Bltools V2.2 [new] [FREE]
BLTools v2.2 — Detailed Technical Paper
Tip 2: Incremental Processing with State Files
The new --state flag allows you to resume interrupted jobs:
bltools transform --input weekly_data --state process.state --resume
4. Algorithms & Implementation Details
4.1 Adapter and Quality Trimming (bltrim)
- Adapter detection: seed-and-extend k-mer matching with spaced seeds (default k=12) and x-drop alignment; fallback to Smith–Waterman for ambiguous matches.
- Quality trimming: sliding-window algorithm (Phred score threshold default 20), with optional minimum read length.
- Paired-end aware trimming: preserves pairing and keeps reads synchronized; orphan handling options: drop, write to single-end file, or keep with placeholder.
4.2 Streaming Conversion & Interleaving (blfastq2bam) bltools v2.2
- Converts FASTQ to unaligned BAM (uBAM) while preserving read-group and platform metadata.
- Interleaving: supports processing of paired FASTQ as interleaved stream to reduce intermediate files.
- Supports read name normalization and sanitization.
4.3 Duplicate marking (bldedup)
- Implements two modes:
- Hash-based in-memory marking for single-lane, low-depth data.
- Sort-and-scan external merge algorithm for large datasets: compute read signature (chromosome, position, orientation, insert size, first-sequence-kmer) then external sort by signature, then mark duplicates in streaming pass.
- UMI-aware duplicate handling: optional UMI whitelist, hamming-distance tolerant grouping, and consensus calling.
4.4 BAM Cleanup (blbamclean)
- Canonicalizes CIGAR strings, fixes mate flags, ensures coordinate-sorted consistency.
- Soft-clip normalization: trims excessive internal clipping, optionally realigns clipped sequences locally to reference using banded alignment.
- Index-aware: can update BAM index or produce coordinate-sorted BAM and BAI.
4.5 Base Quality Recalibration (blrecal)
- Lightweight BQSR: collects covariates (cycle, context k-mer default k=5, machine-reported quality) and fits an additive model using robust linear regression per covariate; applies corrections in streaming pass.
- Supports known-sites mask (VCF) to avoid counting true variants in error model.
4.6 Variant Filtering (blfilter)
- Rule-based filtering expressions, implemented as a small expression DSL:
- Example: DP < 10 || QUAL < 30 || (AF < 0.2 && MQ < 40)
- Vectorized evaluation per record; supports user-defined functions for allele-balance, strand-bias.
4.7 Statistics & Reporting (blstats)
- Per-base quality distribution, per-cycle error rates, GC bias plots (data output as TSV for plotting), alignment concordance tables.
- Outputs JSON summary for automated pipelines plus a human-readable multi-page text report.
12. Security & Data Handling
- Support for streaming encrypted files via gpg pipes.
- Temporary files created with secure permissions and optional automatic shredding.
- No network calls by default; plugin model can allow network but must be explicitly enabled.
3. bltools test
Version 2.2 introduces three types of tests: BLTools v2
- Singular tests: Custom SQL queries returning 0 rows.
- Generic tests: Built-in checks (unique, not_null, accepted_values).
- Freshness tests: Ensure source data is up-to-date.
8. API and Library Use
- libbltools C API:
- Reader/Writer interfaces for FASTQ and BAM
- Stream abstraction allowing plugins to be linked into pipelines
- Python bindings:
- bltools.py provides wrappers for common subcommands, streaming iterators for records, and plugin registration.
- Example Python snippet:
from bltools import FastqReader, BWAStream for rec in FastqReader('reads.fq.gz'): # process read
Run only specific models
bltools run --select models/finance/* --exclude *_test