NGS variant calling QC — from FASTQ to defendable variants

NGS makes the step from raw reads to a clinical variant call look like a button click. It is not. The pipeline between FASTQ and "pathogenic" holds at least eight places where a bad default quietly produces a false positive that then gets reported to a patient. This piece walks through the QC gates that stop that from happening.

The stages and their failure modes

1. FASTQ QC

FastQC + MultiQC are the baseline but the thresholds matter. For clinical germline:

Per-base Q30 ≥ 80% across the read, not just the mean.
Adapter contamination < 1%.
Per-tile sequence quality flat — tile hot-spots mean physical lane damage.
Overrepresented sequences investigated individually (often index hopping or primer dimer).

2. Trimming and alignment

Aggressive trimming (e.g., sliding-window Q20) removes real reads in low-coverage regions and inflates variant allele fractions. Use fixed-length trimming for adapters and let downstream callers handle base quality. Alignment with BWA-MEM2 for short reads, minimap2 for long reads. Report duplicate rate, insert-size distribution, and per-chromosome coverage.

3. Coverage QC

The number most clinical reports hide behind: mean coverage. Mean coverage 150x still means 3% of the target panel has < 20x — and that 3% is where your reportable variant might be. Report:

Mean and median coverage.
Fraction of target at ≥ 20x, ≥ 30x, ≥ 50x.
Per-exon coverage table with dropouts flagged.
Uniformity (fraction of bases within 0.2x of the mean).

4. Variant calling

DeepVariant, GATK HaplotypeCaller, or Strelka2 — each has known blind spots. The practical answer is ensemble calling: run two callers, intersect calls, and flag disagreements for manual review. For somatic: Mutect2 + Strelka2 with tumor-normal pairing; panel-of-normals is mandatory.

5. Variant QC filters

Defaults from the caller are rarely clinically appropriate. Minimum useful filter set:

Depth ≥ 20 at the variant position.
Allele fraction ≥ 0.3 for germline het (flag 0.2–0.3 for mosaic investigation); somatic thresholds are panel-specific.
Strand bias (SOR / FS) within caller-recommended range.
Base quality and mapping quality of supporting reads.
Position within read (end-of-read variants are error-prone).
Repetitive-region flag from a masked reference.

6. Annotation

VEP or snpEff with a pinned cache version. Population frequencies from gnomAD (pinned version). ClinVar + HGMD for known pathogenicity. OMIM for gene-disease linkage. Every annotation source version must be in the report — "gnomAD frequency 0.001" is meaningless without the dump date.

7. ACMG classification

For germline, the ACMG 2015 + 2018 refinements are the reference. Automated classifiers (InterVar, Franklin) get to the same call as expert review ~80% of the time — the 20% difference is where clinical judgment matters. Use the automated call as a starting point, never as the final call, and document the override.

8. Reporting

A clinical NGS report needs: specimen ID, panel version, coverage summary, reportable variants with ACMG classification and evidence, VUSes with evidence, technical limitations (regions not covered), signatory. Everything that a physician might need to question the call has to be retrievable from the report.

ISO 15189 specifics for NGS

Method validation per panel: analytical sensitivity (≥ 99% at ≥ 30x), specificity (< 1 false positive per Mb), reproducibility across runs / operators / reagent lots.
Limit of detection for mosaic and somatic variants with explicit allele-fraction cut-offs.
Proficiency testing — participation in a recognized EQA (GenQA, NEQAS) is not optional.
Change control — reference genome, caller version, annotation cache version all need impact assessment before update.

How AiLabrix fits

Drop FASTQ (or BAM, or VCF if you only want post-call QC). The pipeline runs FastQC → alignment → coverage QC → ensemble variant calling → filters → VEP+gnomAD annotation → InterVar scaffolding → signed PDF with coverage heatmap, variant tables, ACMG evidence bullets and the full pipeline version lock. Reference material and truth sets (GIAB) are baked in for per-run sanity. [email protected] for a pilot.

See AiLabrix on your data

Drop in a CSV. The 26-agent pipeline produces a signed GxP report with full audit trail.

Request a 30-minute demo →