Metabolomics is the omics where batch effects win more papers than biology. Between peak picking, alignment, normalization, identification and enrichment, a typical LC-MS untargeted run accumulates so many defaults that two labs analyzing the same data rarely converge. This piece walks through a pipeline that makes the run reproducible and the biology defensible.
Reproducibility starts at the bench
No pipeline saves a run with poor acquisition. The bench-side discipline that actually matters:
- Randomized injection order across biological groups — not all controls first.
- Pooled QC samples injected every 5–10 samples, plus at the start and end.
- Blanks between sample clusters to catch carry-over.
- Internal standards spiked in at extraction — isotope-labeled where possible, multiple spanning the retention range.
- Stable lab temperature for autosampler tray — lipids drift at room temperature.
Without these, no downstream correction recovers the lost signal-to-noise.
The pipeline
1. Raw data conversion
Vendor format → mzML via ProteoWizard msconvert with centroiding applied at conversion (not re-applied downstream). Keep the raw files and the mzML — they will be asked for.
2. Peak picking + alignment
XCMS (R/Bioconductor) remains the reference. For LC-MS DDA:
- CentWave for peak detection with ppm tolerance matched to your instrument (2.5–5 ppm for Orbitrap, 15–25 ppm for QToF).
- Obiwarp for retention-time correction.
- PeakDensity grouping with density bandwidth tuned on the QC samples.
- Peak filling to reduce missing values before statistics.
Lock every parameter in the SOP. "We tuned peak picking until it looked right" is not a method.
3. QC-based normalization
Raw intensities drift ~10–30% across a 200-sample batch. QC-RSC (robust splines on pooled QC) corrects most of it without overfitting. Report pre- and post-normalization RSD on QC samples per feature — < 20% RSD is the working threshold for keeping a feature.
4. Batch effect detection
PCA on QC samples is your alignment across batches; PCA on biological samples is your biology check. If the two overlap, you have a batch effect that is masking your signal. ComBat or RUV-random can correct, but correction is better avoided — fix it upstream where possible.
5. Identification
Metabolomics Standards Initiative (MSI) levels:
- Level 1 — standard reference compound run on the same system (m/z + RT + MS/MS match).
- Level 2 — MS/MS match against a public library (HMDB, MoNA, METLIN) without reference standard.
- Level 3 — putative class annotation.
- Level 4 — unknown feature.
Every identified metabolite must carry its MSI level in the report. A paper claiming "120 metabolites" where 115 are Level 3 is a paper about unknowns.
6. Statistical analysis
Log-transform (log2 with offset) before testing. Multiple-testing correction with Benjamini-Hochberg FDR, not Bonferroni, and not raw p-values. Effect size (fold change) reported alongside q-values. PLS-DA is a visualization, not a hypothesis test — validate with permutation testing.
7. Pathway enrichment
MetaboAnalyst or Mummichog for untargeted, with KEGG or Reactome pathway sets. Report pathway hits with the feature count, the p-value, and the FDR. Interpret hits with suspicion when feature count < 3 — too few to carry meaning.
The audit trail that survives peer review
- Instrument and method file versions.
- mzML checksums.
- XCMS parameter JSON (not a paragraph in Methods).
- Normalization report with RSD before/after.
- Identification table with MSI level per feature.
- Pipeline version + dependency lock.
How AiLabrix fits
Drop the mzML batch plus the sample metadata CSV. The pipeline runs XCMS with SOP-locked parameters, QC-RSC normalization, batch-effect detection, library-based identification with MSI leveling, statistical tests with FDR, and pathway enrichment via MetaboAnalyst. Output is a signed PDF plus machine-readable feature tables that Methods sections can link to directly. [email protected].
See AiLabrix on your data
Drop in a CSV. The 26-agent pipeline produces a signed GxP report with full audit trail.
Request a 30-minute demo →