Multi-profile Bayesian Alignment Model for LC-MS Data Analysis with Integration of Internal Standards

Tsung-Heng Tsai1,2, Mahlet G. Tadesse3, Cristina Di Poto1, Lewis K. Pannell4, Yehia Mechref5, Yue Wang2, and Habtom W. Ressom1

1Lombardi Comprehensive Cancer Center, Georgetown University, Washington, DC. 2Department of Electrical and Computer Engineering, Virginia Tech, Arlington, VA. 3Department of Mathematics and Statistics, Georgetown University, Washington, DC. 4Mitchell Cancer Institute, University of South Alabama, Mobile, AL. 5Department of Chemistry and Biochemistry, Texas Tech University, Lubbock, TX.

Motivation:
Liquid chromatography-mass spectrometry (LC-MS) has been widely used for profiling expression levels of biomolecules in various "-omic" studies including proteomics, metabolomics and glycomics. Appropriate LC-MS data preprocessing steps are needed to detect true differences between biological groups. Retention time alignment is one of the most important yet challenging preprocessing steps, in order to ensure that ion intensity measurements among multiple LC-MS runs are comparable. Current alignment approaches estimate retention time variability using either single chromatograms or detected peaks, whereas complementary information embedded in the LC-MS data is often overlooked.

Results:
We propose a Bayesian alignment model (BAM) for LC-MS data analysis. The alignment model provides estimates of the retention time variability along with uncertainty measures. The model enables integration of multiple sources of information including internal standards and clustered chromatograms. We apply the model to LC-MS metabolomic, proteomic and glycomic data. The performance of the model is evaluated based on ground-truth data, by measuring correlation of variation, retention time difference across runs, and peak matching performance. We demonstrate that the BAM improves significantly the retention time alignment performance through integration of relevant information such as internal standards and clustered chromatograms in a mathematically rigorous framework.

This webpage provides the data sets, the Matlab codes, and the supplementary information to the main paper.

 
 

Analyzed Data Sets

LC-MS Proteomic Data Set:

  • LC-MS/MS raw data
  • MASCOT search results (.csv): serum samples, internal standards
  • Peak lists (.csv) generated by DifProWare
  • Peak lists (.txt) in the format of SIMA (indexed based on the analysis order)

    LC-MS Glycomic Data Set:

  • LC-MS raw data
  • DeconTools output files (.csv) based on the parameter file
  • Peak lists (.csv) generated using the MATLAB script isoFeature.m
  • Peak lists (.txt) in the format of SIMA (indexed based on the analysis order)
  •  
     

    MATLAB Codes and Associated Files

    README and required MATLAB functions

    LC-MS Proteomic Data Set:

  • Retention times of internal standards in the time_standard_proteomics.txt
  • Base peak chromatograms (1000 registered RT points): bpc_proteomics.mat
  • Binned chromatograms (0.5 Da/bin, 1000 registered RT points): data_matrix_proteomics.mat
  • Single-profile alignment: bioinfo_proteo_sp.m (without Gaussian process prior), bioinfo_proteo_gpsp.m (with Gaussian process prior)
  • Multi-profile alignment (G=4): bioinfo_proteo_mp4.m (without Gaussian process prior), bioinfo_proteo_gpmp4.m (with Gaussian process prior)
  • Ground-truth for peak matching assessment: ground_truth_proteomics.mat, and the evaluation script: eval_proteo.m

    LC-MS Glycomic Data Set:

  • Retention times of internal standards in the time_standard_glycomics.txt
  • Base peak chromatograms (1000 registered RT points): bpc_glycomics.mat
  • Binned chromatograms (0.5 Da/bin, 1000 registered RT points): data_matrix_glycomics.mat
  • Single-profile alignment: bioinfo_glyco_sp.m (without Gaussian process prior), bioinfo_glyco_gpsp.m (with Gaussian process prior)
  • Multi-profile alignment (G=4): bioinfo_glyco_mp4.m (without Gaussian process prior), bioinfo_glyco_gpmp4.m (with Gaussian process prior)
  • Ground-truth for peak matching assessment: ground_truth_glycomics.mat, and the evaluation script: eval_glyco.m
  •  
     

    Supplementary Information

    The document of supplementary information presents the following content:

  • Gaussian Process Regression
  • Profile-based Alignment
  • Full Conditionals of Model Parameters
  • Metropolis-Hastings Algorithm
  • Analyzed Data Sets
  • Peak Matching and Performance Evaluation
  • Metabolomic Ground-truth Data Set
  • Precision and Recall Measures in the Metabolomic Data Set
  • Peaks of Internal Standard in the Proteomic Data Set
  • Proteomic Ground-truth Data
  • Trace Plots of MCMC Samples in the Proteomic Data Set
  • Precision and Recall Measures in the Proteomic Data Set
  • Internal Standard in the Glycomic Data Set
  • Glycomic Ground-truth Data
  • Chromatographic Clustering for the Glycomic Data
  • Precision and Recall Measures in the Glycomic Data Set
  •  
     
    Please email your questions/comments to thtsai@vt.edu
    Last updated on 2013-03-25.