Biomedicines | Free Full-Text | Application of SWATH Mass Spectrometry and Machine Learning in the Diagnosis of Inflammatory Bowel Disease Based on the Stool Proteome
Nevertheless, these studies demonstrated the feasibility of using mass spectrometry on stool samples to identify specific biomarkers that can contribute to the diagnosis of IBD.
Alternatively, analyzing such a large DIA dataset, especially from complex samples such as stool, is challenging and necessitates advanced bioinformatics to identify reliable patterns. In this regard, machine learning (ML) and using advanced feature selection methods have emerged as promising tools. Our hypothesis was that conducting a proteomic analysis on clinical laboratory samples that are intended for the fecal calprotectin test might enable the development of a highly sensitive and specific non-invasive stool test based on mass spectrometry. To investigate this hypothesis, we combined and applied our expertise in basic research, clinical practice, and bioinformatics to develop a precise machine learning model for the accurate diagnosis of active IBD patients from symptomatic non-IBD patients.
This study represents a significant advancement in the field by demonstrating the effectiveness of SWATH-DIA proteomic profiling in diagnosing active IBD patients from non-IBD controls. The novel integration of this proteomic approach with machine learning techniques to create a predictive model enhances the diagnostic accuracy. The model’s practicality was confirmed through successful validation of a separate set of samples, achieving 96% sensitivity with a 0.96 AUC. Furthermore, the robustness of the model is evident in its ability to process data from multiple batches with different collection times, showcasing its real-world applicability. Importantly, the stool samples were obtained under clinically compatible SOP conditions, emphasizing the study’s relevance to clinical practice.
2. Materials and Methods
2.1. Sample Collection and Research Ethics
A total of 123 samples was obtained from the Clinical Hematology Lab of the CIUSSS de l’Estrie-CHUS in the context of the fecal calprotectin (f-cal) testing program. The research protocol for accessing stool samples from patients that have been tested for f-cal includes a reverse consent procedure for using residual stool samples and accessing the related clinical data on the Ariane network for diagnosis. This protocol has been approved by the Research Ethics Committee of the CIUSSS de l’Estrie-CHUS (Protocol number 1991-17, 90-18; last date of approval 27 August 2023). Patients under 18 years were excluded from the study. When prescribed an f-cal test by their doctor, patients were instructed to collect a stool sample at home and bring it to the hospital within 24 h (according to the CHUS protocol, 2 h max at RT, within 24 h, but in the fridge (4 °C)). In the Hematology lab, a special device was used to collect a fixed amount of stool (~50 mg) and perform the extraction to be tested for calprotectin using ELISA. The remaining stool samples were stored frozen at −80 °C and waited for confirmation of the patient’s lack of objection from the Archive Division before being stored in the lab and included in the study.
Furthermore, in our study, we excluded samples with ambiguous diagnoses, retaining only those with clear-cut diagnoses made using imaging, colonoscopy, fecal calprotectin tests, and histological data by the attending physician. The control group in our study consisted of individuals who consulted a doctor for symptoms mimicking IBD. However, subsequent tests confirmed the absence of IBD in these patients. The control group predominantly consisted of individuals with irritable bowel syndrome (IBS), and some had infectious colitis. Hence, we refer to them as symptomatic non-IBD controls.
2.2. Sample Preparation
2.3. SWATH-MS Data Acquisition
The acquisition of LC-MS/MS data was conducted at the proteomics facility located at Allumiqs Solutions in Sherbrooke, Quebec, Canada. Samples were analyzed using an Eksigent μUHPLC (Eksigent, Redwood City, CA, USA) coupled to an ABSciex TripleTOF 6600 mass spectrometer equipped with an electrospray interface with a 25 μm i.d. capillary. Data-Independent Acquisition (DIA) Sequential Window Acquisition of All Theoretical Mass Spectra (SWATH) acquisition mode was used to acquire raw data from the individual samples. The source voltage was set to 5.5 kV and maintained at 325 °C, the curtain gas was set at 35 psi, gas one was set at 27 psi, and gas two was set at 10 psi. Separation was performed on a reverse-phase Kinetex XB column with a 0.3 mm i.d., 2.6 μm particles, 150 mm (Phenomenex), which was maintained at 60 °C. Samples were injected by loop overfilling into a 5 μL loop. For the 60 min LC gradient, the mobile phase consisted of the following: solvent A (0.2% v/v formic acid and 3% DMSO v/v in water) and solvent B (0.2% v/v formic acid and 3% DMSO in EtOH) at a flow rate of 3 μL/min. DDA analyses were conducted with a 60 min LC gradient, while SWATH analyses utilized a 30 min LC gradient under the following conditions: 0 to 4 min, maintaining a constant 98%/2% solvent A/B mixture; 4 to 16 min, transitioning to a 75%/25% mixture; 16 to 21 min, transitioning to a 55%/45% mixture; 21 to 25 min, transitioning to 100% solvent B, which continued until 27 min; and 27 to 30 min for column re-equilibration. The decision to reduce the LC gradient length to 30 min for SWATH was driven by logistical considerations. To ensure optimal SWATH data quality, various combinations of parameters were assessed using variable acquisition windows for an MS scanning range from 350 to 1250 m/z. Parameters evaluated encompassed the number, width, and distribution of the SWATH windows, as well as ion accumulation times. Optimization of SWATH windows was executed using the SWATH Variable Window Calculator (Sciex), scaling window sizes across the m/z range based on the m/z intensity distribution. The selected optimized SWATH method was determined by identifying the combination that provided a minimum of 6 MS2 data points per peak while maximizing quantifiable proteins and peptides.
2.4. Spectral Library Generation
2.5. Label Free Quantification Analysis
All DIA-converted data in mzML format were processed using DIA-NN software (version 1.8.1) with the following parameters: a fragment ion m/z range of 200 to 1800, a precursor m/z range of 300 to 1800, a precursor false-discovery rate (FDR) threshold of 1%, automatic settings for mass accuracy at both the MS2 and MS1 levels, and the scan window. Protein inference was set to ‘Genes’, and the quantification strategy was ‘robust LC (high accuracy)’. Cross-run normalization was disabled, while match between runs (MBR) was enabled.
2.6. Statistical and Modeling Analysis
This study demonstrated the potential use of SWATH-DIA proteomic profiling of stool samples as a tool for diagnosing active-IBD patients from symptomatic non-IBD patients. This was achieved by employing machine learning techniques to develop a robust predictive model. To accomplish this, we designed an experiment with three main steps of (1) data acquisition and processing, (2) training and optimizing a machine learning model based on 78 retrospective samples, and (3) validating the model’s performance on 45 prospective samples. Achieving 96% sensitivity with a 0.96 AUC using a blind dataset confirmed the model’s robustness, also indicating our ability to successfully and effectively process the data obtained from four separate batches with different collection times. The processing steps included the successful removal of batch effects and employed effective methods for normalization and missing value imputation.
To impute missing values, it is crucial to understand the nature of the data and determine the reasons for their absence, which will guide the selection of an appropriate imputation method. Upon comparing the replicated samples, we observed that the missing values were missing at random. “Zero”, “mean”, and “minimum value” are the straightforward imputation methods commonly used, but they may not always be suitable, especially when the missing values occur randomly and are not due to limits of detection or actual missing data. In such cases, imputing them with these methods can introduce bias into the analysis. Therefore, we chose to employ the k-nearest neighbors (KNN) imputation method with a setting of five neighbors. This implies that it leverages information from the five most similar samples in the dataset to estimate the missing values.
Among the differentially expressed proteins (DEPs), the highest-scoring proteins in the volcano plot were S100A8 and S100A9, which are well-known neutrophil-derived proteins predominantly found as the S100A8/S100A9 complex, also known as calprotectin. This finding further confirms the correctness of the analysis pathway.
Let us take a closer look at the cost and gamma hyperparameters to gain insights into their impacts on the model. The scale parameter (γ or gamma) controls how tightly the SVM model fits the training data. The usual range for the gamma parameter is typically between 0.01 and 10. Opting for smaller values, such as our chosen value of 0.001, implies a more extensive decision boundary. In contrast, larger values like one or 10 result in narrower decision boundaries, which, if not carefully considered, can potentially trigger overfitting. On the other hand, the cost parameter (C) in SVM controls the trade-off between training error and testing error. The usual range for the cost parameter typically lies between 0.1 and 1000. A smaller C allows for a larger margin and tolerates some misclassification of training points. In our dataset, C = 8 exhibited better performance than the other values. This value strikes a balance between being not too large, which can lead to overfitting, and not too small, which can risk underfitting.
One limitation of this study is that it involved Canadian IBD patients aged 18 and above. Therefore, applying the machine learning algorithm to populations from different regions and ages should be approached with caution. Additionally, while we were able to correct the batch effect, it is essential to note that all samples were analyzed using a single mass spectrometer. To ensure the broader applicability of this method in different clinical laboratories, it might be advantageous to analyze data from various spectrometers.
The primary objective of this study was to provide the proof of concept that a SWATH-based MS analysis can be advantageously used as an additional tool for assisting the gastroenterologist through a protein signature. This, in turn, can significantly enhance the effectiveness of IBD therapy and overall disease management. Moreover, this approach offers substantial advantages in terms of expediting and improving the precision of IBD diagnoses, thereby preventing the deterioration of the patient’s condition due to delayed colonoscopy or inaccurate diagnosis. It also ensures the optimal prescription of drugs from the outset, maximizing treatment efficacy. Additionally, by reducing the necessity for unnecessary colonoscopies, it not only carries financial benefits but also minimizes patient discomfort and anxiety, saves time, enhances convenience, and streamlines the diagnosis and monitoring processes.
In conclusion, this study presents a proof of concept for the application of SWATH for precise IBD diagnosis using stool proteomics and showcases the effectiveness of the data processing and machine learning approaches. Additionally, it highlights the potential of this method for classifying Crohn’s disease (CD) vs. ulcerative colitis (UC) and distinguishing active IBD from remission. The creation of a non-invasive, precise, and sensitive method for diagnosing and monitoring IBD can have a substantial positive impact on the quality of life of IBD patients and lessen the burden of unnecessary or repeated invasive procedures.
Disasters Expo USA, is proud to be supported by Inergency for their next upcoming edition on March 6th & 7th 2024!
The leading event mitigating the world’s most costly disasters is returning to the Miami Beach
And in case you missed it, here is our ultimate road trip playlist is the perfect mix of podcasts, and hidden gems that will keep you energized for the entire journey-