Biomedicines | Free Full-Text | Application of SWATH Mass Spectrometry and Machine Learning in the Diagnosis of Inflammatory Bowel Disease Based on the Stool Proteome

Biomedicines | Free Full-Text | Application of SWATH Mass Spectrometry and Machine Learning in the Diagnosis of Inflammatory Bowel Disease Based on the Stool Proteome

1. Introduction

Inflammatory bowel disease (IBD) is a chronic disorder of the gastrointestinal tract that affects millions of people worldwide. It is characterized by inflammation of the intestinal mucosa, leading to symptoms such as abdominal pain, diarrhea, rectal bleeding, and weight loss [1]. During flare-ups, patients require drug treatment, such as steroids, immunosuppressants, and biological therapies, to reduce inflammation and promote healing [2]. Several other diseases and conditions can present symptoms similar to those of IBD, including celiac disease, irritable bowel syndrome (IBS), and infectious colitis [3]. However, each of these diseases requires different treatments. Consequently, the rapid and accurate diagnosis of IBD flare-ups is essential to ensure appropriate treatment and management of this condition. This is especially true as IBD is associated with both unique and severe complications, sometimes requiring hospitalization and intestinal resection.
Currently, the gold standard for diagnosing and monitoring IBD is colonoscopy and biopsy, invasive procedures that can be uncomfortable and present risks of complications [4]. Moreover, IBD is a lifelong disease, and repeated colonoscopies are necessary for disease follow-up, representing a significant burden for patients. It is therefore necessary to develop non-invasive methods for IBD diagnosis and follow-up [5]. Stool biomarkers have emerged as a promising non-invasive approach for IBD diagnosis and monitoring because they are in direct contact with the affected area of inflammation and pathology in IBD and can be utilized repeatedly as required. Among stool biomarkers, protein biomarkers have several advantages over other molecules since they are more stable in stool samples and can provide information on the activity and severity of the disease. Calprotectin is a calcium-binding protein that is released by inflammatory cells and is highly elevated in the feces of patients with IBD [6]. Calprotectin is a common clinically used fecal biomarker to monitor disease activity and the response to treatment and to distinguish between IBD and other gastrointestinal conditions that may have similar symptoms. However, it is not always accurate, and false-positive or false-negative results can occur. Especially when the calprotectin value falls within the range of 100 to 300 µg/g, it can be challenging to predict the transition from the remission phase to the flare-up phase of IBD [7]. Given this, it is reasonable to expect that combining multiple biomarkers can enhance accuracy and sensitivity in diagnostic or research applications [8].
Recently, there have been promising developments in technology and platforms that can identify and measure a large number of targets simultaneously, such as mass spectrometry-based approaches. Mass spectrometry holds great potential for clinical proteomics, which is used for a comprehensive study of proteins in clinical samples with the aim of discovering the most relevant disease markers [9]. Data-independent acquisition (DIA) mass spectrometry enables comprehensive quantification of all detectable proteins in a sample and allows retrospective data analysis. It also has several advantages over data-dependent acquisition (DDA) for proteomic profiling, such as higher reproducibility, a lower missing value rate, and better quantification accuracy [10]. In comparison to various DIA methods [11], Sequential Window Acquisition of All Theoretical Mass Spectra (SWATH) typically provides a combination of deep proteome coverage capabilities with quantitative consistency and accuracy [11,12,13].
Overall, only a few published studies have used mass spectrometry (MS) analysis on human stool samples to identify protein profiles for specific pathologies, including IBD. For example, a pilot study was conducted on a cohort of 10 to discriminate between active and remission phases. However, they did not use a validation group and identified 30 differentially expressed proteins in two groups of five patients [14]. Another study was performed on a cohort of IBD patients, which utilized a spectrum analysis instead of quantitative data. Their validation cohort yielded low specificity (55%), and the standard operating procedure (SOP) for sample collection and storage in this study required dispatch to the laboratory within 2 h and freezing at −80 °C, which may not be compatible with the general constraints of a standard clinical setup [15]. Recently, Vitali et al. identified three single fecal biomarkers using 2-DIGE and MALDI-TOF/TOF MS on stool samples [16]. Among them, only RhoGDI2 showed better performance than calprotectin to discriminate control from IBD patients. However, this marker, like calprotectin, was not able to identify patients in the middle zone, encompassing those in remission and with moderate activity.

Nevertheless, these studies demonstrated the feasibility of using mass spectrometry on stool samples to identify specific biomarkers that can contribute to the diagnosis of IBD.

Alternatively, analyzing such a large DIA dataset, especially from complex samples such as stool, is challenging and necessitates advanced bioinformatics to identify reliable patterns. In this regard, machine learning (ML) and using advanced feature selection methods have emerged as promising tools. Our hypothesis was that conducting a proteomic analysis on clinical laboratory samples that are intended for the fecal calprotectin test might enable the development of a highly sensitive and specific non-invasive stool test based on mass spectrometry. To investigate this hypothesis, we combined and applied our expertise in basic research, clinical practice, and bioinformatics to develop a precise machine learning model for the accurate diagnosis of active IBD patients from symptomatic non-IBD patients.

This study represents a significant advancement in the field by demonstrating the effectiveness of SWATH-DIA proteomic profiling in diagnosing active IBD patients from non-IBD controls. The novel integration of this proteomic approach with machine learning techniques to create a predictive model enhances the diagnostic accuracy. The model’s practicality was confirmed through successful validation of a separate set of samples, achieving 96% sensitivity with a 0.96 AUC. Furthermore, the robustness of the model is evident in its ability to process data from multiple batches with different collection times, showcasing its real-world applicability. Importantly, the stool samples were obtained under clinically compatible SOP conditions, emphasizing the study’s relevance to clinical practice.

2. Materials and Methods

2.1. Sample Collection and Research Ethics

A total of 123 samples was obtained from the Clinical Hematology Lab of the CIUSSS de l’Estrie-CHUS in the context of the fecal calprotectin (f-cal) testing program. The research protocol for accessing stool samples from patients that have been tested for f-cal includes a reverse consent procedure for using residual stool samples and accessing the related clinical data on the Ariane network for diagnosis. This protocol has been approved by the Research Ethics Committee of the CIUSSS de l’Estrie-CHUS (Protocol number 1991-17, 90-18; last date of approval 27 August 2023). Patients under 18 years were excluded from the study. When prescribed an f-cal test by their doctor, patients were instructed to collect a stool sample at home and bring it to the hospital within 24 h (according to the CHUS protocol, 2 h max at RT, within 24 h, but in the fridge (4 °C)). In the Hematology lab, a special device was used to collect a fixed amount of stool (~50 mg) and perform the extraction to be tested for calprotectin using ELISA. The remaining stool samples were stored frozen at −80 °C and waited for confirmation of the patient’s lack of objection from the Archive Division before being stored in the lab and included in the study.

Furthermore, in our study, we excluded samples with ambiguous diagnoses, retaining only those with clear-cut diagnoses made using imaging, colonoscopy, fecal calprotectin tests, and histological data by the attending physician. The control group in our study consisted of individuals who consulted a doctor for symptoms mimicking IBD. However, subsequent tests confirmed the absence of IBD in these patients. The control group predominantly consisted of individuals with irritable bowel syndrome (IBS), and some had infectious colitis. Hence, we refer to them as symptomatic non-IBD controls.

2.2. Sample Preparation

Sample preparation was implemented as previously described [17]. Briefly, 100 mg of frozen stool specimens was solubilized in 1 mL of lysis buffer (25 mM Tris, 1% SDS, pH 7.5) and centrifuged. Then the aqueous phase between the pellet and the floating residual was recovered and stored at −80 °C until preparation for LC-MS/MS analysis. The concentration of solubilized proteins in the individual samples was measured using a BCA test. For reduction, the samples were treated with 10 mM dithiothreitol (DTT) and, for alkylation, the samples were exposed to 15 mM iodoacetamide. Subsequently, the quenching step was implemented using 10 mM DTT. The proteins were precipitated with cold acetone and methanol and digested with Trypsin/Lys-C. The cleaning and recovery of the peptides were performed with a reverse-phase Strata-X polymeric SPE sorbent column (Phenomenex, Torrance, CA, USA) according to the manufacturer’s instructions. The recovered peptides were dried under nitrogen flow at 37 °C for 45 min and stored at 4 °C until being resuspended in 20 µL of mobile phase solvent A (0.2% v/v formic acid and 3% DMSO v/v in water) before LC-MS/MS analysis.

2.3. SWATH-MS Data Acquisition

The acquisition of LC-MS/MS data was conducted at the proteomics facility located at Allumiqs Solutions in Sherbrooke, Quebec, Canada. Samples were analyzed using an Eksigent μUHPLC (Eksigent, Redwood City, CA, USA) coupled to an ABSciex TripleTOF 6600 mass spectrometer equipped with an electrospray interface with a 25 μm i.d. capillary. Data-Independent Acquisition (DIA) Sequential Window Acquisition of All Theoretical Mass Spectra (SWATH) acquisition mode was used to acquire raw data from the individual samples. The source voltage was set to 5.5 kV and maintained at 325 °C, the curtain gas was set at 35 psi, gas one was set at 27 psi, and gas two was set at 10 psi. Separation was performed on a reverse-phase Kinetex XB column with a 0.3 mm i.d., 2.6 μm particles, 150 mm (Phenomenex), which was maintained at 60 °C. Samples were injected by loop overfilling into a 5 μL loop. For the 60 min LC gradient, the mobile phase consisted of the following: solvent A (0.2% v/v formic acid and 3% DMSO v/v in water) and solvent B (0.2% v/v formic acid and 3% DMSO in EtOH) at a flow rate of 3 μL/min. DDA analyses were conducted with a 60 min LC gradient, while SWATH analyses utilized a 30 min LC gradient under the following conditions: 0 to 4 min, maintaining a constant 98%/2% solvent A/B mixture; 4 to 16 min, transitioning to a 75%/25% mixture; 16 to 21 min, transitioning to a 55%/45% mixture; 21 to 25 min, transitioning to 100% solvent B, which continued until 27 min; and 27 to 30 min for column re-equilibration. The decision to reduce the LC gradient length to 30 min for SWATH was driven by logistical considerations. To ensure optimal SWATH data quality, various combinations of parameters were assessed using variable acquisition windows for an MS scanning range from 350 to 1250 m/z. Parameters evaluated encompassed the number, width, and distribution of the SWATH windows, as well as ion accumulation times. Optimization of SWATH windows was executed using the SWATH Variable Window Calculator (Sciex), scaling window sizes across the m/z range based on the m/z intensity distribution. The selected optimized SWATH method was determined by identifying the combination that provided a minimum of 6 MS2 data points per peak while maximizing quantifiable proteins and peptides.

2.4. Spectral Library Generation

To generate an ion library, extracted proteins from a representative pool of samples (3 IBD and 3 symptomatic non-IBD patients) were separated on a 4–20% polyacrylamide gel and then reduced, alkylated, and digested in the gel. Peptides were extracted from the gel using successive rounds of dehydration and sonication and purified using reverse-phase SPE. Data-Dependent Acquisition (DDA) mode was used to acquire raw data from 12 gel fractions of a pooled sample. The spectral library was created following the procedure outlined in a previous study [17]. Briefly, the raw data (.wiff) files obtained in DDA and DIA mode were converted into mzML format with MSConvert (GUI) from ProteoWizard (v3.0.22074) [18]. Subsequently, we utilized FragPipe software (, accessed on 10 March 2022) to search the MS/MS spectra against the human proteome reviewed database (UP000005640; including isoforms and contaminants; accessible at (accessed on 15 March 2022), containing 20,411 reviewed proteins) via the MSFragger search engine [19]. This search was conducted with default open search parameters, specifying a peptide length between 6 and 42, using strict trypsin as the enzyme with a maximum of 1 missed cleavage allowed, setting the maximum fragment charge to 4, and designating methionine oxidation as a variable modification and carbamidomethylation as a fixed modification. The mass tolerance for precursor ions was set to ±20 ppm and for fragment ions at 20 ppm. The false discovery rate (FDR) for both peptide and protein identifications was set at 5%. The DDA and DIA-based libraries were merged and carefully filtered to remove duplicated precursors and we counted a total of 2000 proteins. This integration increased the human proteome coverage of the library.

2.5. Label Free Quantification Analysis

All DIA-converted data in mzML format were processed using DIA-NN software (version 1.8.1) with the following parameters: a fragment ion m/z range of 200 to 1800, a precursor m/z range of 300 to 1800, a precursor false-discovery rate (FDR) threshold of 1%, automatic settings for mass accuracy at both the MS2 and MS1 levels, and the scan window. Protein inference was set to ‘Genes’, and the quantification strategy was ‘robust LC (high accuracy)’. Cross-run normalization was disabled, while match between runs (MBR) was enabled.

2.6. Statistical and Modeling Analysis

The statistical analysis was conducted with R software (version 4.2.2) and the basement of RStudio included packages ggplot2 for visualization, limma [20] for normalization, sva [21] for batch effect correction, and impute for imputation [22]. Differentially expressed proteins were identified using ProStar software (version 1.30.5) [23]. Machine learning and the feature selection analysis were mainly performed using freely available WEKA software (, version 3.8.6, accessed on 15 January 2023) [24] and using R packages Caret (Classification And REgression Training) [25], caretEnsemble [26] and Boruta [27].
The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE [28] partner repository with the dataset identifier PXD047585.

4. Discussion

This study demonstrated the potential use of SWATH-DIA proteomic profiling of stool samples as a tool for diagnosing active-IBD patients from symptomatic non-IBD patients. This was achieved by employing machine learning techniques to develop a robust predictive model. To accomplish this, we designed an experiment with three main steps of (1) data acquisition and processing, (2) training and optimizing a machine learning model based on 78 retrospective samples, and (3) validating the model’s performance on 45 prospective samples. Achieving 96% sensitivity with a 0.96 AUC using a blind dataset confirmed the model’s robustness, also indicating our ability to successfully and effectively process the data obtained from four separate batches with different collection times. The processing steps included the successful removal of batch effects and employed effective methods for normalization and missing value imputation.

We have corrected the batch effect using the ComBat method. ComBat starts by adjusting each batch of data separately to have similar means and variances, and then calculates the differences between the batches and uses this information to “harmonize” the data. ComBat adjusts the data for each sample in a way that minimizes the batch-related differences while preserving the true biological differences [37].

To impute missing values, it is crucial to understand the nature of the data and determine the reasons for their absence, which will guide the selection of an appropriate imputation method. Upon comparing the replicated samples, we observed that the missing values were missing at random. “Zero”, “mean”, and “minimum value” are the straightforward imputation methods commonly used, but they may not always be suitable, especially when the missing values occur randomly and are not due to limits of detection or actual missing data. In such cases, imputing them with these methods can introduce bias into the analysis. Therefore, we chose to employ the k-nearest neighbors (KNN) imputation method with a setting of five neighbors. This implies that it leverages information from the five most similar samples in the dataset to estimate the missing values.

Among the differentially expressed proteins (DEPs), the highest-scoring proteins in the volcano plot were S100A8 and S100A9, which are well-known neutrophil-derived proteins predominantly found as the S100A8/S100A9 complex, also known as calprotectin. This finding further confirms the correctness of the analysis pathway.

Utilizing all 48 differentially expressed proteins as biomarker signatures for classification may not be practical. Therefore, we needed to reduce the number of biomarkers without compromising prediction accuracy. However, selecting only the best proteins and combining them based on previous studies does not guarantee an improvement in overall classification performance. Furthermore, in machine learning, a specific coefficient is assigned to each biomarker, known as a weight, based on its importance and effect on classification to achieve an optimal result. For instance, Mooiweer et al. found that the combination of fecal hemoglobin and calprotectin did not enhance their predictive accuracy compared to using fecal Hb and FC individually [57]. Similarly, Schröder et al. found that the combination of calprotectin, lactoferrin, and neutrophile elastase did not increase predictive accuracy when compared with calprotectin alone [58]. In this regard, using correlation-based feature selection in this study helped us to only keep the seven most relevant proteins with maximum correlations with the class variable and minimum intercorrelation. For instance, retaining both the S100A9 and S100A8 proteins does not provide significant additional informative value because both of them are subunits of calprotectin and exhibit a high correlation with each other. Moreover, the correlation heatmap in Figure 5 indicates that S100A9 and S100A8 also share a high correlation with lactoferrin, and there is also a noticeable correlation between azurocidin, myeloblastin, and myeloperoxidase. Although all of them were identified previously as potential IBD markers, keeping one of them might give us almost similar results.
The seven selected proteins include the upregulated proteins S100A9, azurocidin (AZU1), immunoglobulin lambda constant 3, hemoglobin subunit delta, phospholipase B-like 1 (PLBD1), and alpha-1-acid glycoprotein 1 (alpha 1-AGP), and the downregulated protein neutral ceramidase (ASAH2). Two of these proteins, S100A9 and AZU1, are associated with neutrophils and play a key role in the host’s defense against bacterial infections. S100A9 is, in fact, a subunit of calprotectin, accounting for approximately 60% of the total soluble proteins in the cytosol fraction of neutrophils, while AZU1 is found in the azurophilic granules of neutrophils, alongside other proteins [59]. Hemoglobin delta is linked to occult intestinal bleeding in IBD patients, and previous research has highlighted a correlation between fecal hemoglobin and calprotectin [57]. Immunoglobulin lambda is a light chain of hemoglobin and can be indicative of an active immune system in IBD patients. The increase in free light chains (FLCs), including kappa and lambda immunoglobulins, in plasma has previously been shown in diabetes and immune system abnormalities, as well as autoimmune-based inflammatory diseases [60,61]. However, the dysregulation of lambda light chains in stool and its relevance to IBD have not been studied in detail. PLBD1 is a phospholipase that can generate lipid mediators of inflammation and was first identified in neutrophils [62]. However, to the best of our knowledge, its relationship with IBD has not been specifically investigated. Alpha 1-AGP is one of the major acute phase proteins in humans, and its serum concentration increases in response to systemic tissue injury, inflammation, or infection [63]. Takashi et al. demonstrated a significant increase in fecal alpha 1-AGP in active IBD patients compared to non-active patients, suggesting alpha 1-AGP as a potential biomarker for evaluating IBD activity [64]. ASAH2 is involved in breaking down ceramides to sphingosines. Its downregulation in IBD causes ceramide accumulation in microdomains of cholesterol- and sphingolipid-enriched membranes, resulting in an impairment of the barrier function of the gut [65,66]. The loss of ASAH2 causes elevated levels of sphingosine-1-phosphate and systemic inflammation in ASAH2 knockout mice [67]. These proteins collectively offer insights into the complex molecular mechanisms and potential biomarkers associated with IBD.
The superiority of SVM over other models can be attributed to various factors, including the characteristics of the data, the nature of the classes, the distribution of the features, and the inherent strengths and weaknesses of each algorithm [68]. Some advantages of SVM over other classifiers include being less prone to overfitting due to its optimization process and regularization (controlled by the parameter C and gamma), and greater robustness to outliers and noisy data [69,70].
SVM serves as a robust technique for constructing a classifier [71]. Its primary objective is to establish a decision boundary between two classes, facilitating the classification of data points based on their features. This decision boundary, referred to as a hyperplane, is positioned in a manner that maximizes its distance from the nearest data points of each class, which are known as support vectors [72]. Vapnik initially introduced the SVM algorithm in 1963 to create linear classifiers [73]. Additionally, SVMs can employ kernel methods to model complex, non-linear patterns in higher dimensions. The choice of a suitable kernel function, among other considerations, can significantly impact the performance of an SVM model. However, there is no definitive method to determine the optimal kernel for a specific pattern recognition problem. It often involves a trial-and-error approach, beginning with a basic SVM and experimenting with various standard kernel functions [72]. In this study, the selection of the optimal kernel function is part of the hyperparameter tuning process. Depending on the nature of the data, one kernel (with a degree of one) outperforms the others. This configuration is commonly referred to as “Linear SVM” or “SVM with a Linear Kernel” [74]. This setup assumes that the data is linearly separable, which can be considered an advantage in simplifying the model complexity.

Let us take a closer look at the cost and gamma hyperparameters to gain insights into their impacts on the model. The scale parameter (γ or gamma) controls how tightly the SVM model fits the training data. The usual range for the gamma parameter is typically between 0.01 and 10. Opting for smaller values, such as our chosen value of 0.001, implies a more extensive decision boundary. In contrast, larger values like one or 10 result in narrower decision boundaries, which, if not carefully considered, can potentially trigger overfitting. On the other hand, the cost parameter (C) in SVM controls the trade-off between training error and testing error. The usual range for the cost parameter typically lies between 0.1 and 1000. A smaller C allows for a larger margin and tolerates some misclassification of training points. In our dataset, C = 8 exhibited better performance than the other values. This value strikes a balance between being not too large, which can lead to overfitting, and not too small, which can risk underfitting.

One limitation of this study is that it involved Canadian IBD patients aged 18 and above. Therefore, applying the machine learning algorithm to populations from different regions and ages should be approached with caution. Additionally, while we were able to correct the batch effect, it is essential to note that all samples were analyzed using a single mass spectrometer. To ensure the broader applicability of this method in different clinical laboratories, it might be advantageous to analyze data from various spectrometers.

The primary objective of this study was to provide the proof of concept that a SWATH-based MS analysis can be advantageously used as an additional tool for assisting the gastroenterologist through a protein signature. This, in turn, can significantly enhance the effectiveness of IBD therapy and overall disease management. Moreover, this approach offers substantial advantages in terms of expediting and improving the precision of IBD diagnoses, thereby preventing the deterioration of the patient’s condition due to delayed colonoscopy or inaccurate diagnosis. It also ensures the optimal prescription of drugs from the outset, maximizing treatment efficacy. Additionally, by reducing the necessity for unnecessary colonoscopies, it not only carries financial benefits but also minimizes patient discomfort and anxiety, saves time, enhances convenience, and streamlines the diagnosis and monitoring processes.

In conclusion, this study presents a proof of concept for the application of SWATH for precise IBD diagnosis using stool proteomics and showcases the effectiveness of the data processing and machine learning approaches. Additionally, it highlights the potential of this method for classifying Crohn’s disease (CD) vs. ulcerative colitis (UC) and distinguishing active IBD from remission. The creation of a non-invasive, precise, and sensitive method for diagnosing and monitoring IBD can have a substantial positive impact on the quality of life of IBD patients and lessen the burden of unnecessary or repeated invasive procedures.

Disasters Expo USA, is proud to be supported by Inergency for their next upcoming edition on March 6th & 7th 2024!

The leading event mitigating the world’s most costly disasters is returning to the Miami Beach

Convention Center and we want you to join us at the industry’s central platform for emergency management professionals.
Disasters Expo USA is proud to provide a central platform for the industry to connect and
engage with the industry’s leading professionals to better prepare, protect, prevent, respond
and recover from the disasters of today.
Hosting a dedicated platform for the convergence of disaster risk reduction, the keynote line up for Disasters Expo USA 2024 will provide an insight into successful case studies and
programs to accurately prepare for disasters. Featuring sessions from the likes of The Federal Emergency Management Agency,
NASA, The National Aeronautics and Space Administration, NOAA, The National Oceanic and Atmospheric Administration, TSA and several more this event is certainly providing you with the knowledge
required to prepare, respond and recover to disasters.
With over 50 hours worth of unmissable content, exciting new features such as their Disaster
Resilience Roundtable, Emergency Response Live, an Immersive Hurricane Simulation and
much more over just two days, you are guaranteed to gain an all-encompassing insight into
the industry to tackle the challenges of disasters.
By uniting global disaster risk management experts, well experienced emergency
responders and the leading innovators from the world, the event is the hub of the solutions
that provide attendees with tools that they can use to protect the communities and mitigate
the damage from disasters.
Tickets for the event are $119, but we have been given the promo code: HUGI100 that will
enable you to attend the event for FREE!

So don’t miss out and register today:

And in case you missed it, here is our ultimate road trip playlist is the perfect mix of podcasts, and hidden gems that will keep you energized for the entire journey


This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More