banner

Blog

Nov 02, 2024

Bayesian additive regression trees for predicting childhood asthma in the CHILD cohort study | BMC Medical Research Methodology | Full Text

BMC Medical Research Methodology volume 24, Article number: 262 (2024) Cite this article

Metrics details

Asthma is a heterogeneous disease that affects millions of children and adults. There is a lack of objective gold standard diagnosis that spans the ages; instead, diagnoses are made by clinician assessment based on a cluster of signs, symptoms and objective tests dependent on age. Yet, there is a clear morbidity associated with chronic asthma symptoms. Machine learning has become a popular tool to improve asthma diagnosis and classification. There is a paucity of literature on the use of Bayesian machine learning algorithms to predict asthma diagnosis in children. This paper develops a prediction model using the Bayesian additive regression trees (BART) and compares its performance to various machine learning algorithms in predicting the diagnosis of childhood asthma.

Clinically relevant variables collected at or before 3 years of age from 2794 participants in the CHILD Cohort Study were used to predict physician-diagnosed asthma at age 5. BART and six other commonly used machine learning algorithms, namely adaptive boosting, logistic regression, decision tree, neural network, random forest, and support vector machine were trained. Measures of performance including sensitivity, specificity, and area under the receiver operating characteristic (ROC) curve were calculated. The confidence intervals were calculated using Bootstrapping samples. Important predictors and interaction effects associated with asthma were also identified using BART.

BART, logistic regression and random forest showed the highest area under the ROC curve compared to other machine learning algorithms. Based on BART, recurrent wheeze, respiratory infection and food sensitization at 3 years of age were the most important predictors. The three most important interaction effects were found to be interaction terms of respiratory infection at 3 years and recurrent wheezing at 3 years, maternal asthma and paternal asthma, and maternal wheezing and inhalant sensitization of child at 3 years.

BART demonstrated promising prediction performance when compared to other machine learning algorithms. Future research could validate the BART in an external cohort to evaluate its reliability and generalizability.

Peer Review reports

Asthma is a heterogeneous disease that manifests the common symptoms of wheeze, breathlessness, and airflow obstruction [1]. This heterogeneity is further complicated in preschool-aged children (age ≤ 6 years) in whom the diagnosis of asthma is based on the clinical history of the wheeze symptom pattern, triggers and response to therapy. Identifying the risk factors that predict asthma and children who are at a higher risk allows early intervention in order to prevent or minimize the onset and progression of the disease. Such interventions involve timely treatment and management strategies, potentially reducing the severity and frequency of asthma symptoms.

In the era of big data and the advancement of computational tools, there are trends toward using more observations, more variables and therefore more complex machine learning algorithms to provide better classification and prediction of asthma [2]. For example, machine learning algorithms have been used to predict school-age asthma [3] and adult asthma [4] in a general population, and school-age asthma in a high-risk population [5]. Moreover, studies have been undertaken to utilize information from children (age \(\le\) 5 years of age) to predict asthma in school-age children (6–13 years) [6]. However, limited studies have focused on predicting asthma in preschool-age, a critical time window to implement early intervention and facilitate the management of asthma [2].

Many different statistical and machine-learning approaches have been used to develop a prediction model for asthma outcomes. For example, in a retrospective cohort study using electronic health record data from Children’s Hospital of Philadelphia, Bose et al. [6] trained five machine learning models to identify children with asthma diagnosed before age 5 who will continue to have asthma-related visits. Gradient boosted trees method was found to have the best predictive performance compared to the other machine learning models. As another example, using data from the Isle of Wight Birth Cohort, Kothalawala [3] found that support vector machine algorithms provided the best performance as shown by the area under the curve (AUC) compared to other approaches such as decision tree, random forest, multilayer perceptron etc. A comprehensive review of methods for predicting childhood asthma can be found in a recent review [2].

While many machine learning approaches have been explored in predicting asthma, Bayesian machine learning approaches have received very limited attention [2]. Nonetheless, Bayesian approaches have demonstrated substantial promise in simulation studies and real-world applications. Bayesian additive regression trees (BART) [7, 8] approach is one example that facilitates variable selection, interaction detection, disease outcome prediction, and model uncertainty quantification. This approach, which combines the strengths of both machine learning and Bayesian inference, has the potential to yield a better prediction performance and understanding of the important predictors of childhood asthma. Moreover, while many studies have developed prediction models for childhood asthma, there is a knowledge gap as to what risk factors and prediction models can effectively predict asthma in early childhood [2].

The goal of the current study is to develop a prediction model for childhood physician-diagnosed asthma using BART and to identify potential risk factors and their interaction effect in the CHILD cohort study. The predictive performance of BART is compared against other commonly explored machine learning methods using a variety of model performance criteria.

Participants from the CHILD Cohort Study [9] are included in the current analysis. The CHILD Cohort Study is a prospective longitudinal birth cohort study. Pregnant women were initially recruited between 2009 and 2012 in the four study sites (i.e., Vancouver, Edmonton, Winnipeg and Toronto) across Canada. Through longitudinal follow-up, the study collects information on children and their parents via biological samples (cord blood, meconium, breast milk, urine, blood, nasal swab, stool), questionnaires (family history, maternal stress, nutrition, child health, medications, indoor and outdoor environment), home assessments (visual home inspections, dust sampling) and clinical assessments (lung function and skin tests). After excluding ineligible participants at birth, participants who withdrew before any data was collected, as well as before birth or 36-week review, a total of 3224 participants are eligible at 5 years of age.

Physician diagnosis of asthma at 5 years of age is the outcome of interest and is defined based on Clinical Assessment Questionnaires. Specifically, each participant underwent a specialist clinical assessment at 5 years of age by a trained pediatric respirologist or allergist. The assessment included focused allergic history inquiries, wheeze symptom history, physical examination and skin prick testing. Study physicians were asked the question: “In your opinion, does this child have asthma? (Definite/Possible/No)”. Asthma was considered definite if parent-reported more than 2 episodes of wheezing within the last year and one of the following: a prior diagnosis of asthma by a physician reported by the parent, use of a bronchodilator prescribed by a physician for coughing or wheezing episodes, use of prescribed daily controller medication such as inhaled steroids, or frequent wheezing (> 4 distinct episodes) with no alternative diagnosis. Possible asthma was recorded if there were less frequent episodes of wheezing or coughing without colds, and no report of medication use. In the current analysis, definite (“Yes”) and possible diagnoses of asthma were combined into a single category. A sensitivity analysis was performed by removing the possible asthma category.

After screening, a total of 51 clinically relevant variables collected at or before 3 years of age during the follow-up were included in the current analysis. The choice of these variables was informed by the current literature on risk factors and health outcomes associated with asthma [10,11,12,13,14]. These variables include demographic information (e.g., sex, ethnicity, income status, mother’s and father’s education status), growth characteristics (e.g., height, weight, BMI in z-score), family history of allergy diseases (e.g., wheezing, food or inhalant sensitization, and diagnosis of asthma of both mother and father individually), and questionnaire data related to child’s symptom presence and severity (e.g., recurrent wheeze), environmental exposure (e.g., pet ownership, smoking exposure, breastfeeding status), genetic factor (e.g., genetic risk score), and biological factor (e.g., eosinophile count). A full list of variables included in the current study is provided in Table E1of the supplementary materials.

Bayesian Additive Regression Trees (BART) is a machine learning approach that combines the flexibility of regression trees with Bayesian inference [7]. It is a powerful tool for nonparametric regression and classification, which provides a flexible approach to modeling complex relationships between variables. It can handle nonlinear main effects and multiple-way interactions without much input from the users. This approach enables full posterior inference including point and interval estimates of the unknown regression function as well as the marginal effects of predictors. Tutorial and review of BART have been published elsewhere [15, 16].

Briefly, BART is a sum-of-trees model. For classification problems based on a binary outcome, the BART can be expressed as

where \(y\) denotes the binary outcome of interest and \(\Phi\) denotes the cumulative density function of the standard normal distribution. The sum-of-trees model serves as an estimate of the conditional probit at \({\varvec{x}}\) which can be transformed into a conditional probability estimate of \(y=1\). In this formulation, the unknown function \(f({\varvec{x}})\) is approximated by \(m\) distinct regression trees, each composed of a tree structure, denoted by \(T\). The parameters at the terminal nodes (known as leaves), are denoted by \(M\). Therefore, the \({T}^{M}\) represents an entire tree with both its structure and set of leaf parameters.

BART utilizes a regularization prior to constraining the size and fit of each tree \({T}^{M}\) to avoid overfitting, such that each tree contributes only a small part to the overall fit. For the classification problems, the prior for the BART model has two components: (a) the tree structures and (b) the leaf parameters given the tree structures. They can be expressed as:

where the set of tree’s leaf parameters is denoted as \({M}_{t}=\{{\mu }_{t,1},\dots ,{\mu }_{t, {b}_{t}}\}\), and the number of terminal nodes for a given tree as \({b}_{t}\). Each tree is assigned an independent prior. The probability that a node at depth \(d\) splits (i.e. is not terminal) is \(\alpha {\left(1+d\right)}^{-\beta }\), for \(\alpha \in \left(0, 1\right),\) and \(\beta \in [0, \infty ]\). The hyperparameters are set to \(\alpha =0.95\) and \(\beta =2\) by default to strongly favor small trees [7].

To determine the variables of importance for prediction, BART calculates the proportion of times each predictor is chosen as a splitting rule divided by the total number of splitting rules appearing in the model. This allows straightforward evaluation and interpretation of the variable of importance. Three variable selection rules are available [17] (a) Local Threshold: a predictor is included if its variable inclusion proportion exceeds the \(1-\alpha\) quantile of its null distribution. (b) Global Max Threshold: a predictor is included if its variable inclusion proportion exceeds the \(1-\alpha\) quantile of the distribution of the maximum of the null variable inclusion proportions from each permutation of the response. (c) Global SE Threshold: a predictor is included if its variable inclusion proportion exceeds a threshold based on its null distribution mean and SD with a global multiplier shared by all predictors.

To handle missing data, the BART implements a “Missing Incorporated in Attributes” method [18]. This approach involves incorporating missingness by augmenting the nodes’ splitting rules to (a) also handle sorting the missing data to the left or right and (b) consider the missingness itself as a variable to be included in a splitting rule. Variable selection is performed by evaluating the variable inclusion proportions which is defined as the proportion of times each predictor is chosen as a splitting rule divided by the total number of splitting rules appearing in the model.

Participants’ baseline information was summarized using mean and standard deviation (SD) for continuous variables (e.g., age) and the frequency and percentage for categorical variables (e.g., sex).

To establish a prediction model for the outcome of interest, participants were randomly divided into a training set (70%) and a testing set (30%). To implement the BART, we used the bartMachine package [8]. The package allows many features for data analysis based on BART such as handling missing data, variable selection, detection of interaction, and prediction of outcomes. Five-fold cross-validation validations were used to determine the size of the trees and the hyperparameters. In addition to BART, six other machine learning algorithms were trained, which have been commonly used to predict asthma. These algorithms include adaptive boosting (adaboost) via the adabag package [19], logistic regression with lasso regularization via the glmnet package [20], decision tree (dt) via the rpart package [21], neural networks via the neuralnet package [22], random forest (rf) via the randomForest package [23], and support vector machine (SVM) via the e1071 package [24]. For methods except the BART, missing predictor data need to be imputed prior to the analysis. To handle missing data, we performed imputation using the multivariate imputation by chained equations (MICE) via mice R package [25]. The default method option is used for the mice() function, i.e., predictive mean matching was used for numeric data, logistic regression imputation was used for binary data, polytomous regression imputation was used for nominal categorical data, and proportional odds model was used for ordered categorical data.

To evaluate the prediction performance, several measures were used, including (a) sensitivity, defined as the ratio of true positives to the sum of true positives and false negatives, (b) specificity, defined as the ratio of true negatives to the sum of true negatives and false positives, (c) positive predictive value (PPV), defined as the ratio of true positives to the sum of true positives and false positives, (d) negative predictive value (NPV), defined as the ratio of true negatives to the sum of true negatives and false negatives, (e) F score, defined as the harmonic mean of precision (PPV) and recall (true positive rate, TPR), (f) accuracy, defined as the percentage of correctly classified cases, (g) balanced accuracy, a useful performance metric in the case of imbalanced datasets, defined as the average of the sensitivity (also known as recall or true positive rate) for each class, (h) Brier score, calculated by averaging the squared differences between predicted probabilities and the observed outcomes. A lower Brier score indicates better performance, with 0 being perfect accuracy and 1 representing the worst possible predictive performance. (i) calibration intercept, which quantifies the overall bias in the predicted probabilities. A value of zero indicates perfect calibration. (j) calibration slope, which measures the extent to which the predicted probabilities are either too extreme or less extreme. A value of one indicates perfect calibration. Both calibration intercept and calibration slope are derived from fitting a logistic regression model where the observed outcome is the dependent variable, and the predicted probabilities are the independent variable. (k) Area under the curve (AUC), which quantifies the model's ability to rank positive instances higher than negative instances such that the higher the AUC value, the better the model's discriminatory power. These performance measures are commonly used and capture the model's performance in various aspects. In addition, 95% confidence intervals are calculated using 100 bootstrap samples. Performance measures based on the test set will be reported. Figure 1 provides an overview of the workflow of the analysis. All analyses were performed in R version 4.2.1.

Workflow for development and test of prediction models using different machine learning approaches

A total of 2794 who had asthma diagnoses recorded were included in the current analysis (Table 1) of which, 53% were male, 64% were Caucasian, 7.1% had household income less than $40,000, and 18% and 13.9% of mothers and fathers had above university education. The mean (SD) gestational age (weeks) and BMI z-score at birth were 39.2 (1.37) weeks and -0.38 (1.22), respectively.

Of the 2794 children, 171 (6.2%) were diagnosed with definite asthma, 261 (9.3%) were diagnosed with possible asthma, while 2362 (84.5%) were classified as no asthma. A higher proportion of males was observed amongst children with a diagnosis of asthma compared to children with no asthma (57.4% vs. 52.2%). In addition, a lower proportion of children with asthma were Caucasian compared to children without asthma (58.8% vs. 64.9%). Compared to children with asthma, a higher proportion of children without asthma came from households with incomes above $80,000 and parents who had attained university-level education. The gestational age was similar between the two groups but children with no asthma had a lower mean BMI z-score at birth (-0.39 vs. -0.34).

Model performance of BART and other machine learning algorithms in predicting physician-diagnosed asthma (definite/possible vs. no asthma) at 5 years was assessed using measures including sensitivity, specificity, PPV, NPV, F score, balanced accuracy, accuracy, Brier score, calibration intercept, and calibration slope. These values are visualized in Fig. 2 and reported along with their corresponding 95% bootstrap confidence intervals in Table E2.

Spider chart of performance measures for machine learning algorithms under comparison based on the test set for physician-diagnosed asthma (definite/possible vs. no asthma). BART: Bayesian additive regression tree; PPV: Positive predictive value; NPV: Negative predictive value

BART yielded a comparable sensitivity (0.24 [95%CI: 0.18, 0.31]) and specificity (0.98 [95%CI: 0.97, 1]) to other machine-learning methods. The decision tree yielded the highest sensitivity (0.25 [95%CI: 0.17, 0.33]) whereas the highest specificity was found in random forest and support vector machine (0.99 [95%CI: 0.98, 1], in both). Furthermore, BART yielded a similar PPV (0.7 [95%CI: 0.56, 0.84]) and the highest NPV (0.89 [95%CI: 0.87, 0.91]) compared to other approaches. Of note, support vector machine yielded the highest PPV (0.8 [95%CI: 0.59, 1]).

Notably, BART yielded the highest F score (0.36 [95%CI: 0.29, 0.43]) compared to all other approaches, indicating that this approach has both high precision and high recall, and that it was performing reasonably well in terms of both identifying true positives and minimizing false positives and false negatives. Furthermore, BART and decision tree yielded the highest balanced accuracy (0.61 [95%CI: 0.58, 0.64], 0.61 [95%CI: 0.57, 0.64], respectively). BART also yielded the highest accuracy (0.88 [95%CI: 0.86, 0.9]). The lowest Brier score was found in the decision tree (0.05 [95%CI: 0.04, 0.07], suggesting it yielded the highest prediction accuracy, followed by BART (0.1 [95%CI: 0.09, 0.11]). The largest Brier score was observed in logistic regression with lasso (0.68 [95%CI: 0.66, 0.7]). The best calibration performance measured by calibration intercept and slope was observed in the decision tree whereas the worst was observed in adaptive boosting.

The ROC curves along with the AUC measures and the corresponding 95% bootstrap confidence intervals for different machine learning approaches are shown in Fig. 3. The highest AUC values were associated with BART (0.75 [95%CI: 0.72, 0.79]), logistic regression with lasso (0.75 [95%CI: 0.72, 0.79]) and random forest (0.75 [95%CI: 0.71, 0.79]), suggesting a better overall predictive performance compared to other approaches. Both adaptive boosting (0.71 [95%CI: 0.67, 0.74]) and support vector machine (0.72 [95%CI: 0.69, 0.75]) yielded an AUC higher than 0.7, followed by decision tree (0.69 [95%CI: 0.65, 0.73]) and neural network (0.65 [95%CI: 0.5, 0.79]).

Receiver operating characteristic, area under the curves and 95% confidence intervals in machine learning algorithms based on the test set for physician-diagnosed asthma (definite/possible vs. no asthma)

A sensitivity analysis was performed by comparing definite asthma diagnosis and no asthma (i.e., removing the possible asthma category), and similar model performance was observed (Figure E1 & Table E3). In particular, random forest (0.85 [95%CI: 0.8, 0.9]) yielded the highest AUC (Figure E2), followed by adaptive boosting (0.84 [95%CI: 0.79, 0.89]) and BART (0.83 [95%CI: 0.78, 0.88]), respectively. The best calibration performance measured by calibration intercept and slope was observed in the decision tree whereas the worst was observed in random forest.

The inclusion proportion for any given predictor refers to the proportion of the number of times that variable is chosen as a splitting rule over the total number of splitting rules among the posterior draws of the sum-of-trees model. The inclusion proportion revealed that symptoms at 3 years were among the most important predictors of asthma diagnosis at 5 years of age (Fig. 4). The top five important symptoms include recurrent wheezing (RW_3y), respiratory infection (RI_3y), food sensitization (FS_3y), cough without a cold (Cough_3y), and atopic dermatitis (AD_3y). Other important predictors included genetic risk score for asthma (GRS), inhalant sensitization at 3 years (IS_3y), and paternal asthma (Father_Asthma).

Important predictors determined by BART. A Variable inclusion proportions. The vertical bars represent 95% confidence interval. B Local procedure to determine important predictors for α = 0.05. The vertical lines (in green) are the threshold levels determined from the permutation distributions. The plotted points are inclusion proportions that should exceed the vertical line for a predictor to be selected. If the inclusion proportion is higher than the vertical bar, the variable is included and is displayed as a solid dot, otherwise, it is not included and it is displayed as an open dot. C Global Max and Global SE thresholds to determine the important predictors. The horizontal reference line (in red) is the cutoff for the Global Max threshold. Variables with inclusion proportions higher than this threshold are represented by solid dots. The vertical bars (in blue) represent the thresholds for the Global SE procedure. Variables that exceed this threshold but not the Global Max threshold are displayed as asterisks. Variables that exceed neither threshold are displayed as open dots. Abbreviation: RW_3y: Recurrent wheeze at 3 years; RI_3y: Respiratory infection at 3 years; FS_3y: Food sensitization at 3 years; Cough_3y: Cough without cold at 3 years; AD_3y: Atopic dermatitis at 3 years; GRS: Genetic risk score for asthma; IS_3y: Inhalant sensitization at 3y; Father_Asthma: Father asthma; zhei_birth: height z-score at birth; zwei_12m: weight z-score at 1 year; Wheeze_mother: Mother wheezing

A similar set of important predictors was identified based on the local procedure and the Global Max and Threshold Procedures (Fig. 4). Specifically, seven variables were found to exceed the green line threshold including recurrent wheezing at 3 years (RW_3y), respiratory infection at age 3 (RI_3y), Food sensitization (FS_3y), cough without cold at 3 years (Cough_3y), atopic dermatitis at 3 years (AD_3y), genetic risk score (GRS), and paternal asthma (Father_asthma). Recurrent wheeze at 3 years (RW_3y) was the only variable that passed both Global SE and Global Max thresholds, suggesting it is the most important predictor of asthma at 5 years.

Interaction effects in a prediction model play a crucial role in capturing and accounting for complex relationships between variables. Algorithms with additive regression trees like BART have a greater ability than single trees to flexibly fit interactions and non-linearities in their prediction model. In a given tree, variables are considered to interact only when they appear together in a continuous downward path from the root node to a terminal node. Figure 5 shows the average relative importance of the top 10 interaction terms with 95% confidence intervals shown with segments atop. The three most important effects were found to be interaction terms of respiratory infection (RI_3y) and recurrent wheezing (RW_3y) at 3 years of age, maternal asthma (Mother_Asthma) and paternal asthma (Father_Asthma), and maternal wheezing (Wheeze_Mother) and inhalant sensitization at 3 years (IS_3y).

The top 10 interaction effects based on relative importance. The bars represent 95% confidence intervals. Abbreviation: Father_Asthma: Father asthma; Mother_Asthma: Mother asthma; Cough_3y: Cough without cold at 3 years; EBF_6m: Exclusive breastfeeding at 6 months; Wheeze_Mother: Mother wheeze; IS_3y: Inhalant sensitization at 3 years; RI_3y: Respiratory infection at 3 years; RW_3y: Recurrent wheeze at 3 years; FS_3y: Food sensitization at 3 years; AD_3y: Atopic dermatitis at 3 years; Mother_ED: Mother’s education status; zhei_36m: height z-score at 36 months; zhei_3m: height z-score at 3 months; zwei_12m: weight z-score at 1 year; GA: gestational age (weeks)

In this study, we developed a prediction model for childhood asthma using Bayesian Additive Regression Trees (BART) and compared its performance with other commonly used machine learning algorithms. The predictive capabilities of BART were demonstrated through competitive performance across multiple measures, including sensitivity, specificity, positive and negative predictive values, F-score, accuracy, balanced accuracy, and AUC.

The studies focusing on the prediction of childhood asthma have demonstrated considerable variability in terms of sample size, methodologies employed, and achieved prediction accuracy. For instance, Chataimichail et al. [26, 27] conducted research on a cohort of 112 children under the age of 5 diagnosed with asthma. Utilizing multi-layer perception neural network and support vector machine techniques, they achieved an impressive prediction accuracy of approximately 95% for asthma development between the ages of 7 to 14. Another study by Smolinska et al. [28] examined a larger cohort of 252 children (202 of whom had recurrent wheezing) aged between 2 to 4 years. Their approach involved employing the random forest algorithm to predict asthma status at 6 years of age, resulting in a prediction accuracy of 80%. In studies with even larger sample sizes, exceeding 9000 participants, the focus was utilizing respiratory symptoms before age 5 to predict the occurrence of school-age asthma. These studies achieved prediction accuracies ranging between 74% [29]and 81% [5]. In this context, our study adds to the expanding body of research that leverages advanced machine-learning techniques for predictive modeling in childhood asthma. Our approach, based on a birth cohort study, explored the performance of seven distinct machine-learning algorithms. The outcomes revealed prediction accuracies spanning from 0.84 to 0.88 (Table E2). Among these algorithms, BART demonstrated the highest accuracy of 0.88 (95% CI: 0.86, 0.9), emphasizing its potential as a robust tool for pediatric asthma prediction (Table E2). However, we note that all the algorithms under comparison suffer from poor sensitivity, therefore they may not be a good diagnostic test but a potentially good screening test for physician-diagnosed asthma. This finding contributes to the growing understanding of the capabilities and limitations of machine-learning methods in childhood asthma prediction, further paving the way for improved patient care and management strategies.

The process of selecting and determining the significance of predictors for inclusion in an analysis poses significant challenges. Many studies have tackled this task by considering a diverse array of potential predictors spanning sociodemographic, medical, environmental, genetic, and clinical domains [5, 26,27,28]. However, the methods employed to select these predictors have varied across studies, potentially leading to disparities in the final set of variables under consideration. Despite this variability, a consistent trend emerges, with wheezing, cough, and medication consistently being identified as important predictors in the majority of finalized predictive models [26, 27, 30]. In our current analysis, BART highlighted key predictors, including wheezing and coughing at 3 years of age, alongside other factors such as respiratory infection, and sensitization to food and inhalants. Notably, these influential key predictors identified by the data-driven approach are consistent with the definition of the asthma diagnosis which includes allergic history inquiries, wheeze symptom history, and skin prick testing. Therefore, given the inherent relationship between these predictors measured before and at age 3 and the subsequent outcome at age 5, both obtained from questionnaire data, it is expected that these variables are chosen as the most influential predictors in the models. Importantly, while much of the literature has focused on individual predictor variables, the exploration of interactions between predictors has remained relatively underexplored. Our analysis seeks to bridge this gap by shedding light on potential interactions between predictors. bartMachine, however, does not provide signs of directionality for the interaction effects it identifies. This limits the accurate interpretation of these effects and can be considered as a limitation.

In the area of predictive modeling for childhood asthma, BART emerges as a compelling tool. Its unparalleled flexibility in capturing intricate non-linear relationships between predictors and outcomes proves particularly advantageous in deciphering the multifaceted nature of childhood asthma dynamics. Moreover, BART's capability for automated variable selection and interaction detection, streamlines the model-building process and enhances interpretability. The provision of uncertainty estimates allows clinicians and researchers to gauge prediction reliability, crucial for personalized intervention strategies.

Several research directions can be considered in future studies. While several studies have explored the use of genetic and clinical data for asthma prediction, there appears to be a gap in the integration of multi-omics data. Future research could explore the potential benefits of combining genomics, transcriptomics, proteomics, and other omics data to enhance prediction accuracy and gain a more comprehensive understanding of asthma phenotypes. Moreover, many machine learning models, particularly complex ones, lack transparency and interpretability. Addressing this gap by developing models that offer understandable explanations for their predictions could enhance their clinical utility and acceptance, particularly in medical decision-making scenarios. Furthermore, while many of the studies present promising results, a potential knowledge gap is the lack of extensive external validation and generalizability assessments. Future research could validate the BART in an external cohort to evaluate its reliability and generalizability, and compare it with other Bayesian approaches such as Bayesian model averaging [31, 32] and Bayesian support vector machine [33]. Finally, it is crucial to acknowledge that all the methods compared in the current study present a very low sensitivity. This could potentially be attributed to an imbalance in the outcome variable. Further investigation and tuning may be necessary to enhance the models' sensitivity and overall performance.

The CHILD Cohort Study data that support the findings of this paper are not openly available due to study participant consent restrictions. Information on accessing CHILD Cohort Study data and samples is available here: https://childstudy.ca/for-researchers/data-access/.

Bayesian additive regression trees

Area under the curve

Receiver operating characteristics

Standard deviation

Multivariate imputation by chained equations

Support vector machine

Negative predictive value

Positive predictive value

True positive rate

Confidence interval

Recurrent wheeze

Respiratory infection

Food sensitization

Atopic dermatitis

Inhalant sensitization

Atopic rhinitis

Genetic risk score

Exclusive breastfeeding

Subbarao P, Mandhane PJ, Sears MR. Asthma: epidemiology, etiology and risk factors. CMAJ. 2009;181(9):E181–90.

Article PubMed PubMed Central Google Scholar

Patel D, Hall GL, Broadhurst D, Smith A, Schultz A, Foong RE. Does machine learning have a role in the prediction of asthma in children? Paediatr Respir Rev. 2022;41:51–60.

PubMed Google Scholar

Kothalawala DM, Murray CS, Simpson A, et al. Development of childhood asthma prediction models using machine learning approaches. Clinical and Translational Allergy. 2021;11(9):e12076.

Article PubMed PubMed Central Google Scholar

Prosperi MC, Marinho S, Simpson A, Custovic A, Buchan IE. Predicting phenotypes of asthma and eczema with machine learning. BMC Med Genomics. 2014;7:1–10.

Article Google Scholar

Bose S, Kenyon CC, Masino AJ. Personalized prediction of early childhood asthma persistence: a machine learning approach. PLoS ONE. 2021;16(3):e0247784.

Article CAS PubMed PubMed Central Google Scholar

Kothalawala DM, Kadalayil L, Weiss VB, et al. Prediction models for childhood asthma: a systematic review. Pediatr Allergy Immunol. 2020;31(6):616–27.

Article PubMed Google Scholar

Chipman HA, George EI, McCulloch RE. BART: Bayesian additive regression trees. Ann Appl Stat. 2010;4(1):266–98.

Kapelner A, Bleich J. bartMachine: Machine Learning with Bayesian Additive Regression Trees. J Stat Softw. 2016;70(4):1–40.

Subbarao P, Anand SS, Becker AB, et al. The Canadian Healthy Infant Longitudinal Development (CHILD) Study: examining developmental origins of allergy and asthma. Thorax. 2015;70(10):998–1000.

Article PubMed Google Scholar

Tse SM, Rifas-Shiman SL, Coull BA, Litonjua AA, Oken E, Gold DR. Sex-specific risk factors for childhood wheeze and longitudinal phenotypes of wheeze. J Allergy Clin Immunol. 2016;138:1561–8. e6.

Jaakkola J, Nafstad P, Magnus P. Environmental tobacco smoke, parental atopy, and childhood asthma. Environ Health Perspect. 2001;109:579–82.

Lodge CJ, Allen KJ, Lowe AJ, et al. Perinatal cat and dog exposure and the risk of asthma and allergy in the urban environment: a systematic review of longitudinal studies. Clin Dev Immunol . 2011;2012.

Melen E, Wickman M, Nordvall S, Van Hage‐Hamsten M, Lindfors A. Influence of early and current environmental exposure factors on sensitization and outcome of asthma in pre‐school children. Allergy. 2001;56:646–52.

Pividori M, Schoettler N, Nicolae DL, Ober C, Im HK. Shared and distinct genetic risk factors for childhood-onset and adult-onset asthma: genome-wide and transcriptome-wide studies. Lancet Respir Med. 2019;7:509–22.

Tan YV, Roy J. Bayesian additive regression trees and the General BART model. Stat Med. 2019;38(25):5048–69.

Hill J, Linero A, Murray J. Bayesian additive regression trees: A review and look forward. Ann Rev Stat Appl. 2020;7:251–78.

Article Google Scholar

Bleich J, Kapelner A, George EI, Jensen ST. Variable selection for BART: An application to gene regulation. Ann Appl Stat. 2014;8(3):1750–81.

Twala BE, Jones M, Hand DJ. Good methods for coping with missing data in decision trees. Pattern Recogn Lett. 2008;29(7):950–6.

Article Google Scholar

Alfaro E, Gamez M, Garcia N. adabag: An R package for classification with boosting and bagging. J Stat Softw. 2013;54:1–35.

Article Google Scholar

Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1.

Article PubMed PubMed Central Google Scholar

Therneau T, Atkinson B, Ripley B, Ripley MB. Package ‘rpart’. 2015. Available online: http://cran.ma.ic.ac.uk/web/packages/rpart/rpart.pdf. Accessed on 20 Apr 2016.

Günther F, Fritsch S. Neuralnet: training of neural networks. R J. 2010;2(1):30.

Article Google Scholar

RColorBrewer S, Liaw MA. Package ‘randomforest.’ Berkeley, CA, USA: University of California, Berkeley; 2018.

Google Scholar

Karatzoglou A, Meyer D, Hornik K. Support vector machines in R. J Stat Softw. 2006;15:1–28.

Article Google Scholar

Van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate imputation by chained equations in R. J Stat Softw. 2011;45:1–67.

Article Google Scholar

Chatzimichail E, Paraskakis E, Rigas A. An evolutionary two-objective genetic algorithm for asthma prediction. In: 2013 UKSim 15th International Conference on Computer Modelling and Simulation. Cambridge: IEEE; 2013. p. 90–94.

Chatzimichail E, Paraskakis E, Sitzimi M, Rigas A. An intelligent system approach for asthma prediction in symptomatic preschool children. Comput Math Methods Med. 2013;2013:240182.

Smolinska A, Klaassen EM, Dallinga JW, et al. Profiling of volatile organic compounds in exhaled breath as a strategy to find early predictive signatures of asthma in children. PLoS ONE. 2014;9(4):e95668.

Article PubMed PubMed Central Google Scholar

AlSaad R, Malluhi Q, Janahi I, Boughorbel S. Interpreting patient-Specific risk prediction using contextual decomposition of BiLSTMs: application to children with asthma. BMC Med Inform Decis Mak. 2019;19:1–11.

Article Google Scholar

Harvey JL, Kumar SAP. Machine learning for predicting development of asthma in children. In: 2019 IEEE Symposium Series on Computational Intelligence (SSCI). Xiamen: IEEE; 2019. p. 596–603.

Fragoso TM, Bertoli W, Louzada F. Bayesian model averaging: A systematic review and conceptual classification. Int Stat Rev. 2018;86(1):1–28.

Article Google Scholar

Lu Z, Lou W. Bayesian approaches to variable selection: a comparative study from practical perspectives. Int J Biostat. 2022;18(1):83–108.

Article Google Scholar

Wenzel F, Galy-Fajou T, Deutsch M, Kloft M. Bayesian nonlinear support vector machines for big data. In: Ceci M, Hollmén J, Todorovski L, Vens C, Džeroski S, editors. Machine learning and knowledge discovery in databases. ECML PKDD 2017. Lecture notes in computer science. Cham: Springer; 2017;10534:447–63.

Download references

We thank the CHILD Cohort Study (CHILD) participant families for their dedication and commitment to advancing health research. CHILD was initially funded by CIHR and AllerGen NCE. Visit CHILD at childcohort.ca.

ZL is supported by a Discovery Grant funded by the Natural Sciences and Engineering Research Council of Canada, and by an Early Career Researcher Award in Asthma jointly funded by the Canadian Institutes of Health Research, Institute of Circulatory and Respiratory Health, the Canadian Allergy, Asthma and Immunology Foundation, AstraZeneca Canada, AsthmaCanada, and the Canadian Lung Association. Padmaja Subbarao holds a Tier1 CRC Chair in Pediatric Asthma and Lung Health. The funding bodies played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Department of Public Health Sciences, Queen’s University, Kingston, ON, K7L 3N6, Canada

Mojtaba Ahmadiankalati, Himani Boury & Zihang Lu

Department of Pediatrics and Translational Medicine, SickKids Research Institute, The Hospital for Sick Children, Toronto, ON, Canada

Padmaja Subbarao

Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada

Padmaja Subbarao & Wendy Lou

Department of Mathematics and Statistics, Queen’s University, Kingston, ON, Canada

Zihang Lu

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

MA, HB and ZL designed the study and conducted the literature review. MA and ZL conducted the statistical analysis and drafted the manuscript. ZL, WL, and PS interpreted the results and substantially revised the manuscript. All authors have reviewed and approved the manuscript.

Correspondence to Zihang Lu.

The current study was approved by Queen’s University Ethics Board. Informed consent was obtained from the parents/guardians of all study participants and from the study participants if they had the capacity to do so.

Not applicable.

The authors declare no competing interests.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

Ahmadiankalati, M., Boury, H., Subbarao, P. et al. Bayesian additive regression trees for predicting childhood asthma in the CHILD cohort study. BMC Med Res Methodol 24, 262 (2024). https://doi.org/10.1186/s12874-024-02376-2

Download citation

Received: 02 December 2023

Accepted: 17 October 2024

Published: 01 November 2024

DOI: https://doi.org/10.1186/s12874-024-02376-2

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

SHARE