Identifying and characterizing how mixtures of exposures are associated with health endpoints is challenging. We demonstrate how classification and regression trees can be used to generate hypotheses regarding joint effects from exposure mixtures.
In this paper we describe how classification and regression trees (C&RT) can be used as an alternative method for identifying complex joint effects, including interactions, for multiple exposures. The proposed approach expands the applicability of C&RT to epidemiologic research by demonstrating how it can be used for risk estimation. We view this method as a means to generate hypotheses about joint effects that may merit further investigation. We illustrate this approach with an investigation of the effect of outdoor air pollutant concentrations on emergency department visits for pediatric asthma.
breiman classification and regression trees ebook 120
Perhaps the most important way in which the proposed algorithm differs from available C&RT programs is in its control for confounding. Rarely in observational epidemiologic research are we immune to the hazards of confounding. Nonetheless, because most C&RT programs were developed for the purposes of prediction and classification, and not causal inference, they do not directly account for confounding. The typical C&RT approach is to consider all covariates one-at-a-time in the search for the optimal split [7]; however, this one-at-a-time approach ignores confounding. One approach for handling confounding is to first remove the association with the confounders and then fit a regression tree to the residuals [15]; unfortunately, this approach is appropriate only for Gaussian outcomes and cannot be easily applied to the residuals from generalized linear models (e.g. binomial or Poisson data) [16]. Conditional inference trees, first proposed by Hothorn et al. in 2006, offer a framework for recursive partitioning in which the best split is chosen conditional on all possible model splits [17]; however, this approach requires that all covariates in the conditional model be eligible for partitioning. The C&RT algorithm we propose differentiates exposure covariates from control covariates, i.e., it allows for user-defined a priori control of confounding while restricting the selection of the optimal splits to the exposure covariates, thereby making this approach better aligned to epidemiologic research when effect estimation is of interest. Bertolet et al. identified many of the same limitations to the existing C&RT approaches and go on to present a similar method for using classification and regression trees that control for confounding with Cox proportional hazards models and survival data [18].
While most C&RT packages utilize measures of node impurity, including the Gini index for classification trees and least squares for regression trees [7], to guide the splitting decisions there are situations in which other criteria may be justifiable. One approach is to base the best split on statistical significance, as was done in this paper and has been favored by others [17, 18]. Selecting splits based on the smallest P- value (or largest Chi-square statistic) illustrates how recursive partitioning can be used to capture the strongest association present in the data.
With advances in science and technology, high dimensional datasets are increasingly common, leading many researchers to question how best to characterize and analyze these mixtures of exposures. Many issues arise when dealing with mixtures, including exposure covariation, physiological and chemical interaction, joint effects, and novel exposure metrics. Classification and regression trees offer an alternative to traditional regression approaches and may be well-suited for identifying complex patterns of joint effects in the data. While recursive partitioning approaches such as C&RT are not new, they are seldom used in epidemiologic research. We believe that the aforementioned modifications to the C&RT algorithm, namely the differentiation of exposure and control covariates to account for confounding and the withholding of a referent group, can aid researchers interested in generating hypotheses about exposure mixtures.
Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features using -value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. The experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures.
RFs have shown excellent performance for both classification and regression problems. RF model works well even when predictive features contain irrelevant features (or noise); it can be used when the number of features is much larger than the number of samples. However, with randomizing mechanism in both bagging samples and feature selection, RFs could give poor accuracy when applied to high dimensional data. The main cause is that, in the process of growing a tree from the bagged sample data, the subspace of features randomly sampled from thousands of features to split a node of the tree is often dominated by uninformative features (or noise), and the tree grown from such bagged subspace of features will have a low accuracy in prediction which affects the final prediction of the RFs. Furthermore, Breiman et al. noted that feature selection is biased in the classification and regression tree (CART) model because it is based on an information criteria, called multivalue problem [2]. It tends in favor of features containing more values, even if these features have lower importance than other ones or have no relationship with the response feature (i.e., containing less missing values, many categorical or distinct numerical values) [3, 4].
Random forests are an ensemble approach to make classification decisions by voting the results of individual decision trees. An ensemble learner with excellent generalization accuracy has two properties, high accuracy of each component learner and high diversity in component learners [5]. Unlike other ensemble methods such as bagging [1] and boosting [6, 7], which create basic classifiers from random samples of the training data, the random forest approach creates the basic classifiers from randomly selected subspaces of data [8, 9]. The randomly selected subspaces increase the diversity of basic classifiers learnt by a decision tree algorithm.
For feature weighting techniques, recently Xu et al. [13] proposed an improved RF method which uses a novel feature weighting method for subspace selection and therefore enhances classification performance on high dimensional data. The weights of feature were calculated by information gain ratio or -test; Ye et al. [14] then used these weights to propose a stratified sampling method to select feature subspaces for RF in classification problems. Chen et al. [15] used a stratified idea to propose a new clustering method. However, implementation of the random forest model suggested by Ye et al. is based on a binary classification setting, and it uses linear discriminant analysis as the splitting criteria. This stratified RF model is not efficient on high dimensional datasets with multiple classes. With the same way for solving two-class problem, Amaratunga et al. [16] presented a feature weighting method for subspace sampling to deal with microarray data, the -test of variance analysis is used to compute weights for the features. Genuer et al. [12] proposed a strategy involving a ranking of explanatory features using the RFs score weights of importance and a stepwise ascending feature introduction strategy. Deng and Runger [17] proposed a guided regularized RF (GRRF), in which weights of importance scores from an ordinary random forest (RF) are used to guide the feature selection process. They found that the regularized least subset selected by their GRRF with minimal regularization ensures better accuracy than the complete feature set. However, regular RF was used as a classifier due to the fact that regularized RF may have higher variance than RF because the trees are correlated.
In the experiments, we use a bag of words for image features representation for the Caltech and the Horse datasets. To obtain feature vectors using bag-of-words method, image patches (subwindows) are sampled from the training images at the detected interest points or on a dense grid. A visual descriptor is then applied to these patches to extract the local visual features. A clustering technique is then used to cluster these, and the cluster centers are used as visual code words to form visual codebook. An image is then represented as a histogram of these visual words. A classifier is then learned from this feature set for classification.
The latest -packages random Forest and RRF [29, 30] were used in environment to conduct these experiments. The GRRF model was available in the RRF -package. The wsRF model, which used weighted sampling method [13] was intended to solve classification problems. For the image datasets, the 10-fold cross-validation was used to evaluate the prediction performance of the models. From each fold, we built the models with 500 trees and the feature partition for subspace selection in Algorithm 2 was recalculated on each training fold dataset. The and parameters were set to and , respectively. The experimental results were evaluated in two measures AUC and the test accuracy according to (9). 2ff7e9595c
Comments