PK 4}X žF žF appendix-A.htmUT `f`f
Appendix A. Technical details and additional capabilities of random forests.
TECHNICAL DETAILS
The Gini Index
For the K class classification problem the Gini index is defined to be G = k pk(1 â pk), where pk is the proportion of observations at the node in the kth class. The index is minimized when one of the pk takes the value 1 and all the others have the value 0. In this case the node is said to be pure and no further partitioning of the observations in that node will take place. The Gini index takes its maximum value when all the pk take on the vale 1/K, so the observations at the node are spread equally among the K classes. The Gini index for an entire classification tree is a weighted sum of the values of the Gini index at the terminal nodes, with the weights being the numbers of observations at the nodes. Thus, in the selection of the next node to split on, nodes which have large numbers of observations but for which only small improvements in the pks can be realized may be offset against nodes that have small numbers of observations but for which large improvements in the pks are possible.
The Number of Variables Available for Splitting at Each Node
The parameter mtry controls the number of variables available for splitting at each node in a tree in a random forest. The default values of mtry are (the integer value of) the square root of the number of variables for classification, and the number of variables divided by three for regression. The smaller value of mtry for classification is to ensure that the fitted classification trees in the random forest have small pairwise correlations, a characteristic that is not needed for regression trees in a random forest. In principle, for both applications, if there is strong predictive capability in a few variables then a smaller value of mtry is appropriate, and if the data contains a lot of variables that are weakly predictive of the response variable larger values of mtry are appropriate. In practice, RF results are quite insensitive to the values of mtry that are selected. To illustrate this point, for Verbascum thapsus in the Lava Beds NM invasive plants data, we ran RF five times at the default settings (mtry = 5) and obtained out-of-bag PCC values of 95.3%, 95.2%, 95.2%, 95.3%, 95.4%. Next, we ran RF once for each of the following values of mtry: 3, 4, 6, 7, 10, and 15. The out-of-bag PCC values for these six cases were 95.2%, 95.3%, 95.3%, 95.2%, 95.3%, and 95.3%. So, in this example, decreasing mtry to three and increasing it to half the total number of predictor variables had no effect on the correct classification rates. The other metrics we usedâsensitivity, specificity, kappa, and AUCâexhibited the same kind of stability to changes in the value of mtry.
The R implementation of RF (Liaw and Wiener 2002) contains a function called tuneRF which will automatically select the optimal value of mtry with respect to out-of-bag correct classification rates. We have not used this function, in part because the performance of RF is insensitive to the chosen value of mtry, and in part because there is no research as yet to assess the effects of choosing RF parameters such as mtry to optimize out-of-bag error rates on the generalization error rates for RF.
The Number of Trees in the Random Forest
Another parameter that may be controlled in RF is the number of bootstrap samples selected from the raw data, which determines the number of trees in the random forest (ntree). The default value of ntree is 500. Very small values of ntree can result in poor classification performance, but ntree = 50 is adequate in most applications. Larger values of ntree result in more stable classifications and variable importance measures, but in our experience, the differences in stability are very small for large ranges of possible values of ntree. For example, we ran RF with ntree = 50 five times for V. thapsus using the Lava Beds NM data and obtained out-of-bag PCC values of 95.2%, 94.9%, 95.0%, 95.2%, and 95.2%. These numbers show slightly more variability than the five values listed in the previous section for the default ntree = 500, but the difference is very modest.
ADDITIONAL APPLICATIONS OF RANDOM FORESTS IN ECOLOGICAL STUDIES
In this section we describe RFâs capabilities for types of statistical analyses other than classification.
Regression and Survival Analysis
RF may be used to analyze data with a numerical response variable without making any distributional assumptions about the response or predictor variables, or the nature of the relationship between the response and predictor variables. Regression trees are fit to bootstrap samples of the data and the numerical predictions of the out-of-bag response variable values are averaged. Regression functions for regression trees are piecewise constant, or âstepped.â The same is true with regression functions from RF, but the steps are smaller and more numerous, allowing for better approximations of continuous functions. Prasad et al. (2006) is an application of RF to the prediction of abundance and basal area of four tree species in the southeastern United States. When the response variable is a survival or failure time, with or without censoring, RF may be used to compute fully non-parametric survival curves for each distinct combination of predictor variable values in the data set. The approach is similar to Coxâs proportional hazards model, but does not require the proportional hazards assumption that results in all the survival curves having the same general shape. Details of survival forests may be found in Breiman and Cutler (2005).
Proximities, Clustering, and Imputation of Missing Values
The proximity, or similarity, between any two points in a dataset is defined as the proportion of times the two points occur at the same terminal node. Two types of proximities may be obtained from RF. Out-of-bag proximities, which use only out-of-bag observations in the calculations, are the default proximities. Alternatively, proximities may be computed using all the observations. At this time proximities are the subject of intense research and the relative merits of the two kinds of proximities have yet to be resolved. Calculation of proximities is very computationally intensive. For the Lava Beds NM data (n = 8251) the memory required to compute the proximities exceeded the memory available in the Microsoft Windows version of R (4Gb). The FORTRAN implementation of RF (Breiman and Cutler 2005) has an option that permits the storage of a user-specified fixed number of largest proximities for each observation and this greatly reduces the amount of memory required.
Proximities may be used for cluster analysis and for graphical representation of the data by multidimensional scaling (MDS) (Breiman and Cutler 2005). See Appendix C for an example of an MDS plot for the classification of the nest and non-nest sites in the cavity nesting birdsâ data. Proximities also may be used to impute missing values. Missing numerical observations are initially imputed using the median for the variable. Proximities are computed, and the missing values are replaced by weighted averages of values on the variable using the proximities as weights. The process may be iterated as many times as desired (the default is five times). For categorical explanatory variables, the imputed value is taken from the observation that has the largest proximity to the observation with a missing value.
As a sample application of imputation in RF, in three separate experiments using the LAQ data, we randomly selected and replaced 5%, 10%, and 50% of the values on the variable Elevation with missing values. We then imputed the missing values using RF with the number of iterations ranging from 1 to 25. The results for all the combinations of percentages of observations replaced by missing values and numbers of iterations of the RF imputation procedure were qualitatively extremely similar: the means of the original and imputed values were about the same (1069 vs. 1074, for one typical case); the correlations between the true and imputed values ranged from 0.964 to 0.967; and the imputed values were less dispersed than the true values, with standard deviations of about 335 for the imputed values compared to about 460 for the true values. This kind of contraction, or shrinkage, is typical of regression-based imputation procedures. When a large percentage of the values in a dataset have been imputed, Breiman and Cutler (2005) warn that, in subsequent classifications using the imputed data, the out-of-bag estimates of correct classification rates may overestimate the true generalization correct classification rate.
Detecting Multivariate Structure by Unsupervised Learning
Proximities may be used as inputs to traditional clustering algorithms to detect groups in multivariate data, but not all multivariate structure takes the form of clustering. RF uses a form of unsupervised learning (Hastie et al. 2001) to detect general multivariate structure without making assumptions on the existence of clusters within the data. The general approach is as follows: The original data is labeled class 1. The same data but with the values for each variable independently permuted constitute class 2. If there is no multivariate structure among the variables, RF should misclassify about 50% of the time. Misclassification rates substantially lower than 50% are indicative of multivariate structure that may be investigated using other RF tools, including variable importance, proximities, MDS plots, and clustering using proximities.
SOFTWARE USED IN ANALYSES
Stepwise discriminant analysis and preliminary data analyses and manipulations were carried out in SAS version 9.1.3 for Windows (SAS Institute, Cary NC). All other classifications and calculations of accuracy measures were carried out in R version 2.4.0 (R Development Core Team 2006). Logistic regression is part of the core distribution of R. LDA is included in the MASS package (Venables and Ripley 2002). Classification trees are fit in R using the rpart package (Therneau and Atkinson 2006). The R implementation of RF, randomForest, is due to Liaw and Wiener (2002).
SOURCES OF RANDOM FORESTS SOFTWARE
Three sources of software for RF currently exist. These are:
1. FORTRAN code is available from the RF website. (http://www.math.usu.edu/~adele/forests).
2. Liaw and Wiener (2002) have implemented an earlier version of the FORTRAN code for RF in the R statistical package.
3. Salford Systems (www.Salford-Systems.com) markets a professional implementation of RF with an easy-to-use interface.
The use of trade, product, or firm names in this publication is for descriptive purposes only and does not imply endorsement by the U.S. Government.
LITERATURE CITED
Breiman, L., and A. Cutler. 2005. Random Forests website: http://www.math.usu.edu/~adele/forests
Hastie, T. J., R. J. Tibshirani, and J. H. Friedman. 2001. The elements of statistical learning: data mining, inference, and prediction. Springer series in statistics, New York, New York, USA.
Liaw, A., and M. Wiener. 2002. Classification and Regression by randomForest. R News: The Newsletter of the R Project (http://cran.r-project.org/doc/Rnews/) 2(3):18â22.
Prasad, A. M., L. R. Iverson, and A. Liaw. 2006. Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems 9:181â199.
R Development Core Team. 2006. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org
Therneau, T. M., and E. Atkinson. 2006. rpart: Recursive Partitioning. R package version 3.1.
Venables, W. N., and B. D. Ripley. 2002. Modern applied statistics with S (Fourth Edition). Springer, New York, New York, USA.
Appendix C. Visualization techniques for random forests.
This appendix contains some technical details concerning partial dependence plots and information about additional visualization techniques for random forests.
MULTIDIMENSIONAL SCALING PLOTS FOR CLASSIFICATION RESULTS
With regard to the data for the cavity nesting birds in the Uinta Mountains, Utah, USA, a natural question to ask is whether there are any differences on the measured stand characteristics among the nest sites for the three bird species. An RF classification of the nest sites produced correct classification rates at the chance level (i.e., very low), and we sought to understand why this was the case through graphical summaries. RF produces measures of similarity of data points called proximities (see Appendix A). The matrix of proximities is symmetric with all entries taking values between 0 and 1. The value (1 â proximity between points j and k) is a measure of the distance between these points. A bivariate, metric, multidimensional scaling (MDS) plot is a scatter plot of the values of the data points on the first two principal components of the distance matrix (the matrix of 1 â proximities). Using the RF classification for the combined data on the cavity nesting birds, we constructed an MDS plot (Fig. C1). Note that the nest sites for the three species are completely intermingled, showing that it is not possible to separate the nest sites for the different species on the basis of the measured stand characteristics. The nest sites for all the species are fairly well separated from the non-nest sites, which explains why the classification accuracies for nest sites versus non-nest sites were high. Plots of pairs of measured stand characteristicsâincluding the two stand characteristics that RF identifies as most important to the classificationâdo not show such a clear separation of the nest and non-nest sites.
FIG. C1. Random forest-based multi-dimensional scaling plot of non-nest vs. nest sites for three species of cavity nesting birds in the Uinta Mountain, Utah, USA. Non-nest sites are labeled "N". Nest sites are coded "S" for Sphyrapicus nuchalis, "C" for Parus gambeli, and "F" for Colaptes auratus. |
PARTIAL DEPENDENCE PLOTS
Partial dependence plots (Hastie et al. 2001; Friedman 2001) are tools for visualizing the effects of small numbers of variables on the predictions of âblackboxâ classification and regression tools, including RF, boosted trees, support vector machines, and artificial neural networks. In general, a regression or classification function, f, will depend on many predictor variables. We may write f(X) = f(X1, X2, X3, ⌠Xs), where X = (X1, X2, âŚ, Xs) are the predictor variables. The partial dependence of the function f on the variable Xj is the expectation of f with respect to all the variables except Xj. That is, if X(-j) denotes all the variables except Xj, the partial dependence of f on Xj is given by fj(Xj) = EX(-j) [ f(X)]. In practice we estimate this expectation by fixing the values of Xj, and averaging the prediction function over all the combinations of observed values of the other predictors in the data set. This process requires prediction from the entire dataset for each value of Xj in the training data. In the R implementation of partial dependence plots for RF (Liaw and Wiener 2002), instead of using the values of the variable Xj in the training data set, the partialPlot function uses an equally spaced grid of values over the range of Xj in the training data, and the user gets to specify how many points are in the grid. This feature can be very helpful with large data sets where the number of values of Xj may be large.
The partial dependence for two variable, say Xj and Xl, is defined as the conditional expectation the function f(X) with respect to all variables except Xj and Xl. Partial dependence plots for two predictor variables are perspective (three-dimensional) plots (see Fig. 4 in the main article). Even with moderate sample sizes (5,000â10,000), such as the Lava Beds NM invasive plants data, bivariate partial dependence plots can be very computationally intensive.
In classification problems with, say, K classes, there is a separate response function for each class. Letting pk(X) be the probability of membership in the kth class given the predictors.
X = (X1, X2, X3, âŚ, Xs), the kth response function is given by
fk(X) = log pk(X) - j log pj(X) /K
(Hastie et al. 2001, Liaw and Wiener 2002). For the case when K = 2, if p denotes the probability of âsuccessâ (i.e., presence, in species distribution models), the above expression reduces to
f(X) = 0.5 log( p(X)/(1- p(X)) = 0.5 logit( p(X)).
Thus, the scale on the vertical axis of Figs. 2â4 is a half of the logit of probability of presence.
REAL-TIME 3D GRAPHICS WITH rgl
Bivariate partial dependence plots are an excellent way to visualize interactions between two predictor variables, but choosing exactly the correct viewing angle to see the interaction can be quite an art. The rgl real-time 3D graphics driver in R (Adler and Murdoch 2007) allows one to take a 3D plot and spin it in three dimensions using the computer mouse. In a matter of seconds one can view a three-dimensional plot from literally hundreds of angles, and finding the âbestâ perspective to view the interaction between two variables is quick and easy. Figure C2 is a screen snapshot of an rgl 3D plot for the cavity nesting birds data, using the same variables as in Fig. 4 of the main article.
FIG. C2. Screen snapshot of 3D rgl partial dependence plot for variables NumTree3to6in and NumTree9to15in. Nest site data for three species of cavity nesting birds collected in the Uinta Mountains, Utah, USA. |
LITERATURE CITED
Adler, D., and D. Murdoch. 2007. rgl: 3D visualization device system (openGL). R package version 0.71. URL http://rgl.neoscientists.org
Hastie, T. J., R. J. Tibshirani, and J. H. Friedman. 2001. The elements of statistical learning: data mining, inference, and prediction. Springer series in statistics, New York, New York, USA.
Friedman, J. 2001. Greedy function approximation: a gradient boosting machine. Annals of Statistics 29(5):1189â1232.
Liaw, A., and M. Wiener. 2002. Classification and Regression by randomForest. R News: The Newsletter of the R Project (http://cran.r-project.org/doc/Rnews/) 2(3):18â22.