A random forest model uses the idea that one may get an accurate model by choosing any simple model and applying it to a data set many times, each time choosing a random subset of the training data. Further, if the model itself involves getting a subset of the data, that part is randomized as well. Injecting this much randomness is theorized to eliminate any noise and outlier effects.
As the name indicates, a random forest model chooses a tree model as the "simple" model. A random subset of the training data is chosen and a tree model is constructed using it. This is repeated many times, hence creating a “forest”. Further randomization is injected during the construction of the tree model itself. Since a tree involves deciding a best split choice at each node, a random subset of the initial data (which is itself a random subset of the overall training data) is chosen, and then a random subset of the predictor variables is chosen to calculate the best split. Both continuous and categorical variables are used for this purpose. The continuous variables are split the usual way using their values, the categorical values are split by assigning their various possible categories to different leafs of the node.
The predictions of all the tree models are then combined to give the final result. Regression models usually involve taking the average of all the predictions, classification models usually require a majority voting scheme. Given a training data, comparing these predictions with the expected output also enables one to estimate the importance of each variable and the proximity of different variables (this may be done by remodeling the data with slightly different values of a variable, or a combination of variables, and computing the change in the final result); important to decide on a best model.
Studies have shown that upon using the appropriate number of trees, random forest models perform just as well as other models. The appropriate number of trees may be found by simply constructing the models using different number of trees and observing the error rate as compared with the training data. The number of trees is usually high, typically a few hundred at least. Such models may have an advantage in noisy data and data with high correlations.
Random Forest Models in R
Random forest models may be simply constructed using R, as shown in the example below. The model can then be converted to PMML, ready to be loaded into ADAPA. Even given the large size of such models, it can be uploaded quickly into ADAPA and data scored in batch or real-time. The example below uses the “airquality” dataset which is part of the R package randomForest (Authors: Leo Breiman and Adele Cutler, R port by Andy Liaw and Matthew Wiener).
ozone.out <- randomForest(Ozone ~ Wind+Temp+Month, data=na.omit(airquality), ntree=200)
saveXML(pmml(ozone.out, data=airquality), "airquality_rf.pmml");
The resulting forest models the amount of ozone in the atmosphere as a function of wind velocities, temperature and the month. Although this model uses simple predictor variables, it is possible to use more complicated combinations of variables as well. Examples include variables with interactions (ozone levels depending on the product of wind and temperature can be indicated by using the term Wind:Temp in R) and continuous variables treated as categorical variables (ensuring that even though numeric, month is treated as a categorical variable in R can be indicated by using as.Factor(Month)). Such models can also be represented in PMML and uploaded into ADAPA for scoring.