From the original creators of the Random Forests algorithm and their original implementation in Java comes a translation of their work into R as the package Rfviz.
First, I would just say I am in no way deserving of any credit for this package. That credit deservedly goes to Dr. Leo Breiman and Dr. Adele Cutler, the original creators of this wildly popular and successful algorithm. I am just the lucky student who got to work with Dr. Cutler in my graduate program.
Dr. Breiman and Dr. Cutler originally created plots in Java that allow you to visualize and interpret Random Forests. You can look at the original plots and style here:
I was fortunate enough to go to Utah State University starting in 2017, where Dr. Cutler was a professor in the Statistics Department. I wanted to do my graduate work in Machine Learning and asked her to be my graduate advisor. She agreed and showed me potential projects. At this point in time I didn’t know she had helped create the Random Forests algorithm. It actually wasn’t until I was about halfway done with the project that I found out from another faculty member. And when I approached her and asked her why she hadn’t told me, she replied with something along the lines of, “Oh well I don’t really like to toot my own horn.” I hope this tells you how down to earth and how kind of a person she is.
Anyways, one of the potential projects she showed me was translating some Java plots for interactive visualization and interpretation of Random Forests into R. I chose this project and finished the translation into R in 2018 and published the package to CRAN as Rfviz.
Recently, I have been using it in my job and was able to gain, in my opinion, deeper insights beyond what is available through methods such as the Shapley method or Overall Variable Importance Plots. I thought that other people should be know more about the benefits I am experiencing from Dr. Breiman and Dr. Cutler’s work. This is what inspired this article.
Random forests (Breiman (2001)) fit a number of trees (typically 500 or more) to regression or classification data. Each tree is fit to a bootstrap sample of the data, so some observations are not included in the fit of each tree (these are called out of bag observations for the tree). Independently at each node of each tree, a relatively small number of predictor variables (called mtry) is randomly chosen and these variables are used to find the best split. The trees are grown deep and not pruned. To predict for a new observation, the observation is passed down all the trees and the predictions are averaged (regression) or voted (classification).
A local importance score is obtained for each observation in the data set, for each variable. To obtain the local importance score for observation i and variable j, randomly permute variable j for each of the trees in which observation i is out of bag, and compare the error for the variable-jpermuted data to actual error. The average difference in the errors across all trees for which observation i is out of bag is its local importance score.
The (overall) variable importance score for variable j is the average value of its local importance scores over all observations.
Proximities are the proportion of time two observations end up in the same terminal node when both observations are out of bag. The proximity scores are obtained for all combinations of pairs of observations, giving a symmetric proximity matrix.
Note: It is recommended to use 9 or 10 times more trees when dealing with proximities, since it the the proprotion of when two observations are both out of bag. This is to ensure each observation is able to be compared with the subsequent observations.
Parallel Coordinate Plots
Parallel coordinate plots are used for plotting observations on a handful of variables. The variables can be discrete or continuous, or even categorical. Each variable has its own axis and these are represented as equally-spaced parallel vertical lines. Usually, the axes extend from the minimum to the maximum of the observed data values although other definitions are possible. A given observation is plotted by connecting its observed value on each of the variables using a piecewise linear function.
Static parallel coordinate plots are not particularly good for discovering relationships between variables because the display depends strongly on the order of the variables in the plot. In addition, they suffer very badly from overplotting for large data sets. However, by brushing the plot (highlighting subsets of observations in a contrasting color) parallel coordinate plots can be useful in investigating unusual groups of observations and relating the groups to high/low values of the variables.
Plots Produced by Rfviz
Parallel coordinate plot of the input data.
The predictor variables are plotted in a parallel coordinate plot. The observations are colored according to their value of the response variable. Brushing on this plot allows investigators to examine the input data interactively and look for unusual observations, outliers, or any obvious patterns between the predictors and the response. Often this plot is used in conjunction with the local importance plot, which allows users to focus more heavily on the predictors that are important for a given group of observations.
Parallel coordinate plot of the local importance scores.
The local importance scores of each observation are plotted in a parallel coordinate plot. Brushing observations with high local importance can allow the user to look at the corresponding variable on the raw input parallel coordinate plot and observe whether the variable has high or low values, allowing an interpretation such as ‘for this group the most important variable is variable j’.
Rotational scatterplot of the proximities.
This plot allows the user to select groups of observations that appear to be similar and brush them, with the corresponding observations showing up in the two parallel coordinate plots. In classification, for example, if the user brushes a group of observations that are from class 1, they can then examine the local importance parallel coordinate plot. Variables that are important for classifying the group correctly will be highlighted and any variables that have high importance can then be studied in the raw input parallel coordinate plot, to see whether high or low values of the important variable(s) are associated with the group.
bc[is.na(bc)] <- -3
data_x <- bc[,-10]
data_y <- bc$Class
#The prep function. This runs default randomForest() and prepares it for the plotting function.rfprep <- rf_prep(data_x, data_y)
Our use case: Let’s say we look at the variable importance plots within Random Forests.
Looking at some of the top important variables, we can see that according to Mean Decrease in Gini “Uniformity of Cell Size” and “Uniformity of Cell Shape” are the two most important predictors. But what values of these are most important and to which class? Let’s pull up the visualization tool and dig in to one.
#Pull up the visualization toolbcrf <- rf_viz(rfprep, input=TRUE, imp=TRUE, cmd=TRUE)
We can see that the three plots are there, along with a “loon inspector” to interact with the plots. Each one of the parallel coordinate plots have a separate scale for each column of data, which is why there is no y-axis ticks or labels. The scales are relative to the max and min of that column. The proximities plot is an XYZ scatterplot. Mainly it is to show in space, how the different classes are grouped within the trees.
First, let’s see what color correlates with each class. Within the “loon inspector” on the right, inside the “select” section, click on one of the colors under the subsection “by color”. Let’s click on blue first. Within screenshot 5 you can see the result.
We can see that all the data that correlates with the blue class is now highlighted. What class does this correlate with? Back in R, run:
I know it is rudimentary, but it’s all I have figured out so far to identify the classes quickly and easily. I haven’t figured out how to label the colors on the inspector. Now we know that blue is malignant or class 1, and gray is benign or class 0. Now click anywhere on the visualization tool and will be deselected.
We now know that blue is class 1 or those with malignant cancer. Now let’s focus on the “Uniformity of Cell Shape” column, the second most important variable to the Mean Decrease in Gini Overall Importance Plot. Take a look at the Local Importance Scores plot, and the “Uniformity of Cell Shape” column on Screenshot 7. Visually, the values of the local importance scores seem tend to trend higher for class 1/malignant than class 0/benign.
Now here is where the deep interpretation and understanding can come. On the Local Importance Scores plot, using your mouse, click and drag up on the column for “Uniformity of Cell Shape” near where you think the separation between the two classes or colors of lines happens. Here is what I selected within Screenshot 8.
Within R, run:
bc[bcrf$imp['selected'],'Uniformity of Cell Shape']
c1 <- bc[bcrf$imp['selected'],]
summary(c1$`Uniformity of Cell Shape`)
Now do the same for the portion of data we did not select within the “Uniformity of Cell Shape” column on the Local Importance Score plot. Here is what I selected in Screenshot 10.
And again in R, run:
bc[bcrf$imp['selected'],'Uniformity of Cell Shape']
c2 <- bc[bcrf$imp['selected'],]
summary(c2$`Uniformity of Cell Shape`)
Now let’s look at the results. Of what we first selected, 198/236 (~84%) were from class 1/malignant. The values of “Uniformity of Cell Shape” have a 1st Quartile of 4, Median of 6, and 3rd Quartile of 9.
For the second grouping, 423/466 (91%) were from class 0/benign. The values of “Uniformity of Cell Shape” have a 1st Quartile of 1, Median of 1, and 3rd quartile of 2.
From this, we can conclude that for the second most important variable to the prediction, higher values of “Uniformity of Cell Shape”, with a 1st Quartile of 4, Median of 6, and 3rd Quartile of 9, were generally important to a classification of class 1/malignant for Random Forests. I say “generally” because some of class 0/benign were included in that data we selected.
On the other hand, lower values of “Uniformity of Cell Shape” with a 1st Quartile of 1, Median of 1, and 3rd quartile of 2 were generally important to class 0 for Random Forests. Even more, we can look at the exact data and save it as objects within R for even more manipulation.
For instance, let’s say we didn’t want any of the values of the other class to show up in the summary data for “Uniformity of Cell Shape” for each selection of data.
summary(c1$`Uniformity of Cell Shape`[c1$Class!='benign'])
summary(c2$`Uniformity of Cell Shape`[c2$Class!='benign'])
So what does Rfviz allow us to do? Rfviz allows us deeper interaction and interpretation of Random Forests. We can visualize and interact with the data, quickly see the differences between the classes, and see why and how overall important variables are locally important to each class.
For a more in depth tutorial of how to interact with the tool, look here.
I hope you enjoyed this read, and good luck in your interpretation and inference of Random Forests.
Breiman, L. 2001. “Random Forests.” Machine Learning. http://www.springerlink.com/index/u0p06167n6173512.pdf.
Breiman, L, and A Cutler. 2004. Random Forests. https://www.stat.berkeley.edu/~breiman/RandomForests/cc_graphics.htm.
C Beckett, Rfviz: An Interactive Visualization Package for Random Forests in R, 2018, https://chrisbeckett8.github.io/Rfviz.