Decision tree suffer from a high variance. When the data is divided into smaller parts, it is quite likely to expect variance when the data set itself is changed. Bootstrap Aggregation of Bagging is a method to try to overcome this problem.
Averaging a set of independent random variables reduces variance. Consider
We train
This reduction is obvious when the trees are completely independent. This is not true in majority of the cases. Since the trees are deep, we expect them to have very low bias, and similar expectations/means. The noise is introduced due to their variance. Suppose the trees come from the same distribution, but have some pairwise correlation
Since we do not have access to different training datasets, we will get the average using bootstraps of the original data set. Note that trees grown on bootstrapped data sets are deep and not pruned so that they have low bias. Even though they may have high variance, averaging will reduce it out.
Averaging bootstrapped predictors works for regression while we take the majority vote in case of classification. The overall prediction is the most commonly occurring class across the
It can be shown the probability of an observation being present in a bootstrapped dataset is
Any observation not part of the bootstrapped set is called Out of Bag and the error we are about to calculate is out of bag error. Now, we can take the
We have been able to reduce the variance using the bagging approach, but the model is no longer interpretable due to the presence of multiple trees. We need to somehow aggregate all the trees to get this measure.
A simple workaround is to calculate the total amount the RSS (regression) or Gini Index (classification) has decreased across all the trees due to split on a particular variable. To make the values comparable, we simply take the average across