As soon as I get to know some uncommon algorithms, I instantly rush to Kaggle’s web page to enter into a competition. Despite my strong belief that I will score a high grade, I found my entry to be inappropriate. Reason? Most of the other scripts use Ensemble Learning, which bases itself on multiple models. Hence, individual models lead to scoring less. The concept of Ensemble Learning is so predominant in Kaggle that most of the winning scripts are surely employ to this.
Ensembles work well as they eliminate some of the most common and significant errors from individual models. They help in approximating the true-function. They also eliminate uncorrelated errors of individual classifiers and overfitting.
What is Ensemble Learning?
As the name suggests, this learning basis itself on a group of items. They’re viewed as a whole rather than being viewed individually.
The notion of ‘Wisdom of Crowds‘ states that – more often than not, the collective opinion of many is likely to be correct than a single expert opinion. Hence in Ensemble Learning, instead of relying on a single model for predictions, it is sometimes useful to rely on a set of learners.
Here, we pick a set of weak (simple) learners and train these learners on data. When we need to predict on a new data point, we capture and combine the outcomes of all the models. So, the final prediction is then a function of these outcomes. The learners can be from the same family or a different one.
Some of the simplest ensembles are –
- Simple Averages for Regression or Majority Vote for Classification – Final prediction is based on either a simple average of the outcomes of individual learners or the most common classification among the outcomes.
- Weighted averages or Weighted votes – Final prediction is based on the weighted average of the outcomes.
Interesting Fact- Ensemble Learning is great for both small and large amounts data.
Why is Ensemble Learning better?
To understand why it performs better than individual models, the knowledge of variance and bias is vital. The error component of a model decomposes into:
Models with a high variance overfit data i.e. it memorises the data. High variance implies that the model is sensitive to even small fluctuations in the data. Instead of having a generalized model, it models on the noise that is present in the data. Although it performs excellently on training data, its performance is worse on test data.
Models having a high bias underfit (i.e. opposite of overfit) the data. It means that it fails to take a note of the key relations between the features and the target variable. It is due to our inability to represent the true best predictor.
This intuitive image by Scott Fortmann, explains the concept of bias and variance. In the picture, the red circle is the region of true outcomes, and the blue dots are our predicted outcomes. The blue dots inside the red circle indicate correct prediction.
3. Intrinsic Target Noise
It means the minimum error that can be achieved and cannot be further predicted. It relates to the inherent noise that is present in the data and represents a lower bound on the error of any learning algorithm.
Each model has a bias-variance trade-off. A model with a small bias has high variance and vice-versa. Ensemble Learning plays on this trade-off.
We can decrease or increase it as and when required. However, a high variance or a high bias indicates the model’s poor performance.
The graph depicts that with increasing complexity, the error due to variance increases. Similarly, with decreasing complexity, bias error increases. The model’s complexity is optimal at the point where the total error is minimum.
Some of the most popular Ensemble Learning models/techniques are – ‘Bagging’, ‘Boosting’ and ‘Stacking’.
“Bagging” stands for –”Bootstrap Aggregation.”
At first, multiple datasets are created by dividing the original data set. These datasets are then classified into various classes (i.e. Bootstrap) finally the classifiers are then combined to give a final prediction (i.e. Aggregation).
Bootstrap: The process of sampling random subsets from the complete/total/entire training data.
A learner is trained on each of these subsets.
The final model is an aggregation of all these individual learners.
Since each of these bags doesn’t contain full information, it cannot memorize the complete training data. Even if the models overfit their ‘bags’, they do not overfit the entire training dataset. This is how Bagging reduces overfitting.
One of the main problems in employing bagging is that it consumes a lot of resources. It is so because instead of one, we’re training several learners. If there are “k” samples, then the resource usage will approximately increase by k-fold.
Bagging is one of the simplest algorithms. The algorithm in a nutshell;
- Random subsets (“Bags”) of the training sample are taken.
- A model is then trained on each of these subsets.
- The final prediction is a function of outcomes of these models. (unweighted)
Boosting provides a sequential learning of the predictors. Here the first predictor is learned on the whole data set. The consequent ones now learn on the training dataset based on the previous one’s performance. Firstly, the original data is classified into several observations, and each of these observations is given equal weight. If the classifiers are made incorrectly by the first learner, then it gives an increased weight to the missed ones. This is an iterative method and classifiers are added until a certain level of accuracy is reached, some models are made.
Algorithm in a nutshell:
- Train a simple model
- Train another model on the errors
- You integrate the model into the previous one
Repeat steps 2 and 3
We’re starting with a simple model. Simple in a sense, the learner, performs slightly better than chance. For a binary classifier, ‘chance’ is 50% probability. The model should not leave white-noise behind.
We’re making the model more complex as we iterate through the algorithm. We halt the process after we reach our required level of complexity.
Since Boosting keeps on predicting the errors and tries to approximate true-function, Boosting decreases bias substantially.
Did You Know? It’s a myth that boosting doesn’t overfit. It does!
Interesting fact- According to this video Jeong-Yoon Lee, Master Kaggler whose highest rank is 10 considers eXtreme Gradient Boosting to be the best out-of-the-box technique out there. It is an alternative you can count on when you have no idea on what to do next!
Here, we reduce errors in either bias or variance by using a learner to combine predictions of different learners.
Ensemble Learning offers a couple of benefits. The aggregate opinion of multiple models is less noisy. The prediction becomes more accurate as the output of multiple experts is combined here rather than relying on a single expert. Moreover, a complex problem can be decomposed into many subproblems making it much easier to understand and solve.
Overall, Ensemble Learning is like not putting all your eggs in the same basket and diversifying opinions!
Hope now you’ve understood why many entries that use Ensemble Learning rank high on Data Science competitions like Kaggle.