What are Bias and Variance?
The best way to understand Supervised Learning is through bias-variance tradeoff. Whenever we discuss model prediction it’s important to understand prediction error. The prediction error of any Machine Learning model can be broken down into 3 parts (Bias Error, Variance Error, and Irreducible Error). The irreducible error cannot be reduced by using an algorithm. It can be introduced due to the nature of the source of data collection. It could also be because of unknown variables. In simple words, we cannot control this Irreducible Error.
Error due to Bias is the difference between a model average prediction and the correct value that we are trying to predict.
Examples of low-bias machine learning algorithms include Decision Trees, k-Nearest Neighbors and Support Vector Machines, etc.
Examples of high-bias machine learning algorithms include Linear Regression and Logistic Regression etc.
Error due to Variance is the variability of model prediction on a new Test data point.
As a data scientist, our objective should be to minimize error sources from both bias and variance in an optimal way. Prior knowledge of bias and variance helps us in taking the right decision if our algorithm is an overfit or an underfit. The mathematical representation of error is shown below.
Error = Bias2 + Variance + Irreducible Error
Understanding with an Example
We are trying to predict height using weight. The two hypothetical algorithms below are used on the training data set represented by blue dots. Figure 1 represents the fit of the model on the training data set.
Algorithm 1 has an error, shown by the perpendicular dotted line. Whereas Algorithm 2 squiggles and fits the data very well with almost zero error.
When we run our model on testing data represented by green dots in Figure 2, we see that error due to Algorithm 1 is lower as compared to Algorithm 2. We can see this visually in Figure 2.
Thus, we can say that Algorithm 2 has Low Bias (as it fits well with training data i.e., Figure 1) but High Variance (as it fails in generalizing itself with testing data i.e., Figure 2).
So, what is the best model?
- Low Bias: Accurately models the relationship.
- Low Variance: Consistently predict across new data sets.
The commonly used methods to find the sweet spot are:
- Regularization: A model tuning technique where an additional penalty value is added to the loss function.
- Boosting: A sequential ensemble method where different week learners learn sequentially.
- Bagging: A parallel ensembled method where the week learners work simultaneously in parallel.
How Learning Curve helps in understanding Bias and Variance?
We see that the conflict of simultaneously minimizing error sources from bias and variance is one of the toughest decisions in the model building framework. The abstract nature of these errors further makes it very difficult to visualize. The learning curve can be used as a tool to visualize the errors caused by both bias and variance simultaneously.
A learning curve shows the relationship of the training score versus the cross-validated test score for an algorithm with a varying number of training data points. Below is a piece of code that helps in plotting the Learning curve using the Yellow brick library.
Classification data is created using a built-in dataset loaded from Sklearn.
This learning curve shows high test variability and a low score up to around 500 instances, however after this level the model begins to converge on an accuracy score of around 85%. We can see that the training and test scores have perfectly converged to form a plateau. This visualization is typically used to show two things.
- How much the model benefits from more data. We see the X axis representing ‘Training Instance’ i.e., the number of data points.
- If the model is more sensitive to error due to variance or error due to bias.
Types of Learning Curve
Bad Learning Curve due to High Bias
- The training and testing curves converge at a lower score i.e., higher error.
- No matter how much data is feed into the model, the model will not represent the underlying relationship and has a high error.
- Poor fit with Training Data set.
- Poor generalization as a model has not captured enough information.
- This model can be termed as “Underfitted Model.”
- Train longer by getting more training data so that cross validation curve gets more data to converge. A balance needs to occur between spending time by running a low complexity model on more data and increasing the complexity of the model.
- Train with more complex model (e.g., kernelize, non-linear model, ensembled models).
- Get more features into the dataset. Explore the possibility of ‘Omitted Variable Bias’.
- Decrease regularization.
- Try using different cross-validation strategies, revisit preprocessing, study class imbalance, etc.
- New model architecture (Boosting)
Bad Learning Curve due to High Variance
- When there is a large gap between the final converging point of the curve.
- The model is not predicting consistently on the new data set.
- This model can be termed as Overfitted if bias is very low, and the gap is high.
- Get more data.
- Decrease the number of features.
- Increase regularization.
- New model architecture (Oversampling, Bagging)
- Reduce model complexity (complex models are prone to high variance)
Ideal Learning Curve
- The model generalizes to new data.
- Train and Test learning curves converge at similar values.
- Smaller the gap between the learning curves, better is the model.
To summarize, Bias and Variance play a major role in the training process of a model. It is necessary to reduce each of these error types individually to the minimum possible value. We must always understand that any effort to decrease one of these parameters beyond a certain limit increases the probability of the other getting increased. Thus, a trade-off needs to be done both in terms of model complexity and interpretability. Having said that, this article wanted to stress the fact that given a data set, we should not jump directly to a ‘Very Complex’ algorithm because our ‘Bias-Variance Trade-Off’ will become more difficult to manage. The best approach would be to start with a statistical linear model and gradually increase model complexity, keeping an eye on Bias-Variance Trade-Off with the help of the Learning Curve. For me, simpler is better.