Bias-Variance tradeoff
Bias refers to the deviation of the predicted values from the correct value. The error occurs when you make wrong assumptions about data. In other words, it is the error that is created when you represent a real-life complex problem using a simpler model while it might be making them easier to understand. For instance, when you build a linear model to solve for a non-linear problem. It results in under fitting and makes them less flexible. Parametric algos like Linear Regression can produce high bias while non-parametric algos like Decision Trees make good assumptions about the training data and target function and hence do not have high bias. Variance refers to the change that occurs when the model is applied on a different training data. It occurs when the model captures not just the underlying pattern but noise as well. It results in overfitting. In other words, it is memorising the data. It is often observed in Decision Trees. When the observations are limited but the number of parameters are high, it leads to multicollinearity resulting in high variance. Also when we don't limit maximum depth, the tree can keep growing until there is a leaf node for every observation. Put differently, bias refers to how much the data is ignored and variance refers to how dependent is the model is on data Adding more parameters to the model results in increased complexity leading to increased variance and reduced bias. Reducing the number of parameters results in simpler models leading to higher bias. Optimally complex model is one where reduction in one results in equivalent increase in the other which is what Data Scientists strive for. |
Comments