A Different Way to Think About Overfitting and Underfitting: 3 Theories You Should Know About
This is the second part of an essay explaining some methods and techniques to reason through the problems of overfitting and underfitting in machine learning models. Yesterday, we introduced the notion of Model Capacity( universe of potential hypothesis functions) as an efficient way to estimate the model’s propensity to overfit or underfit. Today, I would like to explore some theories in machine learning and mathematics that will help to solidify that concept.
The principle of Occam’s Razor is what happens when philosophers get involved in machine learning :) The origins of the this ancient philosophical theory dates back to somewhere between 1287 and 1347 associating it with philosophers like Ptolemy. In essence, the Occam’s Razor theory states that if we have competing hypothesis that explain known observations we should choose the simplest one. From Sherlock Holmes to Monk, Occam’s Razor has been omnipresent in world class’s detectives that often follow the simplest and most logical hypothesis to uncover complex mysteries.
The Occam’s Razor is a wise philosophical principle to follow in our daily lives but its application in machine learning results controversial at best. Simpler hypothesis are certainly preferred from a computational standpoint in a world in which algorithms are notorious for being resource expensive. Additionally, simpler hypothesis are computationally easier to generalize. However, the challenge with ultra-simple hypothesis is that they often result too abstract to model complex scenarios. As a result, a model with a large enough training set and a decent size number of dimensions should select a complex enough hypothesis that can produce a low training error. Otherwise it will be prompt to underfit.
The VC Dimension
The Occam’s Razor is a nice principle of parsimony but those abstract ideals don’t directly translate into machine learning models that live in the universe of numbers. That challenge was addressed by the founders to statistical theory Vapnik and Chervonekis(VC) who came out with a model to quantify the Capacity of a statistic algorithm. Known as the VC Dimension, this techniques is based on determining the largest number m from which exists a training set of m different x points that the target machine learning function can label arbitrarily.
The VC Dimension is one of the cornerstones of statistical learning and has been used as the basics of many interesting theories. For instance, the VC Dimension helps explain that the gap between the generalization error and the training error in a machine learning model decreases as the size of the training set increases but the same gap increases as the Capacity of the model grows. In other words, models with large training sets are more likely to pick the approximately correct hypothesis but if there are too many potential hypothesis then we are likely to end up with the wrong one.
The No Free Lunch Theorem
I would like to end this article with one of my favorite principles iof machine learning relevant to the the overfitting-underfitting problem. The No Free Lunch Theorem states that, averaged over all possible data-generating distributions, every classification algorithm has approximately the same error rate when classifying previously unobserved points. I like to think about the No Free Lunch Theorem as the mathematical counter-theory to the limitation of machine learning algorithms that force us to generalize semi-absolute knowledge using a finite training set. In logic, for instance, inferring universal rules from a finite set of examples is considered “illogical”. For machine learning practitioners, the No Free Lunch Theorem is another way to say that no algorithm is better than others given enough observations. In other words,thee role of a machine learning model is not to find a universal learning function but rather the hypothesis that better fits the target scenario.