What Borges Can Tech Us About Overfitting and Underfitting in Deep learning Models
Selective forgetting and memory prioritization are key elements of learning.
This is an aggregation of various posts I published last year about overfitting and underfitting
Overfitting and underfitting are two of the biggest challenges in modern deep learning solutions. I often like to compare deep learning overfitting to human hallucinations as the former occurs when algorithms start inferring non-existing patterns in datasets. Despite its importance, there is no easy solution to overfitting and deep learning application often need to use techniques very specific to individual algorithms in order to avoid overfitting behaviors. This problem get even more scarier if you consider that humans are also incredibly prompt to overfitting. Just think about how many stereotypes you used in the last week. Yeah, I know….
Unquestionably, our hallucinations or illusions of validity are present somewhere in the datasets used in the training of deep learning algorithms which creates an even more chaotic picture. Intuitively, we think about data when working on deep learning algorithms but there is also another equally important and often forgotten element of deep learning models: knowledge. In the context of deep learning algorithms, data is often represented as persisted records in one or more databases while knowledge is typically represented as logic rules that can be validated in the data. The role of deep learning models is to infer rules that can be applied to new datasets in the same domain. Unfortunately for deep learning agents, powerful computation capabilities are not a direct answer to knowledge building and overfitting occurs.
What of my favorite ways to explain overfitting is using a story from the great Argentine writer Jorge Luis Borges.
Borges, Planes and Deep Learning Overfitting
Jorge Luis Borges is considered one of the most emblematic Latin American writers and one of my favorite authors during my teenage years. In his story “Funes the Memorious”, Borges tells the story of Funes, a young man with a prodigious memory. Funes is able to remember the exact details he sees, like the shapes of the clouds in the sky at 3:45pm yesterday. However, Funes is tormented by his inability to generalize visual information into knowledge. Borges’ character is regularly surprised by his own image every times he sees himself in the deep learningrror and is unable to deterdeep learningne if the dog seen from the side at 3:14pm, is the same dog seen from the back at 3:15pm. To Funes, two things are the same only if every single detail is identical in both of them.
Funes’ story is a great metaphor to explain that knowledge is not only about processing large volumes of information but also about generalizing rules that ignore some of the details in the data. Just like Funes, deep learning algorithms have almost unlimited capacity to process information. However, that computation power is a direct cause of overfitting as deep learning agents can infer deep millions of patters in data sources without incurring in a major cost.
What You Don’t See is as Important as What You See
During World War II, the Pentagon assembled a team of the country’s most renown mathematicians in order to develop statistical models that could assist the allied troops during the war. One of the first assignments consisted of estimating the level of extra protection that should be added to US planes in order to survive the battles with the German air force. Like good statisticians, the team collected the damage caused to planes returning from encounters with the Nazis.
For each plane, the mathematicians computed the number o bullet holes across different parts of the plane (doors, wings, motor, etc). The group then proceeded to make recommendations about which areas of the planes should have additional protection. Not surprisingly, the vast majority of the recommendations focused on the areas with that had more bullet holes assuming that those were the areas targeted by the German planes. There was one exception in the group, a young statistician named Abraham Wald who recommended to focus the extra protection in the areas that hadn’t shown any damage in the inventoried planes. Why? very simply, the young mathematician argued that the input data set( planes) only included planes that have survived the battles with the Germans. Although severe, the damage suffered by those planes was not catastrophic enough that they couldn’t return to base. therefore, he concluded that the planes that didn’t return were likely to have suffered impacts in other areas. Very clever huh?
The previous story has some very profound lessons for anti-overfitting deep learning techniques. The only way to validate new knowledge is to apply it to unseen datasets and many times hidden datasets are as important as existing ones. This is known in cognitive psychology as “learning by omission”. As many data scientists know: “one successful deep learning experiment is not enough to prove you right is certainly enough to prove you wrong”.
Overfitting and Underfitting in Deep Learning Models
Dumb or Hallucinating
Challenges such as overfitting and underfitting are related to the capacity of a machine learning model to build relevant knowledge based on an initial set of training examples. Conceptually, underfitting is associated withe the inability of a machine learning algorithm to infer valid knowledge from the initial training data. Contrary to that, overfitting is associated with model that create hypothesis that are way too generic or abstract to result practical. Putting it in simpler terms, underfitting models are sort of dumb while overfitting models tend to hallucinate(imagine things that don’t exist ) :).
Understanding Model Capacity
Let’s try to formulate a simple methodology to understand overfitting and underfitting in the context of machine learning algorithms.
A typical machine learning scenario starts with an initial data set that we use to train and test the performance of an algorithm. The statistical wisdom suggests that we use 80% of the dataset to train the model while mainthing the remaining 20% to test it. During the training phase, out model will produce certain deviation from the training data which we is often referred to the Training Error. Similarly, the deviation produced during the test phase is referred to as Test Error. From that perspective, the performance of a machine learning model can be judged on its ability to accomplish two fundamental things:
1 — Reduce the Training Error
2 — Reduce the gap between the Training and Test Errors
Those two simple rules can help us understand the concepts of overfitting and underfitting. Basically, underfitting occurs a model fails at rule #1 and is not able to obtain a sufficiently low error from the training set. Overfitting then happens when a model fails at rule #2 and the gap between the test and training errors is too large. You see? two simple rules to helps us quantify the levels of overfitting and underfitting in machine learning algorithms.
Another super important concept that tremendously helps machine learning practitioners deal with underfitting and overfitting is the notion of Capacity. Conceptually, Capacity represents the number of functions that a machine learning model can select as a possible solution. for instance, la linear regression model can have all degree 1 polynomials of the form y = w*x + b as a Capacity (meaning all the potential solutions).
Capacity is an incredibly relevant concept machine learning models. Technically, a machine learning algorithms performs best when it has a Capacity that is proportional to the complexity of its task and the input of the training data set. Machine learning models with low Capacity are impractical when comes to solve complex tasks and tend to underfit. Along the same lines, models with higher Capacity than needed are prompt to overfit. From that perspective, Capacity represents a measure by which we can estimate the propensity of the model to underfit or overfit.
The principle of Occam’s Razor is what happens when philosophers get involved in machine learning :) The origins of the this ancient philosophical theory dates back to somewhere between 1287 and 1347 associating it with philosophers like Ptolemy. In essence, the Occam’s Razor theory states that if we have competing hypothesis that explain known observations we should choose the simplest one. From Sherlock Holmes to Monk, Occam’s Razor has been omnipresent in world class’s detectives that often follow the simplest and most logical hypothesis to uncover complex mysteries.
The Occam’s Razor is a wise philosophical principle to follow in our daily lives but its application in machine learning results controversial at best. Simpler hypothesis are certainly preferred from a computational standpoint in a world in which algorithms are notorious for being resource expensive. Additionally, simpler hypothesis are computationally easier to generalize. However, the challenge with ultra-simple hypothesis is that they often result too abstract to model complex scenarios. As a result, a model with a large enough training set and a decent size number of dimensions should select a complex enough hypothesis that can produce a low training error. Otherwise it will be prompt to underfit.
The VC Dimension
The Occam’s Razor is a nice principle of parsimony but those abstract ideals don’t directly translate into machine learning models that live in the universe of numbers. That challenge was addressed by the founders to statistical theory Vapnik and Chervonekis(VC) who came out with a model to quantify the Capacity of a statistic algorithm. Known as the VC Dimension, this techniques is based on determining the largest number m from which exists a training set of m different x points that the target machine learning function can label arbitrarily.
The VC Dimension is one of the cornerstones of statistical learning and has been used as the basics of many interesting theories. For instance, the VC Dimension helps explain that the gap between the generalization error and the training error in a machine learning model decreases as the size of the training set increases but the same gap increases as the Capacity of the model grows. In other words, models with large training sets are more likely to pick the approximately correct hypothesis but if there are too many potential hypothesis then we are likely to end up with the wrong one.
The No Free Lunch Theorem
I would like to end this article with one of my favorite principles iof machine learning relevant to the the overfitting-underfitting problem. The No Free Lunch Theorem states that, averaged over all possible data-generating distributions, every classification algorithm has approximately the same error rate when classifying previously unobserved points. I like to think about the No Free Lunch Theorem as the mathematical counter-theory to the limitation of machine learning algorithms that force us to generalize semi-absolute knowledge using a finite training set. In logic, for instance, inferring universal rules from a finite set of examples is considered “illogical”. For machine learning practitioners, the No Free Lunch Theorem is another way to say that no algorithm is better than others given enough observations. In other words,thee role of a machine learning model is not to find a universal learning function but rather the hypothesis that better fits the target scenario.