Practical Deep Learning: These Metrics Will Help Evaluate the Performance of Your Deep Learning Model
One of the most difficult tasks in deep learning applications is to establish and evaluate the performance metrics of a model. In the deep learning world, perfection is often not a feasible objective so establishing clear performance metrics is fundamental to establish the behavior of a model.
The process of selecting performance metrics for a deep learning algorithm is far from trivial. To begin with, we are not talking about a single metric but rather a collection of them. While training processes can use well-established metrics such as cost-functions, the story is fairly different when comes to evaluate the runtime behavior of a model.
The Accuracy Fallacy
Many times we refer to the performance of a model using the notion of accuracy as a metric that should reflect the percentage of cases in which a model produces the right answer. However, accuracy is often a misleading metric because, as we know all too well, some errors are most costly than others.
Let’s take the example of a deep learning that attempts to predict voter fraud. The algorithm will be a binary classifier that will process individual votes and classify them as legit or fraudulent. Clearly, the cost of an error that flags a fraudulent vote as legit is bigger than a similar error that marks a legit vote as potentially fraudulent (some countries might disagree but you get the idea ;) ). Support that our model processes a million votes and misses 100 fraudulent votes. Clearly the accuracy of the model is pretty high but most likely our model won’t be used in real elections any time soon :).
Precision and Recall
One way to address the accuracy fallacy is to introduce two metrics known as precision and recall. Precision is the fraction of correct results produced by a model while recall is the fraction of true events that were detected. In our example, a binary classifier that predicts all votes to be legit has perfect precision but poor recall. Similarly, if the system claims that all votes are fraudulent will have perfect recall while precision will be the real percentage of fraudulent votes.
The relationship between precision and recall is typically reflected in a two-dimensional graph with precision metrics displayed in the y-axis and recall’s in the x-axis. That chart is known as the Precision-Recall or PR Curve as is one of the most important visualizations to understand the performance of deep learning models. Using the PR Curve allow us to adjust the model trading precision for recall and vice versa. Ultimately, we are trying to quantify the area beneath the curve which can be calculated using the following expression:
F= 2*precision*recall/precision + recall
This metric is known in deep learning theory as the F-Score and is often used in deep learning frameworks.
Another very useful metric in deep learning is the notion of coverage. This metric qualifies the fraction of examples for which the system is able to produce a response. Coverage is particularly useful in scenarios in which deep learning models can often produce no response. By conventional measures, a system can have high accuracy by producing a very small number of correct response but its coverage will be very low.