No Data, No Problem: Some Thoughts About Dataset Augmentation in Deep Learning Solutions
Deep learning models are only good as the quality of their training dataset. Many times people fall into the trap of thinking that all challenges in a deep learning model can be solved with additional training. The reality is that, in many scenarios, data is simply not available or in very poor quality. So what do you do when there is no data? Fake it :)
Dataset augmentation is a very popular technique in modern deep learning solutions. The premise of dataset augmentation is to create artificial data records and add them to the training set in order to improve the training processes. How is that even possible? If we train a model with “non real” data it should produce the wrong results , shouldn’t it? Well, not really. By fake data we don’t mean any fake-fake data :) Think about it like a sophisticated forgery of a painter’s masterpiece. It turns out that, in many deep learning scenarios, the results of a deep learning algorithm don’t change drastically with small variations in the training data. From that perspective, it is relatively trivial to augment the training dataset by introducing small variations of the original data.
Traditional deep learning tasks such as image recognition or speech analysis have been some of the greatest beneficiaries of dataset augmentation. High dimensional images are based on many factors that are mathematically simple to simulate. For instance, shifting a few pixels of an image on a different direction doesn’t really alter the results of many image recognition algorithms. Obviously, there are many variations of images that are impossible to accomplish by adding simple pixel transformations which makes dataset augmentation impractical for those use cases.
Do you remember the chaos monkey techniques that are widely used for testing large distributed software applications? The principle is to increase the robustness of an architecture by introducing arbitrarily failure conditions. Well, noise injection can be seen like a distant cousin of chaos monkey techniques but applied to knowledge building scenarios.
Noise injection is a form of dataset augmentation that attempts to increase the robustness ot deep neural networks by introducing random noise in the training data. In the case of deep learning models, noise injection techniques are not only constrained to the input data but they are also regularly applied to the input of the hidden nodes and, in some extreme scenarios, to the output data.
Dataset augmentation can have a profound impact on the performance of deep learning models. As a result, it is recommended to keep very accurate measures about specific data transformations and their impact in the performance of the model. Many times, improvements on the output of a deep learning algorithm are based on augmentation of the training dataset instead of improvements in the model itself. Quantifying the impact of specific dataset augmentations is key to understand the runtime behavior and improve the training of deep learning models.