This is the second part of an essay that explores the fundamental deep learning techniques in natural language processing(NLP) platforms. In the first part, we discussed the main value propositions as well as the history of NLP models. We also explored the grandfather of NLP algorithms, n-grams, which provides a statistical recursive model to determine the probability of occurrence of a series of n tokens or words based on the probability of the previous n-1 words. Today, I would like to expand our discussion onto other techniques actively used in NLP models.
One of the main challenges with n-grams models is that they are vulnerable to the curse of dimensionality(see my previous article about the curse of dimensionality). This is particularly relevant when processing grammatically and syntactically rich languages which contain multi-dimensional structures. One of the popular techniques that tries to address some of the challenges of n-grams is known as Neural Language Models(NLMs) and has been widely implemented in modern deep learning frameworks.
Neural Language Models
The curse of dimensionality is the Achilles’ heel of many deep learning models like n-grams that purely rely on statistical computations which tend to underperform in scenarios with a large number of dimensions. NLMs overcome this challenge by using techniques that can recognize similarities between two worlds while still recognizing the unique characteristics of each word. From that perspective, NLMs avoid having to build knowledge subsets for each specific word in a dataset.
The magic behind NLMs is due to a technique called Distributed Representation which is a modality of Representation Learning that tries to learn individual features within a dataset by segmenting the input smaller combination of attributes. The classic example of Distributed Representation is to generate new images by composing objects identified in other images.
In the context of NLP, NLMs leverage Representation Learning to create a knowledge structure that links similar words based on their statistical strengths. For instance, let’s take media articles that often use the terms “artificial intelligence”, “deep learning” and “machine learning” interchangeably. In that domain, a ELM model will map the attributes of those three terms and infer predictions of one term[ex: deep learning] based on sentences containing the other terms [artificial intelligence, machine learning].
Technically, ELM refers to the individual world representations as word embeddings. If we visualize the output of ELM models, we will see the different word embeddings in a proximity determine by the similarities of their attributes.
n-grams + NLMs
NLMs provide many tangible advantages over n-grams and other NLP techniques but they are not without drawbacks. n-grams tend to be more efficient than NLMs achieving high model capacity because they require little computation to process an input dataset. If both NLMs and n-grams bring complementarily benefits then why not combine the two?
The idea of aggregating NLMs and n-grams has been catching up some momentum in the deep learning community. Ensemble learning techniques offer many ways in which the two algorithms can be combined to deliver highly sophisticated NLP models.