Deep learning is a very broad discipline that has been regularly expanding its areas of research. as a result, the number and variety of deep learning models has drastically grown during the last few years. Different types of deep neural networks have specialized on solving specific types of problems and processing specific types of data. For instance, convolutional neural networks(CNNs) are mostly used to process multi-dimensional datasets such as images. Another type of architecture that has become very popular in deep learning systems is recurrent and recursive networks(RNNs) which have been widely implemented on many of the popular deep learning frameworks in the market.
Just like CNNs specialized on processing multi-dimensional data, the focus on RNNs is to handle sequential datasets. A sequential dataset is typically represented by a vector(x(1), x(2),….x(t)) where the data represents individual input records typically distributed across a time sequence t.
The origins of RNNs date back to the mid 1980s when computer scientists such as Rumelhat started using new ideas in statistical models such as parameter sharing to be able to generalize knowledge on a sequence of values. The use of parameter sharing resulted pivotal for the creation of RNNs as it made it possible to apply machine learning models to sequential datasets of different lengths and generalize knowledge from them while maintaining the performance of the computation graph.
To illustrate the use of parameter sharing, let’s take an example from the natural language processing(NLP) space. Imagine that we are having a conversation about the NFL playoffs that just started this weekend and we say something around the lines of “the Saints beat the Panthers last night” or “last night, the Saints beat the Panthers”. If we have a machine learning model that is trying to identify the objects in the sentence, we would like it to recognize the “Saints” and “the Panthers” regardless of the order in which they appear in each sentence. A traditional fully connected feed forward network will use different parameters for each input which means that it will assign different weights for the different sentences resulting on a computational inefficient graph. Parameter sharing addresses that challenge by sharing weights across different time steps allowing the model to scale across inputs of different lengths.
You can visualize RNNs as a sequential computation graph composed of inputs, hidden units, loss functions and output units. In mathematical terms, an RNN can be expressed using variations of the following equation:
h(t) = f(h(t-1), x(t), p)
where h are the hidden units, x represent the inputs, p are the parameters and t represent time units.
The equation is a fancy way to express that an RNN is trained to predict the future based on the immediate past. From that perspective, the can infer that RNN models tend to have the same input size regardless of the sequence length. This is due to the fact that the model is expressed in terms of transitions between states instead of variable lengths. Similarly, RNNs enable the reuse of the same transition function with the same parameters at every time step(parameter sharing). That characteristic allows RNNs to be very efficient generalizing knowledge even when data is missing from the training dataset.
Later this week, we will continue the discussions about RNNs and its close cousin: recursive networks.