Recurrent Neural Networks (RNNs) are popular models that have shown great promise in many NLP tasks.
What are RNNs?
The idea behind RNNs is to make use of sequential information. In a traditional neural network we assume that all inputs (and outputs) are independent of each other.
Another way to think about RNNs is that they have a “memory” which captures information about what has been calculated so far. In theory RNNs can make use of information in arbitrarily long sequences, but in practice they are limited to looking back only a few steps (more on this later).
If the sequence we care about is a sentence of 5 words, the network would be unrolled into a 5-layer neural network, one layer for each word.
Unlike a traditional deep neural network, which uses different parameters at each layer, a RNN shares the sam parameters across all steps. This reflects the fact that we are performing the same task at each step, just different inputs.
What can RNNs do?
Language Modeling and Generating Text
Given a sequence of words we want to predict the probability of each word given the previous words. A side-effect of being able to predict the next word is that we get a generative model, which allows us to generate new text by sampling from the output probabilities.
A key difference with Language Model is that our output only starts after we have seen the complete input.
Given an input sequence of acoustic signals from a sound wave, we can predict a sequence of phonetic segments together with their probabilities.
Training RNN is similar to training a traditional Neural Network. We also use the backpropagation algorithm, but with a little twist. Because the parameters are shared by all time steps in the network, the gradient at each output depends not only on the calculations of the current time step, but also the previous time steps. BPTT!!!
In words, the probability of a sentence is the product of probabilities of each word given the words that came before it.
Such a model can e used as a scoring mechanism. For example, a Machine Translation system typically generates multiple candidates for an input sentence. You could use a language model to pick the most probable sentence.
Me can predict the probability of a word given the preceding words, we are abke to generate new text. It is a generative model.
Training Data and Preprocessing
- Tokenize Text
- Remove infrequent words: Most words in text will only appear one or two times. It’s a good idea to remove these ifrequent words. Having a huge vocabulary will make our model slow to train.
- Prepend special start and end tokens: We also want to learn which words tend to start and end a sentence.
- Building training data matrices: word_to_index is needed (Word2Vec) and have SENTENCE_START and SENTENCE_END to help with this
Building the RNN
Input x will be a sequence of words, x_t is a single word. Because of how matrix multiplication works we can’t simply use a word index. Instead we represent each word as a one-hot vector.
Output of the network o has a similar format. Each o_t is a vector, and each element represents the probability of that word being the next word in the sentence.
x_t is R^8000
o_t is R^8000
s_t is R^100 (hidden layer has size of 100)
U is R^(100×8000) to transform 8000 input to 100 hidden
V is R^(8000×100) to transform 100 hidden to 8000 output
U is R^(100×100) to transdform 100 hidden to 100 hidden
Training the RNN
To train our network we need a way to measure the errors it makes. We call this the loss function L, and our goal is to find the parameters U, V, and W that minimize the loss function for our training data. A common choice for the loss function is the cross-entropy loss. If we have N training examples and C classes then the loss with respect to to our predictions o and the true labels y is given by:
L(y, o) = -1/N * sum_n (y_n * log o_n)
It is just sum over our training examples and add to the loss based on how off our predictions are.