The motivation is quite straightforward. Producing a dataset with clean labels is expensive but unlabeled data is being generated all the time. To make use of this much larger amount of unlabeled data, one way is to set the learning objectives properly so as to get supervision from the data itself.
The self-supervised task, also known as pretext task, guides us to a supervised loss function. However, we usually don’t care about the final performance of this invented task. Rather we are interested in the learned intermediate representation with the expectation that this representation can carry good semantic or structural meanings and can be beneficial to a variety of practical downstream tasks.
Broadly speaking, all the generative models can be considered as self-supervised, but with different goals: Generative models focus on creating diverse and realistic images, while self-supervised representation learning care about producing good features generally helpful for many tasks.
Many ideas have been proposed for self-supervised representation learning on images. A common workflow is to train a model on one or multiple pretext tasks with unlabelled images and then use one intermediate feature layer of this model to feed a multinomial logistic regression classifier on ImageNet classification. The final classification accuracy quantifies how good the learned representation is.
The pretext task in generative modeling is to reconstruct the original input while learning meaningful latent representation.
Contrastive Predictive Coding (CPC) is an approach for unsupervised learning from high-dimensional data by translating a generative modeling problem to a classification problem.
Finally, I am working with ImageNet… MNIST, and CIFAR-10 datasets give good insight to how networks can be trained and generalize over variations of each classes of data, but they are too small a dataset to offer experience of deep learning of the real world!
Dataset Input Pipeline
First, I will begin with the details of AlexNet’s pipeline. Before training and testing, images were down-sampled to 256×256 images where original images were scaled so that shorter side is 256 and center was cropped from each image. (central_crop of TensorFlow does not do this!)
In VGGNet and ResNet, both used “scale jittering” while is just using values between [256, 480] instead of fixed 256. Other than this they use principal components of the dataset to augment the data, and mean to normalize the data. However, in my case, I will use few more augmentation schemes such as random contrast, etc. I referred to the following link.
Since, this dataset is much bigger than CIFAR, I was thinking that I should use aggressive prefetching to the GPUs to do this. However, it seems to not work that well when I use with what I implemented. Related discussion is in the following link.
Some say that I should use copy_to_device, which actually seem to cause leakage or some kind of problem behind the scene, because I see major slowdown after 500 iterations when I use this. Therefore, I decided to just prefetch to CPU and leave TensorFlow scheduler to do all the work for me. (I am not really sure about this though, it actually seems like throttling due to poor cooling too… I have to look into this)
This is just to note few things that made me waste a lot of time.
First, some images in the ILSVRC 2012 are in CYMK… and this should be consistent or be dealt with later.
Since I learnt a lot from my former project to train ResNet20 on CIFAR, I managed to make the network in one shot! 🙂 However, one change that had to be noted was regarding which option to use out of the ones mentioned in the paper. (A, B, or C… I will try to add information on this)
This is an interesting blog as to how certain changes affect accuracy on ResNet
Everything was perfect, except that it is taking way too long to train. One epoch is taking around an hour (maybe worse due to throttling), so 120 epochs will take 5 days. I will leave it running and see what I get.
I was originally thinking that my implementation had serious problem or something, but by multiplying some factors to timing that is written in the following link, I see that each 10 steps with mini batch size of 256 should take around 5-6s which is a little less than mine, but I will deal with this problem later. (Following link is in PyTorch but number of computation should not be that far off).
Apparently, many literatures get their accuracy using 10-crop testing. According to TF Slim page, this gives some better performance but they did not implement it. I think it is unnecessarily complicated to implement in TensorFlow, so I am skipping it too. It seems to account for around 1-2% accuracy.
Past two weeks, I have implemented input data pipeline using Dataset API, LeNet-5, CifarNet, and so on… This was in hope to gain more insight about the training procedure as well as the deep neural networks itself.
This was my first time using TFRecord. It is supposed to be simple but there seems to be a lot of limitations in using this… It cannot be accessed randomly unlike LMDB. Besides this, there were many things to have in mind when using this format to store training data.
First step is to actually make the directory of RAW, JPEG or PNG format images to TFRecord. There seems to be a lot of code to guide through this step.
Also, there was a good blog post that explains in detail about the anatomy of the TFRecord.
“A TFRecord file contains an array of Examples. Example is a data structure for representing a record, like an observation in a training or test dataset. A record is represented as a set of features, each of which has a name and can be an array of bytes, floats, or 64-bit integers.” Also, “with the cost of having to use the definition files and tooling, protocol buffers can offer a lot faster processing speed compared to text-based formats like JSON or XML,” which lead the TFRecord to be based on the protocol buffers.
Since, there were not much tutorials to get good information on this, I had to go through a lot of trial and error to make this. Even now, there are some problems that I should fix, but this will take some time and studying. I hope the API gets better so that it is easier to use than how it is now…
Although there weren’t many tutorials, there were few which helped me start.
I honestly did not know much about implementing from scratch in TensorFlow. Also, it was not easy to decipher occasional obscurities. However, thankfully, there were some Github repositories that helped me a lot on this.
ResNet only use Average pooling, removed all bias from convolution layers because batch normalization takes care the shifting that is usually done using bias terms, fixed padding that adds values to both sides, and the original model uses true average instead of moving average which most implementations use.
I can just run the training, however it just would not train that well. The loss was not dropping as fast as I expected, and accuracy was max-ing out at around 82% which is 10% worse than what I should have got. Therefore, I had to read through various links to find how to do this.
Biggest problem was actually my mistakes in making the model. One of them was plugging in wrong values in the operation, which was stealing at least 3% of the performance originally, and this sometimes even led to non-converging behavior of the training.
Also, one other problem was differences among frameworks and APIs. TensorFlow, Caffe, and PyTorch all have different default hyper-parameters embedded in the operation which could degrade the overall accuracy.
Also such hyper-parameters inside the API seemed to affect overall parametrization which led to completely different loss plane. Therefore, I had to take into account that the specific hyper-parameters in the paper were taken from the author’s specific environment which may have not been the same as that of mine.
Overall I went through a lot of trial and error to achieve around 90~% with ResNet-20 on CIFAR10. I am still on my journey to find the missing 1%, but overall I believe I am close. I am done with my journey!
Augmentation is important, so I put this as a separate section.
ResNet uses simple augmentation scheme where they zero-pad the original or the flipped data to make in 36x36x3, and then random crop to get 32x32x3.
Many say that with small datasets it is done statically, but in ResNet (deciphering from the numbers of train-steps and mini batch sizes) it seemed that they used dynamic data augmentation. This does not increase the number of images that they start off with and just applies changes to images one by one. (I found a good blog about this but just can’t find the link again…) * This was different to what one of the blog that I referred to argued! Actually, I think they augmented the dataset statically or at least run duplicate model with same dataset in multiple GPUs (as stated in the paper) having a similar effect as doubling the batch size. Therefore, I ran the training with a mini-batch size of 256 instead of 128, and running the same iterations to achieve the accuracy stated in the paper!
There are many augmentations like varying the lighting, contrast, and etc, but I did not use any of them to increase the accuracy. (actually I did try, but it didn’t seem to help)
One of the biggest difficulty I had while doing this was to play around with Regularization. Before going further, I really want to thank one of the blogs that explicitly tackle the confusion about the definition of Regularization, more specifically mixed usage of weight decay and the L2 Regularization.
Anyway, even having the pesky definition out of the way, the value in the paper which was 0.0001 (1e-4) did not give me the accuracy that I hoped for. After lots of experiments that I have done for the past few weeks, I landed at 0.001 (1e-3). This actually gave me above 91.25% performance.
I think the original value seemed to leave too much variance in the model that needed more regularization.
Recurrent Neural Networks (RNNs) are popular models that have shown great promise in many NLP tasks.
What are RNNs?
The idea behind RNNs is to make use of sequential information. In a traditional neural network we assume that all inputs (and outputs) are independent of each other.
Another way to think about RNNs is that they have a “memory” which captures information about what has been calculated so far. In theory RNNs can make use of information in arbitrarily long sequences, but in practice they are limited to looking back only a few steps (more on this later).
If the sequence we care about is a sentence of 5 words, the network would be unrolled into a 5-layer neural network, one layer for each word.
Unlike a traditional deep neural network, which uses different parameters at each layer, a RNN shares the sam parameters across all steps. This reflects the fact that we are performing the same task at each step, just different inputs.
What can RNNs do?
Language Modeling and Generating Text
Given a sequence of words we want to predict the probability of each word given the previous words. A side-effect of being able to predict the next word is that we get a generative model, which allows us to generate new text by sampling from the output probabilities.
A key difference with Language Model is that our output only starts after we have seen the complete input.
Given an input sequence of acoustic signals from a sound wave, we can predict a sequence of phonetic segments together with their probabilities.
Training RNN is similar to training a traditional Neural Network. We also use the backpropagation algorithm, but with a little twist. Because the parameters are shared by all time steps in the network, the gradient at each output depends not only on the calculations of the current time step, but also the previous time steps. BPTT!!!
In words, the probability of a sentence is the product of probabilities of each word given the words that came before it.
Such a model can e used as a scoring mechanism. For example, a Machine Translation system typically generates multiple candidates for an input sentence. You could use a language model to pick the most probable sentence.
Me can predict the probability of a word given the preceding words, we are abke to generate new text. It is a generative model.
Training Data and Preprocessing
Remove infrequent words: Most words in text will only appear one or two times. It’s a good idea to remove these ifrequent words. Having a huge vocabulary will make our model slow to train.
Prepend special start and end tokens: We also want to learn which words tend to start and end a sentence.
Building training data matrices: word_to_index is needed (Word2Vec) and have SENTENCE_START and SENTENCE_END to help with this
Building the RNN
Input x will be a sequence of words, x_t is a single word. Because of how matrix multiplication works we can’t simply use a word index. Instead we represent each word as a one-hot vector.
Output of the network o has a similar format. Each o_t is a vector, and each element represents the probability of that word being the next word in the sentence.
x_t is R^8000
o_t is R^8000
s_t is R^100 (hidden layer has size of 100)
U is R^(100×8000) to transform 8000 input to 100 hidden
V is R^(8000×100) to transform 100 hidden to 8000 output
U is R^(100×100) to transdform 100 hidden to 100 hidden
Training the RNN
To train our network we need a way to measure the errors it makes. We call this the loss function L, and our goal is to find the parameters U, V, and W that minimize the loss function for our training data. A common choice for the loss function is the cross-entropy loss. If we have N training examples and C classes then the loss with respect to to our predictions o and the true labels y is given by:
L(y, o) = -1/N * sum_n (y_n * log o_n)
It is just sum over our training examples and add to the loss based on how off our predictions are.
In NMT, we map the meaning of a sentence into a fixed-length vector representation and then generate a translation based on that vector. By not relying on things like n-gram counts and instead trying to capture the higher-level meaning of a text, NMT systems generalize to new sentences better than many other approaches.
If you plot the embeddings of different sentences in a low dimensional space using PCA or t-SNE for dimensionality reduction, you can see that semantically similar phrases end up close to each other.
Recurrent Neural Networks are known to have problems dealing with such long-range dependencies. In theory, architectures like LSTMs should be able to deal with this, but in practice long-range dependencies are still problematic.
Approach of reversing a sentence a “hack”. It makes things work better in practice, but it’s not a principled solution. But there are languages (like Japanese) where the last word of a sentence could be highly predictive of the first word in an English translation. In that case, reversing the input would make things worse.
One hidden state enough to capture everything about the sequence? NO!
We allow the decoder to “attend” to different parts of the source sentence at each step of the output generation. Importantly, we let the model learn what to attend to based on the input sentence and what it has produced so far.
A big advantage of attention is that it gives us the ability to interpret and visualize what the model is doing.
The basic problem that the attention mechanism solves is that it allows the network to refer back to the input sequence, instead of forcing it to encode all information into one fixed-length vector.
Downside of Attention model
If you do character-level computations and deal with sequences consisting of hundreds of tokens the above attention mechanisms can become prohibitively expensive.
By focusing on one thing, we can neglect many other things. But that’s not really what we’re doing in the above model. We’re essentially looking at everything in detail before deciding what to focus on.
An alternative approach to attention is to use Reinforcement Learning to predict an approximate location to focus to.
There is support for “fake quantization operators” in TensorFlow. Including them where quantization is expected to occur will round the float values to specified number of levels to simulate quantization + Gives recalculated min/max ranges for the 32-bit to 8-bit downscaling.
For most numbers, quantizing numbers is like adding noise, but in the case of zero, this is not the case. Zero shows up a lot in neural network calculations. If zero is not represented well, these zeros will contribute disproportionately to overall result.
Not much principle, but evidence states that avoiding -128 may be helpful.