Past two weeks, I have implemented input data pipeline using Dataset API, LeNet-5, CifarNet, and so on… This was in hope to gain more insight about the training procedure as well as the deep neural networks itself.
This was my first time using TFRecord. It is supposed to be simple but there seems to be a lot of limitations in using this… It cannot be accessed randomly unlike LMDB. Besides this, there were many things to have in mind when using this format to store training data.
First step is to actually make the directory of RAW, JPEG or PNG format images to TFRecord. There seems to be a lot of code to guide through this step.
Also, there was a good blog post that explains in detail about the anatomy of the TFRecord.
“A TFRecord file contains an array of Examples. Example is a data structure for representing a record, like an observation in a training or test dataset. A record is represented as a set of features, each of which has a name and can be an array of bytes, floats, or 64-bit integers.” Also, “with the cost of having to use the definition files and tooling, protocol buffers can offer a lot faster processing speed compared to text-based formats like JSON or XML,” which lead the TFRecord to be based on the protocol buffers.
This can be found here
Since, there were not much tutorials to get good information on this, I had to go through a lot of trial and error to make this. Even now, there are some problems that I should fix, but this will take some time and studying. I hope the API gets better so that it is easier to use than how it is now…
Although there weren’t many tutorials, there were few which helped me start.
general guide: https://towardsdatascience.com/how-to-use-dataset-in-tensorflow-c758ef9e4428
example code: https://kratzert.github.io/2017/06/15/example-of-tensorflows-new-input-pipeline.html
example code: https://sebastianwallkoetter.wordpress.com/2018/02/24/optimize-tf-input-pipeline/
shape error with “parse_example”: https://stackoverflow.com/questions/41951433/tensorflow-valueerror-shape-must-be-rank-1-but-is-rank-0-for-parseexample-pa
The original paper is here
* I implemented v1
I honestly did not know much about implementing from scratch in TensorFlow. Also, it was not easy to decipher occasional obscurities. However, thankfully, there were some Github repositories that helped me a lot on this.
* Many including the links above make changes to the original design from time to time which have to be noted with care.
It was first time I ran into Global Average Pooling. Apparently, this was one of the key features that had to be there to reduce number of parameters. Anyways, following link was a good guide.
ResNet only use Average pooling, removed all bias from convolution layers because batch normalization takes care the shifting that is usually done using bias terms, fixed padding that adds values to both sides, and the original model uses true average instead of moving average which most implementations use.
I can just run the training, however it just would not train that well. The loss was not dropping as fast as I expected, and accuracy was max-ing out at around 82% which is 10% worse than what I should have got. Therefore, I had to read through various links to find how to do this.
Biggest problem was actually my mistakes in making the model. One of them was plugging in wrong values in the operation, which was stealing at least 3% of the performance originally, and this sometimes even led to non-converging behavior of the training.
Also, one other problem was differences among frameworks and APIs. TensorFlow, Caffe, and PyTorch all have different default hyper-parameters embedded in the operation which could degrade the overall accuracy.
Also such hyper-parameters inside the API seemed to affect overall parametrization which led to completely different loss plane. Therefore, I had to take into account that the specific hyper-parameters in the paper were taken from the author’s specific environment which may have not been the same as that of mine.
Batchsize vs Training: https://www.quora.com/Intuitively-how-does-mini-batch-size-affect-the-performance-of-stochastic-gradient-descent
Implementing Batch Norm in TF: https://r2rt.com/implementing-batch-normalization-in-tensorflow.html
Reparametrizing the model changes everything: https://arxiv.org/pdf/1703.04933.pdf
Overall I went through a lot of trial and error to achieve around 90~% with ResNet-20 on CIFAR10.
I am still on my journey to find the missing 1%, but overall I believe I am close. I am done with my journey!
Augmentation is important, so I put this as a separate section.
ResNet uses simple augmentation scheme where they zero-pad the original or the flipped data to make in 36x36x3, and then random crop to get 32x32x3.
Many say that with small datasets it is done statically, but in ResNet (deciphering from the numbers of train-steps and mini batch sizes) it seemed that they used dynamic data augmentation. This does not increase the number of images that they start off with and just applies changes to images one by one. (I found a good blog about this but just can’t find the link again…)
* This was different to what one of the blog that I referred to argued!
Actually, I think they augmented the dataset statically or at least run duplicate model with same dataset in multiple GPUs (as stated in the paper) having a similar effect as doubling the batch size. Therefore, I ran the training with a mini-batch size of 256 instead of 128, and running the same iterations to achieve the accuracy stated in the paper!
There are many augmentations like varying the lighting, contrast, and etc, but I did not use any of them to increase the accuracy. (actually I did try, but it didn’t seem to help)
One of the biggest difficulty I had while doing this was to play around with Regularization. Before going further, I really want to thank one of the blogs that explicitly tackle the confusion about the definition of Regularization, more specifically mixed usage of weight decay and the L2 Regularization.
Anyway, even having the pesky definition out of the way, the value in the paper which was 0.0001 (1e-4) did not give me the accuracy that I hoped for. After lots of experiments that I have done for the past few weeks, I landed at 0.001 (1e-3). This actually gave me above 91.25% performance.
I think the original value seemed to leave too much variance in the model that needed more regularization.