Code
#hide
import graphviz
def gv(s):
return graphviz.Source('digraph G{ rankdir="LR"' + s + '; }')
May 24, 2021
I recently read the Universal Language Model Fine-tuning for Text Classification paper (Howard and Ruder 2018) (ULMFit) which outlines how a language model can be fine tuned for different tasks without retraining the entire model. This reminded me of a technique that I tried to apply to image classification. I’m going to cover the core idea in the ULMFit paper as I see it and then discuss how this can be used to create an easily extended image classifier.
At the time that the paper was written applying NLP to a new domain required task-specific modifications and training from scratch. The writers of the paper are Jeremy Howard and Sebastian Ruder who are better known for running the FastAI course. In that course they heavily promote fine tuning existing models as a fast and effective way to tackle a problem in a new domain, and show how to do this several times for different image models. This paper tries to apply fine tuning to NLP.
To start with I want to define two terms - the body and the head of a model. The body is the largest part of the model and is responsible for feature extraction. The head takes the final output of the model and uses it to perform the task specific classification.
The technique that the FastAI framework uses for fine tuning is to replace the existing classification head of a model with a fresh one and then do a two stage retrain. The first stage just trains the new classification head, as the weights in it are random. If the body was retrained when the head is random then the poor performance of the head would cause the body weights to ablate to some degree. Once the head is acheving reasonable performance the entire model can be further trained to adjust the weights of the body to better fit the head.
The models that are produced for Natural Language tasks are too specific to fine tune well. It seems that the features that the models (of the time) naturally produce are not general enough to transfer to new tasks well when using this head replacement technique. This is likely because the features that are produced at the end of the model are too task specific. At the time the paper was written attention / transformers had just been introduced but BERT had not been released. This means that the models that Jeremy Howard and Sebastian Ruder were considering were LSTM architectures.
To help address this problem Jeremy Howard and Sebastian Ruder have the insight that the features produced earlier in the model may be more able to perform the task than the later layers. This means that a classification head that was able to take the outputs of every layer could be retrained on a different task more easily. They also progressively unfreeze the body layer by layer instead of all at once, essentially treating a larger and larger part of the model as the classification head.
Their technique works very well and they are able to fine tune their model against six different datasets covering three distinct tasks (sentiment analysis, question classification, and topic classification). They achieve state of the art results on the different datasets. So this technique certainly has merit.
The underlying model that they used was a LSTM. It happens that attention / transformers were released some six months prior to the paper however BERT had not been released. Transformer based architectures are almost universally used in NLP now, so it is a pity that this technique was not applied to BERT.
To my knowledge the technique outlined in ULMFit is not used for transfer learning - instead the entire model is fine tuned for the new task. The BERT paper was accompanied by a set of pretrained models that accelerated model development in NLP by starting from a solid base. It seems like a pity that the new technique did not gain greater traction.
The FastAI course extensively covers fine tuning image models to classify new classes of objects. Image models have a small difference in input though - images can vary in size. To handle this it is common to resize the image to fit the model. The model also makes some adjustments to the body to ensure that the classification head receives input of a consistent shape. It does this by using a pooling layer, which reduces the layer to a single feature value (either the maximum or average of the layer).
When training an image classifier the classes that you want to predict are known in advance. You can then just replace the head and retrain according to the recipe, above.
This works very well for single class and multi class image classifiers. A single class classifier is one that predicts a single label for an image. A multi class classifier is one that predicts many labels for an image. The primary difference is that the single class classifier predicts the label of the most dominant object in the image. If you want to find all images containing a chair then you want a multi class classifier, as you need to find every image with a chair even if something else in the image is more dominant.
A single class classifier would label these images as “chair” and “man” respectively. For a multi class classifier, both images would be labelled with “chair” (and the second image would also be labelled with “man”).
When you have a multi class image classifier in production you have a bigger problem. You may wish to add new classes to the classifier based on client requests.
If you retrain the entire model then the performance of the new model may suffer for the existing classes. You would also need to relabel all of your existing training images to ensure that the presence of the new class is correctly identified. So adding new classes is time consuming and risky.
It would be better to be able to freeze the entire body and add new classes by extending the head. Doing this depends on the suitability of the existing model features for classifying the new class. This is where the reuse of earlier layers can be helpful. So a colleague came up with the idea of using the outputs of the earlier layers in the classifier and I evaluated it.
The core idea here is that for each class in the multi class classifier it is taking all of the features from the model and then performing a yes/no classification. This means that if we take a new classifier we can extend the existing model by just appending it to the classification head.
For this to work without degrading the existing classifiers the body of the model must be completely frozen when training the new classifier. One thing to be particularly careful of is the batch normalization layers, which can train themselves.
Anyway, this is the principle. How well does it work in practice?
TBC.