Matthew’s Blog - FastAI Course

It’s weekly learning time again and now I am on the FastAI course lesson 2. Once again the notebooks for this are either here or in the associated fastbook repo.

Questionnaire

Once again there is an associated questionnaire. It was quite fun doing this before so lets try this one.

1. Can we always use a random sample for a validation set? Why or why not?

No. The random sample of the validation set may include results which can be inferred from the training set. If we have a timeseries of values and randomly sample points then we could have a validation point that is in the time period between two training points. It would be far easier to determine the value of that point based on the known values of the training points.

In this example the validation set should be a separate time range that is in the future of the training set. This would be best because the produced model will be used to predict the future, and this validation set is the best simulation of that.

2. What is overfitting? Provide an example.

Overfitting is when a model has been trained so much that it has memorized the training set. The memorized training set means that when presented with a datapoint from the training set it knows the exact answer to produce. This makes the model look very very good when evaluated against the training set.

However when generalizing beyond the training set the performance of the model is likely to be bad as it has tied itself too closely to the training set, instead of determining generalizable rules for the data.

3. What is a metric? How does it differ from “loss”?

A metric is an indicator of the quality of the model which is easily interpreted. The loss is a combination of factors that measure the fitness of the model.

The means of calculating the loss is the most important part of it, not the actual value. There is not a “good” loss value or a “bad” loss value, instead it depends on the architecture and the hyperparameters. So you cannot reasonably interpret a single loss value. Instead the best you can do is look at a trend across loss values (going down, levelling out, etc).

Metrics are designed for interpretation so should be comparable across architectures and hyperparameters.

4. How can pretrained models help?

When training a model for a specific problem domain you may only have a small amount of data available. There may be a lot of data in an associated domain, making it easier to train a high quality model for that domain. If expertise in the associated domain is useful for solving the specific problem then a model trained on the associated data would be able to learn the specific problem more easily than a randomly initialized model.

For example you may only have a small amount of text labelled with sentiment. The associated domain of predicting the next word (language modelling) is much larger, as it does not require labelled data. A language model learns a lot about the language in question and this language level knowledge (of grammar and meaning) is useful when trying to predict sentiment. So using a model trained as a language model and fine tuning it for the sentiment classification task is likely to be easier than training a sentiment classifier from scratch.

5. What is the “head” of a model?

A model is broadly split into two parts. There is the head and the body.

The body is responsible for extracting features from the input data. The head is responsible for performing the final task with these features. This final task could be something like classification.

6. What kinds of features do the early layers of a CNN find? How about the later layers?

The early layers of a CNN spot things like specific colors, color gradients, lines or corners and other such simple image features. These features are combined by successive layers. This combination allows for increasingly specific feature detectors such as eye detectors, face detectors, text detectors and so on.

7. Are image models only useful for photos?

Image models are built around convolutions, which take small areas of the input data and produce features based on the combination of those features. This combination can be done in any number of dimensions - it happens that 2d images are well suited to them.

If you can represent the data in a 2d way and the local arrangement of the data is significant then convolutions may be an appropriate tool. More broadly if the local arrangement of the data is significant then N dimensional convolutions may be appropriate.

Examples of non image use of convolutions are as follows:

Creating specific sound detectors - time and frequency are the two dimensions
Spotting unusual mouse movements across a web page - the screen width and height are two dimensions and color was used to indicate speed
Natural Language Processing - in this case the single dimension is the text. Using convolutions for processing text has fallen out of favour, and instead RNNs, LSTMs and Transformers are more popular now.

The paper by Collobert et al. 2011 describes the use of convolutions for NLP. You can read a description of the approach starting with section 3.3.2 Sentence Approach, in the linked PDF (page 11 of the pdf, actual 2,503).

8. What is an “architecture”?

An architecture is the specific arrangement of layers that forms a neural network.

9. What is segmentation?

The task of segmentation for images is the classification of an image on a per pixel basis.

10. What is y_range used for? When do we need it?

11. What are “hyperparameters”?

Hyperparameters are settings which alter the training or performance of the model without being directly encoded in the parameters of the model. Batch Size and Learning Rate are two such hyperparameters.

12. What’s the best way to avoid failures when using AI in an organization?

Have humans in the loop. Have limited (both in time and location) rollouts. Check for bias and ensure that the model is periodically rechecked for bias.

13. What is a p value?

P Value is intended to be the probability that the observed values could’ve occurred by random chance.

14. What is a prior?

A prior is the belief about the behaviour of the system before making the current set of observations.

15. Provide an example of where the bear classification model might work poorly in production, due to structural or style differences in the training data.

The images on the net of bears are well lit and have the bear in the center of the image. Images from CCTV can be poorly lit and only have the bear off to the side. A well trained model trained on the internet images may not be able to work with CCTV images.

16. Where do text models currently have a major deficiency?

17. What are possible negative societal implications of text generation models?

A lot of the text that is available online has biases in it. The text may view certain races, genders, sexual orientations, religions etc in ways that are not accurate. This bias could then lead to the models producing predictions which result in harm.

For example there is more negative text about black people online. This negative text could cause the model to associate being black with being bad. The negative association could cause the predictions of the model to be biased against black people, harming them.

18. In situations where a model might make mistakes, and those mistakes could be harmful, what is a good alternative to automating a process?

Performing the task manually or always having a human in the loop - either as a reviewer or as a point of contact to remedy harm.

19. What kind of tabular data is deep learning particularly good at?

20. What’s a key downside of directly using a deep learning model for recommendation systems?

The model predicts what the individual would review the product as, not if you should recommend it to them. For example, they may already have a washing machine so advertising one to them would be a waste of time as they will not buy another, even if they would rate this particular washing machine highly.

21. What are the steps of the Drivetrain Approach?

22. How do the steps of the Drivetrain Approach map to a recommendation system?

23. Create an image recognition model using data you curate, and deploy it on the web.

I did this in a separate blog post. https://blog.franglen.io/posts/2021-02-18-man-woman-classifier/

It’s not really “deployed”, but gradio can run it online anytime.

24. What is DataLoaders?

It’s a fastai construct that pairs a training dataloader with a validation dataloader. It has a lot of useful methods like showing a batch or applying transformations correctly to test and validation.

25. What four things do we need to tell fastai to create DataLoaders?

In my experience it has been:

Where the data is
What split to use for validation
How to transform it

I’m not sure what number 4 is.

26. What does the splitter parameter to DataBlock do?

Divides the training and validation data sets.

27. How do we ensure a random split always gives the same validation set?

Provide a fixed seed before initializing the split.