Matthew’s Blog - Machine Learning Interview Questions

I recently saw this twitter thread about the questions that @svpino likes to as Machine Learning candidates during interviews. It would be interesting to investigate what a good answer to each of these might be.

The full thread follows below, with my answers interleaved.

@svpino

16 questions that I really like to ask during interviews for machine learning candidates.

These will help you practice, but more importantly, they will help you think and find ways to improve!

Let’s do this!

How do you handle an imbalanced dataset and avoid the majority class completely overtaking the other?

I had this problem when trying to train using the open images dataset. The person class just dominated the classes.

There were two broad approaches that I tried to fix it. The first was to construct a balanced dataset (either by undersampling the over represented classes, or by repeating the under represented). The other was to weight the updates to the network according to the prevalence of the class, so that rarer classes would result in larger weight updates.

I wasn’t super happy with either approach but the balanced dataset approach seemed to get better results.

How do you deal with out-of-distribution samples in a classification problem?
How would you design a system that minimizes Type II errors?

The same place that sent me to twitter also linked some blog posts by Cassie Kozyrkov about statistics. One of them incidentally covers Type I and Type II errors:

Type I error is changing your mind when you shouldn’t. Type II error is NOT changing your mind when you should.

It is stupidly hard to select text accurately on medium.

Type I error is like convicting an innocent person and Type II error is like failing to convict a guilty person.

So at least I have some pithy definitions for the question.

One thing that comes up with data is the problem of posing more hypothesis for the same data until you get something where the data happens to show a statistically significant correlation. That would be a Type I error, as you would be changing your mind when you shouldn’t.

Sometimes, the validation loss of your model is consistently lower than your training loss. Why could this be happening, and how can you fix it?

I should point out that in the twitter thread this was mentioned as potentially indicating that the validation dataset contains easier examples than the training dataset.

This problem doesn’t seem to relate to the validation data being in the training dataset, as that would lead to the loss on the validation data being the same as the training. If it does relate to the validation examples being easier then there must be some systematic difference between the validation data and the training data. This would make me question the quality of the accuracy predictions made against the validation dataset.

I would be inclined to review how the validation data was selected and possibly reselect it? I’m still not sure about this.

Explain what you would expect to see if we use a learning rate that’s too large to train a neural network.

I’ve done this before and it’s something that can happen if you don’t use a learning rate scheduler. The model can do one of two things:

It can plateau to a loss / accuracy score that is not what it could achieve with a more appropriate learning rate
It can just break and become very bad

The two different outcomes seem to relate to the degree to which the learning rate is wrong. If the learning rate is just slightly too high then the adjustments made each batch overshoot the optimal point and result in no improvement. The model still stays in the right area to produce reasonable results.

If the learning rate is massively too large then the can jump around the

What are the advantages and disadvantages of using a single sample (batch = 1) on every iteration when training a neural network?
What’s the process you follow to determine how many layers and nodes per layer you need for the neural network you are designing.
What do you think about directly using the model’s accuracy as the loss function to train a neural network?
Batch normalization is a popular technique when training deep neural networks. Why do you think this is the case? What benefits does it provide?
Why do you think combining a few weak learners into an ensemble would give us better results than any of the individual models alone?
Early stopping is a popular regularization technique. How does it work, and what are some of the triggers that you think could be useful?
Explain how one-hot-encoding the features of your dataset works. What are some disadvantages of using it?
Explain how Dropout works as a regularizer of your deep learning model.
How would you go about reducing the dimensionality of a dataset?
Can you achieve translation invariance when processing images using a fully-connected neural network? Why is this the case?
Why are deep neural networks usually more powerful than shallower but wider networks?