Matthew’s Blog - Distillation

Distillation is the process of training a small model on a task with the help of a larger, trained, model. The small model is referred to as the student and the large model is the teacher. Teaching the student involves the regular process of measuring accuracy and teaching the student to match the distribution of predictions from the teacher. Matching the distribution shows the student what other classes are similar to the correct class, and guides it to the correct weights in a more holistic way.

The teacher is large and well trained and that means that the teacher predicts the correct class very strongly. For the teacher to add value to the training we must alter this output to show the class distribution more. The way we do that is to alter the outputs, boosting the low probability classes and reducing the high probability class(es). Probability alteration is controlled using a parameter called temperature.

When trying out distillation I was concerned that the temperature parameter was not shaping the teacher output in an appropriate way. In the last blog post I investiged changing the algorithm for temperature to better reflect the underlying probability distribution. This worked but the original temperature algorithm performed better and was faster.

The measurement of loss against the teacher output is done using KL Divergence. When I have wanted to measure the similarity of two distributions I have used Cosine Similarity. In this post I investigate changing the distillation process to use Cosine Similarity Loss.

What is the difference?

Kullback–Leibler divergence is a statistical distance: a measure of how one probability distribution Q is different from a second, reference probability distribution P. wikipedia

Cosine similarity is a measure of similarity between two sequences of numbers. wikipedia

These sound extremely similar.

I have more experience with cosine similarity. The cosine similarity measures the angle between two vectors and produces a value between 1 (same direction) and -1 (opposite direction). It does not measure the magnitude.

When using Cosine Similarity as a training metric you also indicate if the two vectors should be considered the same or different. It is frequently used to train semantic systems where you can say that two input sequences have the same meaning (1) or different meaning (-1). The additional argument allows you to force two inputs to be different, like two utterances which have different meanings.

For KL Divergence it is just for making sequences match. I’m currently thinking of it like training a classifier with cross entropy, except that the correct output is a specific known distribution instead of a single class.

This does suggest to me that KL Divergence is the better metric. I do think that Cosine Similiarty can be used, we just are not using it fully.

Cosine Teacher Loss

I can just smash this out now. The cosine loss is just the student outputs measured against the teacher outputs. It’s important to realise that this measures the angle and not the magnitude, so this can change how well the student learns the decision boundaries from the teacher.

Given that the teacher will produce strongly confident predictions you would normally scale the model outputs in some way.

Code

import torch
import torch.nn.functional as F
from transformers.modeling_outputs import SequenceClassifierOutput

def cosine_loss(
    student_outputs: SequenceClassifierOutput,
    teacher_outputs: SequenceClassifierOutput,
    temperature: float,
    k: float,
) -> torch.Tensor:
    assert student_outputs.logits.size() == teacher_outputs.logits.size()

    batch_size = student_outputs.logits.shape[0]
    loss = F.cosine_embedding_loss(
        input1=student_outputs.logits,
        input2=teacher_outputs.logits,
        target=torch.ones(batch_size, device=student_outputs.logits.device),
        reduction="mean",
    )
    return loss

Results

I’m recording this in the same WandB project as before as it strongly relates to it. The code for this can be found here.

Code

sweep_id_to_name = {
    "rc7u9aec": "temperature",
    "y2nqpee4": "cosine-scaling-fixed-k",
    "1s1egorq": "cosine-temperature",
}

Code

import pandas as pd
import wandb.apis.public as wandb_api

api = wandb_api.Api()
runs = api.runs(path="matthewfranglen/distillation-temperature")
results = pd.DataFrame([
    {"sweep_id": run.sweep.name, "accuracy": run.summary["eval/accuracy"]}
    for run in runs
])
results["sweep"] = results.sweep_id.map(sweep_id_to_name)
results = results[~results.sweep.isnull()]

Code

results

	sweep_id	accuracy	sweep
0	1s1egorq	0.034516	cosine-temperature
1	1s1egorq	0.946452	cosine-temperature
2	1s1egorq	0.951935	cosine-temperature
3	1s1egorq	0.942581	cosine-temperature
4	1s1egorq	0.947742	cosine-temperature
...	...	...	...
631	rc7u9aec	0.248710	temperature
632	rc7u9aec	0.012258	temperature
633	rc7u9aec	0.601613	temperature
634	rc7u9aec	0.032258	temperature
635	rc7u9aec	0.567097	temperature

364 rows × 3 columns

Code

(
    results[["sweep", "accuracy"]]
        .groupby("sweep")
        .agg(max)
        .rename(columns={"accuracy": "best_accuracy"})
        .sort_values(by="best_accuracy", ascending=False)
)

	best_accuracy
sweep
cosine-scaling-fixed-k	0.954839
temperature	0.954194
cosine-temperature	0.952581

Code

(
    results[["sweep", "accuracy"]]
        .groupby("sweep")
        .agg(len)
        .rename(columns={"accuracy": "runs"})
)

	runs
sweep
cosine-scaling-fixed-k	173
cosine-temperature	63
temperature	128

Code

(
    results
        .sort_values(by="accuracy")
        [["accuracy", "sweep"]]
        .boxplot(by="sweep", grid=False, figsize=(9,5))
) ; None

The Cosine Similarity version of Scaling has slightly edged out both the original temperature approach used in the workshop and the Cosine Similarity temperature version. Is this difference significant?

This is the chance for me to do a significance test!

I’m probably going to choose the wrong test, and interpret it in the wrong way. I have found the T-Test which is described as:

Suppose we observe two independent samples, e.g. flower petal lengths, and we are considering whether the two samples were drawn from the same population (e.g. the same species of flower or two species with similar petal characteristics) or two different populations.

The t-test quantifies the difference between the arithmetic means of the two samples. The p-value quantifies the probability of observing as or more extreme values assuming the null hypothesis, that the samples are drawn from populations with the same population means, is true. A p-value larger than a chosen threshold (e.g. 5% or 1%) indicates that our observation is not so unlikely to have occurred by chance. Therefore, we do not reject the null hypothesis of equal population means. If the p-value is smaller than our threshold, then we have evidence against the null hypothesis of equal population means.

Given the accuracy scores from the Cosine Similarity version of Scaling and KL Divergence version of Temperature, can we say that there is a significant difference?

Code

from scipy.stats import ttest_ind

ttest_ind(
    results[results.sweep == "cosine-scaling-fixed-k"].accuracy,
    results[results.sweep == "temperature"].accuracy,
    alternative="greater",
)

Ttest_indResult(statistic=1.3472007428341934, pvalue=0.08946782978885727)

I’ve chosen to test if the Cosine Scaling runs are more accurate than the Temperature runs.

The p-value here does not indicate a strong difference as it does not reach the 5% threshold. The number of evaluations is great enough that this test should be valid. Really, the uncertainty comes from if this is the right test to use at all. Remember that the hyperparameters for these two runs were selected using the baesian method so it is not a random distribution across the two techniques.

Anyway that was fun. Doing this was more a way to explore distillation by exploring the teacher student training.