Prompt Training - Strange Results

Prompt Training GPT2 works well until GPT2-large
Published

June 27, 2021

I’ve been preparing for the paper that I am writing by training GPT2 in varing sizes and prompt lengths. The results can be viewed here, and the code is here.

It’s been going well however I was struck by something very unusual. The GPT2 small and medium models have trained well and have comparable performance. When I train GPT2 large the performance is terrible and never improves.

model batch size best accuracy token count
GPT2-small 32 0.8704 20
GPT2-medium 32 0.8933 5
GPT2-large 16 0.4908 ALL

I’ve been looking at the code trying to figure out why this is. There is clearly some underlying reason for this as every epoch of every prompt length so far has resulted in the same accuracy - 0.4908.


Reviewing the Dataset

I suspect that something about the training has caused the model to instantly collapse and it is predicting a single value over and over. The dataset is GLUE SST2 (Standford Sentiment Treebank) so investigating that would be a good start. Is there a single value that could produce this accuracy?

Code
from datasets import load_dataset

sst2 = load_dataset("glue", "sst2")
sst2_df = sst2["validation"].to_pandas()
sst2_df
Reusing dataset glue (/home/matthew/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
idx label sentence
0 0 1 it 's a charming and often affecting journey .
1 1 0 unflinchingly bleak and desperate
2 2 1 allows us to hope that nolan is poised to emba...
3 3 1 the acting , costumes , music , cinematography...
4 4 0 it 's slow -- very , very slow .
... ... ... ...
867 867 0 has all the depth of a wading pool .
868 868 1 a movie with a real anarchic flair .
869 869 0 a subject like this should inspire reaction in...
870 870 0 ... is an arthritic attempt at directing by ca...
871 871 1 looking aristocratic , luminous yet careworn i...

872 rows × 3 columns

Code
sst2_df.label.value_counts()
1    444
0    428
Name: label, dtype: int64
Code
len(sst2_df[sst2_df.label == 0]) / len(sst2_df)
0.4908256880733945

So it can produce this accuracy by predicting zero every time. Is the trained model doing that?


Reviewing the Model Output

I’ve saved the “best” model for each run so it should be possible to load it up and test it. To do this in this blog I need to copy over the model definition, which is available below.

Code
#collapse

"""
This defines the prompt training models.
"""

# pylint: disable=abstract-method, arguments-differ, too-many-ancestors

import logging

import torch
import transformers.models.gpt2.modeling_gpt2 as gpt2_module
from transformers import GPT2ForSequenceClassification

# disable warning about padding tokens:
# > MODEL will not detect padding tokens in `inputs_embeds`. Results may be
# > unexpected if using padding tokens in conjunction with `inputs_embeds.`
gpt2_module.logger.setLevel(logging.CRITICAL)


class PromptTrainingGPT2ForSequenceClassification(GPT2ForSequenceClassification):
    def __init__(self, config) -> None:
        super().__init__(config)

        assert self.config.pad_token_id is not None

        embedding = self.transformer.wte
        vocab_size = embedding.weight.shape[0]
        prompt_indexes = torch.randint(
            size=(config.prompt_tokens,), low=0, high=vocab_size, device=self.device
        )
        self.prompt = torch.nn.Parameter(embedding(prompt_indexes).clone()[None, :, :])

        # We add this to the input before adding the prompt tokens, it's full of the padding tokens.
        # Making it a parameter means it will get put on the right device and
        # loss will work even though it is not trained.
        input_extension = torch.ones(
            (config.prompt_tokens,), dtype=int, device=self.device
        )
        input_extension *= self.config.pad_token_id
        self.input_extension = torch.nn.Parameter(
            embedding(input_extension)[None, :, :]
        )

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        **kwargs,
    ):
        """
        This converts the input_ids into the embedding space and adds the prompt onto it.
        Then the base forward method can be invoked.
        """
        if attention_mask is not None:
            inputs_embeds = self._extend_inputs_embeds(self._to_embedding(input_ids))
            attention_mask = self._extend_attention_mask(attention_mask)
            self._copy_prompt(
                inputs_embeds=inputs_embeds,
                attention_mask=attention_mask,
            )
        else:
            inputs_embeds = self._add_prompt(self._to_embedding(input_ids))

        return super().forward(
            inputs_embeds=inputs_embeds, attention_mask=attention_mask, **kwargs
        )

    def _to_embedding(self, tokens: torch.Tensor) -> torch.Tensor:
        return self.transformer.wte(tokens)

    def _extend_inputs_embeds(self, inputs_embeds: torch.Tensor) -> torch.Tensor:
        batch_size = inputs_embeds.shape[0]
        input_extension = self.input_extension.repeat_interleave(batch_size, dim=0)
        return torch.cat(
            [
                inputs_embeds,
                input_extension,
            ],
            dim=1,
        )

    def _extend_attention_mask(self, attention_mask: torch.Tensor) -> torch.Tensor:
        batch_size = attention_mask.shape[0]
        prompt_size = self.prompt.shape[1]
        return torch.cat(
            [
                attention_mask,
                torch.zeros((batch_size, prompt_size), device=self.device),
            ],
            dim=1,
        )

    def _copy_prompt(
        self, inputs_embeds: torch.Tensor, attention_mask: torch.Tensor
    ) -> None:
        prompt = self.prompt
        prompt_size = prompt.shape[1]
        attention_indexes = attention_mask.sum(dim=1).long().tolist()
        for batch_index, token_index in enumerate(attention_indexes):
            end_index = token_index + prompt_size
            inputs_embeds[batch_index, token_index:end_index] = prompt[0]
            attention_mask[batch_index, token_index:end_index] = 1
Code
from pathlib import Path

GPT2_LARGE_MODELS = sorted(
    Path("/home/matthew/Programming/Python/prompt-training/models").glob("gpt2-large_*")
)
Code
from transformers import AutoTokenizer

model = PromptTrainingGPT2ForSequenceClassification.from_pretrained(GPT2_LARGE_MODELS[0])
model.eval()
tokenizer = AutoTokenizer.from_pretrained("gpt2-large")
Code
model.device
device(type='cpu')

I’m still running the sweeps for the prompt training so I want to keep the model on the cpu.

Code
with torch.no_grad():
    output = model(**tokenizer(sst2_df.iloc[0].sentence, return_tensors="pt"))
output.logits
tensor([[1.2729]])
Code
with torch.no_grad():
    output = model(**tokenizer(sst2_df.iloc[1].sentence, return_tensors="pt"))
output.logits
tensor([[0.0388]])

So this doesn’t make any sense - the 0th sentence has a label of 1 and the 1st has a label of 0, so these predictions are correct. How does this perform against all of the rows?

I wonder if my metric is off.

Code
def predict(text: str) -> float:
    with torch.no_grad():
        output = model(**tokenizer(text, return_tensors="pt"))
    return output.logits.item()
Code
sst2_df["prediction"] = sst2_df.sentence.apply(predict)
Code
sst2_df.prediction
0      1.272908
1      0.038817
2      0.882452
3      1.114154
4      0.079426
         ...   
867    0.487619
868    0.789458
869    0.668021
870    0.136636
871    0.842498
Name: prediction, Length: 872, dtype: float64
Code
((sst2_df.prediction > 0.5) == (sst2_df.label == 1)).sum()
763
Code
((sst2_df.prediction > 0.5) == (sst2_df.label == 1)).sum() / len(sst2_df)
0.875

So I think my metric is off? How did it work for the previous ones then?


Reviewing the Metric Calculation

The metric calculation hasn’t changed, so why is it now broken? This is the exact code that is used to calculate the metric.

Code
from typing import Dict

from datasets import load_metric
from transformers import EvalPrediction

sst2_metric = load_metric("glue", "sst2")

def compute_sst2_metrics(run: EvalPrediction) -> Dict[str, float]:
    targets = run.label_ids
    predictions = run.predictions.argmax(axis=1)
    return sst2_metric.compute(predictions=predictions, references=targets)

I think I know what the problem is. The GPT2-large model is performing a regression - it has a single output. That means that run.predictions.argmax(axis=1) calculates the argmax of a single value, resulting in 0. So the metric thinks that every prediction is 0, which is consistent with the accuracy values.

We can recreate this as follows:

Code
import numpy as np

# [[1.2729]] was the output from the first row
np.array([[1.2729]]).argmax(axis=1)
array([0])

I suspect that I need to be explicit about the number of labels to produce. Is this the difference between GPT2-{small,medium} and GPT2-large?

We can see if it is by reviewing the score layer of each model, as that is the classification layer. I think that the out_features for gpt2 is 2 while for gpt2-large it is 1.

Code
from transformers import AutoConfig

config = AutoConfig.from_pretrained("gpt2")
config.pad_token_id = config.eos_token_id
config.prompt_tokens = 5

PromptTrainingGPT2ForSequenceClassification.from_pretrained("gpt2", config=config).score
Some weights of PromptTrainingGPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight', 'input_extension', 'prompt']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Linear(in_features=768, out_features=2, bias=False)
Code
config = AutoConfig.from_pretrained("gpt2-large")
config.pad_token_id = config.eos_token_id
config.prompt_tokens = 5

PromptTrainingGPT2ForSequenceClassification.from_pretrained("gpt2-large", config=config).score
Some weights of PromptTrainingGPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2-large and are newly initialized: ['prompt', 'input_extension', 'score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Linear(in_features=1280, out_features=1, bias=False)

So this is it, the score layer differs between the two models. I feel this is a case of relying on defaults that change unexpectedly.

I should probably open a ticket about this. To verify that I should use the underlying huggingface model.

Code
from transformers import AutoModelForSequenceClassification

gpt2_small_features = AutoModelForSequenceClassification.from_pretrained("gpt2").score.out_features
gpt2_large_features = AutoModelForSequenceClassification.from_pretrained("gpt2-large").score.out_features

gpt2_small_features, gpt2_large_features
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2-large and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(2, 1)

I’ve opened this ticket here. Hopefully it’s an easy fix.

So the fix for my code? I need to set num_labels explicitly rather than relying on the default.

Code
config = AutoConfig.from_pretrained("gpt2-large")
config.pad_token_id = config.eos_token_id
config.prompt_tokens = 5
config.num_labels = 2

PromptTrainingGPT2ForSequenceClassification.from_pretrained("gpt2-large", config=config).score
Some weights of PromptTrainingGPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2-large and are newly initialized: ['prompt', 'input_extension', 'score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Linear(in_features=1280, out_features=2, bias=False)

So this seems like an easy fix. I’ll apply it to the code. Running the GPT2-large sweep takes about a day so that will have to wait until I have some free time.

On the plus side this does show that prompt training can be used for regression tasks.