I’ve been preparing for the paper that I am writing by training GPT2 in varing sizes and prompt lengths. The results can be viewed here, and the code is here.
It’s been going well however I was struck by something very unusual. The GPT2 small and medium models have trained well and have comparable performance. When I train GPT2 large the performance is terrible and never improves.
model
batch size
best accuracy
token count
GPT2-small
32
0.8704
20
GPT2-medium
32
0.8933
5
GPT2-large
16
0.4908
ALL
I’ve been looking at the code trying to figure out why this is. There is clearly some underlying reason for this as every epoch of every prompt length so far has resulted in the same accuracy - 0.4908.
Reviewing the Dataset
I suspect that something about the training has caused the model to instantly collapse and it is predicting a single value over and over. The dataset is GLUE SST2 (Standford Sentiment Treebank) so investigating that would be a good start. Is there a single value that could produce this accuracy?
Code
from datasets import load_datasetsst2 = load_dataset("glue", "sst2")sst2_df = sst2["validation"].to_pandas()sst2_df
So it can produce this accuracy by predicting zero every time. Is the trained model doing that?
Reviewing the Model Output
I’ve saved the “best” model for each run so it should be possible to load it up and test it. To do this in this blog I need to copy over the model definition, which is available below.
Code
#collapse"""This defines the prompt training models."""# pylint: disable=abstract-method, arguments-differ, too-many-ancestorsimport loggingimport torchimport transformers.models.gpt2.modeling_gpt2 as gpt2_modulefrom transformers import GPT2ForSequenceClassification# disable warning about padding tokens:# > MODEL will not detect padding tokens in `inputs_embeds`. Results may be# > unexpected if using padding tokens in conjunction with `inputs_embeds.`gpt2_module.logger.setLevel(logging.CRITICAL)class PromptTrainingGPT2ForSequenceClassification(GPT2ForSequenceClassification):def__init__(self, config) ->None:super().__init__(config)assertself.config.pad_token_id isnotNone embedding =self.transformer.wte vocab_size = embedding.weight.shape[0] prompt_indexes = torch.randint( size=(config.prompt_tokens,), low=0, high=vocab_size, device=self.device )self.prompt = torch.nn.Parameter(embedding(prompt_indexes).clone()[None, :, :])# We add this to the input before adding the prompt tokens, it's full of the padding tokens.# Making it a parameter means it will get put on the right device and# loss will work even though it is not trained. input_extension = torch.ones( (config.prompt_tokens,), dtype=int, device=self.device ) input_extension *=self.config.pad_token_idself.input_extension = torch.nn.Parameter( embedding(input_extension)[None, :, :] )def forward(self, input_ids=None, attention_mask=None,**kwargs, ):""" This converts the input_ids into the embedding space and adds the prompt onto it. Then the base forward method can be invoked. """if attention_mask isnotNone: inputs_embeds =self._extend_inputs_embeds(self._to_embedding(input_ids)) attention_mask =self._extend_attention_mask(attention_mask)self._copy_prompt( inputs_embeds=inputs_embeds, attention_mask=attention_mask, )else: inputs_embeds =self._add_prompt(self._to_embedding(input_ids))returnsuper().forward( inputs_embeds=inputs_embeds, attention_mask=attention_mask, **kwargs )def _to_embedding(self, tokens: torch.Tensor) -> torch.Tensor:returnself.transformer.wte(tokens)def _extend_inputs_embeds(self, inputs_embeds: torch.Tensor) -> torch.Tensor: batch_size = inputs_embeds.shape[0] input_extension =self.input_extension.repeat_interleave(batch_size, dim=0)return torch.cat( [ inputs_embeds, input_extension, ], dim=1, )def _extend_attention_mask(self, attention_mask: torch.Tensor) -> torch.Tensor: batch_size = attention_mask.shape[0] prompt_size =self.prompt.shape[1]return torch.cat( [ attention_mask, torch.zeros((batch_size, prompt_size), device=self.device), ], dim=1, )def _copy_prompt(self, inputs_embeds: torch.Tensor, attention_mask: torch.Tensor ) ->None: prompt =self.prompt prompt_size = prompt.shape[1] attention_indexes = attention_mask.sum(dim=1).long().tolist()for batch_index, token_index inenumerate(attention_indexes): end_index = token_index + prompt_size inputs_embeds[batch_index, token_index:end_index] = prompt[0] attention_mask[batch_index, token_index:end_index] =1
Code
from pathlib import PathGPT2_LARGE_MODELS =sorted( Path("/home/matthew/Programming/Python/prompt-training/models").glob("gpt2-large_*"))
Code
from transformers import AutoTokenizermodel = PromptTrainingGPT2ForSequenceClassification.from_pretrained(GPT2_LARGE_MODELS[0])model.eval()tokenizer = AutoTokenizer.from_pretrained("gpt2-large")
Code
model.device
device(type='cpu')
I’m still running the sweeps for the prompt training so I want to keep the model on the cpu.
Code
with torch.no_grad(): output = model(**tokenizer(sst2_df.iloc[0].sentence, return_tensors="pt"))output.logits
tensor([[1.2729]])
Code
with torch.no_grad(): output = model(**tokenizer(sst2_df.iloc[1].sentence, return_tensors="pt"))output.logits
tensor([[0.0388]])
So this doesn’t make any sense - the 0th sentence has a label of 1 and the 1st has a label of 0, so these predictions are correct. How does this perform against all of the rows?
I think I know what the problem is. The GPT2-large model is performing a regression - it has a single output. That means that run.predictions.argmax(axis=1) calculates the argmax of a single value, resulting in 0. So the metric thinks that every prediction is 0, which is consistent with the accuracy values.
We can recreate this as follows:
Code
import numpy as np# [[1.2729]] was the output from the first rownp.array([[1.2729]]).argmax(axis=1)
array([0])
I suspect that I need to be explicit about the number of labels to produce. Is this the difference between GPT2-{small,medium} and GPT2-large?
We can see if it is by reviewing the score layer of each model, as that is the classification layer. I think that the out_features for gpt2 is 2 while for gpt2-large it is 1.
Code
from transformers import AutoConfigconfig = AutoConfig.from_pretrained("gpt2")config.pad_token_id = config.eos_token_idconfig.prompt_tokens =5PromptTrainingGPT2ForSequenceClassification.from_pretrained("gpt2", config=config).score
Some weights of PromptTrainingGPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight', 'input_extension', 'prompt']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of PromptTrainingGPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2-large and are newly initialized: ['prompt', 'input_extension', 'score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
So this is it, the score layer differs between the two models. I feel this is a case of relying on defaults that change unexpectedly.
I should probably open a ticket about this. To verify that I should use the underlying huggingface model.
Code
from transformers import AutoModelForSequenceClassificationgpt2_small_features = AutoModelForSequenceClassification.from_pretrained("gpt2").score.out_featuresgpt2_large_features = AutoModelForSequenceClassification.from_pretrained("gpt2-large").score.out_featuresgpt2_small_features, gpt2_large_features
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2-large and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(2, 1)
I’ve opened this ticket here. Hopefully it’s an easy fix.
So the fix for my code? I need to set num_labels explicitly rather than relying on the default.
Some weights of PromptTrainingGPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2-large and are newly initialized: ['prompt', 'input_extension', 'score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
So this seems like an easy fix. I’ll apply it to the code. Running the GPT2-large sweep takes about a day so that will have to wait until I have some free time.
On the plus side this does show that prompt training can be used for regression tasks.