Evaluating prompt training on a more popular dataset
Published
April 28, 2021
I’ve recently investigated using prompt training and I’ve read a paper by google on the same. The biggest problem with my evaluation was the inability to compare my results to the work of others. It just seems that the sentiment140 dataset is not widely studied.
So I’ve found another dataset which is more popular {% cite maas-EtAl:2011:ACL-HLT2011 %}. There are several entries for it on papers with code and it’s another binary sentiment classification problem, so I should be able to apply the same techniques as before.
This is going to be a comparatively focused evaluation of prompt training GPT-2 small on the dataset. If you want more details on the technique then consider reading my previous posts (my proposal or the google paper review).
Dataset
The dataset is 50,000 movie reviews, split into equal numbers of positive and negative reviews. A test train split has already been provided (50/50 ratio) which I will honor because I want to compare the results of prompt training to other work on this. Each review is in a separate file.
The first thing to do will be to load the data into dataframes.
I can reuse the dataloader from my previous evaluation as the dataframe matches. Once again I am using GPT-2 so the past is available. I’ve adjusted the dataloader to better match regular dataloaders - it works by epoch now. This is because the dataset is small enough to want to iterate over several times.
Code
from transformers import AutoModelForCausalLM, AutoTokenizertokenizer = AutoTokenizer.from_pretrained("gpt2")model = AutoModelForCausalLM.from_pretrained("gpt2")model.eval() ;None
Code
from typing import*import torchPast = Tuple[Tuple[torch.Tensor, ...], ...]class PastDataloader:def__init__(self, model: AutoModelForCausalLM, tokenizer: AutoTokenizer, df: pd.DataFrame, batch_size: int, max_length: int, device: torch.device = torch.device("cuda"), shuffle: bool=True, ) ->None: tokenizer.pad_token = tokenizer.eos_token # needed to enable padding model.to(device)self.model = modelself.tokenizer = tokenizerself.df = dfself.batch_size = batch_sizeself.max_length = max_lengthself.device = deviceself.shuffle = shuffledef__iter__(self) -> Iterator[Dict[str, torch.Tensor]]:""" Returns an iterator that returns the batched rows in a random order. This always returns full batches. """ifself.shuffle: df =self.df.sample(frac=1).reset_index(drop=True)else: df =self.df batch_size =self.batch_sizefor i inrange(len(self)): start = i * batch_size end = start + batch_sizeyieldself._get(df[start:end])def__len__(self) ->int:""" Returns the total number of full batches that can be returned. """returnlen(self.df) //self.batch_size@torch.no_grad()def _get(self, rows: pd.DataFrame) -> Dict[str, Union[torch.Tensor, Past]]: tokens =self.tokenizer( rows.text.tolist(), return_tensors="pt", padding=True, truncation=True, max_length=self.max_length ).to(self.device) past_key_values =self.model(**tokens).past_key_values labels = torch.tensor([ GOOD_TOKEN if label =="good"else BAD_TOKENfor label in rows.sentiment ], dtype=torch.long, device=self.device)return {"past_key_values": past_key_values,"attention_mask": tokens["attention_mask"],"labels": labels }
Now lets see how the model performs with our custom prompts. This is going to vary the number of tokens from the text (as some of the reviews are longer than the model can take, the input will be truncated sometimes no matter what we do). We will also evaluate 5 and 20 token prompts as prompts that size performed strongly in the google evaluation.
Baseline Comparison
We can start with a baseline comparison to see what the model is like without a prompt.
precision recall f1-score support
good 0.84 0.89 0.87 12497
bad 0.88 0.83 0.86 12495
accuracy 0.86 24992
macro avg 0.86 0.86 0.86 24992
weighted avg 0.86 0.86 0.86 24992
Current SOTA on this dataset is 97.4 with other training data, or 94.5 without other training data (Al-Shedivat, Dubey, and Xing 2020), so I have some way to go. This result is around the bottom of the results in the papers with code charts. I think that using GPT-2 would count as using a model that has other training data (since it is fine tuning a trained language model). It’s interesting to compare to no-extra-data evaluations because that can suggest how well this can perform on novel tasks that may’ve previously been performed by statistical methods (e.g. random forests).
precision recall f1-score support
good 0.90 0.83 0.86 12496
bad 0.84 0.90 0.87 12496
accuracy 0.87 24992
macro avg 0.87 0.87 0.87 24992
weighted avg 0.87 0.87 0.87 24992
Looks like it is hard to move the dial. Lets try with more text.
1,000 Tokens
I’ve had to limit the text length because some of the tokenized reviews exceed what the model can handle. It also makes training faster if I limit it to 128 tokens as I can run a batch size of 32. Inferring sentiment from few tokens is harder, so I might be able to improve accuracy by making more tokens available.
The token limit needs to consider the size of the prompt as that contributes to the overall token count. Currently I’m varying the prompt between 5 and 20 tokens. Since the maximum token count is 1,024 I can just let the text go to 1,000 tokens.
So it edges slightly closer with a longer train. It’s possible that this has overfitted the training data and that a better result could’ve been established at a different point on the train. I also think that using a bigger model could lead to better performance. It’s good to understand the limitations of GPT-2 small though.
All in all this is very encouraging.
Sanity Check
I just want to load the tokenizer / model / prompt from disk and check that the accuracy is still the same. Previously when doing this kind of thing I have found that the model changes even though it is not touched by the optimizer. If I have made such a mistake then I want to know about it, and then fix it.
Have I messed this up? Has the model been altered? I guess there is one way to check - compare it to the pretrained model.
Looks like none of the layers have changed in the model. Another way to do this sanity check would be to load the model and prompt from disk and see how they compare to the previous results.