Code
import pandas as pd
= pd.read_parquet("/data/sentiment/imdb-movie-reviews/train.gz.parquet")
sentiment_train_df = pd.read_parquet("/data/sentiment/imdb-movie-reviews/validation.gz.parquet") sentiment_validation_df
May 23, 2021
I’ve been using the prompt training technique to refine a language model into a classifier by training a very small set of parameters. This has been going well so far for tasks where I can easily select target tokens (like sentiment classification - the tokens good and bad work very well).
When I have a more distant relationship between the tokens and task the classifier performs poorly. If I wanted a classifier that determined if a piece of text was written by an author in the northern or southern hemisphere then using the target tokens relevant and irrelevant would not perform well. So part of the problem is the appropriate selection of the tokens to compare.
I have been trying to generalize this approach to training the selection of the target tokens using the idea of a centroid over the model output. Each class in the classifier would have a different centroid and the closest centroid to a given output is the classification. This was very tricky to train well and I finally got reasonable results using cross entropy loss (the reasonable results still had a significant drop in accuracy compared to the good and bad tokens).
I think that centroid training is a poor proxy for actually training a new linear classification head for the language model. So I am now going to train a new classification head. I still want to be able to perform multiple tasks in a single batch (a key benefit that prompt training unlocks), so after this I am going to investigate training multiple different classifiers and concatenating them, so that they are all performed for each entry in the batch and the task specific output can be selected. Finally an evaluation of multi task training for the prompt can be performed - can a single prompt classify the text according to multiple criteria simultaneously?
The two tasks that I am going to evaluate are sentiment analysis and emotion classification. I already have the IMDB dataset for sentiment, I just need to find a dataset for emotion.
Let’s start with a quick review of the sentiment dataset. This is 25,000 imdb movie reviews which are either considered positive or negative, based on the associated score. So the problem is a binary classification problem.
label | text | |
---|---|---|
0 | good | Bromwell High is a cartoon comedy. It ran at t... |
1 | good | Homelessness (or Houselessness as George Carli... |
2 | good | Brilliant over-acting by Lesley Ann Warren. Be... |
3 | good | This is easily the most underrated film inn th... |
4 | good | This is not the typical Mel Brooks film. It wa... |
... | ... | ... |
24995 | bad | Towards the end of the movie, I felt it was to... |
24996 | bad | This is the kind of movie that my enemies cont... |
24997 | bad | I saw 'Descent' last night at the Stockholm Fi... |
24998 | bad | Some films that you pick up for a pound turn o... |
24999 | bad | This is one of the dumbest films, I've ever se... |
25000 rows × 2 columns
For the emotion dataset I have to find and preprocess it first. One dataset that I have found is the SemEval 2018 competition here (search for E-c
) (Mohammad et al. 2018).
The two files I am using are the English training and development set.
import pandas as pd
emotion_train_df = (
pd.read_csv("/data/emotion/sem-eval-2018/train.zip", delimiter="\t")
.drop(columns="ID")
.rename(columns={"Tweet": "text"})
)
emotion_validation_df = (
pd.read_csv("/data/emotion/sem-eval-2018/dev.zip", delimiter="\t")
.drop(columns="ID")
.rename(columns={"Tweet": "text"})
)
text | anger | anticipation | disgust | fear | joy | love | optimism | pessimism | sadness | surprise | trust | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | “Worry is a down payment on a problem you may ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
1 | Whatever you decide to do make sure it makes y... | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
2 | @Max_Kellerman it also helps that the majorit... | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
3 | Accept the challenges so that you can literall... | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
4 | My roommate: it's okay that we can't spell bec... | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6833 | @nicky57672 Hi! We are working towards your hi... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
6834 | @andreamitchell said @berniesanders not only d... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
6835 | @isthataspider @dhodgs i will fight this guy! ... | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
6836 | i wonder how a guy can broke his penis while h... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
6837 | I'm highly animated even though I'm decomposing. | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
6838 rows × 12 columns
I need to convert the different columns into a single multi target label.
EMOTION_LABELS = [
"anger",
"anticipation",
"disgust",
"fear",
"joy",
"love",
"optimism",
"pessimism",
"sadness",
"surprise",
"trust"
]
emotion_train_df["label"] = emotion_train_df.apply(
lambda row: row[EMOTION_LABELS].to_numpy(),
axis=1
)
emotion_validation_df["label"] = emotion_validation_df.apply(
lambda row: row[EMOTION_LABELS].to_numpy(),
axis=1
)
disgust 0.380521
anger 0.372039
joy 0.362240
sadness 0.293653
optimism 0.290143
fear 0.181632
anticipation 0.143024
pessimism 0.116262
love 0.102369
surprise 0.052793
trust 0.052208
dtype: float64
I need to come up with a weighting parameter to address the unbalanced labels. The loss function I am going to use is BCELossWithLogits, and that can accomodate unbalanced labels. It wouldn’t be possible to rebalance the dataset as the labels are intertwined, so balancing one would unbalance the others.
anger 1.687893
anticipation 5.991820
disgust 1.627978
fear 4.505636
joy 1.760597
love 8.768571
optimism 2.446573
pessimism 7.601258
sadness 2.405378
surprise 17.941828
trust 18.154062
dtype: float64
So this is a binary classification problem when considering each individual emotion. The dataset is smaller and the individual emotions are unbalanced, so it will be slightly harder. The only evaluation with published results I can quickly find is this one which uses pearson correlation as an evaluation metric:
Method | Joy | Anger | Sadness | Fear | Valence |
---|---|---|---|---|---|
Bidirectional LSTM | 0.49 | 0.35 | 0.47 | 0.49 | 0.32 |
Bidirectional LSTM + Lexicon Features | 0.54 | 0.43 | 0.47 | 0.55 | 0.51 |
Bidirectional LSTM with pretraining | 0.62 | 0.48 | 0.63 | 0.58 | 0.68 |
Bidirectional LSTM with pretraining + Lexicon Features | 0.6 | 0.5 | 0.64 | 0.55 | 0.71 |
Valence, or hedonic tone, is the affective quality referring to the intrinsic attractiveness/“good”-ness or averseness/“bad”-ness of an event, object, or situation. The term also characterizes and categorizes specific emotions. For example, emotions popularly referred to as “negative”, such as anger and fear, have negative valence. Valence - wikipedia
The valence label comes from a dataset that I have not downloaded so I will not be training or evaluating based on that.
I asked a work colleague about the results of this competition and they were able to find results much more easily than me. The official competition results appear to be here. The top result is significantly better than what I found:
User | macro-avg | anger | fear | joy | sadness |
---|---|---|---|---|---|
venkatesh-1729 | 0.799 (1) | 0.827 (1) | 0.779 (1) | 0.792 (1) | 0.798 (1) |
It’s odd that this only has 4 emotions though. I wonder if this is a different task (semeval is made up of several tasks).
There is also this papers with code that has accuracy and f1 instead of per emotion:
model | accuracy | micro-f1 | macro-f1 |
---|---|---|---|
SpanEmo | 0.601 | 0.713 | 0.578 |
BERT+DK | 0.591 | 0.713 | 0.549 |
BERT-GCN | 0.589 | 0.707 | 0.563 |
Transformer | — | — | 0.561 |
I had a quick look at the SpanEmo {% cite alhuzali-ananiadou-2021-spanemo %} and it looks like it is BERT feeding into a downstream network that then performs toekn classification? I would have to read the paper to get a proper idea of the technique.
Anyway this is an exact match for the task and dataset that I am using so aiming for these stats should be reasonable?
A lot of this is copied from the previous notebooks and adjusted to fit the new classification layer.
As before we have a dataloader. This should work with both datasets.
#collapse
from typing import Dict, Iterator, Optional, Tuple, Union
import pandas as pd
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
Past = Tuple[Tuple[torch.Tensor, ...], ...]
TextBatch = Dict[str, torch.Tensor]
PastBatch = Dict[str, Union[torch.Tensor, Past]]
class TextDataloader:
"""Provides a dataloader over a text dataframe"""
def __init__(
self,
df: pd.DataFrame,
*,
tokenizer: AutoTokenizer,
batch_size: int,
max_length: int,
device: torch.device = torch.device("cuda"),
shuffle: bool = True,
multi_target: bool = False,
) -> None:
self.tokenizer = tokenizer
self.df = df
self.batch_size = batch_size
self.max_length = max_length
self.device = device
self.shuffle = shuffle
self.label_dtype = torch.float if multi_target else torch.long
def __iter__(self) -> Iterator[TextBatch]:
"""Returns an iterator that returns batches.
The final batch can be a partial batch."""
if self.shuffle:
df = self.df.sample(frac=1).reset_index(drop=True)
else:
df = self.df
batch_size = self.batch_size
for i in range(len(self)):
start = i * batch_size
end = start + batch_size
yield self.to_batch(df[start:end])
def __len__(self) -> int:
"""Returns the total number of batches that can be returned."""
full_batches = len(self.df) // self.batch_size
if len(self.df) % self.batch_size:
return full_batches + 1
return full_batches
def to_batch(self, rows: pd.DataFrame) -> TextBatch:
"""Converts the rows into a batch"""
tokens = self.tokenizer(
rows.text.tolist(),
return_tensors="pt",
padding=True,
truncation=True,
max_length=self.max_length,
).to(self.device)
labels = torch.tensor(rows.label.tolist(), dtype=self.label_dtype, device=self.device)
return {
"input_ids": tokens["input_ids"],
"attention_mask": tokens["attention_mask"],
"labels": labels,
}
class PastDataloader(TextDataloader): # pylint: disable=too-few-public-methods
"""Provides a dataloader which converts the text into past tensors"""
def __init__(
self,
df: pd.DataFrame,
*,
model: AutoModelForCausalLM,
tokenizer: AutoTokenizer,
batch_size: int,
max_length: int,
label_map: Optional[Dict[str, int]] = None,
device: torch.device = torch.device("cuda"),
shuffle: bool = True,
multi_target: bool = False,
) -> None:
if label_map:
df = df.copy()
df["label"] = df.label.map(label_map)
super().__init__(
df=df,
tokenizer=tokenizer,
batch_size=batch_size,
max_length=max_length,
device=device,
shuffle=shuffle,
multi_target=multi_target,
)
model.to(device)
self.model = model
@torch.no_grad()
def to_batch(self, rows: pd.DataFrame) -> PastBatch:
batch = super().to_batch(rows)
past_key_values = self.model(
input_ids=batch["input_ids"],
attention_mask=batch.get("attention_mask", None),
).past_key_values
return {
"past_key_values": past_key_values,
"attention_mask": batch["attention_mask"],
"labels": batch["labels"],
}
Then we have the modified training loop from the previous posts. This one works around the trained prompt with the linear classification head.
#collapse
from __future__ import annotations
from abc import ABC, abstractmethod
from dataclasses import dataclass
from pathlib import Path
from typing import Callable, Dict, List, Tuple, Union
import torch
import numpy as np
from tqdm.auto import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
LossFunction = Callable[[torch.Tensor, torch.Tensor], torch.Tensor]
OptimizerFactory = Callable[[torch.nn.Parameter, torch.nn.Parameter], torch.optim.Optimizer]
@dataclass
class TrainedPrompt:
prompt: torch.Tensor
prompt_attention: torch.Tensor
head: torch.nn.Linear
@staticmethod
def make(
model: AutoModelForCausalLM,
tokenizer: AutoTokenizer,
prompt_tokens: int,
device: torch.device,
classes: int = 2
) -> TrainedPrompt:
prompt_indexes = torch.randint(
size=(prompt_tokens,),
low=0,
high=tokenizer.vocab_size,
device=device
)
prompt = torch.nn.Parameter(
model.transformer.wte(prompt_indexes).clone()[None, :, :]
)
attention = torch.ones(1, prompt.shape[1], device=device)
head = torch.nn.Linear(
in_features=model.config.n_embd,
out_features=classes,
).to(device)
return TrainedPrompt(
prompt=prompt,
prompt_attention=attention,
head=head
)
@staticmethod
def load(folder: Path) -> TrainedPrompt:
assert folder.exists()
prompt = torch.load(folder / "prompt.pt")
attention = torch.ones(1, prompt.shape[1], device=prompt.device)
return TrainedPrompt(
prompt=prompt,
prompt_attention=attention,
head=torch.load(folder / "head.pt"),
)
def save(self, folder: Path) -> None:
folder.mkdir(parents=True, exist_ok=True)
torch.save(self.prompt, folder / "prompt.pt")
torch.save(self.head, folder / "head.pt")
def optimizer(self, lr: float = 1e-3) -> torch.optim.Optimizer:
parameters = [self.prompt] + list(self.head.parameters())
return torch.optim.Adam(parameters, lr=lr)
def train(
*,
dl: PastDataloader,
model: AutoModelForCausalLM,
tokenizer: AutoTokenizer,
prompt_tokens: int,
epochs: int,
loss_fn: LossFunction,
classes: int = 2,
) -> TrainedPrompt:
"""Train the prompt"""
prompt = TrainedPrompt.make(
model=model,
tokenizer=tokenizer,
prompt_tokens=prompt_tokens,
device=dl.device,
classes=classes,
)
optimizer = prompt.optimizer()
total_loss = 0.0
current_loss = 0.0
bar_format = "{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, {rate_fmt}] - {postfix[0]:>8.4f}"
with tqdm(
range(epochs), leave=False, bar_format=bar_format, postfix=[total_loss]
) as bar:
for _epoch in bar:
with tqdm(
dl, leave=False, bar_format=bar_format, postfix=[current_loss]
) as epoch_bar:
for batch in epoch_bar:
current_loss = _process(
batch=batch,
model=model,
optimizer=optimizer,
prompt=prompt,
loss_fn=loss_fn,
)
total_loss += current_loss
epoch_bar.postfix[0] = current_loss
average_loss = total_loss / len(dl)
bar.postfix[0] = average_loss
print(f"Average loss: {average_loss:0.4f}")
total_loss = 0.0
return prompt
def _process(
*,
batch: Dict[str, Union[torch.Tensor, Past]],
model: AutoModelForCausalLM,
optimizer: torch.optim.Optimizer,
prompt: TrainedPrompt,
loss_fn: LossFunction,
) -> float:
optimizer.zero_grad()
logits = _get_output_with_past(
model=model,
prompt=prompt,
past=batch["past_key_values"],
past_attention_mask=batch["attention_mask"],
)
labels = batch["labels"]
loss = loss_fn(logits, labels)
loss.backward()
optimizer.step()
return loss.item()
def _get_output_with_past(
*,
model: AutoModelForCausalLM,
prompt: TrainedPrompt,
past: Past,
past_attention_mask: torch.Tensor,
) -> torch.Tensor:
"""Get the predictions for the next token after the prompt"""
# concatenate the past attention with the prompt attention
batch_size = past_attention_mask.shape[0]
attention_mask = prompt.prompt_attention.repeat_interleave(batch_size, dim=0)
attention_mask = torch.cat([past_attention_mask, attention_mask], dim=-1)
# expand the prompt to match the batch size
input_ids = prompt.prompt.repeat_interleave(batch_size, dim=0)
state = model.transformer(
inputs_embeds=input_ids,
attention_mask=attention_mask,
past_key_values=past,
).last_hidden_state
return prompt.head(state[:, -1])
Here we have the evaluation code that can determine the accuracy of the trained prompt. The linear head makes this quite a lot simpler than before.
#collapse
from typing import List
from dataclasses import dataclass
from sklearn.metrics import classification_report
from tqdm.auto import tqdm
import numpy as np
@dataclass
class LabelledOutputs:
outputs: np.ndarray
labels: np.ndarray
predictions: np.ndarray
def generate_outputs(
dl: PastDataloader,
model: AutoModelForCausalLM,
prompt: TrainedPrompt,
multi_target: bool = False,
) -> LabelledOutputs:
raw_outputs = []
raw_predictions = []
for current_outputs, current_predictions in iterate_outputs(
dl=dl, model=model, prompt=prompt, multi_target=multi_target
):
raw_outputs.append(
current_outputs.cpu().numpy(),
)
raw_predictions.append(
current_predictions.cpu().numpy(),
)
outputs = np.concatenate(raw_outputs)
if dl.df.label.dtype.name == "object": # hack for multi label outputs
labels = (
np.concatenate(
dl.df.label
)
.reshape(len(dl.df), -1)
.astype(int)
)
else:
labels = dl.df.label.to_numpy()
predictions = np.concatenate(raw_predictions)
return LabelledOutputs(
outputs=outputs,
labels=labels,
predictions=predictions,
)
@torch.no_grad()
def iterate_outputs(
dl: PastDataloader,
model: AutoModelForCausalLM,
prompt: TrainedPrompt,
multi_target: bool,
) -> Iterator[Tuple[torch.Tensor, torch.Tensor]]:
for batch in tqdm(dl):
output = _get_output_with_past(
model=model,
prompt=prompt,
past=batch["past_key_values"],
past_attention_mask=batch["attention_mask"],
)
if multi_target:
predicted_labels = (output > 0).long()
else:
predicted_labels = output.argmax(dim=-1)
yield output, predicted_labels
@torch.no_grad()
def accuracy(outputs: LabelledOutputs, target_names: List[str] = ["bad", "good"]) -> None:
print(classification_report(
y_true=outputs.labels,
y_pred=outputs.predictions,
target_names=target_names,
zero_division=0
))
Finally we are loading the model and the tokenizer. Once again we are using GPT2-small.
This will use cross entropy loss while training both the prompt and the linear layer.
Lets train a model to classify the IMDB dataset.
BATCH_SIZE = 32
MAX_LENGTH = 1_000
sentiment_train_dataloader = PastDataloader(
model=model,
tokenizer=tokenizer,
df=sentiment_train_df,
batch_size=BATCH_SIZE,
max_length=MAX_LENGTH,
shuffle=True,
label_map={"bad": 0, "good": 1},
)
sentiment_validation_dataloader = PastDataloader(
model=model,
tokenizer=tokenizer,
df=sentiment_validation_df,
batch_size=BATCH_SIZE,
max_length=MAX_LENGTH,
shuffle=False,
label_map={"bad": 0, "good": 1},
)
Average loss: 0.3775
Average loss: 0.2392
Average loss: 0.2149
Average loss: 0.3440
Average loss: 0.2271
Average loss: 0.2124
Average loss: 0.2039
Average loss: 0.1993
Average loss: 0.1954
Average loss: 0.1922
Average loss: 0.1877
Average loss: 0.1905
Average loss: 0.1861
Now lets train another model to classify the SemEval 2018 emotion dataset.
BATCH_SIZE = 32
MAX_LENGTH = 1_000
emotion_train_dataloader = PastDataloader(
model=model,
tokenizer=tokenizer,
df=emotion_train_df,
batch_size=BATCH_SIZE,
max_length=MAX_LENGTH,
shuffle=True,
multi_target=True,
)
emotion_validation_dataloader = PastDataloader(
model=model,
tokenizer=tokenizer,
df=emotion_validation_df,
batch_size=BATCH_SIZE,
max_length=MAX_LENGTH,
shuffle=False,
multi_target=True,
)
Average loss: 1.1200
Average loss: 0.8796
Average loss: 0.7809
Average loss: 1.0890
Average loss: 0.9252
Average loss: 0.8341
Average loss: 0.7857
Average loss: 0.7548
Average loss: 0.7386
Average loss: 0.7220
Average loss: 0.7129
Average loss: 0.7020
Average loss: 0.6950
Lets see how well they perform.
precision recall f1-score support
bad 0.93 0.91 0.92 12500
good 0.92 0.93 0.92 12500
accuracy 0.92 25000
macro avg 0.92 0.92 0.92 25000
weighted avg 0.92 0.92 0.92 25000
precision recall f1-score support
bad 0.94 0.93 0.93 12500
good 0.93 0.94 0.93 12500
accuracy 0.93 25000
macro avg 0.93 0.93 0.93 25000
weighted avg 0.93 0.93 0.93 25000
So when trained for an equivalent number of epochs this consistently beats the “good” and “bad” tokens:
epochs | good / bad token accuracy | linear head accuracy |
---|---|---|
3 | 0.91 | 0.92 |
10 | 0.92 | 0.93 |
The difference isn’t great but it does show that the “good” and “bad” tokens are not optimal for this dataset. The state of the art for this dataset using a pretrained model is 0.97 accuracy.
In order to compare the trained model to the LSTM results that I found earlier I need to calculate the Pearson Correlation of the labels to the predictions. Lets start with the regular classification report first.
precision recall f1-score support
anger 0.69 0.83 0.76 315
anticipation 0.32 0.42 0.36 124
disgust 0.71 0.80 0.75 319
fear 0.48 0.91 0.63 121
joy 0.82 0.83 0.83 400
love 0.40 0.80 0.54 132
optimism 0.67 0.84 0.75 307
pessimism 0.21 0.81 0.33 100
sadness 0.53 0.78 0.63 265
surprise 0.08 0.94 0.15 35
trust 0.13 0.53 0.21 43
micro avg 0.49 0.80 0.60 2161
macro avg 0.46 0.77 0.54 2161
weighted avg 0.60 0.80 0.67 2161
samples avg 0.50 0.80 0.59 2161
This classification report suggests to me that the classifier works much better on the emotions that have more support. The lowest F1 score of the emotions with at least 300 support is 0.75 (disgust and optimism) while the highest F1 score of the other emotions is 0.63 (fear and sadness).
As a comparison to SpanEmo:
model | accuracy | micro-f1 | macro-f1 |
---|---|---|---|
Prompt Training (this) | — | 0.60 | 0.54 |
SpanEmo | 0.601 | 0.713 | 0.578 |
BERT+DK | 0.591 | 0.713 | 0.549 |
BERT-GCN | 0.589 | 0.707 | 0.563 |
Transformer | — | — | 0.561 |
So it lags quite a bit on the micro-f1 and not so much on the macro-f1. It is the worst performing out of these models. I don’t know how significant this is. It certainly isn’t something I would want to build a product around though.
I now need to calculate the Pearson Correlation for the joy (0.83), anger (0.76), sadness (0.63), fear (0.63) emotions.
from scipy.stats import pearsonr
def calculate_pearson_correlation(outputs: LabelledOutputs) -> pd.DataFrame:
results = []
for emotion_name, reference_result in [
("joy", 0.62),
("anger", 0.5),
("sadness", 0.64),
("fear", 0.58)
]:
index = EMOTION_LABELS.index(emotion_name)
correlation, p_value = pearsonr(
outputs.predictions[:, index],
outputs.labels[:, index]
)
results.append({
"emotion": emotion_name,
"correlation": correlation,
"p_value": p_value,
"reference_result": reference_result
})
return pd.DataFrame(results)
emotion | correlation | p_value | reference_result | |
---|---|---|---|---|
0 | joy | 0.683822 | 3.851568e-123 | 0.62 |
1 | anger | 0.607447 | 1.818293e-90 | 0.50 |
2 | sadness | 0.446928 | 1.008134e-44 | 0.64 |
3 | fear | 0.591053 | 1.436841e-84 | 0.58 |
precision recall f1-score support
anger 0.73 0.77 0.75 315
anticipation 0.29 0.55 0.38 124
disgust 0.70 0.87 0.78 319
fear 0.40 0.91 0.55 121
joy 0.83 0.81 0.82 400
love 0.39 0.89 0.55 132
optimism 0.74 0.69 0.72 307
pessimism 0.25 0.62 0.36 100
sadness 0.59 0.74 0.65 265
surprise 0.13 0.83 0.22 35
trust 0.11 0.86 0.20 43
micro avg 0.50 0.78 0.61 2161
macro avg 0.47 0.78 0.54 2161
weighted avg 0.62 0.78 0.67 2161
samples avg 0.52 0.78 0.60 2161
emotion | correlation | p_value | reference_result | |
---|---|---|---|---|
0 | joy | 0.669055 | 4.640033e-116 | 0.62 |
1 | anger | 0.607317 | 2.032116e-90 | 0.50 |
2 | sadness | 0.489182 | 1.714383e-54 | 0.64 |
3 | fear | 0.513182 | 1.109112e-60 | 0.58 |
It’s interesting that training the emotion classifier for longer has decreased the performance. While the micro and macro average stats of the classification report are nearly identical, out of the 4 comparison emotions only sadness experienced an improvement.
Overall these results seem positive. The prompt + linear head is capable of performance comparable to a fine tuned LSTM. The LSTM was even pretrained on the same domain (tweets) as the dataset, while GPT2-small was not.
I think that the accuracy of the individual classifiers has suffered because the prompt has been trying to distinguish all emotions at the same time. If a prompt per emotion were trained would it perform better?
Now we can just concatenate the two linear heads together to create a composite classifier. The outputs of a linear layer are independent so this does not alter the outputs. We do have to take care that the comparison is done over the correct indices.
To demonstrate that these statements are true lets create a composite head and then run the two evaluations again.
(torch.Size([2, 768]), torch.Size([2]))
(torch.Size([11, 768]), torch.Size([11]))
So you can see that the weight and bias shapes are compatible. They can just be concatenated across dimension 0 to produce the composite head.
@torch.no_grad()
def make_composite_head(*heads: torch.nn.Linear) -> torch.nn.Linear:
in_features = 768
out_features = sum(head.weight.shape[0] for head in heads)
composite_head = torch.nn.Linear(in_features=768, out_features=13)
composite_head.weight.data = torch.cat([
head.weight.data
for head in heads
], dim=0)
composite_head.bias.data = torch.cat([
head.bias.data
for head in heads
], dim=0)
return composite_head
To get this working with the existing evaluation code I want to wrap this in a lambda that will just restrict the output to the specified indexes. This is the easiest way to get comparable outputs to the original - the outputs will still originate from the composite head.
Now I can create a mock trained prompt object to wrap all this up.
precision recall f1-score support
bad 0.93 0.91 0.92 12500
good 0.92 0.93 0.92 12500
accuracy 0.92 25000
macro avg 0.92 0.92 0.92 25000
weighted avg 0.92 0.92 0.92 25000
precision recall f1-score support
anger 0.69 0.83 0.76 315
anticipation 0.32 0.42 0.36 124
disgust 0.71 0.80 0.75 319
fear 0.48 0.91 0.63 121
joy 0.82 0.83 0.83 400
love 0.40 0.80 0.54 132
optimism 0.67 0.84 0.75 307
pessimism 0.21 0.81 0.33 100
sadness 0.53 0.78 0.63 265
surprise 0.08 0.94 0.15 35
trust 0.13 0.53 0.21 43
micro avg 0.49 0.80 0.60 2161
macro avg 0.46 0.77 0.54 2161
weighted avg 0.60 0.80 0.67 2161
samples avg 0.50 0.80 0.59 2161
So remembering that these are the short trained prompts you should be able to see that the results are exactly the same.
Something to evaluate quickly is a spot check of the emotional content of the IMDB reviews. This should give an idea of the degree to which a single text can be classified in multiple ways.
This does involve domain switching so some loss of performance is expected. It’s not possible to quantify the performance loss as there is no ground truth for this task. That is why a comprehensive evaluation is not being done.
label | correlation | |
---|---|---|
0 | optimism | 0.531295 |
1 | love | 0.512116 |
2 | trust | 0.480094 |
3 | joy | 0.454155 |
4 | anticipation | 0.266926 |
5 | surprise | 0.156752 |
6 | fear | -0.125623 |
7 | pessimism | -0.309884 |
8 | sadness | -0.320460 |
9 | anger | -0.463072 |
10 | disgust | -0.551651 |
This seems to have a reasonably clear split. The top 4 emotions are positively correlated with positive sentiment, and the bottom 4 emotions are negatively correlated. That feels right.
This should be something that is possible to reason about. The crosstab of of emotion to sentiment should show that negative emotions are correlated with negative sentiment. It is possible to write something that is positive and sad, so I do not expect the crosstab to be a 100% split.
label | correlation | |
---|---|---|
0 | joy | 0.490373 |
1 | optimism | 0.442756 |
2 | love | 0.296635 |
3 | trust | 0.134409 |
4 | anticipation | 0.110726 |
5 | surprise | 0.012533 |
6 | fear | -0.108274 |
7 | pessimism | -0.172231 |
8 | sadness | -0.283275 |
9 | anger | -0.441441 |
10 | disgust | -0.459849 |
Once again the split seems reasonable. While the top 4 / bottom 4 emotions are the same as before this time only the top 2 and bottom 2 seem to stand out as well correlated. It’s interesting how the sentiment classifier appears to be a simpler classifier. This may be down to the available information during training.
These results don’t feel wildly surprising. It’s almost like the emotion classifier is a refinement of the sentiment classifier.
I was going to try to combine these two tasks in a single classification head and prompt. The training would be a little involved as there is not a unified dataset for this task - and normally there would not be.
So the training data would be mixed with a 50/50 split. This would balance the two tasks so that the prompt can learn to do both of them. The loss over the linear head would be directed to the task specific indices by selecting them first and then doing cross entropy, so that the unrelated task weights should not suffer.
However it occurs to me that the emotion dataset is already an example of multi task training. So we can estimate how much doing multiple things is by training a classifier for a single emotion. Let’s choose the very best performing emotion - joy:
emotion | precision | recall | f1 | support |
---|---|---|---|---|
joy | 0.82 | 0.83 | 0.83 | 400 |
#collapse
joy_train_df = (
emotion_train_df[["text", "joy"]]
.rename(columns={"joy": "label"})
.copy()
)
joy_validation_df = (
emotion_validation_df[["text", "joy"]]
.rename(columns={"joy": "label"})
.copy()
)
joy_train_dataloader = PastDataloader(
model=model,
tokenizer=tokenizer,
df=joy_train_df,
batch_size=BATCH_SIZE,
max_length=MAX_LENGTH,
shuffle=True,
)
joy_validation_dataloader = PastDataloader(
model=model,
tokenizer=tokenizer,
df=joy_validation_df,
batch_size=BATCH_SIZE,
max_length=MAX_LENGTH,
shuffle=False,
)
tensor([0.7840, 1.3803], device='cuda:0')
Average loss: 0.5597
Average loss: 0.4182
Average loss: 0.3923
Average loss: 0.5273
Average loss: 0.4281
Average loss: 0.3973
Average loss: 0.3725
Average loss: 0.3587
Average loss: 0.3496
Average loss: 0.3465
Average loss: 0.3404
Average loss: 0.3365
Average loss: 0.3214
precision recall f1-score support
not-joy 0.84 0.84 0.84 486
joy 0.81 0.81 0.81 400
accuracy 0.83 886
macro avg 0.82 0.82 0.82 886
weighted avg 0.83 0.83 0.83 886
precision recall f1-score support
not-joy 0.81 0.92 0.86 486
joy 0.89 0.74 0.80 400
accuracy 0.84 886
macro avg 0.85 0.83 0.83 886
weighted avg 0.84 0.84 0.84 886
This is quite interesting as it is actually performing worse when turned into a single task classifier. I think this deserves more investigation - this post is already quite long though.