Prompt Training - Linear Head

Train a linear classifier per prompt
Published

May 23, 2021

I’ve been using the prompt training technique to refine a language model into a classifier by training a very small set of parameters. This has been going well so far for tasks where I can easily select target tokens (like sentiment classification - the tokens good and bad work very well).

When I have a more distant relationship between the tokens and task the classifier performs poorly. If I wanted a classifier that determined if a piece of text was written by an author in the northern or southern hemisphere then using the target tokens relevant and irrelevant would not perform well. So part of the problem is the appropriate selection of the tokens to compare.

I have been trying to generalize this approach to training the selection of the target tokens using the idea of a centroid over the model output. Each class in the classifier would have a different centroid and the closest centroid to a given output is the classification. This was very tricky to train well and I finally got reasonable results using cross entropy loss (the reasonable results still had a significant drop in accuracy compared to the good and bad tokens).

I think that centroid training is a poor proxy for actually training a new linear classification head for the language model. So I am now going to train a new classification head. I still want to be able to perform multiple tasks in a single batch (a key benefit that prompt training unlocks), so after this I am going to investigate training multiple different classifiers and concatenating them, so that they are all performed for each entry in the batch and the task specific output can be selected. Finally an evaluation of multi task training for the prompt can be performed - can a single prompt classify the text according to multiple criteria simultaneously?


Datasets

The two tasks that I am going to evaluate are sentiment analysis and emotion classification. I already have the IMDB dataset for sentiment, I just need to find a dataset for emotion.

Let’s start with a quick review of the sentiment dataset. This is 25,000 imdb movie reviews which are either considered positive or negative, based on the associated score. So the problem is a binary classification problem.

Code
import pandas as pd

sentiment_train_df = pd.read_parquet("/data/sentiment/imdb-movie-reviews/train.gz.parquet")
sentiment_validation_df = pd.read_parquet("/data/sentiment/imdb-movie-reviews/validation.gz.parquet")
Code
sentiment_train_df
label text
0 good Bromwell High is a cartoon comedy. It ran at t...
1 good Homelessness (or Houselessness as George Carli...
2 good Brilliant over-acting by Lesley Ann Warren. Be...
3 good This is easily the most underrated film inn th...
4 good This is not the typical Mel Brooks film. It wa...
... ... ...
24995 bad Towards the end of the movie, I felt it was to...
24996 bad This is the kind of movie that my enemies cont...
24997 bad I saw 'Descent' last night at the Stockholm Fi...
24998 bad Some films that you pick up for a pound turn o...
24999 bad This is one of the dumbest films, I've ever se...

25000 rows × 2 columns

Code
sentiment_train_df.label.value_counts()
bad     12500
good    12500
Name: label, dtype: int64

For the emotion dataset I have to find and preprocess it first. One dataset that I have found is the SemEval 2018 competition here (search for E-c) (Mohammad et al. 2018).

Mohammad, Saif M., Felipe Bravo-Marquez, Mohammad Salameh, and Svetlana Kiritchenko. 2018. “SemEval-2018 Task 1: Affect in Tweets.” In Proceedings of International Workshop on Semantic Evaluation (SemEval-2018). New Orleans, LA, USA.

The two files I am using are the English training and development set.

Code
import pandas as pd

emotion_train_df = (
    pd.read_csv("/data/emotion/sem-eval-2018/train.zip", delimiter="\t")
        .drop(columns="ID")
        .rename(columns={"Tweet": "text"})
)

emotion_validation_df = (
    pd.read_csv("/data/emotion/sem-eval-2018/dev.zip", delimiter="\t")
        .drop(columns="ID")
        .rename(columns={"Tweet": "text"})
)
Code
emotion_train_df
text anger anticipation disgust fear joy love optimism pessimism sadness surprise trust
0 “Worry is a down payment on a problem you may ... 0 1 0 0 0 0 1 0 0 0 1
1 Whatever you decide to do make sure it makes y... 0 0 0 0 1 1 1 0 0 0 0
2 @Max_Kellerman it also helps that the majorit... 1 0 1 0 1 0 1 0 0 0 0
3 Accept the challenges so that you can literall... 0 0 0 0 1 0 1 0 0 0 0
4 My roommate: it's okay that we can't spell bec... 1 0 1 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ...
6833 @nicky57672 Hi! We are working towards your hi... 0 0 0 0 0 0 0 0 0 0 0
6834 @andreamitchell said @berniesanders not only d... 0 1 0 0 0 0 0 0 0 1 0
6835 @isthataspider @dhodgs i will fight this guy! ... 1 0 1 0 0 0 0 1 0 0 0
6836 i wonder how a guy can broke his penis while h... 0 0 0 0 0 0 0 0 0 1 0
6837 I'm highly animated even though I'm decomposing. 0 0 0 0 0 0 0 1 0 0 0

6838 rows × 12 columns

I need to convert the different columns into a single multi target label.

Code
EMOTION_LABELS = [
    "anger",
    "anticipation",
    "disgust",
    "fear",
    "joy",
    "love",
    "optimism",
    "pessimism",
    "sadness",
    "surprise",
    "trust"
]

emotion_train_df["label"] = emotion_train_df.apply(
    lambda row: row[EMOTION_LABELS].to_numpy(),
    axis=1
)

emotion_validation_df["label"] = emotion_validation_df.apply(
    lambda row: row[EMOTION_LABELS].to_numpy(),
    axis=1
)
Code
emotion_train_df.to_parquet("/data/emotion/sem-eval-2018/train.gz.parquet", compression="gzip")
emotion_validation_df.to_parquet("/data/emotion/sem-eval-2018/test.gz.parquet", compression="gzip")
Code
(
    emotion_train_df[EMOTION_LABELS].sum() / len(emotion_train_df)
)
disgust         0.380521
anger           0.372039
joy             0.362240
sadness         0.293653
optimism        0.290143
fear            0.181632
anticipation    0.143024
pessimism       0.116262
love            0.102369
surprise        0.052793
trust           0.052208
dtype: float64

I need to come up with a weighting parameter to address the unbalanced labels. The loss function I am going to use is BCELossWithLogits, and that can accomodate unbalanced labels. It wouldn’t be possible to rebalance the dataset as the labels are intertwined, so balancing one would unbalance the others.

Code
# the weight for BCELossWithLogits is defined as negative_examples / positive_examples
# it needs to be in the same order as the EMOTION_LABELS array, as the targets are by index
emotion_weights = (emotion_train_df[EMOTION_LABELS] == 0).sum() / emotion_train_df[EMOTION_LABELS].sum()

emotion_weights
anger            1.687893
anticipation     5.991820
disgust          1.627978
fear             4.505636
joy              1.760597
love             8.768571
optimism         2.446573
pessimism        7.601258
sadness          2.405378
surprise        17.941828
trust           18.154062
dtype: float64

So this is a binary classification problem when considering each individual emotion. The dataset is smaller and the individual emotions are unbalanced, so it will be slightly harder. The only evaluation with published results I can quickly find is this one which uses pearson correlation as an evaluation metric:

Method Joy Anger Sadness Fear Valence
Bidirectional LSTM 0.49 0.35 0.47 0.49 0.32
Bidirectional LSTM + Lexicon Features 0.54 0.43 0.47 0.55 0.51
Bidirectional LSTM with pretraining 0.62 0.48 0.63 0.58 0.68
Bidirectional LSTM with pretraining + Lexicon Features 0.6 0.5 0.64 0.55 0.71

Valence, or hedonic tone, is the affective quality referring to the intrinsic attractiveness/“good”-ness or averseness/“bad”-ness of an event, object, or situation. The term also characterizes and categorizes specific emotions. For example, emotions popularly referred to as “negative”, such as anger and fear, have negative valence. Valence - wikipedia

The valence label comes from a dataset that I have not downloaded so I will not be training or evaluating based on that.

I asked a work colleague about the results of this competition and they were able to find results much more easily than me. The official competition results appear to be here. The top result is significantly better than what I found:

User macro-avg anger fear joy sadness
venkatesh-1729 0.799 (1) 0.827 (1) 0.779 (1) 0.792 (1) 0.798 (1)

It’s odd that this only has 4 emotions though. I wonder if this is a different task (semeval is made up of several tasks).

There is also this papers with code that has accuracy and f1 instead of per emotion:

model accuracy micro-f1 macro-f1
SpanEmo 0.601 0.713 0.578
BERT+DK 0.591 0.713 0.549
BERT-GCN 0.589 0.707 0.563
Transformer 0.561

I had a quick look at the SpanEmo {% cite alhuzali-ananiadou-2021-spanemo %} and it looks like it is BERT feeding into a downstream network that then performs toekn classification? I would have to read the paper to get a proper idea of the technique.

Anyway this is an exact match for the task and dataset that I am using so aiming for these stats should be reasonable?


Code, Dataloader, Training Loop

A lot of this is copied from the previous notebooks and adjusted to fit the new classification layer.

As before we have a dataloader. This should work with both datasets.

Code
#collapse

from typing import Dict, Iterator, Optional, Tuple, Union

import pandas as pd
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

Past = Tuple[Tuple[torch.Tensor, ...], ...]
TextBatch = Dict[str, torch.Tensor]
PastBatch = Dict[str, Union[torch.Tensor, Past]]


class TextDataloader:
    """Provides a dataloader over a text dataframe"""

    def __init__(
        self,
        df: pd.DataFrame,
        *,
        tokenizer: AutoTokenizer,
        batch_size: int,
        max_length: int,
        device: torch.device = torch.device("cuda"),
        shuffle: bool = True,
        multi_target: bool = False,
    ) -> None:
        self.tokenizer = tokenizer
        self.df = df
        self.batch_size = batch_size
        self.max_length = max_length
        self.device = device
        self.shuffle = shuffle
        self.label_dtype = torch.float if multi_target else torch.long

    def __iter__(self) -> Iterator[TextBatch]:
        """Returns an iterator that returns batches.
        The final batch can be a partial batch."""
        if self.shuffle:
            df = self.df.sample(frac=1).reset_index(drop=True)
        else:
            df = self.df
        batch_size = self.batch_size

        for i in range(len(self)):
            start = i * batch_size
            end = start + batch_size
            yield self.to_batch(df[start:end])

    def __len__(self) -> int:
        """Returns the total number of batches that can be returned."""
        full_batches = len(self.df) // self.batch_size
        if len(self.df) % self.batch_size:
            return full_batches + 1
        return full_batches

    def to_batch(self, rows: pd.DataFrame) -> TextBatch:
        """Converts the rows into a batch"""
        tokens = self.tokenizer(
            rows.text.tolist(),
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=self.max_length,
        ).to(self.device)
        labels = torch.tensor(rows.label.tolist(), dtype=self.label_dtype, device=self.device)
        return {
            "input_ids": tokens["input_ids"],
            "attention_mask": tokens["attention_mask"],
            "labels": labels,
        }


class PastDataloader(TextDataloader):  # pylint: disable=too-few-public-methods
    """Provides a dataloader which converts the text into past tensors"""

    def __init__(
        self,
        df: pd.DataFrame,
        *,
        model: AutoModelForCausalLM,
        tokenizer: AutoTokenizer,
        batch_size: int,
        max_length: int,
        label_map: Optional[Dict[str, int]] = None,
        device: torch.device = torch.device("cuda"),
        shuffle: bool = True,
        multi_target: bool = False,
    ) -> None:
        if label_map:
            df = df.copy()
            df["label"] = df.label.map(label_map)
        super().__init__(
            df=df,
            tokenizer=tokenizer,
            batch_size=batch_size,
            max_length=max_length,
            device=device,
            shuffle=shuffle,
            multi_target=multi_target,
        )
        model.to(device)
        self.model = model

    @torch.no_grad()
    def to_batch(self, rows: pd.DataFrame) -> PastBatch:
        batch = super().to_batch(rows)
        past_key_values = self.model(
            input_ids=batch["input_ids"],
            attention_mask=batch.get("attention_mask", None),
        ).past_key_values
        return {
            "past_key_values": past_key_values,
            "attention_mask": batch["attention_mask"],
            "labels": batch["labels"],
        }

Then we have the modified training loop from the previous posts. This one works around the trained prompt with the linear classification head.

Code
#collapse

from __future__ import annotations
from abc import ABC, abstractmethod
from dataclasses import dataclass
from pathlib import Path
from typing import Callable, Dict, List, Tuple, Union

import torch
import numpy as np
from tqdm.auto import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer

LossFunction = Callable[[torch.Tensor, torch.Tensor], torch.Tensor]
OptimizerFactory = Callable[[torch.nn.Parameter, torch.nn.Parameter], torch.optim.Optimizer]


@dataclass
class TrainedPrompt:
    prompt: torch.Tensor
    prompt_attention: torch.Tensor
    head: torch.nn.Linear

    @staticmethod
    def make(
        model: AutoModelForCausalLM,
        tokenizer: AutoTokenizer,
        prompt_tokens: int,
        device: torch.device,
        classes: int = 2
    ) -> TrainedPrompt:
        prompt_indexes = torch.randint(
            size=(prompt_tokens,),
            low=0,
            high=tokenizer.vocab_size,
            device=device
        )
        prompt = torch.nn.Parameter(
            model.transformer.wte(prompt_indexes).clone()[None, :, :]
        )
        attention = torch.ones(1, prompt.shape[1], device=device)
        head = torch.nn.Linear(
            in_features=model.config.n_embd,
            out_features=classes,
        ).to(device)
        return TrainedPrompt(
            prompt=prompt,
            prompt_attention=attention,
            head=head
        )

    @staticmethod
    def load(folder: Path) -> TrainedPrompt:
        assert folder.exists()
        prompt = torch.load(folder / "prompt.pt")
        attention = torch.ones(1, prompt.shape[1], device=prompt.device)
        return TrainedPrompt(
            prompt=prompt,
            prompt_attention=attention,
            head=torch.load(folder / "head.pt"),
        )

    def save(self, folder: Path) -> None:
        folder.mkdir(parents=True, exist_ok=True)
        torch.save(self.prompt, folder / "prompt.pt")
        torch.save(self.head, folder / "head.pt")

    def optimizer(self, lr: float = 1e-3) -> torch.optim.Optimizer:
        parameters = [self.prompt] + list(self.head.parameters())
        return torch.optim.Adam(parameters, lr=lr)


def train(
    *,
    dl: PastDataloader,
    model: AutoModelForCausalLM,
    tokenizer: AutoTokenizer,
    prompt_tokens: int,
    epochs: int,
    loss_fn: LossFunction,
    classes: int = 2,
) -> TrainedPrompt:
    """Train the prompt"""
    prompt = TrainedPrompt.make(
        model=model,
        tokenizer=tokenizer,
        prompt_tokens=prompt_tokens,
        device=dl.device,
        classes=classes,
    )
    optimizer = prompt.optimizer()

    total_loss = 0.0
    current_loss = 0.0
    bar_format = "{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, {rate_fmt}] - {postfix[0]:>8.4f}"

    with tqdm(
        range(epochs), leave=False, bar_format=bar_format, postfix=[total_loss]
    ) as bar:
        for _epoch in bar:
            with tqdm(
                dl, leave=False, bar_format=bar_format, postfix=[current_loss]
            ) as epoch_bar:
                for batch in epoch_bar:
                    current_loss = _process(
                        batch=batch,
                        model=model,
                        optimizer=optimizer,
                        prompt=prompt,
                        loss_fn=loss_fn,
                    )
                    total_loss += current_loss
                    epoch_bar.postfix[0] = current_loss

            average_loss = total_loss / len(dl)
            bar.postfix[0] = average_loss
            print(f"Average loss: {average_loss:0.4f}")
            total_loss = 0.0

    return prompt

def _process(
    *,
    batch: Dict[str, Union[torch.Tensor, Past]],
    model: AutoModelForCausalLM,
    optimizer: torch.optim.Optimizer,
    prompt: TrainedPrompt,
    loss_fn: LossFunction,
) -> float:
    optimizer.zero_grad()

    logits = _get_output_with_past(
        model=model,
        prompt=prompt,
        past=batch["past_key_values"],
        past_attention_mask=batch["attention_mask"],
    )
    labels = batch["labels"]
    loss = loss_fn(logits, labels)

    loss.backward()
    optimizer.step()

    return loss.item()


def _get_output_with_past(
    *,
    model: AutoModelForCausalLM,
    prompt: TrainedPrompt,
    past: Past,
    past_attention_mask: torch.Tensor,
) -> torch.Tensor:
    """Get the predictions for the next token after the prompt"""
    # concatenate the past attention with the prompt attention
    batch_size = past_attention_mask.shape[0]
    attention_mask = prompt.prompt_attention.repeat_interleave(batch_size, dim=0)
    attention_mask = torch.cat([past_attention_mask, attention_mask], dim=-1)

    # expand the prompt to match the batch size
    input_ids = prompt.prompt.repeat_interleave(batch_size, dim=0)

    state = model.transformer(
        inputs_embeds=input_ids,
        attention_mask=attention_mask,
        past_key_values=past,
    ).last_hidden_state
    return prompt.head(state[:, -1])

Here we have the evaluation code that can determine the accuracy of the trained prompt. The linear head makes this quite a lot simpler than before.

Code
#collapse

from typing import List
from dataclasses import dataclass
from sklearn.metrics import classification_report
from tqdm.auto import tqdm
import numpy as np

@dataclass
class LabelledOutputs:
    outputs: np.ndarray
    labels: np.ndarray
    predictions: np.ndarray

def generate_outputs(
    dl: PastDataloader,
    model: AutoModelForCausalLM,
    prompt: TrainedPrompt,
    multi_target: bool = False,
) -> LabelledOutputs:
    raw_outputs = []
    raw_predictions = []
    for current_outputs, current_predictions in iterate_outputs(
        dl=dl, model=model, prompt=prompt, multi_target=multi_target
    ):
        raw_outputs.append(
            current_outputs.cpu().numpy(),
        )
        raw_predictions.append(
            current_predictions.cpu().numpy(),
        )

    outputs = np.concatenate(raw_outputs)
    if dl.df.label.dtype.name == "object": # hack for multi label outputs
        labels = (
            np.concatenate(
                dl.df.label
            )
                .reshape(len(dl.df), -1)
                .astype(int)
        )
    else:
        labels = dl.df.label.to_numpy()
    predictions = np.concatenate(raw_predictions)

    return LabelledOutputs(
        outputs=outputs,
        labels=labels,
        predictions=predictions,
    )

@torch.no_grad()
def iterate_outputs(
    dl: PastDataloader,
    model: AutoModelForCausalLM,
    prompt: TrainedPrompt,
    multi_target: bool,
) -> Iterator[Tuple[torch.Tensor, torch.Tensor]]:
    for batch in tqdm(dl):
        output = _get_output_with_past(
            model=model,
            prompt=prompt,
            past=batch["past_key_values"],
            past_attention_mask=batch["attention_mask"],
        )
        if multi_target:
            predicted_labels = (output > 0).long()
        else:
            predicted_labels = output.argmax(dim=-1)
        yield output, predicted_labels

@torch.no_grad()
def accuracy(outputs: LabelledOutputs, target_names: List[str] = ["bad", "good"]) -> None:
    print(classification_report(
        y_true=outputs.labels,
        y_pred=outputs.predictions,
        target_names=target_names,
        zero_division=0
    ))

Finally we are loading the model and the tokenizer. Once again we are using GPT2-small.

Code
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("gpt2")
model.to("cuda")
model.eval()

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token # needed to enable padding

Linear Head Training

This will use cross entropy loss while training both the prompt and the linear layer.

Sentiment Training

Lets train a model to classify the IMDB dataset.

Code
BATCH_SIZE = 32
MAX_LENGTH = 1_000

sentiment_train_dataloader = PastDataloader(
    model=model,
    tokenizer=tokenizer,
    df=sentiment_train_df,
    batch_size=BATCH_SIZE,
    max_length=MAX_LENGTH,
    shuffle=True,
    label_map={"bad": 0, "good": 1},
)
sentiment_validation_dataloader = PastDataloader(
    model=model,
    tokenizer=tokenizer,
    df=sentiment_validation_df,
    batch_size=BATCH_SIZE,
    max_length=MAX_LENGTH,
    shuffle=False,
    label_map={"bad": 0, "good": 1},
)
Code
sentiment_trained_prompt = train(
    dl=sentiment_train_dataloader,
    model=model,
    tokenizer=tokenizer,
    prompt_tokens=5,
    epochs=3,
    loss_fn=torch.nn.functional.cross_entropy,
)
Average loss: 0.3775
Average loss: 0.2392
Average loss: 0.2149
Code
sentiment_trained_prompt.save(Path("/data/blog/2021-05-23-prompt-training-linear-head/sentiment-linear-head"))
Code
sentiment_long_trained_prompt = train(
    dl=sentiment_train_dataloader,
    model=model,
    tokenizer=tokenizer,
    prompt_tokens=5,
    epochs=10,
    loss_fn=torch.nn.functional.cross_entropy,
)
Average loss: 0.3440
Average loss: 0.2271
Average loss: 0.2124
Average loss: 0.2039
Average loss: 0.1993
Average loss: 0.1954
Average loss: 0.1922
Average loss: 0.1877
Average loss: 0.1905
Average loss: 0.1861
Code
sentiment_long_trained_prompt.save(Path("/data/blog/2021-05-23-prompt-training-linear-head/sentiment-linear-head-long"))

Emotion Training

Now lets train another model to classify the SemEval 2018 emotion dataset.

Code
BATCH_SIZE = 32
MAX_LENGTH = 1_000

emotion_train_dataloader = PastDataloader(
    model=model,
    tokenizer=tokenizer,
    df=emotion_train_df,
    batch_size=BATCH_SIZE,
    max_length=MAX_LENGTH,
    shuffle=True,
    multi_target=True,
)
emotion_validation_dataloader = PastDataloader(
    model=model,
    tokenizer=tokenizer,
    df=emotion_validation_df,
    batch_size=BATCH_SIZE,
    max_length=MAX_LENGTH,
    shuffle=False,
    multi_target=True,
)
Code
emotion_trained_prompt = train(
    dl=emotion_train_dataloader,
    model=model,
    tokenizer=tokenizer,
    prompt_tokens=5,
    epochs=3,
    loss_fn=torch.nn.BCEWithLogitsLoss(
        pos_weight=torch.tensor(emotion_weights, device="cuda")
    ),
    classes=len(EMOTION_LABELS),
)
Average loss: 1.1200
Average loss: 0.8796
Average loss: 0.7809
Code
emotion_long_trained_prompt = train(
    dl=emotion_train_dataloader,
    model=model,
    tokenizer=tokenizer,
    prompt_tokens=5,
    epochs=10,
    loss_fn=torch.nn.BCEWithLogitsLoss(
        pos_weight=torch.tensor(emotion_weights, device="cuda")
    ),
    classes=len(EMOTION_LABELS),
)
Average loss: 1.0890
Average loss: 0.9252
Average loss: 0.8341
Average loss: 0.7857
Average loss: 0.7548
Average loss: 0.7386
Average loss: 0.7220
Average loss: 0.7129
Average loss: 0.7020
Average loss: 0.6950

Linear Head Evaluation

Lets see how well they perform.

Sentiment Evaluation

Code
sentiment_outputs = generate_outputs(
    dl=sentiment_validation_dataloader,
    model=model,
    prompt=sentiment_trained_prompt
)
Code
accuracy(sentiment_outputs)
              precision    recall  f1-score   support

         bad       0.93      0.91      0.92     12500
        good       0.92      0.93      0.92     12500

    accuracy                           0.92     25000
   macro avg       0.92      0.92      0.92     25000
weighted avg       0.92      0.92      0.92     25000
Code
sentiment_long_outputs = generate_outputs(
    dl=sentiment_validation_dataloader,
    model=model,
    prompt=sentiment_long_trained_prompt
)
Code
accuracy(sentiment_long_outputs)
              precision    recall  f1-score   support

         bad       0.94      0.93      0.93     12500
        good       0.93      0.94      0.93     12500

    accuracy                           0.93     25000
   macro avg       0.93      0.93      0.93     25000
weighted avg       0.93      0.93      0.93     25000

So when trained for an equivalent number of epochs this consistently beats the “good” and “bad” tokens:

epochs good / bad token accuracy linear head accuracy
3 0.91 0.92
10 0.92 0.93

The difference isn’t great but it does show that the “good” and “bad” tokens are not optimal for this dataset. The state of the art for this dataset using a pretrained model is 0.97 accuracy.

Emotion Evaluation

In order to compare the trained model to the LSTM results that I found earlier I need to calculate the Pearson Correlation of the labels to the predictions. Lets start with the regular classification report first.

Code
emotion_outputs = generate_outputs(
    dl=emotion_validation_dataloader,
    model=model,
    prompt=emotion_trained_prompt,
    multi_target=True,
)
Code
accuracy(emotion_outputs, target_names=EMOTION_LABELS)
              precision    recall  f1-score   support

       anger       0.69      0.83      0.76       315
anticipation       0.32      0.42      0.36       124
     disgust       0.71      0.80      0.75       319
        fear       0.48      0.91      0.63       121
         joy       0.82      0.83      0.83       400
        love       0.40      0.80      0.54       132
    optimism       0.67      0.84      0.75       307
   pessimism       0.21      0.81      0.33       100
     sadness       0.53      0.78      0.63       265
    surprise       0.08      0.94      0.15        35
       trust       0.13      0.53      0.21        43

   micro avg       0.49      0.80      0.60      2161
   macro avg       0.46      0.77      0.54      2161
weighted avg       0.60      0.80      0.67      2161
 samples avg       0.50      0.80      0.59      2161

This classification report suggests to me that the classifier works much better on the emotions that have more support. The lowest F1 score of the emotions with at least 300 support is 0.75 (disgust and optimism) while the highest F1 score of the other emotions is 0.63 (fear and sadness).

As a comparison to SpanEmo:

model accuracy micro-f1 macro-f1
Prompt Training (this) 0.60 0.54
SpanEmo 0.601 0.713 0.578
BERT+DK 0.591 0.713 0.549
BERT-GCN 0.589 0.707 0.563
Transformer 0.561

So it lags quite a bit on the micro-f1 and not so much on the macro-f1. It is the worst performing out of these models. I don’t know how significant this is. It certainly isn’t something I would want to build a product around though.

I now need to calculate the Pearson Correlation for the joy (0.83), anger (0.76), sadness (0.63), fear (0.63) emotions.

Code
from scipy.stats import pearsonr

def calculate_pearson_correlation(outputs: LabelledOutputs) -> pd.DataFrame:
    results = []
    for emotion_name, reference_result in [
        ("joy", 0.62),
        ("anger", 0.5),
        ("sadness", 0.64),
        ("fear", 0.58)
    ]:
        index = EMOTION_LABELS.index(emotion_name)
        correlation, p_value = pearsonr(
            outputs.predictions[:, index],
            outputs.labels[:, index]
        )
        results.append({
            "emotion": emotion_name,
            "correlation": correlation,
            "p_value": p_value,
            "reference_result": reference_result
        })
    return pd.DataFrame(results)
Code
calculate_pearson_correlation(emotion_outputs)
emotion correlation p_value reference_result
0 joy 0.683822 3.851568e-123 0.62
1 anger 0.607447 1.818293e-90 0.50
2 sadness 0.446928 1.008134e-44 0.64
3 fear 0.591053 1.436841e-84 0.58
Code
emotion_long_outputs = generate_outputs(
    dl=emotion_validation_dataloader,
    model=model,
    prompt=emotion_long_trained_prompt,
    multi_target=True,
)
Code
accuracy(emotion_long_outputs, target_names=EMOTION_LABELS)
              precision    recall  f1-score   support

       anger       0.73      0.77      0.75       315
anticipation       0.29      0.55      0.38       124
     disgust       0.70      0.87      0.78       319
        fear       0.40      0.91      0.55       121
         joy       0.83      0.81      0.82       400
        love       0.39      0.89      0.55       132
    optimism       0.74      0.69      0.72       307
   pessimism       0.25      0.62      0.36       100
     sadness       0.59      0.74      0.65       265
    surprise       0.13      0.83      0.22        35
       trust       0.11      0.86      0.20        43

   micro avg       0.50      0.78      0.61      2161
   macro avg       0.47      0.78      0.54      2161
weighted avg       0.62      0.78      0.67      2161
 samples avg       0.52      0.78      0.60      2161
Code
calculate_pearson_correlation(emotion_long_outputs)
emotion correlation p_value reference_result
0 joy 0.669055 4.640033e-116 0.62
1 anger 0.607317 2.032116e-90 0.50
2 sadness 0.489182 1.714383e-54 0.64
3 fear 0.513182 1.109112e-60 0.58

It’s interesting that training the emotion classifier for longer has decreased the performance. While the micro and macro average stats of the classification report are nearly identical, out of the 4 comparison emotions only sadness experienced an improvement.

Overall these results seem positive. The prompt + linear head is capable of performance comparable to a fine tuned LSTM. The LSTM was even pretrained on the same domain (tweets) as the dataset, while GPT2-small was not.

I think that the accuracy of the individual classifiers has suffered because the prompt has been trying to distinguish all emotions at the same time. If a prompt per emotion were trained would it perform better?

Concatenated Linear Head Evaluation

Now we can just concatenate the two linear heads together to create a composite classifier. The outputs of a linear layer are independent so this does not alter the outputs. We do have to take care that the comparison is done over the correct indices.

To demonstrate that these statements are true lets create a composite head and then run the two evaluations again.

Code
sentiment_trained_prompt.head.weight.shape, sentiment_trained_prompt.head.bias.shape
(torch.Size([2, 768]), torch.Size([2]))
Code
emotion_trained_prompt.head.weight.shape, emotion_trained_prompt.head.bias.shape
(torch.Size([11, 768]), torch.Size([11]))

So you can see that the weight and bias shapes are compatible. They can just be concatenated across dimension 0 to produce the composite head.

Code
@torch.no_grad()
def make_composite_head(*heads: torch.nn.Linear) -> torch.nn.Linear:
    in_features = 768
    out_features = sum(head.weight.shape[0] for head in heads)

    composite_head = torch.nn.Linear(in_features=768, out_features=13)
    composite_head.weight.data = torch.cat([
        head.weight.data
        for head in heads
    ], dim=0)
    composite_head.bias.data = torch.cat([
        head.bias.data
        for head in heads
    ], dim=0)

    return composite_head
Code
sentiment_emotion_head = make_composite_head(sentiment_trained_prompt.head, emotion_trained_prompt.head)

To get this working with the existing evaluation code I want to wrap this in a lambda that will just restrict the output to the specified indexes. This is the easiest way to get comparable outputs to the original - the outputs will still originate from the composite head.

Code
from typing import Callable

@torch.no_grad()
def restrict_output(head: torch.nn.Linear, indices: List[int]) -> Callable[[torch.Tensor], torch.Tensor]:
    def wrapper(x: torch.Tensor) -> torch.Tensor:
        return head(x)[:, indices]
    return wrapper

Now I can create a mock trained prompt object to wrap all this up.

Code
accuracy(
    generate_outputs(
        dl=sentiment_validation_dataloader,
        model=model,
        prompt=replace(sentiment_trained_prompt, head=restrict_output(sentiment_emotion_head, indices=[0,1])),
    )
)

              precision    recall  f1-score   support

         bad       0.93      0.91      0.92     12500
        good       0.92      0.93      0.92     12500

    accuracy                           0.92     25000
   macro avg       0.92      0.92      0.92     25000
weighted avg       0.92      0.92      0.92     25000
Code
from dataclasses import replace

accuracy(
    generate_outputs(
        dl=emotion_validation_dataloader,
        model=model,
        prompt=replace(emotion_trained_prompt, head=restrict_output(sentiment_emotion_head, indices=range(2,13))),
        multi_target=True,
    ),
    target_names=EMOTION_LABELS
)

              precision    recall  f1-score   support

       anger       0.69      0.83      0.76       315
anticipation       0.32      0.42      0.36       124
     disgust       0.71      0.80      0.75       319
        fear       0.48      0.91      0.63       121
         joy       0.82      0.83      0.83       400
        love       0.40      0.80      0.54       132
    optimism       0.67      0.84      0.75       307
   pessimism       0.21      0.81      0.33       100
     sadness       0.53      0.78      0.63       265
    surprise       0.08      0.94      0.15        35
       trust       0.13      0.53      0.21        43

   micro avg       0.49      0.80      0.60      2161
   macro avg       0.46      0.77      0.54      2161
weighted avg       0.60      0.80      0.67      2161
 samples avg       0.50      0.80      0.59      2161

So remembering that these are the short trained prompts you should be able to see that the results are exactly the same.

Emotion of IMDB

Something to evaluate quickly is a spot check of the emotional content of the IMDB reviews. This should give an idea of the degree to which a single text can be classified in multiple ways.

This does involve domain switching so some loss of performance is expected. It’s not possible to quantify the performance loss as there is no ground truth for this task. That is why a comprehensive evaluation is not being done.

Code
emotion_of_sentiment_outputs = generate_outputs(
    dl=sentiment_validation_dataloader,
    model=model,
    prompt=emotion_trained_prompt,
    multi_target=True,
)
Code
emotion_of_sentiment_df = pd.DataFrame(
    np.concatenate([
        emotion_of_sentiment_outputs.labels[:, None], # good
        (~emotion_of_sentiment_outputs.labels.astype(bool)).astype(int)[:, None], # bad
        emotion_of_sentiment_outputs.predictions
    ], axis=1),
    columns=["good", "bad"] + EMOTION_LABELS
)
Code
def show_correlation(df: pd.DataFrame, target: str, labels: List[str]) -> pd.Series:
    correlations = pd.DataFrame({
        "label": label,
        "correlation": pearsonr(df[target], df[label])[0]
    } for label in labels)
    return (
        correlations
            .sort_values(by="correlation", ascending=False)
            .reset_index(drop=True)
    )
Code
show_correlation(
    emotion_of_sentiment_df,
    target="good",
    labels=EMOTION_LABELS,
)
label correlation
0 optimism 0.531295
1 love 0.512116
2 trust 0.480094
3 joy 0.454155
4 anticipation 0.266926
5 surprise 0.156752
6 fear -0.125623
7 pessimism -0.309884
8 sadness -0.320460
9 anger -0.463072
10 disgust -0.551651

This seems to have a reasonably clear split. The top 4 emotions are positively correlated with positive sentiment, and the bottom 4 emotions are negatively correlated. That feels right.

Sentiment of SemEval 2018

This should be something that is possible to reason about. The crosstab of of emotion to sentiment should show that negative emotions are correlated with negative sentiment. It is possible to write something that is positive and sad, so I do not expect the crosstab to be a 100% split.

Code
sentiment_of_emotion_outputs = generate_outputs(
    dl=emotion_validation_dataloader,
    model=model,
    prompt=sentiment_trained_prompt,
)
Code
sentiment_of_emotion_df = pd.DataFrame(
    np.concatenate([
        sentiment_of_emotion_outputs.predictions[:, None], # good
        (~sentiment_of_emotion_outputs.predictions.astype(bool)).astype(int)[:, None], # bad
        sentiment_of_emotion_outputs.labels
    ], axis=1),
    columns=["good", "bad"] + EMOTION_LABELS
)
Code
show_correlation(
    sentiment_of_emotion_df,
    target="good",
    labels=EMOTION_LABELS,
)
label correlation
0 joy 0.490373
1 optimism 0.442756
2 love 0.296635
3 trust 0.134409
4 anticipation 0.110726
5 surprise 0.012533
6 fear -0.108274
7 pessimism -0.172231
8 sadness -0.283275
9 anger -0.441441
10 disgust -0.459849

Once again the split seems reasonable. While the top 4 / bottom 4 emotions are the same as before this time only the top 2 and bottom 2 seem to stand out as well correlated. It’s interesting how the sentiment classifier appears to be a simpler classifier. This may be down to the available information during training.

These results don’t feel wildly surprising. It’s almost like the emotion classifier is a refinement of the sentiment classifier.


Multi Task Training

I was going to try to combine these two tasks in a single classification head and prompt. The training would be a little involved as there is not a unified dataset for this task - and normally there would not be.

So the training data would be mixed with a 50/50 split. This would balance the two tasks so that the prompt can learn to do both of them. The loss over the linear head would be directed to the task specific indices by selecting them first and then doing cross entropy, so that the unrelated task weights should not suffer.

However it occurs to me that the emotion dataset is already an example of multi task training. So we can estimate how much doing multiple things is by training a classifier for a single emotion. Let’s choose the very best performing emotion - joy:

emotion precision recall f1 support
joy 0.82 0.83 0.83 400
Code
#collapse

joy_train_df = (
    emotion_train_df[["text", "joy"]]
        .rename(columns={"joy": "label"})
        .copy()
)
joy_validation_df = (
    emotion_validation_df[["text", "joy"]]
        .rename(columns={"joy": "label"})
        .copy()
)

joy_train_dataloader = PastDataloader(
    model=model,
    tokenizer=tokenizer,
    df=joy_train_df,
    batch_size=BATCH_SIZE,
    max_length=MAX_LENGTH,
    shuffle=True,
)
joy_validation_dataloader = PastDataloader(
    model=model,
    tokenizer=tokenizer,
    df=joy_validation_df,
    batch_size=BATCH_SIZE,
    max_length=MAX_LENGTH,
    shuffle=False,
)
Code
joy_count = joy_train_df.label.sum()
not_joy_count = len(joy_train_df) - joy_count
balanced_count = len(joy_train_df) // 2

joy_weights = torch.tensor([
    balanced_count / not_joy_count,
    balanced_count / joy_count,
], dtype=torch.float, device="cuda")
joy_weights
tensor([0.7840, 1.3803], device='cuda:0')
Code
joy_trained_prompt = train(
    dl=joy_train_dataloader,
    model=model,
    tokenizer=tokenizer,
    prompt_tokens=5,
    epochs=3,
    # loss_fn=torch.nn.CrossEntropyLoss(weight=joy_weights)
    loss_fn=torch.nn.functional.cross_entropy,
)
Average loss: 0.5597
Average loss: 0.4182
Average loss: 0.3923
Code
joy_long_trained_prompt = train(
    dl=joy_train_dataloader,
    model=model,
    tokenizer=tokenizer,
    prompt_tokens=5,
    epochs=10,
    # loss_fn=torch.nn.CrossEntropyLoss(weight=joy_weights)
    loss_fn=torch.nn.functional.cross_entropy,
)
Average loss: 0.5273
Average loss: 0.4281
Average loss: 0.3973
Average loss: 0.3725
Average loss: 0.3587
Average loss: 0.3496
Average loss: 0.3465
Average loss: 0.3404
Average loss: 0.3365
Average loss: 0.3214
Code
joy_outputs = generate_outputs(
    dl=joy_validation_dataloader,
    model=model,
    prompt=joy_trained_prompt
)
Code
accuracy(joy_outputs, target_names=["not-joy", "joy"])
              precision    recall  f1-score   support

     not-joy       0.84      0.84      0.84       486
         joy       0.81      0.81      0.81       400

    accuracy                           0.83       886
   macro avg       0.82      0.82      0.82       886
weighted avg       0.83      0.83      0.83       886
Code
joy_long_outputs = generate_outputs(
    dl=joy_validation_dataloader,
    model=model,
    prompt=joy_long_trained_prompt
)
Code
accuracy(joy_long_outputs, target_names=["not-joy", "joy"])
              precision    recall  f1-score   support

     not-joy       0.81      0.92      0.86       486
         joy       0.89      0.74      0.80       400

    accuracy                           0.84       886
   macro avg       0.85      0.83      0.83       886
weighted avg       0.84      0.84      0.84       886

This is quite interesting as it is actually performing worse when turned into a single task classifier. I think this deserves more investigation - this post is already quite long though.