Wikipedia Perplexity Train

Can try training the model using the class based approach hinted at with the perplexity metric
Published

September 2, 2021

The perplexity metric seems to be a good way to measure the performance of the model, as it measures the ability of the model to correctly predict the wikipedia page. The perplexity metric is directly related to cross entropy loss as well. If it’s a great metric to optimize for then how would it perform if we use it as the loss function.

Model and Metric Definitions

We can copy over a lot of this from the previous notebooks. The model will need a bit of work around the loss function. Let’s define that first and then just copy over all the other code. To make it slightly easier to read all of the old code will be folded.

Code
MODEL_NAME = "facebook/bart-base"
BATCH_SIZE = 4
EPOCHS = 5
Code
import blog.transformers_logging

from blog.wikipedia_link.loss.link_perplexity import calculate_loss_link_perplexity
from blog.wikipedia_link.metrics.link_perplexity import metric_link_perplexity

from blog.wikipedia_link.loss.boundary_iob import calculate_loss_boundary_iob
from blog.wikipedia_link.metrics.boundary_iob import metric_boundary_iob

from blog.wikipedia_link.data.boundary_iob import load_dataset_iob
from blog.wikipedia_link.data.page_tokens import load_page_tokens
from blog.wikipedia_link.data.title_to_index import load_title_to_index

from blog.wikipedia_link.model.bart_boundary_iob import BartLinksBoundaryIOB

As you can see I’ve had to work on this a bit. One problem is expanding all of the tokens in the batch to the full 50k tokens, which makes my memory weep.

Training

With all of that code defined we can load up the data and model and train it.

Code
#hide_output

token_indices = load_page_tokens()
title_to_index = load_title_to_index()
split = load_dataset_iob(test_size=BATCH_SIZE*2)
Code
#hide_output
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = BartLinksBoundaryIOB.from_pretrained(MODEL_NAME)

model.token_indices = token_indices
model.boundary_loss = calculate_loss_boundary_iob
model.link_loss = calculate_loss_link_perplexity
Some weights of BartLinksBoundaryIOB were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['link_head.weight', 'link_head.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Code
from typing import *
from pathlib import Path
from transformers import Trainer, TrainingArguments
from transformers import EvalPrediction

MODEL_RUN_FOLDER = Path("/data/blog/2021-09-02-wikipedia-train-perplexity/runs")
MODEL_RUN_FOLDER.mkdir(parents=True, exist_ok=True)

def compute_metrics(preds: EvalPrediction) -> Dict[str, float]:
    labels = preds.label_ids.reshape(-1, 2)
    predictions = preds.predictions.reshape(-1, tokenizer.vocab_size + 3)
    return {
        **{
            f"boundary_{key}": value
            for key, value in metric_boundary_iob(
                predictions=predictions[:, -3:],
                labels=labels[:, 0],
            ).items()
        },
        **{
            f"link_{key}": value
            for key, value in metric_link_perplexity(
                predictions=predictions[:, :tokenizer.vocab_size],
                labels=labels,
                token_indices=token_indices
            ).items()
        }
    }

training_args = TrainingArguments(
    report_to=[],            
    output_dir=MODEL_RUN_FOLDER / "output",
    overwrite_output_dir=True,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    learning_rate=5e-5,
    warmup_ratio=0.06,
    evaluation_strategy="steps",
    logging_dir=MODEL_RUN_FOLDER / "output",

    num_train_epochs=EPOCHS,
    logging_steps=1_000,

    # not really training properly here
    # load_best_model_at_end=True,
    # metric_for_best_model="quality",
    # greater_is_better=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=split["train"],
    eval_dataset=split["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()
[ 2/206065 : < :, Epoch 0.00/5]
Step Training Loss Validation Loss

RuntimeError: CUDA out of memory. Tried to allocate 4.72 GiB (GPU 0; 23.65 GiB total capacity; 21.01 GiB already allocated; 2.44 MiB free; 22.54 GiB reserved in total by PyTorch)

This is destroying my memory. I’ll have to change the model to just calculate the link loss instead of considering the boundary too.

Code
#collapse
from transformers import BartForConditionalGeneration, BartConfig
from transformers.models.bart.modeling_bart import shift_tokens_right
import torch

class BartOnlyLinks(BartForConditionalGeneration):
    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        labels=None,
    ):
        outputs = self.model(
            input_ids,
            attention_mask=attention_mask,
        )
        logits = self.lm_head(outputs[0]) + self.final_logits_bias

        # cut down the output to make the metrics easy to calculate
        # output = (logits,) + outputs[1:]

        if labels is not None:
            loss = calculate_loss_link_perplexity(
                predictions=logits,
                boundary_labels=labels[:, :, :1],
                link_labels=labels[:, :, 1],
                token_indices=token_indices,
            )
            return (loss, logits) # ((loss,) + output)
        return (logits,)
Code
#hide_output
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = BartOnlyLinks.from_pretrained(MODEL_NAME)
Code
from typing import *
from pathlib import Path
from transformers import Trainer, TrainingArguments
from transformers import EvalPrediction

MODEL_RUN_FOLDER = Path("/data/blog/2021-09-02-wikipedia-train-perplexity/runs")
MODEL_RUN_FOLDER.mkdir(parents=True, exist_ok=True)
token_indices_npy = token_indices.cpu().numpy()

def compute_metrics(preds: EvalPrediction) -> Dict[str, float]:
    labels = preds.label_ids.reshape(-1, 2)
    predictions = preds.predictions.reshape(-1, tokenizer.vocab_size)
    return {
        **metric_link_perplexity(
            predictions=predictions,
            labels=labels,
            token_indices=token_indices
        ),
        **metric_link_jaccard(
            predictions=predictions,
            labels=labels,
            token_indices=token_indices_npy
        )
    }

training_args = TrainingArguments(
    report_to=[],            
    output_dir=MODEL_RUN_FOLDER / "output",
    overwrite_output_dir=True,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    learning_rate=5e-5,
    warmup_ratio=0.06,
    evaluation_strategy="steps",
    logging_dir=MODEL_RUN_FOLDER / "output",

    #num_train_epochs=EPOCHS,
    max_steps=1_000,
    logging_steps=10,
    # adafactor=True,

    # not really training properly here
    # load_best_model_at_end=True,
    # metric_for_best_model="quality",
    # greater_is_better=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=split["train"],
    eval_dataset=split["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()
[ 2/1000 : < :, Epoch 0.00/1]
Step Training Loss Validation Loss

RuntimeError: CUDA out of memory. Tried to allocate 2.36 GiB (GPU 0; 23.65 GiB total capacity; 18.66 GiB already allocated; 1.22 GiB free; 21.32 GiB reserved in total by PyTorch)

So even the reduced version is unable to run on a beefy GPU. This is the smallest bart model available and it still takes too much. Training in this way is not feasible.

I did manage to get this working one time. The problem I had was that the final model only predicted a single set of tokens. So even if I can wrangle this into working this is not the right way to train the model.

It is annoying that I’ve managed to break the training though.


Next Steps

The memory usage of this approach is too much, but what part is too much? I need a way to inspect the memory usage.

I do feel that this is a better approach to training than binary cross entropy. The problem is the memory usage and slowness. Word2Vec trains by selecting a number of negative examples per iteration and using them as the comparison point. The memory usage and training speed could be improved by using that approach.

It would also be good to incorporate the actual PMI values into this. If the top 50 PMI token scores were passed through softmax that would give a set of values of a consistent scale. The softmax output of the model could then be compared to that using dot product, which is equivalent to cosine similarity. To speed this up the model output could have softmax applied in it’s entirety and then only the PMI tokens could be compared - or another randomly chosen set of tokens could also be compared.

Either way this negative result has plenty going for it.