Matthew’s Blog - Wikipedia Boundary Metrics

The progress with wikipedia link resolution is difficult to track because there are no metrics that are produced as the model trains. The use of metrics could help spot when the model is under performing, as happened with the cross entropy loss.

The model is performing two tasks so there should be two sets of metrics. Determining the link boundaries is simple enough that it could be passed to a classification report, while the prediction of the top tokens is more involved. For the top tokens there are many different ways to measure accuracy - should the overlap between the tokens and the targets be used? Should it be the predicted page? Should it be the sum of the correct token indices? A more interesting approach would be to treat the page prediction as a language modelling task and try to measure the perplexity of it. Another might be to treat the output as a vector and use cosine similarity.

Investigating these different approaches can also shape the possible loss functions that are used - for example using the perplexity score involves calculating the cross entropy loss.

Begin Within Metrics

To start with I should measure the performance of the begin/within pair of binary classifiers. This is the entity extraction part of the best performing model.

Code

MODEL_NAME = "facebook/bart-base"
BATCH_SIZE = 32

Here is the metric code, I’m describing it as 2 class because it is measuring the accuracy of the begin and within classifiers separately.

Code

import blog.transformers_logging

from blog.wikipedia_link.metrics.boundary_bce_2class import metric_boundary_bce_2class
from blog.wikipedia_link.loss.boundary_bce import calculate_loss_boundary_bce

from blog.wikipedia_link.loss.link_bce import calculate_loss_link_bce
from blog.wikipedia_link.model.bart_boundary_bce import BartLinksBoundaryBCE

from blog.wikipedia_link.data.boundary_bce import load_dataset_bce
from blog.wikipedia_link.data.page_tokens import load_page_tokens
from blog.wikipedia_link.data.title_to_index import load_title_to_index

Now that we have defined the metrics, loss and model we can load the data and train it.

Code

#hide_output
from transformers import AutoTokenizer

token_indices = load_page_tokens()
title_to_index = load_title_to_index()
split = load_dataset_bce(test_size=BATCH_SIZE*2)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = BartLinksBoundaryBCE.from_pretrained(MODEL_NAME)

model.token_indices = token_indices
model.boundary_loss = calculate_loss_boundary_bce
model.link_loss = calculate_loss_link_bce

Some weights of BartLinksBoundaryBCE were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['link_head.bias', 'link_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Code

from typing import *
from pathlib import Path
from transformers import Trainer, TrainingArguments
from transformers import EvalPrediction

MODEL_RUN_FOLDER = Path("/data/blog/2021-08-30-wikipedia-metrics/runs")
MODEL_RUN_FOLDER.mkdir(parents=True, exist_ok=True)

def compute_metrics(preds: EvalPrediction) -> Dict[str, float]:
    boundary_labels = preds.label_ids.reshape(-1, 3)[:, :2]
    boundary_predictions = preds.predictions.reshape(-1, tokenizer.vocab_size + 2)[:, -2:]
    return metric_boundary_bce_2class(boundary_predictions, boundary_labels)

training_args = TrainingArguments(
    report_to=[],            
    output_dir=MODEL_RUN_FOLDER / "output",
    overwrite_output_dir=True,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    learning_rate=5e-5,
    warmup_ratio=0.06,
    evaluation_strategy="steps",
    logging_dir=MODEL_RUN_FOLDER / "output",

    max_steps=100,
    logging_steps=10,

    # not really training properly here
    # load_best_model_at_end=True,
    # metric_for_best_model="quality",
    # greater_is_better=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=split["train"],
    eval_dataset=split["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()

[100/100 01:32, Epoch 0/1]

Step	Training Loss	Validation Loss	Start Accuracy	Within Accuracy	Within Precision	Within Recall	Within Fscore
10	1.210000	0.796951	0.968628	0.901611	0.000000	0.000000	0.000000
20	0.798000	0.744180	0.968750	0.901855	0.000000	0.000000	0.000000
30	0.737100	0.712299	0.968750	0.901855	0.000000	0.000000	0.000000
40	0.722800	0.695815	0.968750	0.901855	0.000000	0.000000	0.000000
50	0.699700	0.673633	0.968750	0.901855	0.000000	0.000000	0.000000
60	0.679000	0.647545	0.968750	0.898804	0.264151	0.017413	0.032672
70	0.623600	0.626958	0.968750	0.904175	0.552486	0.124378	0.203046
80	0.642100	0.621785	0.968750	0.906616	0.566553	0.206468	0.302644
90	0.634900	0.613633	0.968750	0.905151	0.560811	0.154851	0.242690
100	0.615200	0.612778	0.968750	0.910034	0.583127	0.292289	0.389395

TrainOutput(global_step=100, training_loss=0.7362331295013428, metrics={'train_runtime': 93.2935, 'train_samples_per_second': 34.3, 'train_steps_per_second': 1.072, 'total_flos': 487796726169600.0, 'train_loss': 0.7362331295013428, 'epoch': 0.02})

So this is quite noisy output and I’m not super satisfied with the results. I don’t feel that I have a good idea of the improvement or degredation of the model.

Begin Within Class Metrics

Mapping the two binary classifiers to the 4 classes should help with this. The four classes are similar to the IOB (inside, outside, beginning) classes except that the 4 combinations of the two binary classifiers can result in an invalid combination (a token that is the start of a link but not within a link). Once that has been done I can generate more meaningful precision and recall figures.

Code

# from src/main/python/blog/wikipedia_link/metrics/boundary_bce_4class.py
from typing import Dict

import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support


def metric_boundary_bce_4class(
    predictions: np.array,
    labels: np.array,
    start_index: int = 0,
    within_index: int = 1,
) -> Dict[str, float]:
    # predictions and labels have been flattened and sliced
    # this means that labels is [n, 2] ints and predictions is [n, 2] floats
    beginning_label = 0
    inside_label = 1
    outside_label = 2
    invalid_label = 3

    def beginning(ary: np.array) -> np.array:
        return (ary[:, start_index] > 0) & (ary[:, within_index] > 0)

    def within(ary: np.array) -> np.array:
        return (ary[:, start_index] <= 0) & (ary[:, within_index] > 0)

    def outside(ary: np.array) -> np.array:
        return (ary[:, start_index] <= 0) & (ary[:, within_index] <= 0)

    def invalid(ary: np.array) -> np.array:
        return (ary[:, start_index] > 0) & (ary[:, within_index] <= 0)

    def classes(ary: np.array) -> np.array:
        return (
            (beginning(ary) * beginning_label)
            + (within(ary) * inside_label)
            + (outside(ary) * outside_label)
            + (invalid(ary) * invalid_label)
        )

    predictions = classes(predictions)
    labels = classes(labels)

    accuracy = accuracy_score(labels, predictions)
    metrics = precision_recall_fscore_support(labels, predictions, zero_division=0)

    result = {
        "accuracy": accuracy,
        "beginning_precision": metrics[0][beginning_label],
        "beginning_recall": metrics[1][beginning_label],
        "beginning_fscore": metrics[2][beginning_label],
        "inside_precision": metrics[0][inside_label],
        "inside_recall": metrics[1][inside_label],
        "inside_fscore": metrics[2][inside_label],
        "outside_precision": metrics[0][outside_label],
        "outside_recall": metrics[1][outside_label],
        "outside_fscore": metrics[2][outside_label],
    }
    if len(metrics[0]) == 4:
        result = {
            **result,
            "invalid_precision": metrics[0][invalid_label],
            "invalid_recall": metrics[1][invalid_label],
            "invalid_fscore": metrics[2][invalid_label],
        }
    else:
        result = {
            **result,
            "invalid_precision": 0.0,
            "invalid_recall": 0.0,
            "invalid_fscore": 0.0,
        }

    return result

Code

from typing import *
from pathlib import Path
from transformers import Trainer, TrainingArguments
from transformers import EvalPrediction

MODEL_RUN_FOLDER = Path("/data/blog/2021-08-30-wikipedia-metrics/runs")
MODEL_RUN_FOLDER.mkdir(parents=True, exist_ok=True)

def compute_metrics(preds: EvalPrediction) -> Dict[str, float]:
    boundary_labels = preds.label_ids.reshape(-1, 3)[:, :2]
    boundary_predictions = preds.predictions.reshape(-1, tokenizer.vocab_size + 2)[:, -2:]
    return metric_boundary_bce_4class(boundary_predictions, boundary_labels)

training_args = TrainingArguments(
    report_to=[],
    output_dir=MODEL_RUN_FOLDER / "output",
    overwrite_output_dir=True,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    learning_rate=5e-5,
    warmup_ratio=0.06,
    evaluation_strategy="steps",
    logging_dir=MODEL_RUN_FOLDER / "output",

    max_steps=100,
    logging_steps=10,

    # not really training properly here
    # load_best_model_at_end=True,
    # metric_for_best_model="quality",
    # greater_is_better=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=split["train"],
    eval_dataset=split["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()

using `logging_steps` to initialize `eval_steps` to 10
PyTorch: setting up devices

[100/100 01:34, Epoch 0/1]

Step	Training Loss	Validation Loss	Accuracy	Beginning Precision	Beginning Recall	Beginning Fscore	Inside Precision	Inside Recall	Inside Fscore	Outside Precision	Outside Recall	Outside Fscore
10	0.624400	0.604486	0.904114	0.000000	0.000000	0.000000	0.426696	0.355839	0.388060	0.932321	0.976110	0.953713
20	0.617700	0.587958	0.906372	0.750000	0.029297	0.056391	0.437500	0.319343	0.369198	0.930673	0.980306	0.954845
30	0.594300	0.580028	0.906006	0.762712	0.087891	0.157618	0.428678	0.526460	0.472563	0.949463	0.962507	0.955940
40	0.593200	0.577678	0.911194	0.644860	0.404297	0.496999	0.465568	0.721715	0.566011	0.969855	0.942813	0.956143
50	0.584700	0.570706	0.921753	0.641791	0.587891	0.613660	0.549356	0.700730	0.615878	0.967593	0.949716	0.958571
60	0.581300	0.570254	0.922546	0.655172	0.519531	0.579521	0.558360	0.645985	0.598985	0.961907	0.957025	0.959460
70	0.552800	0.564304	0.922852	0.643172	0.570312	0.604555	0.558431	0.649635	0.600590	0.963944	0.955333	0.959619
80	0.578900	0.564882	0.922424	0.641975	0.609375	0.625251	0.552790	0.668796	0.605285	0.965811	0.952084	0.958899
90	0.579800	0.562999	0.923218	0.655405	0.568359	0.608787	0.554545	0.667883	0.605960	0.965100	0.954453	0.959747
100	0.565100	0.562203	0.923096	0.654018	0.572266	0.610417	0.552906	0.677007	0.608696	0.965794	0.953506	0.959610

TrainOutput(global_step=100, training_loss=0.5872174787521363, metrics={'train_runtime': 95.2434, 'train_samples_per_second': 33.598, 'train_steps_per_second': 1.05, 'total_flos': 487796726169600.0, 'train_loss': 0.5872174787521363, 'epoch': 0.02})

So I think that these boundary metrics are far better. They make it much easier to see where it is strong and weak - not that training for 100 batches is going to really show good performance. It would be good to evaluate the other form of this metric against the cross entropy version of the model to check I have implemented it well.

IOB Boundary Metrics

The Inside / Outside / Beginning 3 class boundary classifier has already been tried, lets see what the metrics say about it.

Code

# from src/main/python/blog/wikipedia_link/metrics/boundary_iob.py
from typing import Dict

import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support


def metric_boundary_iob(
    predictions: np.array,
    labels: np.array,
    inside_label: int = 1,
    beginning_label: int = 2,
    outside_label: int = 0,
) -> Dict[str, float]:
    # predictions and labels have been flattened and sliced
    # this means that labels is [n] ints and predictions is [n, 3] floats
    predictions = predictions.argmax(axis=1)

    accuracy = accuracy_score(labels, predictions)
    metrics = precision_recall_fscore_support(labels, predictions, zero_division=0)

    return {
        "accuracy": accuracy,
        "beginning_precision": metrics[0][beginning_label],
        "beginning_recall": metrics[1][beginning_label],
        "beginning_fscore": metrics[2][beginning_label],
        "inside_precision": metrics[0][inside_label],
        "inside_recall": metrics[1][inside_label],
        "inside_fscore": metrics[2][inside_label],
        "outside_precision": metrics[0][outside_label],
        "outside_recall": metrics[1][outside_label],
        "outside_fscore": metrics[2][outside_label],
    }



# from src/main/python/blog/wikipedia_link/loss/boundary_iob.py
import torch


def calculate_loss_boundary_iob(
    predictions: torch.Tensor,  # [:, n, 3] inside, outside, begin
    labels: torch.Tensor,  # [:, n, 1] class
) -> torch.Tensor:
    """Calculate the loss for the boundary predictions (outside, inside, beginning).
    The predictions are only the boundary predictions.
    The labels combine the boundary labels and the link target index."""
    return torch.nn.functional.cross_entropy(
        predictions.reshape(-1, 3), labels.flatten()
    )



# from src/main/python/blog/wikipedia_link/loss/link_bce.py
import numpy as np
import torch


def calculate_loss_link_bce(
    predictions: torch.Tensor,  # [:, n, vocab_size]
    boundary_labels: torch.Tensor,  # [:, n, 1] or [:, n, 2] for iob or bce
    link_labels: torch.Tensor,  # [:, n] for index
    token_indices: np.array,  # index -> 50 tokens
) -> torch.Tensor:
    """Calculate the loss for the link predictions.
    The labels for this are only valid within a link,
    so the boundary_labels are used to spot the links.
    The predictions are only the link target predictions."""

    boundary_labels = boundary_labels.view(-1, boundary_labels.shape[-1])
    mask = boundary_labels.sum(dim=1) > 0

    link_labels = link_labels.view(-1, 1)[mask].long()
    rows = link_labels.shape[0]

    vocab_size = predictions.shape[-1]
    predictions = predictions.view(-1, vocab_size)[mask]

    targets = torch.zeros(vocab_size * rows, device=predictions.device)
    target_offsets = torch.tensor(range(rows), device=predictions.device) * vocab_size
    target_indexes = (token_indices[link_labels] + target_offsets[:, None]).flatten()
    targets[target_indexes] = 1

    return torch.nn.functional.binary_cross_entropy_with_logits(
        predictions, targets.view(-1, vocab_size)
    )



# from src/main/python/blog/wikipedia_link/data/boundary_iob.py
from pathlib import Path
from typing import Dict

import pandas as pd
from datasets import Dataset


def load_dataset_iob(test_size: int = 64) -> Dict[str, Dataset]:
    df = pd.read_parquet(
        sorted(
            Path("/data/blog/2021-08-21-link-evaluation").glob(  # different folder
                "*.gz.parquet"
            )
        )[-1]
    )
    df = df[["input_ids", "attention_mask", "label"]]

    return Dataset.from_pandas(df).train_test_split(test_size=test_size)



# from src/main/python/blog/wikipedia_link/data/page_tokens.py
from pathlib import Path

import numpy as np
import pandas as pd
import torch


def load_page_tokens(device: torch.device = torch.device("cuda")) -> torch.Tensor:
    token_df = pd.concat(
        [
            pd.read_parquet(path)
            for path in sorted(
                Path("/data/blog/2021-08-01-wikipedia-page-pmi/").glob(
                    "*-pmi.gz.parquet"
                )
            )
        ]
    )
    token_df = token_df.set_index("title")

    token_indices = np.concatenate(token_df.tokens.values).reshape(-1, 50)
    token_indices = torch.from_numpy(token_indices).long()
    return token_indices.detach().to(device)



# from src/main/python/blog/wikipedia_link/data/title_to_index.py
from pathlib import Path
from typing import Dict

import pandas as pd


def convert_pmi_to_title_index(
    source: Path,
    destination: Path,
) -> None:
    destination.parent.mkdir(exist_ok=True, parents=True)

    if destination.exists():
        print(f"Skipping title-to-index aggregation, already exists at {destination}")
        return

    df = pd.read_parquet(source)[["title"]]
    df = df.sort_values(by="title")
    df = df.reset_index()
    df = df.set_index("title")
    df.to_parquet(destination, compression="gzip")


def load_title_to_index() -> Dict[str, int]:
    title_to_index = pd.read_parquet(
        "/data/blog/2021-07-30-wikipedia-data-generation/title-to-index.gz.parquet"
    )
    return title_to_index["index"].to_dict()



# from src/main/python/blog/wikipedia_link/model/bart_boundary_iob.py
import numpy as np
import torch
from transformers import BartConfig, BartForConditionalGeneration


class BartLinksBoundaryIOB(BartForConditionalGeneration):
    def __init__(self, config: BartConfig) -> None:
        super().__init__(config)
        self.link_head = torch.nn.Linear(
            in_features=config.d_model, out_features=3, bias=True
        )
        self.token_indices = None

    @staticmethod
    def boundary_loss(
        predictions: torch.Tensor,  # [:, n, 3] outside, inside, begin
        labels: torch.Tensor,  # [:, n, 1] class: outside, inside, begin
    ) -> torch.Tensor:
        raise NotImplementedError()

    @staticmethod
    def link_loss(
        predictions: torch.Tensor,  # [:, n, vocab_size]
        boundary_labels: torch.Tensor,  # [:, n, 1] class: outside, inside, begin
        link_labels: torch.Tensor,  # [:, n] for index
        token_indices: np.array,  # index -> 50 tokens
    ) -> torch.Tensor:
        raise NotImplementedError()

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        labels=None,
    ):
        assert self.token_indices is not None, "Model misconfigured, set token_indices"

        outputs = self.model(
            input_ids,
            attention_mask=attention_mask,
        )
        link_logits = self.lm_head(outputs[0]) + self.final_logits_bias
        boundary_logits = self.link_head(outputs[0])
        logits = torch.cat(
            [
                link_logits,
                boundary_logits,
            ],
            dim=-1,
        )

        # Including the full outputs means more stuff gets passed to the metrics method.
        # Keeping the output just the logits or loss and logits makes metrics easier.
        # The base model uses this approach:
        # output = (logits,) + outputs[1:]

        if labels is not None:
            boundary_loss = self.boundary_loss(
                predictions=boundary_logits, labels=labels[:, :, :1]
            )
            link_loss = self.link_loss(
                predictions=link_logits,
                boundary_labels=labels[:, :, :1],
                link_labels=labels[:, :, 1],
                token_indices=self.token_indices,
            )
            loss = boundary_loss + link_loss
            return (loss, logits)
        return (logits,)

Code

#hide_output
from transformers import AutoTokenizer

token_indices = load_page_tokens()
title_to_index = load_title_to_index()
split = load_dataset_iob(test_size=BATCH_SIZE*2)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = BartLinksBoundaryIOB.from_pretrained(MODEL_NAME)

model.token_indices = token_indices
model.boundary_loss = calculate_loss_boundary_iob
model.link_loss = calculate_loss_link_bce

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/facebook/bart-base/resolve/main/config.json from cache at /home/matthew/.cache/huggingface/transformers/f5310d276a6d1648d00c32fadc8bf7b4607e0fbd5b404fc4a0045960aa2bdfdb.8512cdf8592f538a7fd4b40eecaa096285410ec6494049568b3300922ab71165
Model config BartConfig {
  "activation_dropout": 0.1,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "architectures": [
    "BartModel"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 0,
  "classif_dropout": 0.1,
  "classifier_dropout": 0.0,
  "d_model": 768,
  "decoder_attention_heads": 12,
  "decoder_ffn_dim": 3072,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 6,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 12,
  "encoder_ffn_dim": 3072,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 6,
  "eos_token_id": 2,
  "forced_eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_position_embeddings": 1024,
  "model_type": "bart",
  "no_repeat_ngram_size": 3,
  "normalize_before": false,
  "normalize_embedding": true,
  "num_beams": 4,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "scale_embedding": false,
  "task_specific_params": {
    "summarization": {
      "length_penalty": 1.0,
      "max_length": 128,
      "min_length": 12,
      "num_beams": 4
    },
    "summarization_cnn": {
      "length_penalty": 2.0,
      "max_length": 142,
      "min_length": 56,
      "num_beams": 4
    },
    "summarization_xsum": {
      "length_penalty": 1.0,
      "max_length": 62,
      "min_length": 11,
      "num_beams": 6
    }
  },
  "transformers_version": "4.9.2",
  "use_cache": true,
  "vocab_size": 50265
}

loading file https://huggingface.co/facebook/bart-base/resolve/main/vocab.json from cache at /home/matthew/.cache/huggingface/transformers/43978bdeaa326572886b44fcfed82f932f76571095ce31973e51c3da8ccade7f.d67d6b367eb24ab43b08ad55e014cf254076934f71d832bbab9ad35644a375ab
loading file https://huggingface.co/facebook/bart-base/resolve/main/merges.txt from cache at /home/matthew/.cache/huggingface/transformers/3c167ed8af56e6605eeb794b63a79d65d85e6708c9b04408d41946337030f5cd.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
loading file https://huggingface.co/facebook/bart-base/resolve/main/tokenizer.json from cache at /home/matthew/.cache/huggingface/transformers/a878fcd69bba037c9b1b227f4213579ae43d0aaa9374e167bc6c5f41b1cfeb30.fc9576039592f026ad76a1c231b89aee8668488c671dfbe6616bab2ed298d730
loading file https://huggingface.co/facebook/bart-base/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/facebook/bart-base/resolve/main/special_tokens_map.json from cache at None
loading file https://huggingface.co/facebook/bart-base/resolve/main/tokenizer_config.json from cache at None
loading configuration file https://huggingface.co/facebook/bart-base/resolve/main/config.json from cache at /home/matthew/.cache/huggingface/transformers/f5310d276a6d1648d00c32fadc8bf7b4607e0fbd5b404fc4a0045960aa2bdfdb.8512cdf8592f538a7fd4b40eecaa096285410ec6494049568b3300922ab71165
Model config BartConfig {
  "activation_dropout": 0.1,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "architectures": [
    "BartModel"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 0,
  "classif_dropout": 0.1,
  "classifier_dropout": 0.0,
  "d_model": 768,
  "decoder_attention_heads": 12,
  "decoder_ffn_dim": 3072,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 6,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 12,
  "encoder_ffn_dim": 3072,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 6,
  "eos_token_id": 2,
  "forced_eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_position_embeddings": 1024,
  "model_type": "bart",
  "no_repeat_ngram_size": 3,
  "normalize_before": false,
  "normalize_embedding": true,
  "num_beams": 4,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "scale_embedding": false,
  "task_specific_params": {
    "summarization": {
      "length_penalty": 1.0,
      "max_length": 128,
      "min_length": 12,
      "num_beams": 4
    },
    "summarization_cnn": {
      "length_penalty": 2.0,
      "max_length": 142,
      "min_length": 56,
      "num_beams": 4
    },
    "summarization_xsum": {
      "length_penalty": 1.0,
      "max_length": 62,
      "min_length": 11,
      "num_beams": 6
    }
  },
  "transformers_version": "4.9.2",
  "use_cache": true,
  "vocab_size": 50265
}

loading configuration file https://huggingface.co/facebook/bart-base/resolve/main/config.json from cache at /home/matthew/.cache/huggingface/transformers/f5310d276a6d1648d00c32fadc8bf7b4607e0fbd5b404fc4a0045960aa2bdfdb.8512cdf8592f538a7fd4b40eecaa096285410ec6494049568b3300922ab71165
Model config BartConfig {
  "activation_dropout": 0.1,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "architectures": [
    "BartModel"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 0,
  "classif_dropout": 0.1,
  "classifier_dropout": 0.0,
  "d_model": 768,
  "decoder_attention_heads": 12,
  "decoder_ffn_dim": 3072,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 6,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 12,
  "encoder_ffn_dim": 3072,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 6,
  "eos_token_id": 2,
  "forced_eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_position_embeddings": 1024,
  "model_type": "bart",
  "no_repeat_ngram_size": 3,
  "normalize_before": false,
  "normalize_embedding": true,
  "num_beams": 4,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "scale_embedding": false,
  "task_specific_params": {
    "summarization": {
      "length_penalty": 1.0,
      "max_length": 128,
      "min_length": 12,
      "num_beams": 4
    },
    "summarization_cnn": {
      "length_penalty": 2.0,
      "max_length": 142,
      "min_length": 56,
      "num_beams": 4
    },
    "summarization_xsum": {
      "length_penalty": 1.0,
      "max_length": 62,
      "min_length": 11,
      "num_beams": 6
    }
  },
  "transformers_version": "4.9.2",
  "use_cache": true,
  "vocab_size": 50265
}

loading weights file https://huggingface.co/facebook/bart-base/resolve/main/pytorch_model.bin from cache at /home/matthew/.cache/huggingface/transformers/486355ec722ef05fd480e999d4c763be56549ae930f6a3742ee721a5d2a05647.9faea28a6782a9589c09b1942c039943df02232d83d2ac288a69ddfa928eae22
All model checkpoint weights were used when initializing BartLinksBoundaryIOB.

Some weights of BartLinksBoundaryIOB were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['link_head.bias', 'link_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Code

from typing import *
from pathlib import Path
from transformers import Trainer, TrainingArguments
from transformers import EvalPrediction

MODEL_RUN_FOLDER = Path("/data/blog/2021-08-30-wikipedia-metrics/runs")
MODEL_RUN_FOLDER.mkdir(parents=True, exist_ok=True)

def compute_metrics(preds: EvalPrediction) -> Dict[str, float]:
    boundary_labels = preds.label_ids.reshape(-1, 2)[:, 0]
    boundary_predictions = preds.predictions.reshape(-1, tokenizer.vocab_size + 3)[:, -3:]
    return metric_boundary_iob(boundary_predictions, boundary_labels)

training_args = TrainingArguments(
    report_to=[],            
    output_dir=MODEL_RUN_FOLDER / "output",
    overwrite_output_dir=True,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    learning_rate=5e-5,
    warmup_ratio=0.06,
    evaluation_strategy="steps",
    logging_dir=MODEL_RUN_FOLDER / "output",

    max_steps=100,
    logging_steps=10,

    # not really training properly here
    # load_best_model_at_end=True,
    # metric_for_best_model="quality",
    # greater_is_better=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=split["train"],
    eval_dataset=split["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()

using `logging_steps` to initialize `eval_steps` to 10
PyTorch: setting up devices

[100/100 01:31, Epoch 0/1]

Step	Training Loss	Validation Loss	Accuracy	Beginning Precision	Beginning Recall	Beginning Fscore	Inside Precision	Inside Recall	Inside Fscore	Outside Precision	Outside Recall	Outside Fscore
10	1.479900	0.935388	0.899658	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.899713	0.999932	0.947179
20	0.839800	0.795681	0.899719	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.899719	1.000000	0.947213
30	0.750300	0.721273	0.901245	0.000000	0.000000	0.000000	0.595420	0.068602	0.123028	0.903710	0.996405	0.947796
40	0.698600	0.680169	0.907898	0.000000	0.000000	0.000000	0.622018	0.298153	0.403092	0.917735	0.986093	0.950687
50	0.671300	0.665654	0.910828	0.727273	0.063241	0.116364	0.668750	0.282322	0.397032	0.918663	0.988400	0.952256
60	0.666300	0.651620	0.920044	0.610860	0.533597	0.569620	0.620942	0.521548	0.566922	0.948222	0.964046	0.956068
70	0.656500	0.641215	0.921082	0.641026	0.444664	0.525088	0.606164	0.622691	0.614317	0.952439	0.960450	0.956428
80	0.634100	0.642581	0.921692	0.599206	0.596838	0.598020	0.599219	0.674582	0.634671	0.961096	0.951903	0.956477
90	0.620200	0.641026	0.916504	0.574627	0.608696	0.591171	0.563187	0.721196	0.632472	0.964981	0.942134	0.953421
100	0.644800	0.638462	0.918945	0.580038	0.608696	0.594021	0.580364	0.701847	0.635350	0.963531	0.946340	0.954858

TrainOutput(global_step=100, training_loss=0.7661756467819214, metrics={'train_runtime': 92.7329, 'train_samples_per_second': 34.508, 'train_steps_per_second': 1.078, 'total_flos': 487800505958400.0, 'train_loss': 0.7661756467819214, 'epoch': 0.02})

This works and it gets a marginally better score. When I trained the model with this approach it did very poorly though, so it will be interesting to see what the metrics are for a real train.

I’m going to work on the link description metrics in another post as this is quite long.