Code
= "facebook/bart-base"
MODEL_NAME = 32 BATCH_SIZE
August 30, 2021
The progress with wikipedia link resolution is difficult to track because there are no metrics that are produced as the model trains. The use of metrics could help spot when the model is under performing, as happened with the cross entropy loss.
The model is performing two tasks so there should be two sets of metrics. Determining the link boundaries is simple enough that it could be passed to a classification report, while the prediction of the top tokens is more involved. For the top tokens there are many different ways to measure accuracy - should the overlap between the tokens and the targets be used? Should it be the predicted page? Should it be the sum of the correct token indices? A more interesting approach would be to treat the page prediction as a language modelling task and try to measure the perplexity of it. Another might be to treat the output as a vector and use cosine similarity.
Investigating these different approaches can also shape the possible loss functions that are used - for example using the perplexity score involves calculating the cross entropy loss.
To start with I should measure the performance of the begin/within pair of binary classifiers. This is the entity extraction part of the best performing model.
Here is the metric code, I’m describing it as 2 class because it is measuring the accuracy of the begin and within classifiers separately.
import blog.transformers_logging
from blog.wikipedia_link.metrics.boundary_bce_2class import metric_boundary_bce_2class
from blog.wikipedia_link.loss.boundary_bce import calculate_loss_boundary_bce
from blog.wikipedia_link.loss.link_bce import calculate_loss_link_bce
from blog.wikipedia_link.model.bart_boundary_bce import BartLinksBoundaryBCE
from blog.wikipedia_link.data.boundary_bce import load_dataset_bce
from blog.wikipedia_link.data.page_tokens import load_page_tokens
from blog.wikipedia_link.data.title_to_index import load_title_to_index
Now that we have defined the metrics, loss and model we can load the data and train it.
#hide_output
from transformers import AutoTokenizer
token_indices = load_page_tokens()
title_to_index = load_title_to_index()
split = load_dataset_bce(test_size=BATCH_SIZE*2)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = BartLinksBoundaryBCE.from_pretrained(MODEL_NAME)
model.token_indices = token_indices
model.boundary_loss = calculate_loss_boundary_bce
model.link_loss = calculate_loss_link_bce
Some weights of BartLinksBoundaryBCE were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['link_head.bias', 'link_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
from typing import *
from pathlib import Path
from transformers import Trainer, TrainingArguments
from transformers import EvalPrediction
MODEL_RUN_FOLDER = Path("/data/blog/2021-08-30-wikipedia-metrics/runs")
MODEL_RUN_FOLDER.mkdir(parents=True, exist_ok=True)
def compute_metrics(preds: EvalPrediction) -> Dict[str, float]:
boundary_labels = preds.label_ids.reshape(-1, 3)[:, :2]
boundary_predictions = preds.predictions.reshape(-1, tokenizer.vocab_size + 2)[:, -2:]
return metric_boundary_bce_2class(boundary_predictions, boundary_labels)
training_args = TrainingArguments(
report_to=[],
output_dir=MODEL_RUN_FOLDER / "output",
overwrite_output_dir=True,
per_device_train_batch_size=BATCH_SIZE,
per_device_eval_batch_size=BATCH_SIZE,
learning_rate=5e-5,
warmup_ratio=0.06,
evaluation_strategy="steps",
logging_dir=MODEL_RUN_FOLDER / "output",
max_steps=100,
logging_steps=10,
# not really training properly here
# load_best_model_at_end=True,
# metric_for_best_model="quality",
# greater_is_better=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=split["train"],
eval_dataset=split["test"],
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
trainer.train()
Step | Training Loss | Validation Loss | Start Accuracy | Start Precision | Start Recall | Start Fscore | Within Accuracy | Within Precision | Within Recall | Within Fscore |
---|---|---|---|---|---|---|---|---|---|---|
10 | 1.210000 | 0.796951 | 0.968628 | 0.000000 | 0.000000 | 0.000000 | 0.901611 | 0.000000 | 0.000000 | 0.000000 |
20 | 0.798000 | 0.744180 | 0.968750 | 0.000000 | 0.000000 | 0.000000 | 0.901855 | 0.000000 | 0.000000 | 0.000000 |
30 | 0.737100 | 0.712299 | 0.968750 | 0.000000 | 0.000000 | 0.000000 | 0.901855 | 0.000000 | 0.000000 | 0.000000 |
40 | 0.722800 | 0.695815 | 0.968750 | 0.000000 | 0.000000 | 0.000000 | 0.901855 | 0.000000 | 0.000000 | 0.000000 |
50 | 0.699700 | 0.673633 | 0.968750 | 0.000000 | 0.000000 | 0.000000 | 0.901855 | 0.000000 | 0.000000 | 0.000000 |
60 | 0.679000 | 0.647545 | 0.968750 | 0.000000 | 0.000000 | 0.000000 | 0.898804 | 0.264151 | 0.017413 | 0.032672 |
70 | 0.623600 | 0.626958 | 0.968750 | 0.000000 | 0.000000 | 0.000000 | 0.904175 | 0.552486 | 0.124378 | 0.203046 |
80 | 0.642100 | 0.621785 | 0.968750 | 0.000000 | 0.000000 | 0.000000 | 0.906616 | 0.566553 | 0.206468 | 0.302644 |
90 | 0.634900 | 0.613633 | 0.968750 | 0.000000 | 0.000000 | 0.000000 | 0.905151 | 0.560811 | 0.154851 | 0.242690 |
100 | 0.615200 | 0.612778 | 0.968750 | 0.000000 | 0.000000 | 0.000000 | 0.910034 | 0.583127 | 0.292289 | 0.389395 |
TrainOutput(global_step=100, training_loss=0.7362331295013428, metrics={'train_runtime': 93.2935, 'train_samples_per_second': 34.3, 'train_steps_per_second': 1.072, 'total_flos': 487796726169600.0, 'train_loss': 0.7362331295013428, 'epoch': 0.02})
So this is quite noisy output and I’m not super satisfied with the results. I don’t feel that I have a good idea of the improvement or degredation of the model.
Mapping the two binary classifiers to the 4 classes should help with this. The four classes are similar to the IOB (inside, outside, beginning) classes except that the 4 combinations of the two binary classifiers can result in an invalid combination (a token that is the start of a link but not within a link). Once that has been done I can generate more meaningful precision and recall figures.
# from src/main/python/blog/wikipedia_link/metrics/boundary_bce_4class.py
from typing import Dict
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
def metric_boundary_bce_4class(
predictions: np.array,
labels: np.array,
start_index: int = 0,
within_index: int = 1,
) -> Dict[str, float]:
# predictions and labels have been flattened and sliced
# this means that labels is [n, 2] ints and predictions is [n, 2] floats
beginning_label = 0
inside_label = 1
outside_label = 2
invalid_label = 3
def beginning(ary: np.array) -> np.array:
return (ary[:, start_index] > 0) & (ary[:, within_index] > 0)
def within(ary: np.array) -> np.array:
return (ary[:, start_index] <= 0) & (ary[:, within_index] > 0)
def outside(ary: np.array) -> np.array:
return (ary[:, start_index] <= 0) & (ary[:, within_index] <= 0)
def invalid(ary: np.array) -> np.array:
return (ary[:, start_index] > 0) & (ary[:, within_index] <= 0)
def classes(ary: np.array) -> np.array:
return (
(beginning(ary) * beginning_label)
+ (within(ary) * inside_label)
+ (outside(ary) * outside_label)
+ (invalid(ary) * invalid_label)
)
predictions = classes(predictions)
labels = classes(labels)
accuracy = accuracy_score(labels, predictions)
metrics = precision_recall_fscore_support(labels, predictions, zero_division=0)
result = {
"accuracy": accuracy,
"beginning_precision": metrics[0][beginning_label],
"beginning_recall": metrics[1][beginning_label],
"beginning_fscore": metrics[2][beginning_label],
"inside_precision": metrics[0][inside_label],
"inside_recall": metrics[1][inside_label],
"inside_fscore": metrics[2][inside_label],
"outside_precision": metrics[0][outside_label],
"outside_recall": metrics[1][outside_label],
"outside_fscore": metrics[2][outside_label],
}
if len(metrics[0]) == 4:
result = {
**result,
"invalid_precision": metrics[0][invalid_label],
"invalid_recall": metrics[1][invalid_label],
"invalid_fscore": metrics[2][invalid_label],
}
else:
result = {
**result,
"invalid_precision": 0.0,
"invalid_recall": 0.0,
"invalid_fscore": 0.0,
}
return result
from typing import *
from pathlib import Path
from transformers import Trainer, TrainingArguments
from transformers import EvalPrediction
MODEL_RUN_FOLDER = Path("/data/blog/2021-08-30-wikipedia-metrics/runs")
MODEL_RUN_FOLDER.mkdir(parents=True, exist_ok=True)
def compute_metrics(preds: EvalPrediction) -> Dict[str, float]:
boundary_labels = preds.label_ids.reshape(-1, 3)[:, :2]
boundary_predictions = preds.predictions.reshape(-1, tokenizer.vocab_size + 2)[:, -2:]
return metric_boundary_bce_4class(boundary_predictions, boundary_labels)
training_args = TrainingArguments(
report_to=[],
output_dir=MODEL_RUN_FOLDER / "output",
overwrite_output_dir=True,
per_device_train_batch_size=BATCH_SIZE,
per_device_eval_batch_size=BATCH_SIZE,
learning_rate=5e-5,
warmup_ratio=0.06,
evaluation_strategy="steps",
logging_dir=MODEL_RUN_FOLDER / "output",
max_steps=100,
logging_steps=10,
# not really training properly here
# load_best_model_at_end=True,
# metric_for_best_model="quality",
# greater_is_better=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=split["train"],
eval_dataset=split["test"],
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
trainer.train()
using `logging_steps` to initialize `eval_steps` to 10
PyTorch: setting up devices
Step | Training Loss | Validation Loss | Accuracy | Beginning Precision | Beginning Recall | Beginning Fscore | Inside Precision | Inside Recall | Inside Fscore | Outside Precision | Outside Recall | Outside Fscore | Invalid Precision | Invalid Recall | Invalid Fscore |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | 0.624400 | 0.604486 | 0.904114 | 0.000000 | 0.000000 | 0.000000 | 0.426696 | 0.355839 | 0.388060 | 0.932321 | 0.976110 | 0.953713 | 0.000000 | 0.000000 | 0.000000 |
20 | 0.617700 | 0.587958 | 0.906372 | 0.750000 | 0.029297 | 0.056391 | 0.437500 | 0.319343 | 0.369198 | 0.930673 | 0.980306 | 0.954845 | 0.000000 | 0.000000 | 0.000000 |
30 | 0.594300 | 0.580028 | 0.906006 | 0.762712 | 0.087891 | 0.157618 | 0.428678 | 0.526460 | 0.472563 | 0.949463 | 0.962507 | 0.955940 | 0.000000 | 0.000000 | 0.000000 |
40 | 0.593200 | 0.577678 | 0.911194 | 0.644860 | 0.404297 | 0.496999 | 0.465568 | 0.721715 | 0.566011 | 0.969855 | 0.942813 | 0.956143 | 0.000000 | 0.000000 | 0.000000 |
50 | 0.584700 | 0.570706 | 0.921753 | 0.641791 | 0.587891 | 0.613660 | 0.549356 | 0.700730 | 0.615878 | 0.967593 | 0.949716 | 0.958571 | 0.000000 | 0.000000 | 0.000000 |
60 | 0.581300 | 0.570254 | 0.922546 | 0.655172 | 0.519531 | 0.579521 | 0.558360 | 0.645985 | 0.598985 | 0.961907 | 0.957025 | 0.959460 | 0.000000 | 0.000000 | 0.000000 |
70 | 0.552800 | 0.564304 | 0.922852 | 0.643172 | 0.570312 | 0.604555 | 0.558431 | 0.649635 | 0.600590 | 0.963944 | 0.955333 | 0.959619 | 0.000000 | 0.000000 | 0.000000 |
80 | 0.578900 | 0.564882 | 0.922424 | 0.641975 | 0.609375 | 0.625251 | 0.552790 | 0.668796 | 0.605285 | 0.965811 | 0.952084 | 0.958899 | 0.000000 | 0.000000 | 0.000000 |
90 | 0.579800 | 0.562999 | 0.923218 | 0.655405 | 0.568359 | 0.608787 | 0.554545 | 0.667883 | 0.605960 | 0.965100 | 0.954453 | 0.959747 | 0.000000 | 0.000000 | 0.000000 |
100 | 0.565100 | 0.562203 | 0.923096 | 0.654018 | 0.572266 | 0.610417 | 0.552906 | 0.677007 | 0.608696 | 0.965794 | 0.953506 | 0.959610 | 0.000000 | 0.000000 | 0.000000 |
TrainOutput(global_step=100, training_loss=0.5872174787521363, metrics={'train_runtime': 95.2434, 'train_samples_per_second': 33.598, 'train_steps_per_second': 1.05, 'total_flos': 487796726169600.0, 'train_loss': 0.5872174787521363, 'epoch': 0.02})
So I think that these boundary metrics are far better. They make it much easier to see where it is strong and weak - not that training for 100 batches is going to really show good performance. It would be good to evaluate the other form of this metric against the cross entropy version of the model to check I have implemented it well.
The Inside / Outside / Beginning 3 class boundary classifier has already been tried, lets see what the metrics say about it.
# from src/main/python/blog/wikipedia_link/metrics/boundary_iob.py
from typing import Dict
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
def metric_boundary_iob(
predictions: np.array,
labels: np.array,
inside_label: int = 1,
beginning_label: int = 2,
outside_label: int = 0,
) -> Dict[str, float]:
# predictions and labels have been flattened and sliced
# this means that labels is [n] ints and predictions is [n, 3] floats
predictions = predictions.argmax(axis=1)
accuracy = accuracy_score(labels, predictions)
metrics = precision_recall_fscore_support(labels, predictions, zero_division=0)
return {
"accuracy": accuracy,
"beginning_precision": metrics[0][beginning_label],
"beginning_recall": metrics[1][beginning_label],
"beginning_fscore": metrics[2][beginning_label],
"inside_precision": metrics[0][inside_label],
"inside_recall": metrics[1][inside_label],
"inside_fscore": metrics[2][inside_label],
"outside_precision": metrics[0][outside_label],
"outside_recall": metrics[1][outside_label],
"outside_fscore": metrics[2][outside_label],
}
# from src/main/python/blog/wikipedia_link/loss/boundary_iob.py
import torch
def calculate_loss_boundary_iob(
predictions: torch.Tensor, # [:, n, 3] inside, outside, begin
labels: torch.Tensor, # [:, n, 1] class
) -> torch.Tensor:
"""Calculate the loss for the boundary predictions (outside, inside, beginning).
The predictions are only the boundary predictions.
The labels combine the boundary labels and the link target index."""
return torch.nn.functional.cross_entropy(
predictions.reshape(-1, 3), labels.flatten()
)
# from src/main/python/blog/wikipedia_link/loss/link_bce.py
import numpy as np
import torch
def calculate_loss_link_bce(
predictions: torch.Tensor, # [:, n, vocab_size]
boundary_labels: torch.Tensor, # [:, n, 1] or [:, n, 2] for iob or bce
link_labels: torch.Tensor, # [:, n] for index
token_indices: np.array, # index -> 50 tokens
) -> torch.Tensor:
"""Calculate the loss for the link predictions.
The labels for this are only valid within a link,
so the boundary_labels are used to spot the links.
The predictions are only the link target predictions."""
boundary_labels = boundary_labels.view(-1, boundary_labels.shape[-1])
mask = boundary_labels.sum(dim=1) > 0
link_labels = link_labels.view(-1, 1)[mask].long()
rows = link_labels.shape[0]
vocab_size = predictions.shape[-1]
predictions = predictions.view(-1, vocab_size)[mask]
targets = torch.zeros(vocab_size * rows, device=predictions.device)
target_offsets = torch.tensor(range(rows), device=predictions.device) * vocab_size
target_indexes = (token_indices[link_labels] + target_offsets[:, None]).flatten()
targets[target_indexes] = 1
return torch.nn.functional.binary_cross_entropy_with_logits(
predictions, targets.view(-1, vocab_size)
)
# from src/main/python/blog/wikipedia_link/data/boundary_iob.py
from pathlib import Path
from typing import Dict
import pandas as pd
from datasets import Dataset
def load_dataset_iob(test_size: int = 64) -> Dict[str, Dataset]:
df = pd.read_parquet(
sorted(
Path("/data/blog/2021-08-21-link-evaluation").glob( # different folder
"*.gz.parquet"
)
)[-1]
)
df = df[["input_ids", "attention_mask", "label"]]
return Dataset.from_pandas(df).train_test_split(test_size=test_size)
# from src/main/python/blog/wikipedia_link/data/page_tokens.py
from pathlib import Path
import numpy as np
import pandas as pd
import torch
def load_page_tokens(device: torch.device = torch.device("cuda")) -> torch.Tensor:
token_df = pd.concat(
[
pd.read_parquet(path)
for path in sorted(
Path("/data/blog/2021-08-01-wikipedia-page-pmi/").glob(
"*-pmi.gz.parquet"
)
)
]
)
token_df = token_df.set_index("title")
token_indices = np.concatenate(token_df.tokens.values).reshape(-1, 50)
token_indices = torch.from_numpy(token_indices).long()
return token_indices.detach().to(device)
# from src/main/python/blog/wikipedia_link/data/title_to_index.py
from pathlib import Path
from typing import Dict
import pandas as pd
def convert_pmi_to_title_index(
source: Path,
destination: Path,
) -> None:
destination.parent.mkdir(exist_ok=True, parents=True)
if destination.exists():
print(f"Skipping title-to-index aggregation, already exists at {destination}")
return
df = pd.read_parquet(source)[["title"]]
df = df.sort_values(by="title")
df = df.reset_index()
df = df.set_index("title")
df.to_parquet(destination, compression="gzip")
def load_title_to_index() -> Dict[str, int]:
title_to_index = pd.read_parquet(
"/data/blog/2021-07-30-wikipedia-data-generation/title-to-index.gz.parquet"
)
return title_to_index["index"].to_dict()
# from src/main/python/blog/wikipedia_link/model/bart_boundary_iob.py
import numpy as np
import torch
from transformers import BartConfig, BartForConditionalGeneration
class BartLinksBoundaryIOB(BartForConditionalGeneration):
def __init__(self, config: BartConfig) -> None:
super().__init__(config)
self.link_head = torch.nn.Linear(
in_features=config.d_model, out_features=3, bias=True
)
self.token_indices = None
@staticmethod
def boundary_loss(
predictions: torch.Tensor, # [:, n, 3] outside, inside, begin
labels: torch.Tensor, # [:, n, 1] class: outside, inside, begin
) -> torch.Tensor:
raise NotImplementedError()
@staticmethod
def link_loss(
predictions: torch.Tensor, # [:, n, vocab_size]
boundary_labels: torch.Tensor, # [:, n, 1] class: outside, inside, begin
link_labels: torch.Tensor, # [:, n] for index
token_indices: np.array, # index -> 50 tokens
) -> torch.Tensor:
raise NotImplementedError()
def forward(
self,
input_ids=None,
attention_mask=None,
labels=None,
):
assert self.token_indices is not None, "Model misconfigured, set token_indices"
outputs = self.model(
input_ids,
attention_mask=attention_mask,
)
link_logits = self.lm_head(outputs[0]) + self.final_logits_bias
boundary_logits = self.link_head(outputs[0])
logits = torch.cat(
[
link_logits,
boundary_logits,
],
dim=-1,
)
# Including the full outputs means more stuff gets passed to the metrics method.
# Keeping the output just the logits or loss and logits makes metrics easier.
# The base model uses this approach:
# output = (logits,) + outputs[1:]
if labels is not None:
boundary_loss = self.boundary_loss(
predictions=boundary_logits, labels=labels[:, :, :1]
)
link_loss = self.link_loss(
predictions=link_logits,
boundary_labels=labels[:, :, :1],
link_labels=labels[:, :, 1],
token_indices=self.token_indices,
)
loss = boundary_loss + link_loss
return (loss, logits)
return (logits,)
#hide_output
from transformers import AutoTokenizer
token_indices = load_page_tokens()
title_to_index = load_title_to_index()
split = load_dataset_iob(test_size=BATCH_SIZE*2)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = BartLinksBoundaryIOB.from_pretrained(MODEL_NAME)
model.token_indices = token_indices
model.boundary_loss = calculate_loss_boundary_iob
model.link_loss = calculate_loss_link_bce
Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/facebook/bart-base/resolve/main/config.json from cache at /home/matthew/.cache/huggingface/transformers/f5310d276a6d1648d00c32fadc8bf7b4607e0fbd5b404fc4a0045960aa2bdfdb.8512cdf8592f538a7fd4b40eecaa096285410ec6494049568b3300922ab71165
Model config BartConfig {
"activation_dropout": 0.1,
"activation_function": "gelu",
"add_bias_logits": false,
"add_final_layer_norm": false,
"architectures": [
"BartModel"
],
"attention_dropout": 0.1,
"bos_token_id": 0,
"classif_dropout": 0.1,
"classifier_dropout": 0.0,
"d_model": 768,
"decoder_attention_heads": 12,
"decoder_ffn_dim": 3072,
"decoder_layerdrop": 0.0,
"decoder_layers": 6,
"decoder_start_token_id": 2,
"dropout": 0.1,
"early_stopping": true,
"encoder_attention_heads": 12,
"encoder_ffn_dim": 3072,
"encoder_layerdrop": 0.0,
"encoder_layers": 6,
"eos_token_id": 2,
"forced_eos_token_id": 2,
"gradient_checkpointing": false,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1",
"2": "LABEL_2"
},
"init_std": 0.02,
"is_encoder_decoder": true,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1,
"LABEL_2": 2
},
"max_position_embeddings": 1024,
"model_type": "bart",
"no_repeat_ngram_size": 3,
"normalize_before": false,
"normalize_embedding": true,
"num_beams": 4,
"num_hidden_layers": 6,
"pad_token_id": 1,
"scale_embedding": false,
"task_specific_params": {
"summarization": {
"length_penalty": 1.0,
"max_length": 128,
"min_length": 12,
"num_beams": 4
},
"summarization_cnn": {
"length_penalty": 2.0,
"max_length": 142,
"min_length": 56,
"num_beams": 4
},
"summarization_xsum": {
"length_penalty": 1.0,
"max_length": 62,
"min_length": 11,
"num_beams": 6
}
},
"transformers_version": "4.9.2",
"use_cache": true,
"vocab_size": 50265
}
loading file https://huggingface.co/facebook/bart-base/resolve/main/vocab.json from cache at /home/matthew/.cache/huggingface/transformers/43978bdeaa326572886b44fcfed82f932f76571095ce31973e51c3da8ccade7f.d67d6b367eb24ab43b08ad55e014cf254076934f71d832bbab9ad35644a375ab
loading file https://huggingface.co/facebook/bart-base/resolve/main/merges.txt from cache at /home/matthew/.cache/huggingface/transformers/3c167ed8af56e6605eeb794b63a79d65d85e6708c9b04408d41946337030f5cd.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
loading file https://huggingface.co/facebook/bart-base/resolve/main/tokenizer.json from cache at /home/matthew/.cache/huggingface/transformers/a878fcd69bba037c9b1b227f4213579ae43d0aaa9374e167bc6c5f41b1cfeb30.fc9576039592f026ad76a1c231b89aee8668488c671dfbe6616bab2ed298d730
loading file https://huggingface.co/facebook/bart-base/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/facebook/bart-base/resolve/main/special_tokens_map.json from cache at None
loading file https://huggingface.co/facebook/bart-base/resolve/main/tokenizer_config.json from cache at None
loading configuration file https://huggingface.co/facebook/bart-base/resolve/main/config.json from cache at /home/matthew/.cache/huggingface/transformers/f5310d276a6d1648d00c32fadc8bf7b4607e0fbd5b404fc4a0045960aa2bdfdb.8512cdf8592f538a7fd4b40eecaa096285410ec6494049568b3300922ab71165
Model config BartConfig {
"activation_dropout": 0.1,
"activation_function": "gelu",
"add_bias_logits": false,
"add_final_layer_norm": false,
"architectures": [
"BartModel"
],
"attention_dropout": 0.1,
"bos_token_id": 0,
"classif_dropout": 0.1,
"classifier_dropout": 0.0,
"d_model": 768,
"decoder_attention_heads": 12,
"decoder_ffn_dim": 3072,
"decoder_layerdrop": 0.0,
"decoder_layers": 6,
"decoder_start_token_id": 2,
"dropout": 0.1,
"early_stopping": true,
"encoder_attention_heads": 12,
"encoder_ffn_dim": 3072,
"encoder_layerdrop": 0.0,
"encoder_layers": 6,
"eos_token_id": 2,
"forced_eos_token_id": 2,
"gradient_checkpointing": false,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1",
"2": "LABEL_2"
},
"init_std": 0.02,
"is_encoder_decoder": true,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1,
"LABEL_2": 2
},
"max_position_embeddings": 1024,
"model_type": "bart",
"no_repeat_ngram_size": 3,
"normalize_before": false,
"normalize_embedding": true,
"num_beams": 4,
"num_hidden_layers": 6,
"pad_token_id": 1,
"scale_embedding": false,
"task_specific_params": {
"summarization": {
"length_penalty": 1.0,
"max_length": 128,
"min_length": 12,
"num_beams": 4
},
"summarization_cnn": {
"length_penalty": 2.0,
"max_length": 142,
"min_length": 56,
"num_beams": 4
},
"summarization_xsum": {
"length_penalty": 1.0,
"max_length": 62,
"min_length": 11,
"num_beams": 6
}
},
"transformers_version": "4.9.2",
"use_cache": true,
"vocab_size": 50265
}
loading configuration file https://huggingface.co/facebook/bart-base/resolve/main/config.json from cache at /home/matthew/.cache/huggingface/transformers/f5310d276a6d1648d00c32fadc8bf7b4607e0fbd5b404fc4a0045960aa2bdfdb.8512cdf8592f538a7fd4b40eecaa096285410ec6494049568b3300922ab71165
Model config BartConfig {
"activation_dropout": 0.1,
"activation_function": "gelu",
"add_bias_logits": false,
"add_final_layer_norm": false,
"architectures": [
"BartModel"
],
"attention_dropout": 0.1,
"bos_token_id": 0,
"classif_dropout": 0.1,
"classifier_dropout": 0.0,
"d_model": 768,
"decoder_attention_heads": 12,
"decoder_ffn_dim": 3072,
"decoder_layerdrop": 0.0,
"decoder_layers": 6,
"decoder_start_token_id": 2,
"dropout": 0.1,
"early_stopping": true,
"encoder_attention_heads": 12,
"encoder_ffn_dim": 3072,
"encoder_layerdrop": 0.0,
"encoder_layers": 6,
"eos_token_id": 2,
"forced_eos_token_id": 2,
"gradient_checkpointing": false,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1",
"2": "LABEL_2"
},
"init_std": 0.02,
"is_encoder_decoder": true,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1,
"LABEL_2": 2
},
"max_position_embeddings": 1024,
"model_type": "bart",
"no_repeat_ngram_size": 3,
"normalize_before": false,
"normalize_embedding": true,
"num_beams": 4,
"num_hidden_layers": 6,
"pad_token_id": 1,
"scale_embedding": false,
"task_specific_params": {
"summarization": {
"length_penalty": 1.0,
"max_length": 128,
"min_length": 12,
"num_beams": 4
},
"summarization_cnn": {
"length_penalty": 2.0,
"max_length": 142,
"min_length": 56,
"num_beams": 4
},
"summarization_xsum": {
"length_penalty": 1.0,
"max_length": 62,
"min_length": 11,
"num_beams": 6
}
},
"transformers_version": "4.9.2",
"use_cache": true,
"vocab_size": 50265
}
loading weights file https://huggingface.co/facebook/bart-base/resolve/main/pytorch_model.bin from cache at /home/matthew/.cache/huggingface/transformers/486355ec722ef05fd480e999d4c763be56549ae930f6a3742ee721a5d2a05647.9faea28a6782a9589c09b1942c039943df02232d83d2ac288a69ddfa928eae22
All model checkpoint weights were used when initializing BartLinksBoundaryIOB.
Some weights of BartLinksBoundaryIOB were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['link_head.bias', 'link_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
from typing import *
from pathlib import Path
from transformers import Trainer, TrainingArguments
from transformers import EvalPrediction
MODEL_RUN_FOLDER = Path("/data/blog/2021-08-30-wikipedia-metrics/runs")
MODEL_RUN_FOLDER.mkdir(parents=True, exist_ok=True)
def compute_metrics(preds: EvalPrediction) -> Dict[str, float]:
boundary_labels = preds.label_ids.reshape(-1, 2)[:, 0]
boundary_predictions = preds.predictions.reshape(-1, tokenizer.vocab_size + 3)[:, -3:]
return metric_boundary_iob(boundary_predictions, boundary_labels)
training_args = TrainingArguments(
report_to=[],
output_dir=MODEL_RUN_FOLDER / "output",
overwrite_output_dir=True,
per_device_train_batch_size=BATCH_SIZE,
per_device_eval_batch_size=BATCH_SIZE,
learning_rate=5e-5,
warmup_ratio=0.06,
evaluation_strategy="steps",
logging_dir=MODEL_RUN_FOLDER / "output",
max_steps=100,
logging_steps=10,
# not really training properly here
# load_best_model_at_end=True,
# metric_for_best_model="quality",
# greater_is_better=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=split["train"],
eval_dataset=split["test"],
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
trainer.train()
using `logging_steps` to initialize `eval_steps` to 10
PyTorch: setting up devices
Step | Training Loss | Validation Loss | Accuracy | Beginning Precision | Beginning Recall | Beginning Fscore | Inside Precision | Inside Recall | Inside Fscore | Outside Precision | Outside Recall | Outside Fscore |
---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | 1.479900 | 0.935388 | 0.899658 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.899713 | 0.999932 | 0.947179 |
20 | 0.839800 | 0.795681 | 0.899719 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.899719 | 1.000000 | 0.947213 |
30 | 0.750300 | 0.721273 | 0.901245 | 0.000000 | 0.000000 | 0.000000 | 0.595420 | 0.068602 | 0.123028 | 0.903710 | 0.996405 | 0.947796 |
40 | 0.698600 | 0.680169 | 0.907898 | 0.000000 | 0.000000 | 0.000000 | 0.622018 | 0.298153 | 0.403092 | 0.917735 | 0.986093 | 0.950687 |
50 | 0.671300 | 0.665654 | 0.910828 | 0.727273 | 0.063241 | 0.116364 | 0.668750 | 0.282322 | 0.397032 | 0.918663 | 0.988400 | 0.952256 |
60 | 0.666300 | 0.651620 | 0.920044 | 0.610860 | 0.533597 | 0.569620 | 0.620942 | 0.521548 | 0.566922 | 0.948222 | 0.964046 | 0.956068 |
70 | 0.656500 | 0.641215 | 0.921082 | 0.641026 | 0.444664 | 0.525088 | 0.606164 | 0.622691 | 0.614317 | 0.952439 | 0.960450 | 0.956428 |
80 | 0.634100 | 0.642581 | 0.921692 | 0.599206 | 0.596838 | 0.598020 | 0.599219 | 0.674582 | 0.634671 | 0.961096 | 0.951903 | 0.956477 |
90 | 0.620200 | 0.641026 | 0.916504 | 0.574627 | 0.608696 | 0.591171 | 0.563187 | 0.721196 | 0.632472 | 0.964981 | 0.942134 | 0.953421 |
100 | 0.644800 | 0.638462 | 0.918945 | 0.580038 | 0.608696 | 0.594021 | 0.580364 | 0.701847 | 0.635350 | 0.963531 | 0.946340 | 0.954858 |
TrainOutput(global_step=100, training_loss=0.7661756467819214, metrics={'train_runtime': 92.7329, 'train_samples_per_second': 34.508, 'train_steps_per_second': 1.078, 'total_flos': 487800505958400.0, 'train_loss': 0.7661756467819214, 'epoch': 0.02})
This works and it gets a marginally better score. When I trained the model with this approach it did very poorly though, so it will be interesting to see what the metrics are for a real train.
I’m going to work on the link description metrics in another post as this is quite long.