Creating a Train Dataset for Cross Language WSI

Using Wikipedia and Wikidata to link other languages to English articles
prompt internalization
multilingual prompt internalization
cross language word sense induction
Published

September 5, 2022

I want to create a model that can understand the meaning of things in many languages. To do this I have to have a baseline for what a thing is. I’m using Wikipedia articles as both the list of things I recognize and the definition of what they are. When some text on Wikipedia links to an article, I can use that text as a description of the target article. Aggregating these descriptions then gives me a definition for the article.

I want to be able to train a model to produce these descriptions for text of any language. To do this I need text in a language where there are things in the text and I know the definition of those things. I’ve previously been trying to do this using a parallel sentence dataset, however that lead to some problems.

The Tatobea dataset has sentences but they are quite short. To be able to work with them I had to identify the nouns in them, which I did with part of speech tagging. Since I had no way to associate the nouns I relied on having a single noun in each sentence. I also had no canonical form of the noun so the teacher was providing a target based on a single short example. This lead to a model that would not distinguish between different tokens in the input, and was not used to handling long sequences.

Since I am processing a lot of Wikipedia data I now have a way to create a much better training dataset. The inputs can be much longer and they can have many targets in them. This does involve quite a lot of data processing, which I am going to cover in this post.

Code
import blog.transformers_logging
Code
from pathlib import Path

INTERIM_FOLDER = Path("/data/prompt-internalization/multilingual/wikipedia/interim")
PROCESSED_FOLDER = Path("/data/prompt-internalization/multilingual/wikipedia/processed")

Data Processing

The Wikipedia and Wikidata data files are bz2 encoded xml. Code to handle this has already been written as part of coming up with the English article definitions. I can use that as a base to handle the training data.

This post is going to focus on the extraction of the training data, which requires the text with links and a way to resolve the links to the English article definitions.

Mapping Article Name

Finding the English article name for a link that is in a different language can be done using Wikidata, as a Wikidata entry will contain links to the Wikipedia pages for the topic in different languages (you can see this on the right hand side of this page).

The wikidata entries look like this:

<page>
  <title>Q10222280</title>
  <ns>0</ns>
  <id>11495715</id>
  <revision>
    <id>1010802678</id>
    <parentid>838728765</parentid>
    <timestamp>2019-09-10T00:22:02Z</timestamp>
    <contributor>
      <username>Edoderoo</username>
      <id>7150</id>
    </contributor>
    <comment>/* wbeditentity-update:0| */ https://www.wikidata.org/w/index.php?title=Wikidata:Bot_requests&amp;oldid=1007509180 Wikimedia-kategori</comment>
    <model>wikibase-item</model>
    <format>application/json</format>
    <text bytes="12680" xml:space="preserve">... xml encoded json ...</text>
    <sha1>n35r4gz4obbpsw97p1qvfzghknsb189</sha1>
  </revision>
</page>

The most interesting part of this is the xml encoded json which contains the titles of the Wikipedia pages:

{
  "type": "item",
  "id": "Q10222285",
  "labels": {
    "sv": {
      "language": "sv",
      "value": "Kategori:Ilithucia"
    },
    "ceb": {
      "language": "ceb",
      "value": "Kategoriya:Ilithucia"
    },
    "war": {
      "language": "war",
      "value": "Kaarangay:Ilithucia"
    },
    "en": {
      "language": "en",
      "value": "Category:Ilithucia"
    },
    "bg": {
      "language": "bg",
      "value": "Категория:Ilithucia"
    },
    "it": {
      "language": "it",
      "value": "Categoria:Ilithucia"
    }
  },
  "descriptions": {
    "es": {
      "language": "es",
      "value": "categoría de Wikimedia"
    },

Here the labels entry has the Wikipedia page title for this entry in different languages. By reading this we can create a mapping between the different languages and the English article title.

I’ve created such a mapping:

Code
import pandas as pd

title_df = pd.read_parquet(INTERIM_FOLDER / "wikidatawiki/20220701/titles.gz.parquet")
title_df[title_df.title.str.len() > 5]
title site target
0 ! (álbum de trippie redd) eswiki ! (trippie redd album)
1 ! (trippie redd) itwiki ! (trippie redd album)
2 ! (альбом trippie redd) ruwiki ! (trippie redd album)
11 !oka tokat itwiki !oka tokat
12 !oka tokat ptwiki !oka tokat
... ... ... ...
3833552 класс ♯p ruwiki ♯p
3833553 numeral-p-completo eswiki ♯p-complete
3833554 sharp-p-complet frwiki ♯p-complete
3833555 sharp-p-completo itwiki ♯p-complete
3833556 p-sharp completude ptwiki ♯p-complete

3645941 rows × 3 columns

With this it is now possible to take an article title in one language and map it back to the English article. I’m using the same languages that I was before, it could be done with any language that has reasonable Wikipedia support.

Creating Dataset

The dataset can be made from the text in the wikipedia articles of different languages. Links from these articles can be used only if they exist in the mapping and the English article has a description. Finally, at least two links must be present in each input row.

I’ve done something similar for the English article definitions so a lot of the code for that can be reused. The English article descriptions only have a single link per row, so that will be the major change. For the student I want to maximize the number of links in an input and since there are millions of articles available I am expecting to generate a single test row per article.

There are problems with this as the English article referred to has to have a valid description. It’s very expensive to calculate the descriptions for the different articles. If I calculate every single one then it could take weeks of GPU time.

To make the process more efficient I can determine the English articles which are referred to by the student datasets and describe only those articles. By filtering it down to the most popular articles I can cut down the number of articles that need to be described. If I have one hundred thousand English articles to use then that should allow a large enough training dataset without spending too long on set up.

I’ve created such a dataset for Spanish Wikipedia and the data looks like this:

Code
import pandas as pd

df = pd.read_parquet(
    INTERIM_FOLDER / "eswiki/20220701/student/dataset/article-000-000000000.gz.parquet"
)
df
input_ids targets
0 [0, 6, 124180, 4, 197594, 79680, 1138, 8, 6, 1... [{'end': 18, 'start': 14, 'target': 'microstat...
1 [0, 1832, 19265, 124851, 220, 2855, 41767, 381... [{'end': 10, 'start': 9, 'target': 'climate'},...
2 [0, 540, 79680, 1138, 8, 6, 124180, 15, 19, 66... [{'end': 25, 'start': 23, 'target': 'southern ...
3 [0, 6, 162518, 11598, 5, 1388, 198, 8, 8156, 3... [{'end': 11, 'start': 10, 'target': 'spain'}, ...
4 [0, 5599, 57252, 7, 136749, 84891, 110, 15636,... [{'end': 282, 'start': 278, 'target': 'composi...
... ... ...
922 [0, 1818, 9641, 13085, 21, 9596, 40, 3814, 855... [{'end': 32, 'start': 29, 'target': 'aragon'},...
923 [0, 1657, 88, 12024, 146, 7493, 113, 21376, 10... [{'end': 107, 'start': 96, 'target': 'national...
924 [0, 3731, 121218, 124716, 115723, 1183, 159175... [{'end': 23, 'start': 21, 'target': 'the corrs...
925 [0, 503, 82687, 533, 435, 53251, 516, 10, 2124... [{'end': 32, 'start': 28, 'target': 'symmetry ...
926 [0, 7244, 40266, 198, 51, 3128, 8, 66708, 8, 5... [{'end': 12, 'start': 5, 'target': 'database m...

927 rows × 2 columns

This isn’t very readable so let’s try expanding the first row.

Code
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

row = df.iloc[0]
print(tokenizer.decode(row.input_ids)[:256] + "...")
print()

for term in row.targets[:3]:
    target = term["target"]
    text = tokenizer.decode(row.input_ids[term["start"]:term["end"]])
    print(f"{target} appears as {text}")
print(f"... {len(row.targets) - 3} other links")

pd.DataFrame(row.targets.tolist()).target.value_counts()
<s> Andorra, oficialmente Principado de Andorra (), es un micro-Estado soberano sin litoral ubicado en el suroeste de Europa, entre España y Francia, en el límite de la península ibérica. Se constituye en Estado independiente, de derecho, democrático y soc...

microstate appears as micro-Estado
landlocked country appears as sin litoral
europe appears as Europa
... 32 other links
spain                               3
microstate                          2
france                              2
andorra la vella                    2
roman catholic diocese of urgell    1
tourism                             1
world war ii                        1
flood                               1
emergency                           1
french language                     1
portuguese language                 1
spanish language                    1
catalan language                    1
prime minister of andorra           1
head of government                  1
president of france                 1
head of state                       1
co-princes of andorra               1
landlocked country                  1
pyrénées-orientales                 1
ariège (department)                 1
province of lleida                  1
catalonia                           1
pyrenees                            1
principality                        1
democracy                           1
state (polity)                      1
iberian peninsula                   1
europe                              1
tax haven                           1
Name: target, dtype: int64

Now we can see that this is working quite well. The entry has been extracted from the Andorra article on Spanish Wikipedia and we have the English article links for each term. All of this fits within a single model input and we have 35 reasonably diverse links available to train with.

With this I can then create something more suitable for training. That means creating a regular sized label of integers and limiting the targets to those that have descriptions.

As there is quite a lot of data available I am restricting the rows to those that have between 5 and 10 valid targets. That means there will always be ten labels (as a consistent size is required for batching) without wasting too much space. I’ve set that up and it looks like this:

Code
import pandas as pd

df = pd.read_parquet(PROCESSED_FOLDER / "20220701/student/valid-dataset.gz.parquet")
df
input_ids label
0 [0, 67538, 503, 51086, 4, 6, 4, 6, 4, 6, 4, 6,... [[37, 41, 4664], [49, 56, 1384], [70, 72, 8386...
1 [0, 180, 1657, 85, 246, 8, 83366, 5076, 393, 2... [[16, 17, 9006], [17, 20, 9070], [21, 27, 9535...
2 [0, 786, 771, 66847, 223, 110536, 7, 393, 788,... [[11, 14, 4269], [14, 18, 1143], [18, 21, 6794...
3 [0, 10250, 538, 16615, 7, 332, 519, 164, 198, ... [[10, 11, 8224], [16, 17, 3069], [24, 25, 6310...
4 [0, 241, 634, 89408, 57282, 7118, 2069, 6896, ... [[15, 16, 6785], [16, 18, 8467], [111, 115, 26...
... ... ...
9995 [0, 44532, 865, 15, 69990, 587, 12, 527, 10593... [[4, 6, 6775], [20, 23, 6767], [25, 28, 9425],...
9996 [0, 180, 113666, 31, 8, 46932, 161808, 158850,... [[67, 70, 2186], [89, 92, 7478], [184, 186, 74...
9997 [0, 188075, 395, 9903, 178434, 46, 33, 50648, ... [[7, 10, 9030], [23, 26, 9029], [31, 32, 217],...
9998 [0, 11852, 90565, 93, 1391, 127, 188, 15, 2856... [[15, 16, 178], [19, 20, 5795], [20, 23, 3184]...
9999 [0, 2758, 5708, 76, 393, 286, 59403, 48, 20833... [[8, 9, 8208], [10, 11, 7087], [15, 17, 958], ...

10000 rows × 2 columns

Again lets explore the first row to check that I have done this correctly.

Code
import numpy as np
import pandas as pd
from transformers import AutoTokenizer

index_to_article = (
    pd.read_parquet(
        PROCESSED_FOLDER / "20220701/article-descriptions.gz.parquet",
        columns=["target"]
    ).target.to_dict()
)
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

row = df.iloc[0]
print(tokenizer.decode(row.input_ids)[:256] + "...")
print()

targets = np.array(row.label.tolist())
for (start, end, target) in targets[:3]:
    target = index_to_article[target]
    text = tokenizer.decode(row.input_ids[start:end])
    print(f"{target} appears as {text}")
print(f"... {sum(targets[:, 0] != -1) - 3} other links")
<s> Johann Segner,,,,, (9 de octubre de 1704 - 5 de octubre de 1777) fue un científico húngaro. Nacido en el Reino de Hungría, en la entonces ciudad húngara de Pozsony/Presburgo (hoy Bratislava), sus antepasados habían emigrado ahí desde Estiria en el sigl...

kingdom of hungary appears as Reino de Hungría
bratislava appears as Pozsony/Presburgo
styria appears as Estiria
... 6 other links

There are a lot of commas at the start of this. The wikipedia page contains a list of his name in different languages. While it would be nice to clean this up I want to see how well the student model can perform. To do that I need to create a trainer.

Training

Code
from pathlib import Path

ARTICLE_FILE = PROCESSED_FOLDER / "20220701" / "article-descriptions.gz.parquet"
DATASET_FOLDER = PROCESSED_FOLDER / "20220701" / "student"
MODEL_FOLDER = Path("/data/prompt-internalization/multilingual/models/wikipedia")
RUN_FOLDER = Path("/tmp/runs")

MODEL_FOLDER.mkdir(parents=True, exist_ok=True)
RUN_FOLDER.mkdir(parents=True, exist_ok=True)

Distance Training

This will use the weighted distance between the two points as the loss.

Code
from itertools import starmap
from typing import Any, Dict, List, Optional, Tuple, Union
from pathlib import Path

import pandas as pd
import datasets
import torch
import torch.nn.functional as F
from transformers import (
    AutoModelForMaskedLM,
    AutoTokenizer,
    DataCollatorWithPadding,
    EvalPrediction,
    Trainer,
    TrainingArguments,
)
from transformers.modeling_outputs import MaskedLMOutput

class ArticleTrainingArguments(TrainingArguments):
    def __init__(
        self,
        *args,
        temperature: float = 2.0,
        **kwargs,
    ) -> None:
        super().__init__(*args, **kwargs)
        self.temperature = temperature


class ArticleMeasure:
    def __init__(self, file: Path) -> None:
        df = pd.read_parquet(file)
        self.indices = [
            torch.tensor(values, dtype=torch.long)
            for values in df["indices"]
        ]
        self.mean = [
            torch.tensor(values)
            for values in df["mean"]
        ]
        self.weight = [
            torch.tensor(1 / values)
            for values in df["std"]
        ]

    def to(self, device) -> None:
        self.indices = [entry.to(device) for entry in self.indices]
        self.mean = [entry.to(device) for entry in self.mean]
        self.weight = [entry.to(device) for entry in self.weight]

    def distance(self, output: torch.Tensor, index: int) -> torch.Tensor:
        output = output[self.indices[index]]
        return torch.linalg.norm(
            (output - self.mean[index]) * self.weight[index]
        )

class ArticleTrainer(Trainer):
    def __init__(
        self,
        *args,
        article_file: Path = None,
        **kwargs,
    ) -> None:
        super().__init__(*args, **kwargs)
        self.measure = ArticleMeasure(article_file)
        self.measure.to(self.model.device)

    def compute_loss(
        self,
        model: AutoModelForMaskedLM,
        inputs: Dict[str, Union[torch.Tensor, Any]],
        return_outputs: bool = False,
    ) -> Union[Tuple[torch.Tensor, torch.Tensor], torch.Tensor]:
        outputs: MaskedLMOutput = model(
            input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"]
        )
        # distances are -1 for the missing labels
        distances: torch.Tensor = self.distances(outputs.logits, labels=inputs["labels"])
        loss: torch.Tensor = self.loss(distances)

        if not return_outputs:
            return loss
        return loss, distances

    def distances(self, outputs: torch.Tensor, labels: torch.Tensor) -> torch.Tensor:
        result = torch.ones(size=(outputs.shape[0], labels.shape[1]), dtype=torch.float, device=outputs.device)
        result = result * -1
        for row_index, (row, row_labels) in enumerate(zip(outputs, labels)):
            for label_index, (start, _, target) in enumerate(row_labels):
                logits = row[start]
                logits = logits.softmax(dim=0)
                result[row_index, label_index] = self.measure.distance(logits, target)
        return result
    
    def loss(self, distances: torch.Tensor) -> torch.Tensor:
        # should this be kldiv instead?
        loss = torch.tensor(0., dtype=torch.float, device=distances.device)
        count = 0
        for row_distances in distances:
            for label_distance in row_distances:
                if label_distance < 0:
                    continue
                loss += label_distance
                count += 1
        loss = loss / count
        return loss



def compute_metrics(model_output: EvalPrediction) -> Dict[str, float]:
    # distance is just loss already
    kl_div = model_output.predictions[:, 0].mean()
    overlap = model_output.predictions[:, 1].mean()
    return {
        "kl_div": kl_div,
        "overlap": overlap,
    }



def train(
    *,
    model_name: str = "xlm-roberta-base",
    # dataset_name: str = "xlm-roberta",
    batch_size: int = 32,
    learning_rate: float = 1e-4,
    # temperature: float = 2,
    fp16: bool = False,
    # mean_prediction: bool = False,
    # ignore_tokens: Optional[List[int]] = None,
    epochs: Optional[float] = 2,
    max_steps: int = -1,
    evaluation_steps: int = 500,
    article_file: Path = None,
) -> Path:
    assert article_file is not None
    run_name = "-".join(
        [
            f"{model_name}",
            f"e{epochs}" if max_steps == -1 else f"ms{max_steps}",
            f"bs{batch_size}",
            f"lr{learning_rate}",
            # f"t{temperature}",
        ]
        + (["fp16"] if fp16 else [])
        # + (["mean"] if mean_prediction else [])
        # + ([f"it{len(ignore_tokens)}"] if ignore_tokens else [])
    )
    print(f"Starting {run_name}")
    train_ds = datasets.load_from_disk(DATASET_FOLDER / "train.dataset")
    test_ds = datasets.load_from_disk(DATASET_FOLDER / "valid.dataset")

    training_args = ArticleTrainingArguments(
        report_to="none",
        output_dir=RUN_FOLDER,
        num_train_epochs=epochs,
        max_steps=max_steps,
        seed=33,
        # number of steps before moving evaluation results from GPU to CPU see
        # https://discuss.huggingface.co/t/cuda-out-of-memory-when-using-trainer-with-compute-metrics/2941
        eval_accumulation_steps=5,
        #
        # hyperparameters
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        fp16=fp16,
        # temperature=temperature,
        # mean_prediction=mean_prediction,
        # ignore_tokens=ignore_tokens,
        learning_rate=learning_rate,
        #
        # evaluation settings
        evaluation_strategy="steps",
        logging_steps=evaluation_steps,
        eval_steps=evaluation_steps,
        save_steps=evaluation_steps,
        #
        # checkpoint settings
        logging_dir=RUN_FOLDER / "logs",
        save_total_limit=2,
        load_best_model_at_end=True,
        metric_for_best_model="loss",
        greater_is_better=False,
        # remove_unused_columns=False,
    )

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForMaskedLM.from_pretrained(model_name)
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    trainer = ArticleTrainer(
        model=model,
        args=training_args,
        train_dataset=train_ds,
        eval_dataset=test_ds,
        data_collator=data_collator,
        tokenizer=tokenizer,
        # compute_metrics=compute_metrics,
        article_file=article_file,
    )

    trainer.train()
    model.save_pretrained(MODEL_FOLDER / run_name)

    return MODEL_FOLDER / run_name
Code
# %pdb
Code
model_path = train(
    model_name="xlm-roberta-base",
    batch_size=8,
    learning_rate=1e-4,
    # epochs=20,
    max_steps=10_000,
    evaluation_steps=1_000,
    article_file=ARTICLE_FILE,
)
Starting xlm-roberta-base-ms10000-bs8-lr0.0001
/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.10/lib/python3.10/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
[10000/10000 2:59:18, Epoch 0/1]
Step Training Loss Validation Loss
1000 47.021400 16.955563
2000 16.655400 16.955563
3000 16.682800 16.955563
4000 16.690500 16.955563
5000 16.685800 16.955563
6000 16.646200 16.955563
7000 16.699800 16.955563
8000 16.687000 16.955563
9000 16.685600 16.955563
10000 16.684500 16.955563

This has got a problem. The validation set loss never changes. I think that using KL Divergence as a loss might be better.

I can also make this more efficient by expanding the target out to the full 250k tokens and then doing KL Divergence against that. The output would be repeated N times for each label and the target would be an expanded version of the indices + tokens. It may be appropriate to zero out all the indicies that don’t appear in the target.

Code
# from src/main/python/blog/prompt_internalization/multilingual/roberta/evaluate.py
from pathlib import Path
from typing import List, Optional, Tuple

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer


def evaluate(
    model_name: str, model_path: Path, ignore_tokens: Optional[List[int]] = None
) -> None:
    if ignore_tokens is None:
        ignore_tokens = []

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForMaskedLM.from_pretrained(model_path)
    model.eval()

    bass_evaluation(model=model, tokenizer=tokenizer, ignore_tokens=ignore_tokens)
    friday_evaluation(model=model, tokenizer=tokenizer, ignore_tokens=ignore_tokens)
    malibu_evaluation(model=model, tokenizer=tokenizer, ignore_tokens=ignore_tokens)
    football_evaluation(model=model, tokenizer=tokenizer, ignore_tokens=ignore_tokens)


def bass_evaluation(
    model: AutoModelForMaskedLM, tokenizer: AutoTokenizer, ignore_tokens: List[int]
) -> None:
    first_phrase = "We spotted a large bass in the ocean."
    second_phrase = "The bass player did not receive the acknowledgment she deserves."
    third_phrase = "The black sea bass, is a member of the wreckfish family."

    first_predicted_words = get_predictions(
        model=model,
        tokenizer=tokenizer,
        text=first_phrase,
        noun="bass",
        ignore_tokens=ignore_tokens,
    )
    second_predicted_words = get_predictions(
        model=model,
        tokenizer=tokenizer,
        text=second_phrase,
        noun="bass",
        ignore_tokens=ignore_tokens,
    )
    third_predicted_words = get_predictions(
        model=model,
        tokenizer=tokenizer,
        text=third_phrase,
        noun="bass",
        ignore_tokens=ignore_tokens,
    )

    print("=== BASS EVALUATION ===")
    print(f"First Phrase is: {first_phrase} Target is: bass")
    print(f"Description is: {', '.join(first_predicted_words)}")
    print()

    print(f"Second Phrase is: {second_phrase} Target is: bass")
    print(f"Description is: {', '.join(second_predicted_words)}")
    print()

    print(f"Third Phrase is: {third_phrase} Target is: bass")
    print(f"Description is: {', '.join(third_predicted_words)}")
    print()

    print(
        f"First & Second: {sorted(set(first_predicted_words) & set(second_predicted_words))}"
    )
    print(
        f"First & Third: {sorted(set(first_predicted_words) & set(third_predicted_words))}"
    )
    print(
        f"Second & Third: {sorted(set(second_predicted_words) & set(third_predicted_words))}"
    )
    print()


def friday_evaluation(
    model: AutoModelForMaskedLM, tokenizer: AutoTokenizer, ignore_tokens: List[int]
) -> None:
    spanish_text = "Friday es mi canción favorita."
    english_text = "Friday is my favourite song."

    spanish_predicted_words = get_predictions(
        model=model,
        tokenizer=tokenizer,
        text=spanish_text,
        noun="Friday",
        ignore_tokens=ignore_tokens,
    )
    english_predicted_words = get_predictions(
        model=model,
        tokenizer=tokenizer,
        text=english_text,
        noun="Friday",
        ignore_tokens=ignore_tokens,
    )

    overlap = set(spanish_predicted_words) & set(english_predicted_words)
    difference = set(spanish_predicted_words) ^ set(english_predicted_words)

    print("=== FRIDAY EVALUATION ===")
    print(f"Spanish Phrase is: {spanish_text}")
    print(f"Spanish Description is: {', '.join(spanish_predicted_words)}")

    print(f"English Phrase is: {english_text}")
    print(f"English Description is: {', '.join(english_predicted_words)}")
    print()

    print(f"Description Overlap is: {', '.join(sorted(overlap))}")
    print(f"Description Difference is: {', '.join(sorted(difference))}")
    print()


def malibu_evaluation(
    model: AutoModelForMaskedLM, tokenizer: AutoTokenizer, ignore_tokens: List[int]
) -> None:
    text = "I like to drive my Malibu while drinking Malibu."

    first_predicted_words = get_predictions(
        model=model,
        tokenizer=tokenizer,
        text=text,
        noun="Malibu",
        ignore_tokens=ignore_tokens,
    )
    second_predicted_words = get_predictions(
        model=model,
        tokenizer=tokenizer,
        text=text,
        noun="Malibu",
        index=1,
        ignore_tokens=ignore_tokens,
    )

    print("=== MALIBU EVALUATION ===")
    print(f"Phrase is: {text}")
    print(f"First Malibu (car) Description is: {', '.join(first_predicted_words)}")
    print(f"Second Malibu (drink) Description is: {', '.join(second_predicted_words)}")
    print()

    print(
        f"First & Second: {sorted(set(first_predicted_words) & set(second_predicted_words))}"
    )
    print(
        f"First ^ Second: {sorted(set(first_predicted_words) ^ set(second_predicted_words))}"
    )
    print()


def football_evaluation(
    model: AutoModelForMaskedLM, tokenizer: AutoTokenizer, ignore_tokens: List[int]
) -> None:
    spanish_phrase = (
        "Retiremos el equipo de la cancha, "
        "Boca no merece jugar esta copa que "
        "hace tiempo viene siendo desprestigiada.\n"
        "Ya no se juega al futbol."
    )

    english_phrase = (
        "Let's remove the team from the field, "
        "Boca does not deserve to play this cup that "
        "has long been discredited. "
        "Football is no longer played."
    )

    print("=== FOOTBALL EVALUATION ===")
    print(f"Spanish Phrase is: {spanish_phrase}")
    print(f"English Phrase is: {english_phrase}")
    print()

    for spanish_noun, english_noun in [
        ["equipo", "team"],
        ["Boca", "Boca"],
        ["copa", "cup"],
        ["tiempo", "long"],
        ["futbol", "Football"],
    ]:
        spanish_description = get_predictions(
            model=model,
            tokenizer=tokenizer,
            text=spanish_phrase,
            noun=spanish_noun,
            ignore_tokens=ignore_tokens,
        )
        english_description = get_predictions(
            model=model,
            tokenizer=tokenizer,
            text=english_phrase,
            noun=english_noun,
            ignore_tokens=ignore_tokens,
        )
        overlap = set(spanish_description) & set(english_description)
        difference = set(spanish_description) ^ set(english_description)

        print(f"Spanish word is: {spanish_noun}, English word is: {english_noun}")
        print(f"Spanish Description is: {', '.join(spanish_description)}")
        print(f"English Description is: {', '.join(english_description)}")
        print(f"Overlap is: {', '.join(sorted(overlap))} ({len(overlap)})")
        print(f"Difference is: {', '.join(sorted(difference))} ({len(difference)})")
        print()


@torch.inference_mode()
def get_predictions(
    *,
    model: AutoModelForMaskedLM,
    tokenizer: AutoTokenizer,
    text: str,
    noun: str,
    index: int = 0,
    ignore_tokens: Optional[List[int]] = None,
) -> List[str]:
    if ignore_tokens is None:
        ignore_tokens = []

    tokens = tokenizer(text, return_tensors="pt")
    start, _end = get_noun(
        tokenizer=tokenizer, tokens=tokens.input_ids[0], noun=noun, index=index
    )

    output = model(**tokens)
    predictions = output.logits[0, start]
    predictions[ignore_tokens] = predictions.min()
    predicted_tokens = predictions.argsort(descending=True)[:10]
    predicted_words = [
        word.strip() for word in tokenizer.batch_decode(predicted_tokens)
    ]

    return predicted_words


def get_noun(
    tokenizer: AutoTokenizer, tokens: torch.Tensor, noun: str, index: int
) -> Tuple[int, int]:
    length = tokens.shape[0]
    current_index = index
    for start_index in range(length):
        word = tokenizer.decode(tokens[start_index]).strip()
        if not noun.startswith(word):
            continue
        for end_index in range(start_index + 1, length):
            word = tokenizer.decode(tokens[start_index:end_index]).strip()
            if not noun == word:
                continue
            if current_index > 0:
                current_index -= 1
            else:
                return start_index, end_index
    raise AssertionError(f"Did not find {noun}[{index}] in {tokenizer.decode(tokens)}")
Code
evaluate("xlm-roberta-base", model_path)
Could not locate the tokenizer configuration file, will try to use the model config instead.
=== BASS EVALUATION ===
First Phrase is: We spotted a large bass in the ocean. Target is: bass
Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, tatzen, der, laguntzen, ikusten

Second Phrase is: The bass player did not receive the acknowledgment she deserves. Target is: bass
Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, tatzen, der, laguntzen, ikusten

Third Phrase is: The black sea bass, is a member of the wreckfish family. Target is: bass
Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, tatzen, der, laguntzen, ikusten

First & Second: ['<s>', 'azaltzen', 'dauden', 'der', 'eskaera', 'görmək', 'ikusten', 'laguntzen', 'tatzen', 'zusehen']
First & Third: ['<s>', 'azaltzen', 'dauden', 'der', 'eskaera', 'görmək', 'ikusten', 'laguntzen', 'tatzen', 'zusehen']
Second & Third: ['<s>', 'azaltzen', 'dauden', 'der', 'eskaera', 'görmək', 'ikusten', 'laguntzen', 'tatzen', 'zusehen']

=== FRIDAY EVALUATION ===
Spanish Phrase is: Friday es mi canción favorita.
Spanish Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, tatzen, der, laguntzen, ikusten
English Phrase is: Friday is my favourite song.
English Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, tatzen, der, laguntzen, ikusten

Description Overlap is: <s>, azaltzen, dauden, der, eskaera, görmək, ikusten, laguntzen, tatzen, zusehen
Description Difference is: 

=== MALIBU EVALUATION ===
Phrase is: I like to drive my Malibu while drinking Malibu.
First Malibu (car) Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, tatzen, der, laguntzen, ikusten
Second Malibu (drink) Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, tatzen, der, laguntzen, ikusten

First & Second: ['<s>', 'azaltzen', 'dauden', 'der', 'eskaera', 'görmək', 'ikusten', 'laguntzen', 'tatzen', 'zusehen']
First ^ Second: []

=== FOOTBALL EVALUATION ===
Spanish Phrase is: Retiremos el equipo de la cancha, Boca no merece jugar esta copa que hace tiempo viene siendo desprestigiada.
Ya no se juega al futbol.
English Phrase is: Let's remove the team from the field, Boca does not deserve to play this cup that has long been discredited. Football is no longer played.

Spanish word is: equipo, English word is: team
Spanish Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, tatzen, der, laguntzen, ikusten
English Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, der, tatzen, laguntzen, ikusten
Overlap is: <s>, azaltzen, dauden, der, eskaera, görmək, ikusten, laguntzen, tatzen, zusehen (10)
Difference is:  (0)

Spanish word is: Boca, English word is: Boca
Spanish Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, tatzen, der, laguntzen, ikusten
English Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, der, tatzen, laguntzen, ikusten
Overlap is: <s>, azaltzen, dauden, der, eskaera, görmək, ikusten, laguntzen, tatzen, zusehen (10)
Difference is:  (0)

Spanish word is: copa, English word is: cup
Spanish Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, tatzen, der, laguntzen, ikusten
English Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, der, tatzen, laguntzen, ikusten
Overlap is: <s>, azaltzen, dauden, der, eskaera, görmək, ikusten, laguntzen, tatzen, zusehen (10)
Difference is:  (0)

Spanish word is: tiempo, English word is: long
Spanish Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, tatzen, der, laguntzen, ikusten
English Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, der, tatzen, laguntzen, ikusten
Overlap is: <s>, azaltzen, dauden, der, eskaera, görmək, ikusten, laguntzen, tatzen, zusehen (10)
Difference is:  (0)

Spanish word is: futbol, English word is: Football
Spanish Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, tatzen, der, laguntzen, ikusten
English Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, der, tatzen, laguntzen, ikusten
Overlap is: <s>, azaltzen, dauden, der, eskaera, görmək, ikusten, laguntzen, tatzen, zusehen (10)
Difference is:  (0)

It’s clear that the distance based training has fundamentally broken the model. I need to try out using KL Divergence as the loss metric. Part of the problem may be the using softmax during the extraction of the rows from the teacher. That would require reprocessing the teacher data which would be tiresome - it takes 15 hours to complete.

KL Divergence

This time I am going to use KL Divergence instead. To try to make this easier to work with I want to load all of the target values into memory. I am going to discard the standard deviation and just train against the distribution represented by the mean.

Loading all of them into memory should take \(250k_{tokens} * 8_{bytes per float} * 10k_{descriptions} = 20G\) which is too much.

However it should be possible to load them in for a single inference. This is less ideal as it involves shipping them from CPU each time, but it should speed up the inference process.

Code
from itertools import starmap
from typing import Any, Dict, List, Optional, Tuple, Union
from pathlib import Path

import pandas as pd
import datasets
import torch
import torch.nn.functional as F
from transformers import (
    AutoModelForMaskedLM,
    AutoTokenizer,
    DataCollatorWithPadding,
    EvalPrediction,
    Trainer,
    TrainingArguments,
)
from transformers.modeling_outputs import MaskedLMOutput

class ArticleTrainingArguments(TrainingArguments):
    def __init__(
        self,
        *args,
        temperature: float = 2.0,
        **kwargs,
    ) -> None:
        super().__init__(*args, **kwargs)
        self.temperature = temperature


class ArticleMeasure:
    def __init__(self, file: Path) -> None:
        df = pd.read_parquet(file)
        self.indices = [
            torch.tensor(values, dtype=torch.long)
            for values in df["indices"]
        ]
        self.mean = [
            torch.tensor(values, dtype=torch.float)
            for values in df["mean"]
        ]

    def to(self, device) -> None:
        self.indices = [entry.to(device) for entry in self.indices]
        self.mean = [entry.to(device) for entry in self.mean]

    def loss(self, output: torch.Tensor, labels: torch.Tensor) -> torch.Tensor:
        batch_size, token_count, vocab_size = output.shape

        start_tokens = labels[:, :, 0]
        offsets = torch.arange(start=0, end=batch_size, dtype=torch.long, device=output.device) * token_count
        offsets = offsets.repeat_interleave((start_tokens != -1).sum(axis=1))

        start_tokens = start_tokens.flatten()
        target_indices = labels[:, :, 2].flatten()
        token_mask = start_tokens != -1
        start_tokens = start_tokens[token_mask]
        target_indices = target_indices[token_mask]
        # offsets = offsets[token_mask] # repeat_interleave has already established this

        output = output.reshape(-1, vocab_size)
        predictions = output[start_tokens + offsets]
        predictions = F.log_softmax(predictions, dim=-1)
        targets = torch.zeros_like(predictions, dtype=torch.float, device=output.device, requires_grad=False)
        for row_index, index in enumerate(target_indices):
            targets[row_index, self.indices[index]] = self.mean[index]

        return F.kl_div(
            input=predictions,
            target=targets,
            reduction="batchmean",
            log_target=False,
        )

class ArticleTrainer(Trainer):
    def __init__(
        self,
        *args,
        article_file: Path = None,
        **kwargs,
    ) -> None:
        super().__init__(*args, **kwargs)
        self.measure = ArticleMeasure(article_file)
        self.measure.to(self.model.device)

    def compute_loss(
        self,
        model: AutoModelForMaskedLM,
        inputs: Dict[str, Union[torch.Tensor, Any]],
        return_outputs: bool = False,
    ) -> Union[Tuple[torch.Tensor, torch.Tensor], torch.Tensor]:
        outputs: MaskedLMOutput = model(
            input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"]
        )
        loss: torch.Tensor = self.measure.loss(outputs.logits, labels=inputs["labels"])

        if not return_outputs:
            return loss
        return loss, outputs



def compute_metrics(model_output: EvalPrediction) -> Dict[str, float]:
    # distance is just loss already
    kl_div = model_output.predictions[:, 0].mean()
    overlap = model_output.predictions[:, 1].mean()
    return {
        "kl_div": kl_div,
        "overlap": overlap,
    }



def train(
    *,
    model_name: str = "xlm-roberta-base",
    # dataset_name: str = "xlm-roberta",
    batch_size: int = 32,
    learning_rate: float = 1e-4,
    # temperature: float = 2,
    fp16: bool = False,
    # mean_prediction: bool = False,
    # ignore_tokens: Optional[List[int]] = None,
    epochs: Optional[float] = 2,
    max_steps: int = -1,
    evaluation_steps: int = 500,
    article_file: Path = None,
) -> Path:
    assert article_file is not None
    run_name = "-".join(
        [
            f"{model_name}",
            f"e{epochs}" if max_steps == -1 else f"ms{max_steps}",
            f"bs{batch_size}",
            f"lr{learning_rate}",
            # f"t{temperature}",
        ]
        + (["fp16"] if fp16 else [])
        # + (["mean"] if mean_prediction else [])
        # + ([f"it{len(ignore_tokens)}"] if ignore_tokens else [])
    )
    print(f"Starting {run_name}")
    train_ds = datasets.load_from_disk(DATASET_FOLDER / "train.dataset")
    test_ds = datasets.load_from_disk(DATASET_FOLDER / "valid.dataset")

    training_args = ArticleTrainingArguments(
        report_to="none",
        output_dir=RUN_FOLDER,
        num_train_epochs=epochs,
        max_steps=max_steps,
        seed=33,
        # number of steps before moving evaluation results from GPU to CPU see
        # https://discuss.huggingface.co/t/cuda-out-of-memory-when-using-trainer-with-compute-metrics/2941
        eval_accumulation_steps=5,
        #
        # hyperparameters
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        fp16=fp16,
        # temperature=temperature,
        # mean_prediction=mean_prediction,
        # ignore_tokens=ignore_tokens,
        learning_rate=learning_rate,
        #
        # evaluation settings
        evaluation_strategy="steps",
        logging_steps=evaluation_steps,
        eval_steps=evaluation_steps,
        save_steps=evaluation_steps,
        #
        # checkpoint settings
        logging_dir=RUN_FOLDER / "logs",
        save_total_limit=2,
        load_best_model_at_end=True,
        metric_for_best_model="loss",
        greater_is_better=False,
        # remove_unused_columns=False,
    )

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForMaskedLM.from_pretrained(model_name)
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    trainer = ArticleTrainer(
        model=model,
        args=training_args,
        train_dataset=train_ds,
        eval_dataset=test_ds,
        data_collator=data_collator,
        tokenizer=tokenizer,
        # compute_metrics=compute_metrics,
        article_file=article_file,
    )

    trainer.train()
    model.save_pretrained(MODEL_FOLDER / run_name)

    return MODEL_FOLDER / run_name
Code
model_path = train(
    model_name="xlm-roberta-base",
    batch_size=8,
    learning_rate=1e-4,
    # epochs=20,
    max_steps=1_000,
    evaluation_steps=100,
    article_file=ARTICLE_FILE,
)
Starting xlm-roberta-base-ms1000-bs8-lr0.0001
/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.10/lib/python3.10/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
[1000/1000 17:25, Epoch 0/1]
Step Training Loss Validation Loss
100 1.090300 0.486406
200 0.489100 0.400390
300 0.416900 0.375600
400 0.406200 0.361904
500 0.355400 0.342380
600 0.330400 0.308772
700 0.307400 0.313152
800 0.289600 0.291620
900 0.271900 0.276856
1000 0.269700 0.272617

Code
# from src/main/python/blog/prompt_internalization/multilingual/roberta/evaluate.py
from pathlib import Path
from typing import List, Optional, Tuple

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer


def evaluate(
    model_name: str, model_path: Path, ignore_tokens: Optional[List[int]] = None
) -> None:
    if ignore_tokens is None:
        ignore_tokens = []

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForMaskedLM.from_pretrained(model_path)
    model.eval()

    bass_evaluation(model=model, tokenizer=tokenizer, ignore_tokens=ignore_tokens)
    friday_evaluation(model=model, tokenizer=tokenizer, ignore_tokens=ignore_tokens)
    malibu_evaluation(model=model, tokenizer=tokenizer, ignore_tokens=ignore_tokens)
    football_evaluation(model=model, tokenizer=tokenizer, ignore_tokens=ignore_tokens)


def bass_evaluation(
    model: AutoModelForMaskedLM, tokenizer: AutoTokenizer, ignore_tokens: List[int]
) -> None:
    first_phrase = "We spotted a large bass in the ocean."
    second_phrase = "The bass player did not receive the acknowledgment she deserves."
    third_phrase = "The black sea bass, is a member of the wreckfish family."

    first_predicted_words = get_predictions(
        model=model,
        tokenizer=tokenizer,
        text=first_phrase,
        noun="bass",
        ignore_tokens=ignore_tokens,
    )
    second_predicted_words = get_predictions(
        model=model,
        tokenizer=tokenizer,
        text=second_phrase,
        noun="bass",
        ignore_tokens=ignore_tokens,
    )
    third_predicted_words = get_predictions(
        model=model,
        tokenizer=tokenizer,
        text=third_phrase,
        noun="bass",
        ignore_tokens=ignore_tokens,
    )

    print("=== BASS EVALUATION ===")
    print(f"First Phrase is: {first_phrase} Target is: bass")
    print(f"Description is: {', '.join(first_predicted_words)}")
    print()

    print(f"Second Phrase is: {second_phrase} Target is: bass")
    print(f"Description is: {', '.join(second_predicted_words)}")
    print()

    print(f"Third Phrase is: {third_phrase} Target is: bass")
    print(f"Description is: {', '.join(third_predicted_words)}")
    print()

    print(
        f"First & Second: {sorted(set(first_predicted_words) & set(second_predicted_words))}"
    )
    print(
        f"First & Third: {sorted(set(first_predicted_words) & set(third_predicted_words))}"
    )
    print(
        f"Second & Third: {sorted(set(second_predicted_words) & set(third_predicted_words))}"
    )
    print()


def friday_evaluation(
    model: AutoModelForMaskedLM, tokenizer: AutoTokenizer, ignore_tokens: List[int]
) -> None:
    spanish_text = "Friday es mi canción favorita."
    english_text = "Friday is my favourite song."

    spanish_predicted_words = get_predictions(
        model=model,
        tokenizer=tokenizer,
        text=spanish_text,
        noun="Friday",
        ignore_tokens=ignore_tokens,
    )
    english_predicted_words = get_predictions(
        model=model,
        tokenizer=tokenizer,
        text=english_text,
        noun="Friday",
        ignore_tokens=ignore_tokens,
    )

    overlap = set(spanish_predicted_words) & set(english_predicted_words)
    difference = set(spanish_predicted_words) ^ set(english_predicted_words)

    print("=== FRIDAY EVALUATION ===")
    print(f"Spanish Phrase is: {spanish_text}")
    print(f"Spanish Description is: {', '.join(spanish_predicted_words)}")

    print(f"English Phrase is: {english_text}")
    print(f"English Description is: {', '.join(english_predicted_words)}")
    print()

    print(f"Description Overlap is: {', '.join(sorted(overlap))}")
    print(f"Description Difference is: {', '.join(sorted(difference))}")
    print()


def malibu_evaluation(
    model: AutoModelForMaskedLM, tokenizer: AutoTokenizer, ignore_tokens: List[int]
) -> None:
    text = "I like to drive my Malibu while drinking Malibu."

    first_predicted_words = get_predictions(
        model=model,
        tokenizer=tokenizer,
        text=text,
        noun="Malibu",
        ignore_tokens=ignore_tokens,
    )
    second_predicted_words = get_predictions(
        model=model,
        tokenizer=tokenizer,
        text=text,
        noun="Malibu",
        index=1,
        ignore_tokens=ignore_tokens,
    )

    print("=== MALIBU EVALUATION ===")
    print(f"Phrase is: {text}")
    print(f"First Malibu (car) Description is: {', '.join(first_predicted_words)}")
    print(f"Second Malibu (drink) Description is: {', '.join(second_predicted_words)}")
    print()

    print(
        f"First & Second: {sorted(set(first_predicted_words) & set(second_predicted_words))}"
    )
    print(
        f"First ^ Second: {sorted(set(first_predicted_words) ^ set(second_predicted_words))}"
    )
    print()


def football_evaluation(
    model: AutoModelForMaskedLM, tokenizer: AutoTokenizer, ignore_tokens: List[int]
) -> None:
    spanish_phrase = (
        "Retiremos el equipo de la cancha, "
        "Boca no merece jugar esta copa que "
        "hace tiempo viene siendo desprestigiada.\n"
        "Ya no se juega al futbol."
    )

    english_phrase = (
        "Let's remove the team from the field, "
        "Boca does not deserve to play this cup that "
        "has long been discredited. "
        "Football is no longer played."
    )

    print("=== FOOTBALL EVALUATION ===")
    print(f"Spanish Phrase is: {spanish_phrase}")
    print(f"English Phrase is: {english_phrase}")
    print()

    for spanish_noun, english_noun in [
        ["equipo", "team"],
        ["Boca", "Boca"],
        ["copa", "cup"],
        ["tiempo", "long"],
        ["futbol", "Football"],
    ]:
        spanish_description = get_predictions(
            model=model,
            tokenizer=tokenizer,
            text=spanish_phrase,
            noun=spanish_noun,
            ignore_tokens=ignore_tokens,
        )
        english_description = get_predictions(
            model=model,
            tokenizer=tokenizer,
            text=english_phrase,
            noun=english_noun,
            ignore_tokens=ignore_tokens,
        )
        overlap = set(spanish_description) & set(english_description)
        difference = set(spanish_description) ^ set(english_description)

        print(f"Spanish word is: {spanish_noun}, English word is: {english_noun}")
        print(f"Spanish Description is: {', '.join(spanish_description)}")
        print(f"English Description is: {', '.join(english_description)}")
        print(f"Overlap is: {', '.join(sorted(overlap))} ({len(overlap)})")
        print(f"Difference is: {', '.join(sorted(difference))} ({len(difference)})")
        print()


@torch.inference_mode()
def get_predictions(
    *,
    model: AutoModelForMaskedLM,
    tokenizer: AutoTokenizer,
    text: str,
    noun: str,
    index: int = 0,
    ignore_tokens: Optional[List[int]] = None,
) -> List[str]:
    if ignore_tokens is None:
        ignore_tokens = []

    tokens = tokenizer(text, return_tensors="pt")
    start, _end = get_noun(
        tokenizer=tokenizer, tokens=tokens.input_ids[0], noun=noun, index=index
    )

    output = model(**tokens)
    predictions = output.logits[0, start]
    predictions[ignore_tokens] = predictions.min()
    predicted_tokens = predictions.argsort(descending=True)[:10]
    predicted_words = [
        word.strip() for word in tokenizer.batch_decode(predicted_tokens)
    ]

    return predicted_words


def get_noun(
    tokenizer: AutoTokenizer, tokens: torch.Tensor, noun: str, index: int
) -> Tuple[int, int]:
    length = tokens.shape[0]
    current_index = index
    for start_index in range(length):
        word = tokenizer.decode(tokens[start_index]).strip()
        if not noun.startswith(word):
            continue
        for end_index in range(start_index + 1, length):
            word = tokenizer.decode(tokens[start_index:end_index]).strip()
            if not noun == word:
                continue
            if current_index > 0:
                current_index -= 1
            else:
                return start_index, end_index
    raise AssertionError(f"Did not find {noun}[{index}] in {tokenizer.decode(tokens)}")
Code
evaluate("xlm-roberta-base", model_path)
Could not locate the tokenizer configuration file, will try to use the model config instead.
=== BASS EVALUATION ===
First Phrase is: We spotted a large bass in the ocean. Target is: bass
Description is: Material, Color, Type, Area, Surface, Size, Water, Description, Location, Feature

Second Phrase is: The bass player did not receive the acknowledgment she deserves. Target is: bass
Description is: Instrument, Material, Style, Music, Type, Sport, Language, Sports, Description, Color

Third Phrase is: The black sea bass, is a member of the wreckfish family. Target is: bass
Description is: Color, Animal, Type, Material, Cat, Name, Fish, Food, Description, Plant

First & Second: ['Color', 'Description', 'Material', 'Type']
First & Third: ['Color', 'Description', 'Material', 'Type']
Second & Third: ['Color', 'Description', 'Material', 'Type']

=== FRIDAY EVALUATION ===
Spanish Phrase is: Friday es mi canción favorita.
Spanish Description is: Date, Day, Time, Description, Tag, Name, Year, Color, Age, Birthday
English Phrase is: Friday is my favourite song.
English Description is: Date, Day, Time, Tag, Description, Name, Year, Color, Age, Birthday

Description Overlap is: Age, Birthday, Color, Date, Day, Description, Name, Tag, Time, Year
Description Difference is: 

=== MALIBU EVALUATION ===
Phrase is: I like to drive my Malibu while drinking Malibu.
First Malibu (car) Description is: Country, Location, Land, Language, Region, City, Origin, State, Area, Source
Second Malibu (drink) Description is: Country, Food, Language, Land, Location, Source, Culture, Type, Color, Region

First & Second: ['Country', 'Land', 'Language', 'Location', 'Region', 'Source']
First ^ Second: ['Area', 'City', 'Color', 'Culture', 'Food', 'Origin', 'State', 'Type']

=== FOOTBALL EVALUATION ===
Spanish Phrase is: Retiremos el equipo de la cancha, Boca no merece jugar esta copa que hace tiempo viene siendo desprestigiada.
Ya no se juega al futbol.
English Phrase is: Let's remove the team from the field, Boca does not deserve to play this cup that has long been discredited. Football is no longer played.

Spanish word is: equipo, English word is: team
Spanish Description is: Name, Type, Organization, Sponsor, Owner, Location, Sport, Sports, Title, Team
English Description is: Team, Type, Organization, Name, Sport, Sports, Sponsor, Location, Title, Company
Overlap is: Location, Name, Organization, Sponsor, Sport, Sports, Team, Title, Type (9)
Difference is: Company, Owner (2)

Spanish word is: Boca, English word is: Boca
Spanish Description is: City, Location, Sponsor, Owner, Company, Country, Team, Organization, Name, Land
English Description is: City, Sponsor, Location, Company, Owner, Country, Organization, Name, Team, Land
Overlap is: City, Company, Country, Land, Location, Name, Organization, Owner, Sponsor, Team (10)
Difference is:  (0)

Spanish word is: copa, English word is: cup
Spanish Description is: Title, Type, Series, Sports, Category, Game, Sport, Match, Sponsor, Organization
English Description is: Title, Series, Type, Sport, Sports, Category, Game, Organization, Status, Match
Overlap is: Category, Game, Match, Organization, Series, Sport, Sports, Title, Type (9)
Difference is: Sponsor, Status (2)

Spanish word is: tiempo, English word is: long
Spanish Description is: Year, Time, Age, Date, Description, Country, History, Location, Duration, Title
English Description is: Title, Description, Age, Status, Year, Subject, Type, Country, Religion, Date
Overlap is: Age, Country, Date, Description, Title, Year (6)
Difference is: Duration, History, Location, Religion, Status, Subject, Time, Type (8)

Spanish word is: futbol, English word is: Football
Spanish Description is: Sports, Sport, Type, Style, Football, Game, Language, Category, Religion, Title
English Description is: Sports, Sport, Football, Style, Type, Game, Language, Religion, Category, Culture
Overlap is: Category, Football, Game, Language, Religion, Sport, Sports, Style, Type (9)
Difference is: Culture, Title (2)
Code
model_path = train(
    model_name="xlm-roberta-base",
    batch_size=8,
    learning_rate=1e-4,
    epochs=1,
    # max_steps=1_000,
    evaluation_steps=1_000,
    article_file=ARTICLE_FILE,
)
Starting xlm-roberta-base-e1-bs8-lr0.0001
/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.10/lib/python3.10/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
[ 41236/125000 9:06:49 < 18:30:49, 1.26 it/s, Epoch 0.33/1]
Step Training Loss Validation Loss
1000 0.471600 0.336681
2000 0.300800 0.286805
3000 0.267600 0.284328
4000 0.248100 0.281049
5000 0.229100 0.274642
6000 0.214900 0.261699
7000 0.205300 0.255581
8000 0.193600 0.245989
9000 0.194000 0.241496
10000 0.181600 0.239476
11000 0.171500 0.252683
12000 0.167000 0.237175
13000 0.164200 0.240240
14000 0.158500 0.243052
15000 0.156400 0.237823
16000 0.155900 0.244386
17000 0.150900 0.235650
18000 0.142800 0.230899
19000 0.139900 0.228503
20000 0.138100 0.226981
21000 0.136700 0.230420
22000 0.134300 0.218844
23000 0.133900 0.219605
24000 0.126200 0.215770
25000 0.127300 0.221229
26000 0.128100 0.222714
27000 0.122700 0.219199
28000 0.119200 0.221566
29000 0.118900 0.218871
30000 0.118800 0.212282
31000 0.118300 0.211434
32000 0.114100 0.214648
33000 0.111900 0.219043
34000 0.112000 0.216563
35000 0.109100 0.220941
36000 0.106500 0.216465
37000 0.105700 0.219658
38000 0.106400 0.210428
39000 0.105100 0.208662
40000 0.101300 0.221220
41000 0.103000 0.206249

KeyboardInterrupt: 

I’ve interrupted this because it looks like it is going to take more than a day to do an entire epoch. It’s possible to review the performance of the model as the checkpoints have been saved.

Code
evaluate("xlm-roberta-base", Path("/tmp/runs/checkpoint-41000/"))
Could not locate the tokenizer configuration file, will try to use the model config instead.
=== BASS EVALUATION ===
First Phrase is: We spotted a large bass in the ocean. Target is: bass
Description is: Material, Area, Type, Location, Source, Description, Category, Site, Name, Surface

Second Phrase is: The bass player did not receive the acknowledgment she deserves. Target is: bass
Description is: Instrument, Material, Type, Music, Style, System, Player, Guitar, Track, Description

Third Phrase is: The black sea bass, is a member of the wreckfish family. Target is: bass
Description is: Material, Area, Type, Location, Source, Description, Surface, Application, Land, Name

First & Second: ['Description', 'Material', 'Type']
First & Third: ['Area', 'Description', 'Location', 'Material', 'Name', 'Source', 'Surface', 'Type']
Second & Third: ['Description', 'Material', 'Type']

=== FRIDAY EVALUATION ===
Spanish Phrase is: Friday es mi canción favorita.
Spanish Description is: Date, Day, Time, Holiday, Weekend, Night, Sunday, Birthday, Event, Saturday
English Phrase is: Friday is my favourite song.
English Description is: Date, Day, Time, Holiday, Weekend, Night, Sunday, Birthday, Friday, Saturday

Description Overlap is: Birthday, Date, Day, Holiday, Night, Saturday, Sunday, Time, Weekend
Description Difference is: Event, Friday

=== MALIBU EVALUATION ===
Phrase is: I like to drive my Malibu while drinking Malibu.
First Malibu (car) Description is: Location, Country, City, Land, Place, Region, Area, State, Local, Address
Second Malibu (drink) Description is: Location, Country, City, Land, Place, Region, Area, State, Address, Local

First & Second: ['Address', 'Area', 'City', 'Country', 'Land', 'Local', 'Location', 'Place', 'Region', 'State']
First ^ Second: []

=== FOOTBALL EVALUATION ===
Spanish Phrase is: Retiremos el equipo de la cancha, Boca no merece jugar esta copa que hace tiempo viene siendo desprestigiada.
Ya no se juega al futbol.
English Phrase is: Let's remove the team from the field, Boca does not deserve to play this cup that has long been discredited. Football is no longer played.

Spanish word is: equipo, English word is: team
Spanish Description is: Sponsor, Owner, Organization, Team, Company, Name, Location, Title, Type, Member
English Description is: Sponsor, Organization, Owner, Team, Name, Company, Location, Sports, Type, Title
Overlap is: Company, Location, Name, Organization, Owner, Sponsor, Team, Title, Type (9)
Difference is: Member, Sports (2)

Spanish word is: Boca, English word is: Boca
Spanish Description is: Sponsor, Club, City, Company, Organization, Team, Owner, Location, Land, Country
English Description is: Sponsor, Club, City, Company, Owner, Organization, Team, Location, Brand, Land
Overlap is: City, Club, Company, Land, Location, Organization, Owner, Sponsor, Team (9)
Difference is: Brand, Country (2)

Spanish word is: copa, English word is: cup
Spanish Description is: Title, Type, Sponsor, Series, Sports, Sport, Club, Location, Category, Football
English Description is: Title, Type, Cup, Series, Sport, Sponsor, Sports, Location, Category, Match
Overlap is: Category, Location, Series, Sponsor, Sport, Sports, Title, Type (8)
Difference is: Club, Cup, Football, Match (4)

Spanish word is: tiempo, English word is: long
Spanish Description is: Time, Age, Duration, Weight, Type, Game, Size, Speed, Year, Sport
English Description is: Type, Age, Sport, Year, Sports, Game, Title, Location, Duration, Time
Overlap is: Age, Duration, Game, Sport, Time, Type, Year (7)
Difference is: Location, Size, Speed, Sports, Title, Weight (6)

Spanish word is: futbol, English word is: Football
Spanish Description is: Sports, Sport, Football, Type, Game, Style, Title, Theme, Religion, Organization
English Description is: Sports, Sport, Football, Type, Style, Game, Title, Religion, Series, Category
Overlap is: Football, Game, Religion, Sport, Sports, Style, Title, Type (8)
Difference is: Category, Organization, Series, Theme (4)

This model has not collapsed, unlike the last one. The Malibu evaluation is not great as it has suggested the same output for both instances. The Football evaluation is much improved with distinct suggestions for the different words.

Overall I think this is an improvement over the previous approach. The generation of the features needs to be improved. Performing the softmax over the values during feature generation is premature and makes aggregation more tricky. If I stop doing this then I should have a way to provide a value for all of the missing values. I can use a fixed index (like 0) as the mean of all of the unindexed values to fill them out.