Code
import blog.transformers_logging
September 5, 2022
I want to create a model that can understand the meaning of things in many languages. To do this I have to have a baseline for what a thing is. I’m using Wikipedia articles as both the list of things I recognize and the definition of what they are. When some text on Wikipedia links to an article, I can use that text as a description of the target article. Aggregating these descriptions then gives me a definition for the article.
I want to be able to train a model to produce these descriptions for text of any language. To do this I need text in a language where there are things in the text and I know the definition of those things. I’ve previously been trying to do this using a parallel sentence dataset, however that lead to some problems.
The Tatobea dataset has sentences but they are quite short. To be able to work with them I had to identify the nouns in them, which I did with part of speech tagging. Since I had no way to associate the nouns I relied on having a single noun in each sentence. I also had no canonical form of the noun so the teacher was providing a target based on a single short example. This lead to a model that would not distinguish between different tokens in the input, and was not used to handling long sequences.
Since I am processing a lot of Wikipedia data I now have a way to create a much better training dataset. The inputs can be much longer and they can have many targets in them. This does involve quite a lot of data processing, which I am going to cover in this post.
The Wikipedia and Wikidata data files are bz2 encoded xml. Code to handle this has already been written as part of coming up with the English article definitions. I can use that as a base to handle the training data.
This post is going to focus on the extraction of the training data, which requires the text with links and a way to resolve the links to the English article definitions.
Finding the English article name for a link that is in a different language can be done using Wikidata, as a Wikidata entry will contain links to the Wikipedia pages for the topic in different languages (you can see this on the right hand side of this page).
The wikidata entries look like this:
<page>
<title>Q10222280</title>
<ns>0</ns>
<id>11495715</id>
<revision>
<id>1010802678</id>
<parentid>838728765</parentid>
<timestamp>2019-09-10T00:22:02Z</timestamp>
<contributor>
<username>Edoderoo</username>
<id>7150</id>
</contributor>
<comment>/* wbeditentity-update:0| */ https://www.wikidata.org/w/index.php?title=Wikidata:Bot_requests&oldid=1007509180 Wikimedia-kategori</comment>
<model>wikibase-item</model>
<format>application/json</format>
<text bytes="12680" xml:space="preserve">... xml encoded json ...</text>
<sha1>n35r4gz4obbpsw97p1qvfzghknsb189</sha1>
</revision>
</page>
The most interesting part of this is the xml encoded json which contains the titles of the Wikipedia pages:
{
"type": "item",
"id": "Q10222285",
"labels": {
"sv": {
"language": "sv",
"value": "Kategori:Ilithucia"
},
"ceb": {
"language": "ceb",
"value": "Kategoriya:Ilithucia"
},
"war": {
"language": "war",
"value": "Kaarangay:Ilithucia"
},
"en": {
"language": "en",
"value": "Category:Ilithucia"
},
"bg": {
"language": "bg",
"value": "Категория:Ilithucia"
},
"it": {
"language": "it",
"value": "Categoria:Ilithucia"
}
},
"descriptions": {
"es": {
"language": "es",
"value": "categoría de Wikimedia"
},
Here the labels entry has the Wikipedia page title for this entry in different languages. By reading this we can create a mapping between the different languages and the English article title.
I’ve created such a mapping:
title | site | target | |
---|---|---|---|
0 | ! (álbum de trippie redd) | eswiki | ! (trippie redd album) |
1 | ! (trippie redd) | itwiki | ! (trippie redd album) |
2 | ! (альбом trippie redd) | ruwiki | ! (trippie redd album) |
11 | !oka tokat | itwiki | !oka tokat |
12 | !oka tokat | ptwiki | !oka tokat |
... | ... | ... | ... |
3833552 | класс ♯p | ruwiki | ♯p |
3833553 | numeral-p-completo | eswiki | ♯p-complete |
3833554 | sharp-p-complet | frwiki | ♯p-complete |
3833555 | sharp-p-completo | itwiki | ♯p-complete |
3833556 | p-sharp completude | ptwiki | ♯p-complete |
3645941 rows × 3 columns
With this it is now possible to take an article title in one language and map it back to the English article. I’m using the same languages that I was before, it could be done with any language that has reasonable Wikipedia support.
The dataset can be made from the text in the wikipedia articles of different languages. Links from these articles can be used only if they exist in the mapping and the English article has a description. Finally, at least two links must be present in each input row.
I’ve done something similar for the English article definitions so a lot of the code for that can be reused. The English article descriptions only have a single link per row, so that will be the major change. For the student I want to maximize the number of links in an input and since there are millions of articles available I am expecting to generate a single test row per article.
There are problems with this as the English article referred to has to have a valid description. It’s very expensive to calculate the descriptions for the different articles. If I calculate every single one then it could take weeks of GPU time.
To make the process more efficient I can determine the English articles which are referred to by the student datasets and describe only those articles. By filtering it down to the most popular articles I can cut down the number of articles that need to be described. If I have one hundred thousand English articles to use then that should allow a large enough training dataset without spending too long on set up.
I’ve created such a dataset for Spanish Wikipedia and the data looks like this:
input_ids | targets | |
---|---|---|
0 | [0, 6, 124180, 4, 197594, 79680, 1138, 8, 6, 1... | [{'end': 18, 'start': 14, 'target': 'microstat... |
1 | [0, 1832, 19265, 124851, 220, 2855, 41767, 381... | [{'end': 10, 'start': 9, 'target': 'climate'},... |
2 | [0, 540, 79680, 1138, 8, 6, 124180, 15, 19, 66... | [{'end': 25, 'start': 23, 'target': 'southern ... |
3 | [0, 6, 162518, 11598, 5, 1388, 198, 8, 8156, 3... | [{'end': 11, 'start': 10, 'target': 'spain'}, ... |
4 | [0, 5599, 57252, 7, 136749, 84891, 110, 15636,... | [{'end': 282, 'start': 278, 'target': 'composi... |
... | ... | ... |
922 | [0, 1818, 9641, 13085, 21, 9596, 40, 3814, 855... | [{'end': 32, 'start': 29, 'target': 'aragon'},... |
923 | [0, 1657, 88, 12024, 146, 7493, 113, 21376, 10... | [{'end': 107, 'start': 96, 'target': 'national... |
924 | [0, 3731, 121218, 124716, 115723, 1183, 159175... | [{'end': 23, 'start': 21, 'target': 'the corrs... |
925 | [0, 503, 82687, 533, 435, 53251, 516, 10, 2124... | [{'end': 32, 'start': 28, 'target': 'symmetry ... |
926 | [0, 7244, 40266, 198, 51, 3128, 8, 66708, 8, 5... | [{'end': 12, 'start': 5, 'target': 'database m... |
927 rows × 2 columns
This isn’t very readable so let’s try expanding the first row.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
row = df.iloc[0]
print(tokenizer.decode(row.input_ids)[:256] + "...")
print()
for term in row.targets[:3]:
target = term["target"]
text = tokenizer.decode(row.input_ids[term["start"]:term["end"]])
print(f"{target} appears as {text}")
print(f"... {len(row.targets) - 3} other links")
pd.DataFrame(row.targets.tolist()).target.value_counts()
<s> Andorra, oficialmente Principado de Andorra (), es un micro-Estado soberano sin litoral ubicado en el suroeste de Europa, entre España y Francia, en el límite de la península ibérica. Se constituye en Estado independiente, de derecho, democrático y soc...
microstate appears as micro-Estado
landlocked country appears as sin litoral
europe appears as Europa
... 32 other links
spain 3
microstate 2
france 2
andorra la vella 2
roman catholic diocese of urgell 1
tourism 1
world war ii 1
flood 1
emergency 1
french language 1
portuguese language 1
spanish language 1
catalan language 1
prime minister of andorra 1
head of government 1
president of france 1
head of state 1
co-princes of andorra 1
landlocked country 1
pyrénées-orientales 1
ariège (department) 1
province of lleida 1
catalonia 1
pyrenees 1
principality 1
democracy 1
state (polity) 1
iberian peninsula 1
europe 1
tax haven 1
Name: target, dtype: int64
Now we can see that this is working quite well. The entry has been extracted from the Andorra article on Spanish Wikipedia and we have the English article links for each term. All of this fits within a single model input and we have 35 reasonably diverse links available to train with.
With this I can then create something more suitable for training. That means creating a regular sized label of integers and limiting the targets to those that have descriptions.
As there is quite a lot of data available I am restricting the rows to those that have between 5 and 10 valid targets. That means there will always be ten labels (as a consistent size is required for batching) without wasting too much space. I’ve set that up and it looks like this:
input_ids | label | |
---|---|---|
0 | [0, 67538, 503, 51086, 4, 6, 4, 6, 4, 6, 4, 6,... | [[37, 41, 4664], [49, 56, 1384], [70, 72, 8386... |
1 | [0, 180, 1657, 85, 246, 8, 83366, 5076, 393, 2... | [[16, 17, 9006], [17, 20, 9070], [21, 27, 9535... |
2 | [0, 786, 771, 66847, 223, 110536, 7, 393, 788,... | [[11, 14, 4269], [14, 18, 1143], [18, 21, 6794... |
3 | [0, 10250, 538, 16615, 7, 332, 519, 164, 198, ... | [[10, 11, 8224], [16, 17, 3069], [24, 25, 6310... |
4 | [0, 241, 634, 89408, 57282, 7118, 2069, 6896, ... | [[15, 16, 6785], [16, 18, 8467], [111, 115, 26... |
... | ... | ... |
9995 | [0, 44532, 865, 15, 69990, 587, 12, 527, 10593... | [[4, 6, 6775], [20, 23, 6767], [25, 28, 9425],... |
9996 | [0, 180, 113666, 31, 8, 46932, 161808, 158850,... | [[67, 70, 2186], [89, 92, 7478], [184, 186, 74... |
9997 | [0, 188075, 395, 9903, 178434, 46, 33, 50648, ... | [[7, 10, 9030], [23, 26, 9029], [31, 32, 217],... |
9998 | [0, 11852, 90565, 93, 1391, 127, 188, 15, 2856... | [[15, 16, 178], [19, 20, 5795], [20, 23, 3184]... |
9999 | [0, 2758, 5708, 76, 393, 286, 59403, 48, 20833... | [[8, 9, 8208], [10, 11, 7087], [15, 17, 958], ... |
10000 rows × 2 columns
Again lets explore the first row to check that I have done this correctly.
import numpy as np
import pandas as pd
from transformers import AutoTokenizer
index_to_article = (
pd.read_parquet(
PROCESSED_FOLDER / "20220701/article-descriptions.gz.parquet",
columns=["target"]
).target.to_dict()
)
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
row = df.iloc[0]
print(tokenizer.decode(row.input_ids)[:256] + "...")
print()
targets = np.array(row.label.tolist())
for (start, end, target) in targets[:3]:
target = index_to_article[target]
text = tokenizer.decode(row.input_ids[start:end])
print(f"{target} appears as {text}")
print(f"... {sum(targets[:, 0] != -1) - 3} other links")
<s> Johann Segner,,,,, (9 de octubre de 1704 - 5 de octubre de 1777) fue un científico húngaro. Nacido en el Reino de Hungría, en la entonces ciudad húngara de Pozsony/Presburgo (hoy Bratislava), sus antepasados habían emigrado ahí desde Estiria en el sigl...
kingdom of hungary appears as Reino de Hungría
bratislava appears as Pozsony/Presburgo
styria appears as Estiria
... 6 other links
There are a lot of commas at the start of this. The wikipedia page contains a list of his name in different languages. While it would be nice to clean this up I want to see how well the student model can perform. To do that I need to create a trainer.
from pathlib import Path
ARTICLE_FILE = PROCESSED_FOLDER / "20220701" / "article-descriptions.gz.parquet"
DATASET_FOLDER = PROCESSED_FOLDER / "20220701" / "student"
MODEL_FOLDER = Path("/data/prompt-internalization/multilingual/models/wikipedia")
RUN_FOLDER = Path("/tmp/runs")
MODEL_FOLDER.mkdir(parents=True, exist_ok=True)
RUN_FOLDER.mkdir(parents=True, exist_ok=True)
This will use the weighted distance between the two points as the loss.
from itertools import starmap
from typing import Any, Dict, List, Optional, Tuple, Union
from pathlib import Path
import pandas as pd
import datasets
import torch
import torch.nn.functional as F
from transformers import (
AutoModelForMaskedLM,
AutoTokenizer,
DataCollatorWithPadding,
EvalPrediction,
Trainer,
TrainingArguments,
)
from transformers.modeling_outputs import MaskedLMOutput
class ArticleTrainingArguments(TrainingArguments):
def __init__(
self,
*args,
temperature: float = 2.0,
**kwargs,
) -> None:
super().__init__(*args, **kwargs)
self.temperature = temperature
class ArticleMeasure:
def __init__(self, file: Path) -> None:
df = pd.read_parquet(file)
self.indices = [
torch.tensor(values, dtype=torch.long)
for values in df["indices"]
]
self.mean = [
torch.tensor(values)
for values in df["mean"]
]
self.weight = [
torch.tensor(1 / values)
for values in df["std"]
]
def to(self, device) -> None:
self.indices = [entry.to(device) for entry in self.indices]
self.mean = [entry.to(device) for entry in self.mean]
self.weight = [entry.to(device) for entry in self.weight]
def distance(self, output: torch.Tensor, index: int) -> torch.Tensor:
output = output[self.indices[index]]
return torch.linalg.norm(
(output - self.mean[index]) * self.weight[index]
)
class ArticleTrainer(Trainer):
def __init__(
self,
*args,
article_file: Path = None,
**kwargs,
) -> None:
super().__init__(*args, **kwargs)
self.measure = ArticleMeasure(article_file)
self.measure.to(self.model.device)
def compute_loss(
self,
model: AutoModelForMaskedLM,
inputs: Dict[str, Union[torch.Tensor, Any]],
return_outputs: bool = False,
) -> Union[Tuple[torch.Tensor, torch.Tensor], torch.Tensor]:
outputs: MaskedLMOutput = model(
input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"]
)
# distances are -1 for the missing labels
distances: torch.Tensor = self.distances(outputs.logits, labels=inputs["labels"])
loss: torch.Tensor = self.loss(distances)
if not return_outputs:
return loss
return loss, distances
def distances(self, outputs: torch.Tensor, labels: torch.Tensor) -> torch.Tensor:
result = torch.ones(size=(outputs.shape[0], labels.shape[1]), dtype=torch.float, device=outputs.device)
result = result * -1
for row_index, (row, row_labels) in enumerate(zip(outputs, labels)):
for label_index, (start, _, target) in enumerate(row_labels):
logits = row[start]
logits = logits.softmax(dim=0)
result[row_index, label_index] = self.measure.distance(logits, target)
return result
def loss(self, distances: torch.Tensor) -> torch.Tensor:
# should this be kldiv instead?
loss = torch.tensor(0., dtype=torch.float, device=distances.device)
count = 0
for row_distances in distances:
for label_distance in row_distances:
if label_distance < 0:
continue
loss += label_distance
count += 1
loss = loss / count
return loss
def compute_metrics(model_output: EvalPrediction) -> Dict[str, float]:
# distance is just loss already
kl_div = model_output.predictions[:, 0].mean()
overlap = model_output.predictions[:, 1].mean()
return {
"kl_div": kl_div,
"overlap": overlap,
}
def train(
*,
model_name: str = "xlm-roberta-base",
# dataset_name: str = "xlm-roberta",
batch_size: int = 32,
learning_rate: float = 1e-4,
# temperature: float = 2,
fp16: bool = False,
# mean_prediction: bool = False,
# ignore_tokens: Optional[List[int]] = None,
epochs: Optional[float] = 2,
max_steps: int = -1,
evaluation_steps: int = 500,
article_file: Path = None,
) -> Path:
assert article_file is not None
run_name = "-".join(
[
f"{model_name}",
f"e{epochs}" if max_steps == -1 else f"ms{max_steps}",
f"bs{batch_size}",
f"lr{learning_rate}",
# f"t{temperature}",
]
+ (["fp16"] if fp16 else [])
# + (["mean"] if mean_prediction else [])
# + ([f"it{len(ignore_tokens)}"] if ignore_tokens else [])
)
print(f"Starting {run_name}")
train_ds = datasets.load_from_disk(DATASET_FOLDER / "train.dataset")
test_ds = datasets.load_from_disk(DATASET_FOLDER / "valid.dataset")
training_args = ArticleTrainingArguments(
report_to="none",
output_dir=RUN_FOLDER,
num_train_epochs=epochs,
max_steps=max_steps,
seed=33,
# number of steps before moving evaluation results from GPU to CPU see
# https://discuss.huggingface.co/t/cuda-out-of-memory-when-using-trainer-with-compute-metrics/2941
eval_accumulation_steps=5,
#
# hyperparameters
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
fp16=fp16,
# temperature=temperature,
# mean_prediction=mean_prediction,
# ignore_tokens=ignore_tokens,
learning_rate=learning_rate,
#
# evaluation settings
evaluation_strategy="steps",
logging_steps=evaluation_steps,
eval_steps=evaluation_steps,
save_steps=evaluation_steps,
#
# checkpoint settings
logging_dir=RUN_FOLDER / "logs",
save_total_limit=2,
load_best_model_at_end=True,
metric_for_best_model="loss",
greater_is_better=False,
# remove_unused_columns=False,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
trainer = ArticleTrainer(
model=model,
args=training_args,
train_dataset=train_ds,
eval_dataset=test_ds,
data_collator=data_collator,
tokenizer=tokenizer,
# compute_metrics=compute_metrics,
article_file=article_file,
)
trainer.train()
model.save_pretrained(MODEL_FOLDER / run_name)
return MODEL_FOLDER / run_name
Starting xlm-roberta-base-ms10000-bs8-lr0.0001
/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.10/lib/python3.10/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
Step | Training Loss | Validation Loss |
---|---|---|
1000 | 47.021400 | 16.955563 |
2000 | 16.655400 | 16.955563 |
3000 | 16.682800 | 16.955563 |
4000 | 16.690500 | 16.955563 |
5000 | 16.685800 | 16.955563 |
6000 | 16.646200 | 16.955563 |
7000 | 16.699800 | 16.955563 |
8000 | 16.687000 | 16.955563 |
9000 | 16.685600 | 16.955563 |
10000 | 16.684500 | 16.955563 |
This has got a problem. The validation set loss never changes. I think that using KL Divergence as a loss might be better.
I can also make this more efficient by expanding the target out to the full 250k tokens and then doing KL Divergence against that. The output would be repeated N times for each label and the target would be an expanded version of the indices + tokens. It may be appropriate to zero out all the indicies that don’t appear in the target.
# from src/main/python/blog/prompt_internalization/multilingual/roberta/evaluate.py
from pathlib import Path
from typing import List, Optional, Tuple
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
def evaluate(
model_name: str, model_path: Path, ignore_tokens: Optional[List[int]] = None
) -> None:
if ignore_tokens is None:
ignore_tokens = []
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_path)
model.eval()
bass_evaluation(model=model, tokenizer=tokenizer, ignore_tokens=ignore_tokens)
friday_evaluation(model=model, tokenizer=tokenizer, ignore_tokens=ignore_tokens)
malibu_evaluation(model=model, tokenizer=tokenizer, ignore_tokens=ignore_tokens)
football_evaluation(model=model, tokenizer=tokenizer, ignore_tokens=ignore_tokens)
def bass_evaluation(
model: AutoModelForMaskedLM, tokenizer: AutoTokenizer, ignore_tokens: List[int]
) -> None:
first_phrase = "We spotted a large bass in the ocean."
second_phrase = "The bass player did not receive the acknowledgment she deserves."
third_phrase = "The black sea bass, is a member of the wreckfish family."
first_predicted_words = get_predictions(
model=model,
tokenizer=tokenizer,
text=first_phrase,
noun="bass",
ignore_tokens=ignore_tokens,
)
second_predicted_words = get_predictions(
model=model,
tokenizer=tokenizer,
text=second_phrase,
noun="bass",
ignore_tokens=ignore_tokens,
)
third_predicted_words = get_predictions(
model=model,
tokenizer=tokenizer,
text=third_phrase,
noun="bass",
ignore_tokens=ignore_tokens,
)
print("=== BASS EVALUATION ===")
print(f"First Phrase is: {first_phrase} Target is: bass")
print(f"Description is: {', '.join(first_predicted_words)}")
print()
print(f"Second Phrase is: {second_phrase} Target is: bass")
print(f"Description is: {', '.join(second_predicted_words)}")
print()
print(f"Third Phrase is: {third_phrase} Target is: bass")
print(f"Description is: {', '.join(third_predicted_words)}")
print()
print(
f"First & Second: {sorted(set(first_predicted_words) & set(second_predicted_words))}"
)
print(
f"First & Third: {sorted(set(first_predicted_words) & set(third_predicted_words))}"
)
print(
f"Second & Third: {sorted(set(second_predicted_words) & set(third_predicted_words))}"
)
print()
def friday_evaluation(
model: AutoModelForMaskedLM, tokenizer: AutoTokenizer, ignore_tokens: List[int]
) -> None:
spanish_text = "Friday es mi canción favorita."
english_text = "Friday is my favourite song."
spanish_predicted_words = get_predictions(
model=model,
tokenizer=tokenizer,
text=spanish_text,
noun="Friday",
ignore_tokens=ignore_tokens,
)
english_predicted_words = get_predictions(
model=model,
tokenizer=tokenizer,
text=english_text,
noun="Friday",
ignore_tokens=ignore_tokens,
)
overlap = set(spanish_predicted_words) & set(english_predicted_words)
difference = set(spanish_predicted_words) ^ set(english_predicted_words)
print("=== FRIDAY EVALUATION ===")
print(f"Spanish Phrase is: {spanish_text}")
print(f"Spanish Description is: {', '.join(spanish_predicted_words)}")
print(f"English Phrase is: {english_text}")
print(f"English Description is: {', '.join(english_predicted_words)}")
print()
print(f"Description Overlap is: {', '.join(sorted(overlap))}")
print(f"Description Difference is: {', '.join(sorted(difference))}")
print()
def malibu_evaluation(
model: AutoModelForMaskedLM, tokenizer: AutoTokenizer, ignore_tokens: List[int]
) -> None:
text = "I like to drive my Malibu while drinking Malibu."
first_predicted_words = get_predictions(
model=model,
tokenizer=tokenizer,
text=text,
noun="Malibu",
ignore_tokens=ignore_tokens,
)
second_predicted_words = get_predictions(
model=model,
tokenizer=tokenizer,
text=text,
noun="Malibu",
index=1,
ignore_tokens=ignore_tokens,
)
print("=== MALIBU EVALUATION ===")
print(f"Phrase is: {text}")
print(f"First Malibu (car) Description is: {', '.join(first_predicted_words)}")
print(f"Second Malibu (drink) Description is: {', '.join(second_predicted_words)}")
print()
print(
f"First & Second: {sorted(set(first_predicted_words) & set(second_predicted_words))}"
)
print(
f"First ^ Second: {sorted(set(first_predicted_words) ^ set(second_predicted_words))}"
)
print()
def football_evaluation(
model: AutoModelForMaskedLM, tokenizer: AutoTokenizer, ignore_tokens: List[int]
) -> None:
spanish_phrase = (
"Retiremos el equipo de la cancha, "
"Boca no merece jugar esta copa que "
"hace tiempo viene siendo desprestigiada.\n"
"Ya no se juega al futbol."
)
english_phrase = (
"Let's remove the team from the field, "
"Boca does not deserve to play this cup that "
"has long been discredited. "
"Football is no longer played."
)
print("=== FOOTBALL EVALUATION ===")
print(f"Spanish Phrase is: {spanish_phrase}")
print(f"English Phrase is: {english_phrase}")
print()
for spanish_noun, english_noun in [
["equipo", "team"],
["Boca", "Boca"],
["copa", "cup"],
["tiempo", "long"],
["futbol", "Football"],
]:
spanish_description = get_predictions(
model=model,
tokenizer=tokenizer,
text=spanish_phrase,
noun=spanish_noun,
ignore_tokens=ignore_tokens,
)
english_description = get_predictions(
model=model,
tokenizer=tokenizer,
text=english_phrase,
noun=english_noun,
ignore_tokens=ignore_tokens,
)
overlap = set(spanish_description) & set(english_description)
difference = set(spanish_description) ^ set(english_description)
print(f"Spanish word is: {spanish_noun}, English word is: {english_noun}")
print(f"Spanish Description is: {', '.join(spanish_description)}")
print(f"English Description is: {', '.join(english_description)}")
print(f"Overlap is: {', '.join(sorted(overlap))} ({len(overlap)})")
print(f"Difference is: {', '.join(sorted(difference))} ({len(difference)})")
print()
@torch.inference_mode()
def get_predictions(
*,
model: AutoModelForMaskedLM,
tokenizer: AutoTokenizer,
text: str,
noun: str,
index: int = 0,
ignore_tokens: Optional[List[int]] = None,
) -> List[str]:
if ignore_tokens is None:
ignore_tokens = []
tokens = tokenizer(text, return_tensors="pt")
start, _end = get_noun(
tokenizer=tokenizer, tokens=tokens.input_ids[0], noun=noun, index=index
)
output = model(**tokens)
predictions = output.logits[0, start]
predictions[ignore_tokens] = predictions.min()
predicted_tokens = predictions.argsort(descending=True)[:10]
predicted_words = [
word.strip() for word in tokenizer.batch_decode(predicted_tokens)
]
return predicted_words
def get_noun(
tokenizer: AutoTokenizer, tokens: torch.Tensor, noun: str, index: int
) -> Tuple[int, int]:
length = tokens.shape[0]
current_index = index
for start_index in range(length):
word = tokenizer.decode(tokens[start_index]).strip()
if not noun.startswith(word):
continue
for end_index in range(start_index + 1, length):
word = tokenizer.decode(tokens[start_index:end_index]).strip()
if not noun == word:
continue
if current_index > 0:
current_index -= 1
else:
return start_index, end_index
raise AssertionError(f"Did not find {noun}[{index}] in {tokenizer.decode(tokens)}")
Could not locate the tokenizer configuration file, will try to use the model config instead.
=== BASS EVALUATION ===
First Phrase is: We spotted a large bass in the ocean. Target is: bass
Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, tatzen, der, laguntzen, ikusten
Second Phrase is: The bass player did not receive the acknowledgment she deserves. Target is: bass
Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, tatzen, der, laguntzen, ikusten
Third Phrase is: The black sea bass, is a member of the wreckfish family. Target is: bass
Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, tatzen, der, laguntzen, ikusten
First & Second: ['<s>', 'azaltzen', 'dauden', 'der', 'eskaera', 'görmək', 'ikusten', 'laguntzen', 'tatzen', 'zusehen']
First & Third: ['<s>', 'azaltzen', 'dauden', 'der', 'eskaera', 'görmək', 'ikusten', 'laguntzen', 'tatzen', 'zusehen']
Second & Third: ['<s>', 'azaltzen', 'dauden', 'der', 'eskaera', 'görmək', 'ikusten', 'laguntzen', 'tatzen', 'zusehen']
=== FRIDAY EVALUATION ===
Spanish Phrase is: Friday es mi canción favorita.
Spanish Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, tatzen, der, laguntzen, ikusten
English Phrase is: Friday is my favourite song.
English Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, tatzen, der, laguntzen, ikusten
Description Overlap is: <s>, azaltzen, dauden, der, eskaera, görmək, ikusten, laguntzen, tatzen, zusehen
Description Difference is:
=== MALIBU EVALUATION ===
Phrase is: I like to drive my Malibu while drinking Malibu.
First Malibu (car) Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, tatzen, der, laguntzen, ikusten
Second Malibu (drink) Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, tatzen, der, laguntzen, ikusten
First & Second: ['<s>', 'azaltzen', 'dauden', 'der', 'eskaera', 'görmək', 'ikusten', 'laguntzen', 'tatzen', 'zusehen']
First ^ Second: []
=== FOOTBALL EVALUATION ===
Spanish Phrase is: Retiremos el equipo de la cancha, Boca no merece jugar esta copa que hace tiempo viene siendo desprestigiada.
Ya no se juega al futbol.
English Phrase is: Let's remove the team from the field, Boca does not deserve to play this cup that has long been discredited. Football is no longer played.
Spanish word is: equipo, English word is: team
Spanish Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, tatzen, der, laguntzen, ikusten
English Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, der, tatzen, laguntzen, ikusten
Overlap is: <s>, azaltzen, dauden, der, eskaera, görmək, ikusten, laguntzen, tatzen, zusehen (10)
Difference is: (0)
Spanish word is: Boca, English word is: Boca
Spanish Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, tatzen, der, laguntzen, ikusten
English Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, der, tatzen, laguntzen, ikusten
Overlap is: <s>, azaltzen, dauden, der, eskaera, görmək, ikusten, laguntzen, tatzen, zusehen (10)
Difference is: (0)
Spanish word is: copa, English word is: cup
Spanish Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, tatzen, der, laguntzen, ikusten
English Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, der, tatzen, laguntzen, ikusten
Overlap is: <s>, azaltzen, dauden, der, eskaera, görmək, ikusten, laguntzen, tatzen, zusehen (10)
Difference is: (0)
Spanish word is: tiempo, English word is: long
Spanish Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, tatzen, der, laguntzen, ikusten
English Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, der, tatzen, laguntzen, ikusten
Overlap is: <s>, azaltzen, dauden, der, eskaera, görmək, ikusten, laguntzen, tatzen, zusehen (10)
Difference is: (0)
Spanish word is: futbol, English word is: Football
Spanish Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, tatzen, der, laguntzen, ikusten
English Description is: <s>, görmək, dauden, azaltzen, zusehen, eskaera, der, tatzen, laguntzen, ikusten
Overlap is: <s>, azaltzen, dauden, der, eskaera, görmək, ikusten, laguntzen, tatzen, zusehen (10)
Difference is: (0)
It’s clear that the distance based training has fundamentally broken the model. I need to try out using KL Divergence as the loss metric. Part of the problem may be the using softmax during the extraction of the rows from the teacher. That would require reprocessing the teacher data which would be tiresome - it takes 15 hours to complete.
This time I am going to use KL Divergence instead. To try to make this easier to work with I want to load all of the target values into memory. I am going to discard the standard deviation and just train against the distribution represented by the mean.
Loading all of them into memory should take \(250k_{tokens} * 8_{bytes per float} * 10k_{descriptions} = 20G\) which is too much.
However it should be possible to load them in for a single inference. This is less ideal as it involves shipping them from CPU each time, but it should speed up the inference process.
from itertools import starmap
from typing import Any, Dict, List, Optional, Tuple, Union
from pathlib import Path
import pandas as pd
import datasets
import torch
import torch.nn.functional as F
from transformers import (
AutoModelForMaskedLM,
AutoTokenizer,
DataCollatorWithPadding,
EvalPrediction,
Trainer,
TrainingArguments,
)
from transformers.modeling_outputs import MaskedLMOutput
class ArticleTrainingArguments(TrainingArguments):
def __init__(
self,
*args,
temperature: float = 2.0,
**kwargs,
) -> None:
super().__init__(*args, **kwargs)
self.temperature = temperature
class ArticleMeasure:
def __init__(self, file: Path) -> None:
df = pd.read_parquet(file)
self.indices = [
torch.tensor(values, dtype=torch.long)
for values in df["indices"]
]
self.mean = [
torch.tensor(values, dtype=torch.float)
for values in df["mean"]
]
def to(self, device) -> None:
self.indices = [entry.to(device) for entry in self.indices]
self.mean = [entry.to(device) for entry in self.mean]
def loss(self, output: torch.Tensor, labels: torch.Tensor) -> torch.Tensor:
batch_size, token_count, vocab_size = output.shape
start_tokens = labels[:, :, 0]
offsets = torch.arange(start=0, end=batch_size, dtype=torch.long, device=output.device) * token_count
offsets = offsets.repeat_interleave((start_tokens != -1).sum(axis=1))
start_tokens = start_tokens.flatten()
target_indices = labels[:, :, 2].flatten()
token_mask = start_tokens != -1
start_tokens = start_tokens[token_mask]
target_indices = target_indices[token_mask]
# offsets = offsets[token_mask] # repeat_interleave has already established this
output = output.reshape(-1, vocab_size)
predictions = output[start_tokens + offsets]
predictions = F.log_softmax(predictions, dim=-1)
targets = torch.zeros_like(predictions, dtype=torch.float, device=output.device, requires_grad=False)
for row_index, index in enumerate(target_indices):
targets[row_index, self.indices[index]] = self.mean[index]
return F.kl_div(
input=predictions,
target=targets,
reduction="batchmean",
log_target=False,
)
class ArticleTrainer(Trainer):
def __init__(
self,
*args,
article_file: Path = None,
**kwargs,
) -> None:
super().__init__(*args, **kwargs)
self.measure = ArticleMeasure(article_file)
self.measure.to(self.model.device)
def compute_loss(
self,
model: AutoModelForMaskedLM,
inputs: Dict[str, Union[torch.Tensor, Any]],
return_outputs: bool = False,
) -> Union[Tuple[torch.Tensor, torch.Tensor], torch.Tensor]:
outputs: MaskedLMOutput = model(
input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"]
)
loss: torch.Tensor = self.measure.loss(outputs.logits, labels=inputs["labels"])
if not return_outputs:
return loss
return loss, outputs
def compute_metrics(model_output: EvalPrediction) -> Dict[str, float]:
# distance is just loss already
kl_div = model_output.predictions[:, 0].mean()
overlap = model_output.predictions[:, 1].mean()
return {
"kl_div": kl_div,
"overlap": overlap,
}
def train(
*,
model_name: str = "xlm-roberta-base",
# dataset_name: str = "xlm-roberta",
batch_size: int = 32,
learning_rate: float = 1e-4,
# temperature: float = 2,
fp16: bool = False,
# mean_prediction: bool = False,
# ignore_tokens: Optional[List[int]] = None,
epochs: Optional[float] = 2,
max_steps: int = -1,
evaluation_steps: int = 500,
article_file: Path = None,
) -> Path:
assert article_file is not None
run_name = "-".join(
[
f"{model_name}",
f"e{epochs}" if max_steps == -1 else f"ms{max_steps}",
f"bs{batch_size}",
f"lr{learning_rate}",
# f"t{temperature}",
]
+ (["fp16"] if fp16 else [])
# + (["mean"] if mean_prediction else [])
# + ([f"it{len(ignore_tokens)}"] if ignore_tokens else [])
)
print(f"Starting {run_name}")
train_ds = datasets.load_from_disk(DATASET_FOLDER / "train.dataset")
test_ds = datasets.load_from_disk(DATASET_FOLDER / "valid.dataset")
training_args = ArticleTrainingArguments(
report_to="none",
output_dir=RUN_FOLDER,
num_train_epochs=epochs,
max_steps=max_steps,
seed=33,
# number of steps before moving evaluation results from GPU to CPU see
# https://discuss.huggingface.co/t/cuda-out-of-memory-when-using-trainer-with-compute-metrics/2941
eval_accumulation_steps=5,
#
# hyperparameters
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
fp16=fp16,
# temperature=temperature,
# mean_prediction=mean_prediction,
# ignore_tokens=ignore_tokens,
learning_rate=learning_rate,
#
# evaluation settings
evaluation_strategy="steps",
logging_steps=evaluation_steps,
eval_steps=evaluation_steps,
save_steps=evaluation_steps,
#
# checkpoint settings
logging_dir=RUN_FOLDER / "logs",
save_total_limit=2,
load_best_model_at_end=True,
metric_for_best_model="loss",
greater_is_better=False,
# remove_unused_columns=False,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
trainer = ArticleTrainer(
model=model,
args=training_args,
train_dataset=train_ds,
eval_dataset=test_ds,
data_collator=data_collator,
tokenizer=tokenizer,
# compute_metrics=compute_metrics,
article_file=article_file,
)
trainer.train()
model.save_pretrained(MODEL_FOLDER / run_name)
return MODEL_FOLDER / run_name
Starting xlm-roberta-base-ms1000-bs8-lr0.0001
/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.10/lib/python3.10/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
Step | Training Loss | Validation Loss |
---|---|---|
100 | 1.090300 | 0.486406 |
200 | 0.489100 | 0.400390 |
300 | 0.416900 | 0.375600 |
400 | 0.406200 | 0.361904 |
500 | 0.355400 | 0.342380 |
600 | 0.330400 | 0.308772 |
700 | 0.307400 | 0.313152 |
800 | 0.289600 | 0.291620 |
900 | 0.271900 | 0.276856 |
1000 | 0.269700 | 0.272617 |
# from src/main/python/blog/prompt_internalization/multilingual/roberta/evaluate.py
from pathlib import Path
from typing import List, Optional, Tuple
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
def evaluate(
model_name: str, model_path: Path, ignore_tokens: Optional[List[int]] = None
) -> None:
if ignore_tokens is None:
ignore_tokens = []
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_path)
model.eval()
bass_evaluation(model=model, tokenizer=tokenizer, ignore_tokens=ignore_tokens)
friday_evaluation(model=model, tokenizer=tokenizer, ignore_tokens=ignore_tokens)
malibu_evaluation(model=model, tokenizer=tokenizer, ignore_tokens=ignore_tokens)
football_evaluation(model=model, tokenizer=tokenizer, ignore_tokens=ignore_tokens)
def bass_evaluation(
model: AutoModelForMaskedLM, tokenizer: AutoTokenizer, ignore_tokens: List[int]
) -> None:
first_phrase = "We spotted a large bass in the ocean."
second_phrase = "The bass player did not receive the acknowledgment she deserves."
third_phrase = "The black sea bass, is a member of the wreckfish family."
first_predicted_words = get_predictions(
model=model,
tokenizer=tokenizer,
text=first_phrase,
noun="bass",
ignore_tokens=ignore_tokens,
)
second_predicted_words = get_predictions(
model=model,
tokenizer=tokenizer,
text=second_phrase,
noun="bass",
ignore_tokens=ignore_tokens,
)
third_predicted_words = get_predictions(
model=model,
tokenizer=tokenizer,
text=third_phrase,
noun="bass",
ignore_tokens=ignore_tokens,
)
print("=== BASS EVALUATION ===")
print(f"First Phrase is: {first_phrase} Target is: bass")
print(f"Description is: {', '.join(first_predicted_words)}")
print()
print(f"Second Phrase is: {second_phrase} Target is: bass")
print(f"Description is: {', '.join(second_predicted_words)}")
print()
print(f"Third Phrase is: {third_phrase} Target is: bass")
print(f"Description is: {', '.join(third_predicted_words)}")
print()
print(
f"First & Second: {sorted(set(first_predicted_words) & set(second_predicted_words))}"
)
print(
f"First & Third: {sorted(set(first_predicted_words) & set(third_predicted_words))}"
)
print(
f"Second & Third: {sorted(set(second_predicted_words) & set(third_predicted_words))}"
)
print()
def friday_evaluation(
model: AutoModelForMaskedLM, tokenizer: AutoTokenizer, ignore_tokens: List[int]
) -> None:
spanish_text = "Friday es mi canción favorita."
english_text = "Friday is my favourite song."
spanish_predicted_words = get_predictions(
model=model,
tokenizer=tokenizer,
text=spanish_text,
noun="Friday",
ignore_tokens=ignore_tokens,
)
english_predicted_words = get_predictions(
model=model,
tokenizer=tokenizer,
text=english_text,
noun="Friday",
ignore_tokens=ignore_tokens,
)
overlap = set(spanish_predicted_words) & set(english_predicted_words)
difference = set(spanish_predicted_words) ^ set(english_predicted_words)
print("=== FRIDAY EVALUATION ===")
print(f"Spanish Phrase is: {spanish_text}")
print(f"Spanish Description is: {', '.join(spanish_predicted_words)}")
print(f"English Phrase is: {english_text}")
print(f"English Description is: {', '.join(english_predicted_words)}")
print()
print(f"Description Overlap is: {', '.join(sorted(overlap))}")
print(f"Description Difference is: {', '.join(sorted(difference))}")
print()
def malibu_evaluation(
model: AutoModelForMaskedLM, tokenizer: AutoTokenizer, ignore_tokens: List[int]
) -> None:
text = "I like to drive my Malibu while drinking Malibu."
first_predicted_words = get_predictions(
model=model,
tokenizer=tokenizer,
text=text,
noun="Malibu",
ignore_tokens=ignore_tokens,
)
second_predicted_words = get_predictions(
model=model,
tokenizer=tokenizer,
text=text,
noun="Malibu",
index=1,
ignore_tokens=ignore_tokens,
)
print("=== MALIBU EVALUATION ===")
print(f"Phrase is: {text}")
print(f"First Malibu (car) Description is: {', '.join(first_predicted_words)}")
print(f"Second Malibu (drink) Description is: {', '.join(second_predicted_words)}")
print()
print(
f"First & Second: {sorted(set(first_predicted_words) & set(second_predicted_words))}"
)
print(
f"First ^ Second: {sorted(set(first_predicted_words) ^ set(second_predicted_words))}"
)
print()
def football_evaluation(
model: AutoModelForMaskedLM, tokenizer: AutoTokenizer, ignore_tokens: List[int]
) -> None:
spanish_phrase = (
"Retiremos el equipo de la cancha, "
"Boca no merece jugar esta copa que "
"hace tiempo viene siendo desprestigiada.\n"
"Ya no se juega al futbol."
)
english_phrase = (
"Let's remove the team from the field, "
"Boca does not deserve to play this cup that "
"has long been discredited. "
"Football is no longer played."
)
print("=== FOOTBALL EVALUATION ===")
print(f"Spanish Phrase is: {spanish_phrase}")
print(f"English Phrase is: {english_phrase}")
print()
for spanish_noun, english_noun in [
["equipo", "team"],
["Boca", "Boca"],
["copa", "cup"],
["tiempo", "long"],
["futbol", "Football"],
]:
spanish_description = get_predictions(
model=model,
tokenizer=tokenizer,
text=spanish_phrase,
noun=spanish_noun,
ignore_tokens=ignore_tokens,
)
english_description = get_predictions(
model=model,
tokenizer=tokenizer,
text=english_phrase,
noun=english_noun,
ignore_tokens=ignore_tokens,
)
overlap = set(spanish_description) & set(english_description)
difference = set(spanish_description) ^ set(english_description)
print(f"Spanish word is: {spanish_noun}, English word is: {english_noun}")
print(f"Spanish Description is: {', '.join(spanish_description)}")
print(f"English Description is: {', '.join(english_description)}")
print(f"Overlap is: {', '.join(sorted(overlap))} ({len(overlap)})")
print(f"Difference is: {', '.join(sorted(difference))} ({len(difference)})")
print()
@torch.inference_mode()
def get_predictions(
*,
model: AutoModelForMaskedLM,
tokenizer: AutoTokenizer,
text: str,
noun: str,
index: int = 0,
ignore_tokens: Optional[List[int]] = None,
) -> List[str]:
if ignore_tokens is None:
ignore_tokens = []
tokens = tokenizer(text, return_tensors="pt")
start, _end = get_noun(
tokenizer=tokenizer, tokens=tokens.input_ids[0], noun=noun, index=index
)
output = model(**tokens)
predictions = output.logits[0, start]
predictions[ignore_tokens] = predictions.min()
predicted_tokens = predictions.argsort(descending=True)[:10]
predicted_words = [
word.strip() for word in tokenizer.batch_decode(predicted_tokens)
]
return predicted_words
def get_noun(
tokenizer: AutoTokenizer, tokens: torch.Tensor, noun: str, index: int
) -> Tuple[int, int]:
length = tokens.shape[0]
current_index = index
for start_index in range(length):
word = tokenizer.decode(tokens[start_index]).strip()
if not noun.startswith(word):
continue
for end_index in range(start_index + 1, length):
word = tokenizer.decode(tokens[start_index:end_index]).strip()
if not noun == word:
continue
if current_index > 0:
current_index -= 1
else:
return start_index, end_index
raise AssertionError(f"Did not find {noun}[{index}] in {tokenizer.decode(tokens)}")
Could not locate the tokenizer configuration file, will try to use the model config instead.
=== BASS EVALUATION ===
First Phrase is: We spotted a large bass in the ocean. Target is: bass
Description is: Material, Color, Type, Area, Surface, Size, Water, Description, Location, Feature
Second Phrase is: The bass player did not receive the acknowledgment she deserves. Target is: bass
Description is: Instrument, Material, Style, Music, Type, Sport, Language, Sports, Description, Color
Third Phrase is: The black sea bass, is a member of the wreckfish family. Target is: bass
Description is: Color, Animal, Type, Material, Cat, Name, Fish, Food, Description, Plant
First & Second: ['Color', 'Description', 'Material', 'Type']
First & Third: ['Color', 'Description', 'Material', 'Type']
Second & Third: ['Color', 'Description', 'Material', 'Type']
=== FRIDAY EVALUATION ===
Spanish Phrase is: Friday es mi canción favorita.
Spanish Description is: Date, Day, Time, Description, Tag, Name, Year, Color, Age, Birthday
English Phrase is: Friday is my favourite song.
English Description is: Date, Day, Time, Tag, Description, Name, Year, Color, Age, Birthday
Description Overlap is: Age, Birthday, Color, Date, Day, Description, Name, Tag, Time, Year
Description Difference is:
=== MALIBU EVALUATION ===
Phrase is: I like to drive my Malibu while drinking Malibu.
First Malibu (car) Description is: Country, Location, Land, Language, Region, City, Origin, State, Area, Source
Second Malibu (drink) Description is: Country, Food, Language, Land, Location, Source, Culture, Type, Color, Region
First & Second: ['Country', 'Land', 'Language', 'Location', 'Region', 'Source']
First ^ Second: ['Area', 'City', 'Color', 'Culture', 'Food', 'Origin', 'State', 'Type']
=== FOOTBALL EVALUATION ===
Spanish Phrase is: Retiremos el equipo de la cancha, Boca no merece jugar esta copa que hace tiempo viene siendo desprestigiada.
Ya no se juega al futbol.
English Phrase is: Let's remove the team from the field, Boca does not deserve to play this cup that has long been discredited. Football is no longer played.
Spanish word is: equipo, English word is: team
Spanish Description is: Name, Type, Organization, Sponsor, Owner, Location, Sport, Sports, Title, Team
English Description is: Team, Type, Organization, Name, Sport, Sports, Sponsor, Location, Title, Company
Overlap is: Location, Name, Organization, Sponsor, Sport, Sports, Team, Title, Type (9)
Difference is: Company, Owner (2)
Spanish word is: Boca, English word is: Boca
Spanish Description is: City, Location, Sponsor, Owner, Company, Country, Team, Organization, Name, Land
English Description is: City, Sponsor, Location, Company, Owner, Country, Organization, Name, Team, Land
Overlap is: City, Company, Country, Land, Location, Name, Organization, Owner, Sponsor, Team (10)
Difference is: (0)
Spanish word is: copa, English word is: cup
Spanish Description is: Title, Type, Series, Sports, Category, Game, Sport, Match, Sponsor, Organization
English Description is: Title, Series, Type, Sport, Sports, Category, Game, Organization, Status, Match
Overlap is: Category, Game, Match, Organization, Series, Sport, Sports, Title, Type (9)
Difference is: Sponsor, Status (2)
Spanish word is: tiempo, English word is: long
Spanish Description is: Year, Time, Age, Date, Description, Country, History, Location, Duration, Title
English Description is: Title, Description, Age, Status, Year, Subject, Type, Country, Religion, Date
Overlap is: Age, Country, Date, Description, Title, Year (6)
Difference is: Duration, History, Location, Religion, Status, Subject, Time, Type (8)
Spanish word is: futbol, English word is: Football
Spanish Description is: Sports, Sport, Type, Style, Football, Game, Language, Category, Religion, Title
English Description is: Sports, Sport, Football, Style, Type, Game, Language, Religion, Category, Culture
Overlap is: Category, Football, Game, Language, Religion, Sport, Sports, Style, Type (9)
Difference is: Culture, Title (2)
Starting xlm-roberta-base-e1-bs8-lr0.0001
/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.10/lib/python3.10/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
Step | Training Loss | Validation Loss |
---|---|---|
1000 | 0.471600 | 0.336681 |
2000 | 0.300800 | 0.286805 |
3000 | 0.267600 | 0.284328 |
4000 | 0.248100 | 0.281049 |
5000 | 0.229100 | 0.274642 |
6000 | 0.214900 | 0.261699 |
7000 | 0.205300 | 0.255581 |
8000 | 0.193600 | 0.245989 |
9000 | 0.194000 | 0.241496 |
10000 | 0.181600 | 0.239476 |
11000 | 0.171500 | 0.252683 |
12000 | 0.167000 | 0.237175 |
13000 | 0.164200 | 0.240240 |
14000 | 0.158500 | 0.243052 |
15000 | 0.156400 | 0.237823 |
16000 | 0.155900 | 0.244386 |
17000 | 0.150900 | 0.235650 |
18000 | 0.142800 | 0.230899 |
19000 | 0.139900 | 0.228503 |
20000 | 0.138100 | 0.226981 |
21000 | 0.136700 | 0.230420 |
22000 | 0.134300 | 0.218844 |
23000 | 0.133900 | 0.219605 |
24000 | 0.126200 | 0.215770 |
25000 | 0.127300 | 0.221229 |
26000 | 0.128100 | 0.222714 |
27000 | 0.122700 | 0.219199 |
28000 | 0.119200 | 0.221566 |
29000 | 0.118900 | 0.218871 |
30000 | 0.118800 | 0.212282 |
31000 | 0.118300 | 0.211434 |
32000 | 0.114100 | 0.214648 |
33000 | 0.111900 | 0.219043 |
34000 | 0.112000 | 0.216563 |
35000 | 0.109100 | 0.220941 |
36000 | 0.106500 | 0.216465 |
37000 | 0.105700 | 0.219658 |
38000 | 0.106400 | 0.210428 |
39000 | 0.105100 | 0.208662 |
40000 | 0.101300 | 0.221220 |
41000 | 0.103000 | 0.206249 |
KeyboardInterrupt:
I’ve interrupted this because it looks like it is going to take more than a day to do an entire epoch. It’s possible to review the performance of the model as the checkpoints have been saved.
Could not locate the tokenizer configuration file, will try to use the model config instead.
=== BASS EVALUATION ===
First Phrase is: We spotted a large bass in the ocean. Target is: bass
Description is: Material, Area, Type, Location, Source, Description, Category, Site, Name, Surface
Second Phrase is: The bass player did not receive the acknowledgment she deserves. Target is: bass
Description is: Instrument, Material, Type, Music, Style, System, Player, Guitar, Track, Description
Third Phrase is: The black sea bass, is a member of the wreckfish family. Target is: bass
Description is: Material, Area, Type, Location, Source, Description, Surface, Application, Land, Name
First & Second: ['Description', 'Material', 'Type']
First & Third: ['Area', 'Description', 'Location', 'Material', 'Name', 'Source', 'Surface', 'Type']
Second & Third: ['Description', 'Material', 'Type']
=== FRIDAY EVALUATION ===
Spanish Phrase is: Friday es mi canción favorita.
Spanish Description is: Date, Day, Time, Holiday, Weekend, Night, Sunday, Birthday, Event, Saturday
English Phrase is: Friday is my favourite song.
English Description is: Date, Day, Time, Holiday, Weekend, Night, Sunday, Birthday, Friday, Saturday
Description Overlap is: Birthday, Date, Day, Holiday, Night, Saturday, Sunday, Time, Weekend
Description Difference is: Event, Friday
=== MALIBU EVALUATION ===
Phrase is: I like to drive my Malibu while drinking Malibu.
First Malibu (car) Description is: Location, Country, City, Land, Place, Region, Area, State, Local, Address
Second Malibu (drink) Description is: Location, Country, City, Land, Place, Region, Area, State, Address, Local
First & Second: ['Address', 'Area', 'City', 'Country', 'Land', 'Local', 'Location', 'Place', 'Region', 'State']
First ^ Second: []
=== FOOTBALL EVALUATION ===
Spanish Phrase is: Retiremos el equipo de la cancha, Boca no merece jugar esta copa que hace tiempo viene siendo desprestigiada.
Ya no se juega al futbol.
English Phrase is: Let's remove the team from the field, Boca does not deserve to play this cup that has long been discredited. Football is no longer played.
Spanish word is: equipo, English word is: team
Spanish Description is: Sponsor, Owner, Organization, Team, Company, Name, Location, Title, Type, Member
English Description is: Sponsor, Organization, Owner, Team, Name, Company, Location, Sports, Type, Title
Overlap is: Company, Location, Name, Organization, Owner, Sponsor, Team, Title, Type (9)
Difference is: Member, Sports (2)
Spanish word is: Boca, English word is: Boca
Spanish Description is: Sponsor, Club, City, Company, Organization, Team, Owner, Location, Land, Country
English Description is: Sponsor, Club, City, Company, Owner, Organization, Team, Location, Brand, Land
Overlap is: City, Club, Company, Land, Location, Organization, Owner, Sponsor, Team (9)
Difference is: Brand, Country (2)
Spanish word is: copa, English word is: cup
Spanish Description is: Title, Type, Sponsor, Series, Sports, Sport, Club, Location, Category, Football
English Description is: Title, Type, Cup, Series, Sport, Sponsor, Sports, Location, Category, Match
Overlap is: Category, Location, Series, Sponsor, Sport, Sports, Title, Type (8)
Difference is: Club, Cup, Football, Match (4)
Spanish word is: tiempo, English word is: long
Spanish Description is: Time, Age, Duration, Weight, Type, Game, Size, Speed, Year, Sport
English Description is: Type, Age, Sport, Year, Sports, Game, Title, Location, Duration, Time
Overlap is: Age, Duration, Game, Sport, Time, Type, Year (7)
Difference is: Location, Size, Speed, Sports, Title, Weight (6)
Spanish word is: futbol, English word is: Football
Spanish Description is: Sports, Sport, Football, Type, Game, Style, Title, Theme, Religion, Organization
English Description is: Sports, Sport, Football, Type, Style, Game, Title, Religion, Series, Category
Overlap is: Football, Game, Religion, Sport, Sports, Style, Title, Type (8)
Difference is: Category, Organization, Series, Theme (4)
This model has not collapsed, unlike the last one. The Malibu evaluation is not great as it has suggested the same output for both instances. The Football evaluation is much improved with distinct suggestions for the different words.
Overall I think this is an improvement over the previous approach. The generation of the features needs to be improved. Performing the softmax over the values during feature generation is premature and makes aggregation more tricky. If I stop doing this then I should have a way to provide a value for all of the missing values. I can use a fixed index (like 0) as the mean of all of the unindexed values to fill them out.