Cross Language Prompt Internalization - Double Head with Language Modelling

Trying to reduce model collapse by simultaneous language modelling
prompt internalization
multilingual prompt internalization
cross language word sense induction
Published

July 12, 2022

The XLM-RoBERTa model appears to be a good candidate for the multilingual prompt internalization task. When trained the model frequently loses the ability to differentiate between different words or phrases in the text. This leads to it returning predictions for different tokens that are extremely similar.

I believe that the training set is the source of this problem. Due to my inability to correctly associate different nouns across the English sentence and the translation each training row has only a single noun in it. This means that the training process does not punish the model for producing the same output for every token, as it only checks a single token.

To fix this I can think of the following options:

Ideally a better training set could be found which would have noun markup and association. There are several datasets available on the Statistical Machine Translation website, so reviewing them would be productive.

Until then I am going to overcomplicate the model to fix a deficiency in the dataset.

Code
import blog.transformers_logging

Multi Head Model

The conversational model that I recently retrained used the GPT2DoubleHeadsModel which has a classification head and a language modelling head. By reviewing this code I should be able to split the head of the XLM-RoBERTa model so that one can be used for language describing, and one can be used for masked language modelling. It’s likely that the training process will involve submitting the input twice as the masked language modelling requires providing masked inputs, while the language describing process needs all of the input to provide context.

Code
# from src/main/python/blog/prompt_internalization/multilingual/double_head/model.py
import logging
from dataclasses import dataclass
from typing import Optional, Tuple, Union

import torch
from transformers import RobertaConfig, RobertaPreTrainedModel
from transformers.models.roberta.modeling_roberta import RobertaLMHead, RobertaModel

logger = logging.getLogger(__name__)


@dataclass
class RobertaDoubleHeadsOutput:
    ld_logits: torch.FloatTensor
    lm_logits: torch.FloatTensor
    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
    attentions: Optional[Tuple[torch.FloatTensor]] = None


# Heavily copied from RobertaForMaskedLM
class RobertaDoubleHeadsModel(RobertaPreTrainedModel):
    def __init__(self, config: RobertaConfig) -> None:
        super().__init__(config)

        if config.is_decoder:
            logger.warning(
                "If you want to use `RobertaDoubleHeadsModel` make sure "
                "`config.is_decoder=False` for bi-directional self-attention."
            )

        self.roberta = RobertaModel(config, add_pooling_layer=False)
        self.lm_head = RobertaLMHead(config)
        self.ld_head = RobertaLMHead(config)

        # The LM head weights require special treatment only when they are tied
        # with the word embeddings
        self.update_keys_to_ignore(config, ["lm_head.decoder.weight"])

        # Initialize weights and apply final processing
        self.post_init()

    def copy_lm_weights_to_ld(self) -> None:
        # copy the language modelling weights to the language describing head
        print("initializing language describing head with language modelling weights")
        self.ld_head.load_state_dict(self.lm_head.state_dict())

    def forward(
        self,
        input_ids: Optional[torch.LongTensor] = None,
        attention_mask: Optional[torch.FloatTensor] = None,
        *,
        token_type_ids: Optional[torch.LongTensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        head_mask: Optional[torch.FloatTensor] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        encoder_hidden_states: Optional[torch.FloatTensor] = None,
        encoder_attention_mask: Optional[torch.FloatTensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple[torch.Tensor], RobertaDoubleHeadsOutput]:
        # all loss calculations will be in the trainer
        return_dict = (
            return_dict if return_dict is not None else self.config.use_return_dict
        )

        outputs = self.roberta(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            encoder_hidden_states=encoder_hidden_states,
            encoder_attention_mask=encoder_attention_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        sequence_output = outputs[0]

        ld_output = self.ld_head(sequence_output)
        lm_output = self.lm_head(sequence_output)

        if not return_dict:
            return (ld_output, lm_output) + outputs[2:]

        return RobertaDoubleHeadsOutput(
            ld_logits=ld_output,
            lm_logits=lm_output,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

Training Code

This now needs to perform the language modelling task as well as the language describing task. Evaluation also needs to change as the model response uses different fields.

Code
# from src/main/python/blog/prompt_internalization/multilingual/double_head/trainer.py
from itertools import starmap
from typing import Any, Dict, Tuple, Union

import torch
import torch.nn.functional as F
from transformers import AutoModelForMaskedLM, AutoTokenizer, Trainer, TrainingArguments

from .model import RobertaDoubleHeadsOutput


class DoubleHeadsPromptInternalizationTrainingArguments(TrainingArguments):
    def __init__(
        self,
        *args,
        temperature: float = 2.0,
        mean_prediction: bool = True,
        alpha: float = 0.5,  # loss is alpha * lm_loss + (1 - alpha) * ld_loss
        **kwargs,
    ) -> None:
        assert 0 <= alpha <= 1
        super().__init__(*args, **kwargs)
        self.temperature = temperature
        self.mean_prediction = mean_prediction
        self.alpha = alpha


class DoubleHeadsPromptInternalizationTrainer(Trainer):
    def __init__(
        self,
        *args,
        teacher_model: AutoModelForMaskedLM = None,
        tokenizer: AutoTokenizer = None,
        **kwargs,
    ) -> None:
        super().__init__(*args, **kwargs)
        self.teacher = teacher_model
        self._move_model_to_device(self.teacher, self.model.device)
        self.teacher.eval()
        self.mask_token_id = tokenizer.mask_token_id
        self.vocab_size = tokenizer.vocab_size

    def compute_loss(
        self,
        model: AutoModelForMaskedLM,
        inputs: Dict[str, Union[torch.Tensor, Any]],
        return_outputs: bool = False,
    ) -> Union[Tuple[torch.Tensor, torch.Tensor], torch.Tensor]:
        outputs: RobertaDoubleHeadsOutput = model(
            input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"]
        )
        if self.args.mean_prediction:
            ld_predictions = self._student_predictions_mean(
                outputs=outputs, labels=inputs["ld_labels"]
            )
        else:
            ld_predictions = self._student_predictions_first(
                outputs=outputs, labels=inputs["ld_labels"]
            )
        ld_targets = self._teacher_predictions(
            input_ids=inputs["teacher_input_ids"],
            attention_mask=inputs["teacher_attention_mask"],
        )
        ld_loss = self._ld_loss(predictions=ld_predictions, targets=ld_targets)
        lm_loss = self._lm_loss(
            predictions=outputs.lm_logits, targets=inputs["lm_labels"]
        )

        loss = self.args.alpha * lm_loss + (1 - self.args.alpha) * ld_loss

        if not return_outputs:
            return loss

        # This directly calculates the kl_div and overlap metrics.
        # It's much faster to do this using CUDA operations instead of waiting for cpu numpy.
        with torch.inference_mode():
            kl_div = F.kl_div(
                input=F.log_softmax(ld_predictions.to(torch.float32), dim=-1),
                target=F.softmax(ld_targets.to(torch.float32), dim=-1),
                reduction="none",
                log_target=False,
            )
            kl_div = kl_div.sum(dim=1)

            overlap = starmap(
                torch.isin,
                zip(
                    ld_predictions.argsort(descending=True)[:, :10],
                    ld_targets.argsort(descending=True)[:, :10],
                ),
            )
            overlap = map(torch.sum, overlap)
            overlap = torch.tensor(list(overlap), device=self.model.device)
            overlap = overlap / 10

            # calculating exact cross entropy values per row is hard with unattended tokens
            # I can just repeat the rows to get it working with the mean in compute_metrics

        # This will reshape the metrics to be [batch_size, 2] which will then
        # get correctly passed to the metric calculation
        metric_output = torch.cat(
            [
                kl_div[:, None],
                overlap[:, None],
                lm_loss.broadcast_to(inputs["input_ids"].shape[0], 1),
            ],
            dim=1,
        )
        return loss, metric_output

    @torch.inference_mode()
    def _teacher_predictions(
        self, input_ids: torch.Tensor, attention_mask: torch.Tensor
    ) -> torch.Tensor:
        outputs_teacher = self.teacher(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )
        mask_indices = input_ids == self.mask_token_id
        teacher_predictions = outputs_teacher.logits[mask_indices]
        return teacher_predictions

    def _student_predictions_mean(
        self, outputs: RobertaDoubleHeadsOutput, labels: torch.Tensor
    ) -> torch.Tensor:
        # When calculating this it is very important to avoid breaking back propagation.
        # torch.cat will break back propagation, so the prediction is added per row to a holder
        logits = outputs.ld_logits
        predictions = torch.zeros(logits.shape[0], device=logits.device)
        for index, (start, length) in enumerate(labels):
            prediction = logits[index, start : start + length]
            prediction = prediction.mean(dim=0)
            predictions[index] += prediction
        return predictions

    def _student_predictions_first(
        self,
        outputs: RobertaDoubleHeadsOutput,
        labels: torch.Tensor,
    ) -> torch.Tensor:
        return outputs.ld_logits[range(outputs.ld_logits.shape[0]), labels[:, 0]]

    def _ld_loss(
        self, predictions: torch.Tensor, targets: torch.Tensor
    ) -> torch.Tensor:
        predictions = F.log_softmax(
            predictions.to(torch.float32) / self.args.temperature, dim=-1
        )
        targets = F.softmax(targets.to(torch.float32) / self.args.temperature, dim=-1)
        loss = F.kl_div(
            input=predictions,
            target=targets,
            reduction="batchmean",
            log_target=False,
        )
        return loss * (self.args.temperature**2)

    def _lm_loss(
        self, predictions: torch.Tensor, targets: torch.Tensor
    ) -> torch.Tensor:
        return F.cross_entropy(
            predictions.view(-1, self.vocab_size),
            targets.view(-1),
        )



# from src/main/python/blog/prompt_internalization/multilingual/double_head/collator.py
from typing import Any, Dict, List

import torch
from transformers import AutoTokenizer, DataCollatorForLanguageModeling


class DoubleHeadsCollator:
    """
    The teacher inputs need to be padded and have an associated attention mask.
    The student inputs need to be masked.
    The student needs two labels -
        lm_labels for language modelling, and
        ld_labels for language describing
    """

    def __init__(self, tokenizer: AutoTokenizer) -> None:
        self.tokenizer = tokenizer
        self.collator = DataCollatorForLanguageModeling(
            tokenizer=tokenizer,
            mlm=True,
            mlm_probability=0.15,
            return_tensors="pt",
        )

    def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
        teacher_inputs = self._teacher_inputs(features)
        student_inputs = self._student_inputs(features)
        batch = {**teacher_inputs, **student_inputs}

        return batch

    def _teacher_inputs(self, features: List[Dict[str, Any]]) -> Dict[str, List[Any]]:
        teacher_inputs = [{"input_ids": row["teacher_input_ids"]} for row in features]
        teacher_batch = self.tokenizer.pad(
            teacher_inputs,
            padding=True,
            return_tensors="pt",
        )
        return {
            "teacher_input_ids": teacher_batch["input_ids"],
            "teacher_attention_mask": teacher_batch["attention_mask"],
        }

    def _student_inputs(self, features: List[Dict[str, Any]]) -> Dict[str, List[Any]]:
        student_inputs = [{"input_ids": row["input_ids"]} for row in features]
        student_labels = torch.tensor(
            [row["labels"][0] for row in features],
            dtype=torch.long,
        )
        lm_inputs = self.collator(student_inputs)
        return {
            "input_ids": lm_inputs["input_ids"],
            "attention_mask": lm_inputs["attention_mask"],
            "ld_labels": student_labels,
            "lm_labels": lm_inputs["labels"],
        }



# from src/main/python/blog/prompt_internalization/multilingual/double_head/metrics.py
from typing import Dict

from transformers import EvalPrediction


def compute_metrics(model_output: EvalPrediction) -> Dict[str, float]:
    kl_div = model_output.predictions[:, 0].mean()
    overlap = model_output.predictions[:, 1].mean()
    cross_entropy = model_output.predictions[:, 2].mean()
    return {
        "kl_div": kl_div,
        "overlap": overlap,
        "cross_entropy": cross_entropy,
    }



# from src/main/python/blog/prompt_internalization/multilingual/double_head/train.py
from pathlib import Path
from typing import Optional

import datasets
from transformers import AutoModelForMaskedLM, AutoTokenizer

from .collator import DoubleHeadsCollator
from .metrics import compute_metrics
from .model import RobertaDoubleHeadsModel
from .trainer import (
    DoubleHeadsPromptInternalizationTrainer,
    DoubleHeadsPromptInternalizationTrainingArguments,
)

DATASET_FOLDER = Path("/data/tatoeba/2022-06-18/dataset/")
MODEL_FOLDER = Path("/data/prompt-internalization/multilingual/")
RUN_FOLDER = Path("/tmp/runs")

MODEL_FOLDER.mkdir(parents=True, exist_ok=True)
RUN_FOLDER.mkdir(parents=True, exist_ok=True)


def train(
    *,
    model_name: str = "xlm-roberta-base",
    dataset_name: str = "xlm-roberta",
    batch_size: int = 64,
    learning_rate: float = 1e-4,
    temperature: float = 2,
    alpha: float = 0.5,
    fp16: bool = False,
    mean_prediction: bool = False,
    epochs: Optional[float] = 2,
    max_steps: int = -1,
    evaluation_steps: int = 500,
) -> Path:
    run_name = "-".join(
        [
            f"{model_name}",
            f"e{epochs}" if max_steps == -1 else f"ms{max_steps}",
            f"bs{batch_size}",
            f"lr{learning_rate}",
            f"t{temperature}",
            f"a{alpha}",
        ]
        + (["fp16"] if fp16 else [])
        + (["mean"] if mean_prediction else [])
    )
    print(f"Starting {run_name}")
    train_ds = datasets.load_from_disk(DATASET_FOLDER / f"{dataset_name}-train.dataset")
    test_ds = datasets.load_from_disk(DATASET_FOLDER / f"{dataset_name}-test.dataset")

    training_args = DoubleHeadsPromptInternalizationTrainingArguments(
        report_to="none",
        output_dir=RUN_FOLDER,
        num_train_epochs=epochs,
        max_steps=max_steps,
        seed=33,
        # number of steps before moving evaluation results from GPU to CPU see
        # https://discuss.huggingface.co/t/cuda-out-of-memory-when-using-trainer-with-compute-metrics/2941
        eval_accumulation_steps=5,
        #
        # hyperparameters
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        fp16=fp16,
        temperature=temperature,
        alpha=alpha,
        mean_prediction=mean_prediction,
        learning_rate=learning_rate,
        #
        # evaluation settings
        evaluation_strategy="steps",
        logging_steps=evaluation_steps,
        eval_steps=evaluation_steps,
        save_steps=evaluation_steps,
        #
        # checkpoint settings
        logging_dir=RUN_FOLDER / "logs",
        save_total_limit=2,
        load_best_model_at_end=True,
        metric_for_best_model="overlap",
        greater_is_better=True,
        remove_unused_columns=False,
    )

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    teacher_model = AutoModelForMaskedLM.from_pretrained(model_name)
    student_model = RobertaDoubleHeadsModel.from_pretrained(model_name)
    student_model.copy_lm_weights_to_ld()
    data_collator = DoubleHeadsCollator(tokenizer=tokenizer)

    trainer = DoubleHeadsPromptInternalizationTrainer(
        model=student_model,
        args=training_args,
        teacher_model=teacher_model,
        train_dataset=train_ds,
        eval_dataset=test_ds,
        data_collator=data_collator,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
    )

    trainer.train()
    student_model.save_pretrained(MODEL_FOLDER / run_name)

    return MODEL_FOLDER / run_name



# from src/main/python/blog/prompt_internalization/multilingual/double_head/evaluate.py
from pathlib import Path
from typing import Dict, List, Optional, Tuple

import torch
from transformers import AutoTokenizer

from .model import RobertaDoubleHeadsModel


def evaluate(
    model_name: str,
    model_path: Path,
    ignore_tokens: Optional[Dict[str, List[int]]] = None,
) -> None:
    if ignore_tokens is None:
        ignore_tokens = {}

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = RobertaDoubleHeadsModel.from_pretrained(model_path)
    model.eval()

    bass_evaluation(model=model, tokenizer=tokenizer, ignore_tokens=ignore_tokens)
    friday_evaluation(model=model, tokenizer=tokenizer, ignore_tokens=ignore_tokens)
    malibu_evaluation(model=model, tokenizer=tokenizer, ignore_tokens=ignore_tokens)
    football_evaluation(model=model, tokenizer=tokenizer, ignore_tokens=ignore_tokens)


def bass_evaluation(
    model: RobertaDoubleHeadsModel,
    tokenizer: AutoTokenizer,
    ignore_tokens: Dict[str, List[int]],
) -> None:
    first_phrase = "We spotted a large bass in the ocean."
    second_phrase = "The bass player did not receive the acknowledgment she deserves."
    third_phrase = "The black sea bass, is a member of the wreckfish family."

    first_description = get_predictions(
        model=model,
        tokenizer=tokenizer,
        text=first_phrase,
        noun="bass",
        ignore_tokens=ignore_tokens,
    )
    second_description = get_predictions(
        model=model,
        tokenizer=tokenizer,
        text=second_phrase,
        noun="bass",
        ignore_tokens=ignore_tokens,
    )
    third_description = get_predictions(
        model=model,
        tokenizer=tokenizer,
        text=third_phrase,
        noun="bass",
        ignore_tokens=ignore_tokens,
    )

    print("=== BASS EVALUATION ===")
    print(f"First Phrase is: {first_phrase} Target is: bass")
    print(f"Second Phrase is: {second_phrase} Target is: bass")
    print(f"Third Phrase is: {third_phrase} Target is: bass")
    print()

    for key in first_description:
        first_words = first_description[key]
        second_words = second_description[key]
        third_words = third_description[key]

        first_second_overlap = set(first_words) & set(second_words)
        first_third_overlap = set(first_words) & set(third_words)
        second_third_overlap = set(second_words) & set(third_words)

        print(f"First {key} description is:  {', '.join(first_words)}")
        print(f"Second {key} description is: {', '.join(second_words)}")
        print(f"Third {key} description is:  {', '.join(third_words)}")

        print(
            f"First & Second: {sorted(first_second_overlap)} ({len(first_second_overlap)})"
        )
        print(
            f"First & Third:  {sorted(first_third_overlap)} ({len(first_third_overlap)})"
        )
        print(
            f"Second & Third: {sorted(second_third_overlap)} ({len(second_third_overlap)})"
        )
        print()


def friday_evaluation(
    model: RobertaDoubleHeadsModel,
    tokenizer: AutoTokenizer,
    ignore_tokens: Dict[str, List[int]],
) -> None:
    spanish_text = "Friday es mi canción favorita."
    english_text = "Friday is my favourite song."

    spanish_description = get_predictions(
        model=model,
        tokenizer=tokenizer,
        text=spanish_text,
        noun="Friday",
        ignore_tokens=ignore_tokens,
    )
    english_description = get_predictions(
        model=model,
        tokenizer=tokenizer,
        text=english_text,
        noun="Friday",
        ignore_tokens=ignore_tokens,
    )

    print("=== FRIDAY EVALUATION ===")
    print(f"Spanish Phrase is: {spanish_text}")
    print(f"English Phrase is: {english_text}")
    print()

    for key in spanish_description:
        spanish_words = spanish_description[key]
        english_words = english_description[key]
        overlap = set(spanish_words) & set(english_words)
        difference = set(spanish_words) ^ set(english_words)

        print(f"Spanish {key} description is: {', '.join(spanish_words)}")
        print(f"English {key} description is: {', '.join(english_words)}")
        print(f"Overlap is:    {', '.join(sorted(overlap))} ({len(overlap)})")
        print(f"Difference is: {', '.join(sorted(difference))} ({len(difference)})")
        print()


def malibu_evaluation(
    model: RobertaDoubleHeadsModel,
    tokenizer: AutoTokenizer,
    ignore_tokens: Dict[str, List[int]],
) -> None:
    text = "I like to drive my Malibu while drinking Malibu."

    first_description = get_predictions(
        model=model,
        tokenizer=tokenizer,
        text=text,
        noun="Malibu",
        ignore_tokens=ignore_tokens,
    )
    second_description = get_predictions(
        model=model,
        tokenizer=tokenizer,
        text=text,
        noun="Malibu",
        index=1,
        ignore_tokens=ignore_tokens,
    )

    print("=== MALIBU EVALUATION ===")
    print(f"Phrase is: {text}")
    print()

    for key in first_description:
        first_words = first_description[key]
        second_words = second_description[key]
        overlap = set(first_words) & set(second_words)
        difference = set(first_words) ^ set(second_words)
        print(f"First Malibu (car) {key} description is:    {', '.join(first_words)}")
        print(f"Second Malibu (drink) {key} description is: {', '.join(second_words)}")
        print(f"Overlap is:    {', '.join(sorted(overlap))} ({len(overlap)})")
        print(f"Difference is: {', '.join(sorted(difference))} ({len(difference)})")
        print()


def football_evaluation(
    model: RobertaDoubleHeadsModel,
    tokenizer: AutoTokenizer,
    ignore_tokens: Dict[str, List[int]],
) -> None:
    spanish_phrase = (
        "Retiremos el equipo de la cancha, "
        "Boca no merece jugar esta copa que "
        "hace tiempo viene siendo desprestigiada.\n"
        "Ya no se juega al futbol."
    )

    english_phrase = (
        "Let's remove the team from the field, "
        "Boca does not deserve to play this cup that "
        "has long been discredited. "
        "Football is no longer played."
    )

    print("=== FOOTBALL EVALUATION ===")
    print(f"Spanish Phrase is: {spanish_phrase}")
    print(f"English Phrase is: {english_phrase}")
    print()

    for spanish_noun, english_noun in [
        ["equipo", "team"],
        ["Boca", "Boca"],
        ["copa", "cup"],
        ["tiempo", "long"],
        ["futbol", "Football"],
    ]:
        spanish_description = get_predictions(
            model=model,
            tokenizer=tokenizer,
            text=spanish_phrase,
            noun=spanish_noun,
            ignore_tokens=ignore_tokens,
        )
        english_description = get_predictions(
            model=model,
            tokenizer=tokenizer,
            text=english_phrase,
            noun=english_noun,
            ignore_tokens=ignore_tokens,
        )

        print(f"Spanish word is: {spanish_noun}, English word is: {english_noun}")
        for key in spanish_description:
            spanish_words = spanish_description[key]
            english_words = english_description[key]
            overlap = set(spanish_words) & set(english_words)
            difference = set(spanish_words) ^ set(english_words)

            print(f"Spanish {key} description is: {', '.join(spanish_words)}")
            print(f"English {key} description is: {', '.join(english_words)}")
            print(f"Overlap is:    {', '.join(sorted(overlap))} ({len(overlap)})")
            print(f"Difference is: {', '.join(sorted(difference))} ({len(difference)})")
        print()


@torch.inference_mode()
def get_predictions(
    *,
    model: RobertaDoubleHeadsModel,
    tokenizer: AutoTokenizer,
    text: str,
    noun: str,
    ignore_tokens: Dict[str, List[int]],
    index: int = 0,
) -> List[List[str]]:
    tokens = tokenizer(text, return_tensors="pt")
    start, _end = get_noun(
        tokenizer=tokenizer, tokens=tokens.input_ids[0], noun=noun, index=index
    )

    output = model(**tokens)
    predictions = output.ld_logits[0, start]
    return get_filtered_tokens(
        tokenizer=tokenizer, predictions=predictions, ignore_tokens=ignore_tokens
    )


def get_noun(
    tokenizer: AutoTokenizer, tokens: torch.Tensor, noun: str, index: int
) -> Tuple[int, int]:
    length = tokens.shape[0]
    current_index = index
    for start_index in range(length):
        word = tokenizer.decode(tokens[start_index]).strip()
        if not noun.startswith(word):
            continue
        for end_index in range(start_index + 1, length):
            word = tokenizer.decode(tokens[start_index:end_index]).strip()
            if not noun == word:
                continue
            if current_index > 0:
                current_index -= 1
            else:
                return start_index, end_index
    raise AssertionError(f"Did not find {noun}[{index}] in {tokenizer.decode(tokens)}")


def get_filtered_tokens(
    tokenizer: AutoTokenizer,
    predictions: torch.Tensor,
    ignore_tokens: Dict[str, List[int]],
) -> Dict[str, List[str]]:
    return {
        "none": get_tokens(
            tokenizer=tokenizer, predictions=predictions, ignore_tokens=[]
        ),
        **{
            key: get_tokens(
                tokenizer=tokenizer, predictions=predictions, ignore_tokens=tokens
            )
            for key, tokens in ignore_tokens.items()
        },
    }


def get_tokens(
    tokenizer: AutoTokenizer, predictions: torch.Tensor, ignore_tokens: List[int]
) -> List[str]:
    predictions = predictions.clone()
    predictions[ignore_tokens] = predictions.min()
    tokens = predictions.argsort(descending=True)[:10]
    return [word.strip() for word in tokenizer.batch_decode(tokens)]

Training - XLM Roberta Base

Now to try training a model with this. As before we will start with XLM-RoBERTa-base and move on to XLM-RoBERTa-large.

Code
MODEL_NAME = "xlm-roberta-base"
Code
model_path = train(
    model_name=MODEL_NAME,
    batch_size=32,
    learning_rate=1e-4,
    temperature=2,
    alpha=0.5,
    mean_prediction=False,
    epochs=2,
    evaluation_steps=1_000,
)
Starting xlm-roberta-base-e2-bs32-lr0.0001-t2-a0.5
initializing language describing head with language modelling weights
/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.10/lib/python3.10/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
[6110/6110 50:21, Epoch 2/2]
Step Training Loss Validation Loss Kl Div Overlap Cross Entropy
1000 1.511200 2.791588 0.438695 0.597767 5.104256
2000 1.265800 2.933053 0.375079 0.622868 5.453454
3000 1.165200 3.201085 0.359807 0.631752 6.016028
4000 1.037100 3.280622 0.339257 0.644167 6.192915
5000 0.981800 3.002947 0.329574 0.650252 5.650171
6000 0.944000 3.063303 0.319838 0.654045 5.778174

The default value for alpha (the ratio between language modelling loss and language describing loss) may well be off. The cross entropy loss over the validation set appears to be around 10x the language describing loss so I may want to put alpha down to ~0.1.

Code
model_path = "/data/prompt-internalization/multilingual/xlm-roberta-base-e2-bs32-lr0.0001-t2-a0.5/"
Code
import numpy as np

token_weights = np.load("/data/tatoeba/2022-06-18/dataset/xlm-roberta-base-tokens.npy", allow_pickle=True)

top_n = np.argsort(token_weights)[::-1]
top_10 = top_n[:10]
top_p50 = top_n[token_weights[top_n].cumsum() <= 0.5]
Code
evaluate(
    model_name=MODEL_NAME,
    model_path=model_path,
    ignore_tokens={
        "top 10": top_10.tolist(),
        "top .5": top_p50.tolist(),
    }
)
=== BASS EVALUATION ===
First Phrase is: We spotted a large bass in the ocean. Target is: bass
Second Phrase is: The bass player did not receive the acknowledgment she deserves. Target is: bass
Third Phrase is: The black sea bass, is a member of the wreckfish family. Target is: bass

First none description is:  Location, Type, Description, Name, Color, Status, Owner, Material, Area, Size
Second none description is: Description, Type, Name, Status, Position, Title, Owner, Language, Location, Rating
Third none description is:  Owner, Name, Type, Color, Description, Status, Country, Race, Age, Animal
First & Second: ['Description', 'Location', 'Name', 'Owner', 'Status', 'Type'] (6)
First & Third:  ['Color', 'Description', 'Name', 'Owner', 'Status', 'Type'] (6)
Second & Third: ['Description', 'Name', 'Owner', 'Status', 'Type'] (5)

First top 10 description is:  Color, Area, Size, View, Title, Feature, Application, Cat, Position, Weight
Second top 10 description is: Position, Title, Rating, Feature, Model, Color, Application, Instrument, Details, Motor
Third top 10 description is:  Color, Race, Animal, Cat, Weight, Title, Model, Size, Family, Food
First & Second: ['Application', 'Color', 'Feature', 'Position', 'Title'] (5)
First & Third:  ['Cat', 'Color', 'Size', 'Title', 'Weight'] (5)
Second & Third: ['Color', 'Model', 'Title'] (3)

First top .5 description is:  Area, View, Feature, Application, Cat, Theme, Subject, Category, Source, Weather
Second top .5 description is: Rating, Feature, Model, Application, Instrument, Motor, Level, Style, Control, Item
Third top .5 description is:  Race, Animal, Cat, Model, Family, Food, Gen, Subject, Feature, Profile
First & Second: ['Application', 'Feature'] (2)
First & Third:  ['Cat', 'Feature', 'Subject'] (3)
Second & Third: ['Feature', 'Model'] (2)

=== FRIDAY EVALUATION ===
Spanish Phrase is: Friday es mi canción favorita.
English Phrase is: Friday is my favourite song.

Spanish none description is: Name, Description, Owner, Location, Comment, Title, Color, Status, Country, Tags
English none description is: Description, Tag, Tags, Name, Album, Country, Title, Status, Language, Theme
Overlap is:    Country, Description, Name, Status, Tags, Title (6)
Difference is: Album, Color, Comment, Language, Location, Owner, Tag, Theme (8)

Spanish top 10 description is: Comment, Title, Color, Tags, Details, Photo, Date, Text, Album, Keyword
English top 10 description is: Tag, Tags, Album, Title, Theme, Keyword, Details, Music, Label, Labels
Overlap is:    Album, Details, Keyword, Tags, Title (5)
Difference is: Color, Comment, Date, Label, Labels, Music, Photo, Tag, Text, Theme (10)

Spanish top .5 description is: Comment, Text, Album, Video, Subject, Year, Birthday, Motor, Credit, Home
English top .5 description is: Album, Theme, Music, Label, Labels, Motto, Song, Comments, Video, Category
Overlap is:    Album, Video (2)
Difference is: Birthday, Category, Comment, Comments, Credit, Home, Label, Labels, Motor, Motto, Music, Song, Subject, Text, Theme, Year (16)

=== MALIBU EVALUATION ===
Phrase is: I like to drive my Malibu while drinking Malibu.

First Malibu (car) none description is:    Country, Location, Land, Language, Status, Type, Region, Area, Name, City
Second Malibu (drink) none description is: Country, Location, Land, Language, Region, Status, Type, Area, Name, City
Overlap is:    Area, City, Country, Land, Language, Location, Name, Region, Status, Type (10)
Difference is:  (0)

First Malibu (car) top 10 description is:    Land, Region, Area, City, Service, Address, State, Keyword, Style, Nation
Second Malibu (drink) top 10 description is: Land, Region, Area, City, Service, Food, Market, State, Address, Nation
Overlap is:    Address, Area, City, Land, Nation, Region, Service, State (8)
Difference is: Food, Keyword, Market, Style (4)

First Malibu (car) top .5 description is:    Region, Area, City, Service, State, Style, Nation, Food, Culture, Theme
Second Malibu (drink) top .5 description is: Region, Area, City, Service, Food, Market, State, Nation, Culture, Style
Overlap is:    Area, City, Culture, Food, Nation, Region, Service, State, Style (9)
Difference is: Market, Theme (2)

=== FOOTBALL EVALUATION ===
Spanish Phrase is: Retiremos el equipo de la cancha, Boca no merece jugar esta copa que hace tiempo viene siendo desprestigiada.
Ya no se juega al futbol.
English Phrase is: Let's remove the team from the field, Boca does not deserve to play this cup that has long been discredited. Football is no longer played.

Spanish word is: equipo, English word is: team
Spanish none description is: Description, Type, Location, Name, Title, Owner, Game, Status, Sport, Position
English none description is: Type, Description, Location, Name, Title, Owner, Game, Material, Status, Position
Overlap is:    Description, Game, Location, Name, Owner, Position, Status, Title, Type (9)
Difference is: Material, Sport (2)
Spanish top 10 description is: Title, Game, Sport, Position, Style, Theme, Rating, Category, Application, Size
English top 10 description is: Title, Game, Position, Size, Color, Team, Category, Sport, Rating, Brand
Overlap is:    Category, Game, Position, Rating, Size, Sport, Title (7)
Difference is: Application, Brand, Color, Style, Team, Theme (6)
Spanish top .5 description is: Game, Sport, Style, Theme, Rating, Category, Application, Sports, Organization, Team
English top .5 description is: Game, Team, Category, Sport, Rating, Group, Organization, Style, Theme, Application
Overlap is:    Application, Category, Game, Organization, Rating, Sport, Style, Team, Theme (9)
Difference is: Group, Sports (2)

Spanish word is: Boca, English word is: Boca
Spanish none description is: Owner, Name, Comment, Photo, Location, Brand, Color, Description, Author, ID
English none description is: Owner, Name, Location, Company, Color, Author, Photo, Family, Address, Comment
Overlap is:    Author, Color, Comment, Location, Name, Owner, Photo (7)
Difference is: Address, Brand, Company, Description, Family, ID (6)
Spanish top 10 description is: Comment, Photo, Brand, Color, Author, ID, Details, Title, Nick, Song
English top 10 description is: Company, Color, Author, Photo, Family, Address, Comment, Motor, Person, User
Overlap is:    Author, Color, Comment, Photo (4)
Difference is: Address, Brand, Company, Details, Family, ID, Motor, Nick, Person, Song, Title, User (12)
Spanish top .5 description is: Comment, Author, Nick, Song, Singer, Motor, Car, Company, About, Family
English top .5 description is: Company, Author, Family, Comment, Motor, Person, User, Member, Nick, Vehicle
Overlap is:    Author, Comment, Company, Family, Motor, Nick (6)
Difference is: About, Car, Member, Person, Singer, Song, User, Vehicle (8)

Spanish word is: copa, English word is: cup
Spanish none description is: Type, Title, Game, Description, Sport, Sports, Location, Style, Status, Name
English none description is: Type, Description, Title, Game, Sport, Material, Sports, Style, Name, Status
Overlap is:    Description, Game, Name, Sport, Sports, Status, Style, Title, Type (9)
Difference is: Location, Material (2)
Spanish top 10 description is: Title, Game, Sport, Sports, Style, Category, Series, Keyword, Theme, Color
English top 10 description is: Title, Game, Sport, Sports, Style, Series, Color, Category, Size, Theme
Overlap is:    Category, Color, Game, Series, Sport, Sports, Style, Theme, Title (9)
Difference is: Keyword, Size (2)
Spanish top .5 description is: Game, Sport, Sports, Style, Category, Series, Theme, Games, Tip, Application
English top .5 description is: Game, Sport, Sports, Style, Series, Category, Theme, Application, Item, Feature
Overlap is:    Application, Category, Game, Series, Sport, Sports, Style, Theme (8)
Difference is: Feature, Games, Item, Tip (4)

Spanish word is: tiempo, English word is: long
Spanish none description is: Age, Duration, Description, Time, Weight, Size, Rating, Year, Location, Date
English none description is: Age, Long, long, Weight, Status, Year, Life, Rating, Description, age
Overlap is:    Age, Description, Rating, Weight, Year (5)
Difference is: Date, Duration, Life, Location, Long, Size, Status, Time, age, long (10)
Spanish top 10 description is: Duration, Time, Weight, Size, Rating, Year, Date, Range, Alter, Life
English top 10 description is: Long, long, Weight, Year, Life, Rating, age, Duration, Title, Race
Overlap is:    Duration, Life, Rating, Weight, Year (5)
Difference is: Alter, Date, Long, Race, Range, Size, Time, Title, age, long (10)
Spanish top .5 description is: Duration, Rating, Year, Range, Alter, Life, Price, Game, Period, Race
English top .5 description is: Long, long, Year, Life, Rating, age, Duration, Race, Price, Quality
Overlap is:    Duration, Life, Price, Race, Rating, Year (6)
Difference is: Alter, Game, Long, Period, Quality, Range, age, long (8)

Spanish word is: futbol, English word is: Football
Spanish none description is: Sport, Sports, Game, Type, Style, Hobby, Football, Description, Games, Country
English none description is: Sport, Sports, Type, Game, Description, Football, Style, Hobby, Title, Category
Overlap is:    Description, Football, Game, Hobby, Sport, Sports, Style, Type (8)
Difference is: Category, Country, Games, Title (4)
Spanish top 10 description is: Sport, Sports, Game, Style, Hobby, Football, Games, Theme, Application, Series
English top 10 description is: Sport, Sports, Game, Football, Style, Hobby, Title, Category, Theme, Games
Overlap is:    Football, Game, Games, Hobby, Sport, Sports, Style, Theme (8)
Difference is: Application, Category, Series, Title (4)
Spanish top .5 description is: Sport, Sports, Game, Style, Hobby, Football, Games, Theme, Application, Series
English top .5 description is: Sport, Sports, Game, Football, Style, Hobby, Category, Theme, Games, Application
Overlap is:    Application, Football, Game, Games, Hobby, Sport, Sports, Style, Theme (9)
Difference is: Category, Series (2)

Training - XLM RoBERTa Large

Code
model_path = train(
    model_name="xlm-roberta-large",
    batch_size=8,
    learning_rate=1e-4,
    temperature=2,
    alpha=0.5,
    mean_prediction=False,
    epochs=2,
    evaluation_steps=1_000,
)
Starting xlm-roberta-large-e2-bs8-lr0.0001-t2-a0.5
initializing language describing head with language modelling weights
/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.10/lib/python3.10/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
[24438/24438 3:48:41, Epoch 2/2]
Step Training Loss Validation Loss Kl Div Overlap Cross Entropy
1000 4.485700 nan 1.408290 0.383782 nan
2000 4.765600 5.876565 1.386561 0.399743 10.665840
3000 4.725900 6.066762 1.386035 0.392967 11.045374
4000 4.655300 6.264757 1.401647 0.378980 11.443541
5000 4.663600 6.235996 1.399746 0.361958 11.384928
6000 4.618500 6.168671 1.380730 0.380718 11.256695
7000 4.573900 6.172040 1.381824 0.366568 11.266456
8000 4.593900 6.212361 1.382200 0.380718 11.342365
9000 4.533700 6.132387 1.382104 0.380718 11.183510
10000 4.573600 6.185318 1.373810 0.366568 11.290386
11000 4.520700 6.170875 1.390312 0.388165 11.261891
12000 4.564600 nan 1.379688 0.380718 nan
13000 4.494800 6.219527 1.394951 0.387430 11.360277
14000 4.483600 6.078000 1.388268 0.380718 11.073941
15000 4.480900 6.189854 1.384314 0.382721 11.301337
16000 4.417300 6.161162 1.378112 0.399743 11.245730
17000 4.422100 6.119100 1.384519 0.376409 11.160753
18000 4.454800 6.121497 1.380128 0.383782 11.166265
19000 4.441200 6.126425 1.390149 0.378980 11.174656
20000 4.441400 6.090848 1.388660 0.399743 11.104525
21000 4.442400 6.121112 1.377978 0.383782 11.165845
22000 4.450800 6.118627 1.385170 0.399743 11.159862
23000 4.468500 6.133310 1.378750 0.380718 11.189851
24000 4.451300 6.110619 1.376852 0.399743 11.145279

Code
import numpy as np

token_weights = np.load("/data/tatoeba/2022-06-18/dataset/xlm-roberta-large-tokens.npy", allow_pickle=True)

top_n = np.argsort(token_weights)[::-1]
top_10 = top_n[:10]
top_p50 = top_n[token_weights[top_n].cumsum() <= 0.5]
Code
model_path = "/data/prompt-internalization/multilingual/xlm-roberta-large-e2-bs8-lr0.0001-t2-a0.5/"
Code
evaluate(
    model_name="xlm-roberta-large",
    model_path=model_path,
    ignore_tokens={
        "top 10": top_10.tolist(),
        "top .5": top_p50.tolist(),
    }
)
=== BASS EVALUATION ===
First Phrase is: We spotted a large bass in the ocean. Target is: bass
Second Phrase is: The bass player did not receive the acknowledgment she deserves. Target is: bass
Third Phrase is: The black sea bass, is a member of the wreckfish family. Target is: bass

First none description is:  Name, Type, Location, Owner, Description, Age, Title, Category, Other, Author
Second none description is: Name, Type, Location, Owner, Description, Age, Title, Category, Other, Author
Third none description is:  Name, Type, Location, Owner, Description, Age, Title, Category, Other, Author
First & Second: ['Age', 'Author', 'Category', 'Description', 'Location', 'Name', 'Other', 'Owner', 'Title', 'Type'] (10)
First & Third:  ['Age', 'Author', 'Category', 'Description', 'Location', 'Name', 'Other', 'Owner', 'Title', 'Type'] (10)
Second & Third: ['Age', 'Author', 'Category', 'Description', 'Location', 'Name', 'Other', 'Owner', 'Title', 'Type'] (10)

First top 10 description is:  Other, Author, Product, Color, Place, Service, Tags, Country, Job, Language
Second top 10 description is: Other, Author, Product, Color, Place, Service, Tags, Country, Job, Language
Third top 10 description is:  Other, Author, Product, Color, Place, Service, Tags, Country, Job, Language
First & Second: ['Author', 'Color', 'Country', 'Job', 'Language', 'Other', 'Place', 'Product', 'Service', 'Tags'] (10)
First & Third:  ['Author', 'Color', 'Country', 'Job', 'Language', 'Other', 'Place', 'Product', 'Service', 'Tags'] (10)
Second & Third: ['Author', 'Color', 'Country', 'Job', 'Language', 'Other', 'Place', 'Product', 'Service', 'Tags'] (10)

First top .5 description is:  Color, Family, Subject, Size, Model, Company, Position, Style, Race, Photo
Second top .5 description is: Color, Family, Subject, Size, Model, Company, Position, Style, Race, Photo
Third top .5 description is:  Color, Family, Subject, Size, Model, Company, Position, Style, Race, Photo
First & Second: ['Color', 'Company', 'Family', 'Model', 'Photo', 'Position', 'Race', 'Size', 'Style', 'Subject'] (10)
First & Third:  ['Color', 'Company', 'Family', 'Model', 'Photo', 'Position', 'Race', 'Size', 'Style', 'Subject'] (10)
Second & Third: ['Color', 'Company', 'Family', 'Model', 'Photo', 'Position', 'Race', 'Size', 'Style', 'Subject'] (10)

=== FRIDAY EVALUATION ===
Spanish Phrase is: Friday es mi canción favorita.
English Phrase is: Friday is my favourite song.

Spanish none description is: Name, Type, Location, Owner, Description, Age, Title, Category, Other, Author
English none description is: Name, Type, Location, Owner, Description, Age, Title, Category, Other, Author
Overlap is:    Age, Author, Category, Description, Location, Name, Other, Owner, Title, Type (10)
Difference is:  (0)

Spanish top 10 description is: Other, Author, Product, Color, Place, Service, Tags, Country, Job, Language
English top 10 description is: Other, Author, Product, Color, Place, Service, Tags, Country, Job, Language
Overlap is:    Author, Color, Country, Job, Language, Other, Place, Product, Service, Tags (10)
Difference is:  (0)

Spanish top .5 description is: Color, Family, Subject, Size, Model, Company, Position, Style, Race, Photo
English top .5 description is: Color, Family, Subject, Size, Model, Company, Position, Style, Race, Photo
Overlap is:    Color, Company, Family, Model, Photo, Position, Race, Size, Style, Subject (10)
Difference is:  (0)

=== MALIBU EVALUATION ===
Phrase is: I like to drive my Malibu while drinking Malibu.

First Malibu (car) none description is:    Name, Type, Location, Owner, Description, Age, Title, Category, Other, Author
Second Malibu (drink) none description is: Name, Type, Location, Owner, Description, Age, Title, Category, Other, Author
Overlap is:    Age, Author, Category, Description, Location, Name, Other, Owner, Title, Type (10)
Difference is:  (0)

First Malibu (car) top 10 description is:    Other, Author, Product, Color, Place, Service, Tags, Country, Job, Language
Second Malibu (drink) top 10 description is: Other, Author, Product, Color, Place, Service, Tags, Country, Job, Language
Overlap is:    Author, Color, Country, Job, Language, Other, Place, Product, Service, Tags (10)
Difference is:  (0)

First Malibu (car) top .5 description is:    Color, Family, Subject, Size, Model, Company, Position, Style, Race, Photo
Second Malibu (drink) top .5 description is: Color, Family, Subject, Size, Model, Company, Position, Style, Race, Photo
Overlap is:    Color, Company, Family, Model, Photo, Position, Race, Size, Style, Subject (10)
Difference is:  (0)

=== FOOTBALL EVALUATION ===
Spanish Phrase is: Retiremos el equipo de la cancha, Boca no merece jugar esta copa que hace tiempo viene siendo desprestigiada.
Ya no se juega al futbol.
English Phrase is: Let's remove the team from the field, Boca does not deserve to play this cup that has long been discredited. Football is no longer played.

Spanish word is: equipo, English word is: team
Spanish none description is: Name, Type, Location, Owner, Description, Age, Title, Category, Other, Author
English none description is: Name, Type, Location, Owner, Description, Age, Title, Category, Other, Author
Overlap is:    Age, Author, Category, Description, Location, Name, Other, Owner, Title, Type (10)
Difference is:  (0)
Spanish top 10 description is: Other, Author, Product, Color, Place, Service, Tags, Country, Job, Language
English top 10 description is: Other, Author, Product, Color, Place, Service, Tags, Country, Job, Language
Overlap is:    Author, Color, Country, Job, Language, Other, Place, Product, Service, Tags (10)
Difference is:  (0)
Spanish top .5 description is: Color, Family, Subject, Size, Model, Company, Position, Style, Race, Photo
English top .5 description is: Color, Family, Subject, Size, Model, Company, Position, Style, Race, Photo
Overlap is:    Color, Company, Family, Model, Photo, Position, Race, Size, Style, Subject (10)
Difference is:  (0)

Spanish word is: Boca, English word is: Boca
Spanish none description is: Name, Type, Location, Owner, Description, Age, Title, Category, Other, Author
English none description is: Name, Type, Location, Owner, Description, Age, Title, Category, Other, Author
Overlap is:    Age, Author, Category, Description, Location, Name, Other, Owner, Title, Type (10)
Difference is:  (0)
Spanish top 10 description is: Other, Author, Product, Color, Place, Service, Tags, Country, Job, Language
English top 10 description is: Other, Author, Product, Color, Place, Service, Tags, Country, Job, Language
Overlap is:    Author, Color, Country, Job, Language, Other, Place, Product, Service, Tags (10)
Difference is:  (0)
Spanish top .5 description is: Color, Family, Subject, Size, Model, Company, Position, Style, Race, Photo
English top .5 description is: Color, Family, Subject, Size, Model, Company, Position, Style, Race, Photo
Overlap is:    Color, Company, Family, Model, Photo, Position, Race, Size, Style, Subject (10)
Difference is:  (0)

Spanish word is: copa, English word is: cup
Spanish none description is: Name, Type, Location, Owner, Description, Age, Title, Category, Other, Author
English none description is: Name, Type, Location, Owner, Description, Age, Title, Category, Other, Author
Overlap is:    Age, Author, Category, Description, Location, Name, Other, Owner, Title, Type (10)
Difference is:  (0)
Spanish top 10 description is: Other, Author, Product, Color, Place, Service, Tags, Country, Job, Language
English top 10 description is: Other, Author, Product, Color, Place, Service, Tags, Country, Job, Language
Overlap is:    Author, Color, Country, Job, Language, Other, Place, Product, Service, Tags (10)
Difference is:  (0)
Spanish top .5 description is: Color, Family, Subject, Size, Model, Company, Position, Style, Race, Photo
English top .5 description is: Color, Family, Subject, Size, Model, Company, Position, Style, Race, Photo
Overlap is:    Color, Company, Family, Model, Photo, Position, Race, Size, Style, Subject (10)
Difference is:  (0)

Spanish word is: tiempo, English word is: long
Spanish none description is: Name, Type, Location, Owner, Description, Age, Title, Category, Other, Author
English none description is: Name, Type, Location, Owner, Description, Age, Title, Category, Other, Author
Overlap is:    Age, Author, Category, Description, Location, Name, Other, Owner, Title, Type (10)
Difference is:  (0)
Spanish top 10 description is: Other, Author, Product, Color, Place, Service, Tags, Country, Job, Language
English top 10 description is: Other, Author, Product, Color, Place, Service, Tags, Country, Job, Language
Overlap is:    Author, Color, Country, Job, Language, Other, Place, Product, Service, Tags (10)
Difference is:  (0)
Spanish top .5 description is: Color, Family, Subject, Size, Model, Company, Position, Style, Race, Photo
English top .5 description is: Color, Family, Subject, Size, Model, Company, Position, Style, Race, Photo
Overlap is:    Color, Company, Family, Model, Photo, Position, Race, Size, Style, Subject (10)
Difference is:  (0)

Spanish word is: futbol, English word is: Football
Spanish none description is: Name, Type, Location, Owner, Description, Age, Title, Category, Other, Author
English none description is: Name, Type, Location, Owner, Description, Age, Title, Category, Other, Author
Overlap is:    Age, Author, Category, Description, Location, Name, Other, Owner, Title, Type (10)
Difference is:  (0)
Spanish top 10 description is: Other, Author, Product, Color, Place, Service, Tags, Country, Job, Language
English top 10 description is: Other, Author, Product, Color, Place, Service, Tags, Country, Job, Language
Overlap is:    Author, Color, Country, Job, Language, Other, Place, Product, Service, Tags (10)
Difference is:  (0)
Spanish top .5 description is: Color, Family, Subject, Size, Model, Company, Position, Style, Race, Photo
English top .5 description is: Color, Family, Subject, Size, Model, Company, Position, Style, Race, Photo
Overlap is:    Color, Company, Family, Model, Photo, Position, Race, Size, Style, Subject (10)
Difference is:  (0)

The large model has collapsed terribly.

Reducing LM Contribution

Code
model_path = train(
    model_name=MODEL_NAME,
    batch_size=32,
    learning_rate=1e-4,
    temperature=2,
    alpha=0.1,
    mean_prediction=False,
    epochs=2,
    evaluation_steps=1_000,
)
Starting xlm-roberta-base-e2-bs32-lr0.0001-t2-a0.1
initializing language describing head with language modelling weights
/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.10/lib/python3.10/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
[6110/6110 49:12, Epoch 2/2]
Step Training Loss Validation Loss Kl Div Overlap Cross Entropy
1000 0.701300 0.856570 0.335526 0.646788 5.372933
2000 0.538200 0.878939 0.303974 0.664985 5.863934
3000 0.494800 0.856900 0.291485 0.675407 5.824556
4000 0.440000 0.870678 0.282658 0.683110 5.997384
5000 0.420700 0.854028 0.277266 0.687389 5.925179
6000 0.405200 0.836515 0.266587 0.693247 5.836579

Code
import numpy as np

token_weights = np.load("/data/tatoeba/2022-06-18/dataset/xlm-roberta-base-tokens.npy", allow_pickle=True)

top_n = np.argsort(token_weights)[::-1]
top_10 = top_n[:10]
top_p50 = top_n[token_weights[top_n].cumsum() <= 0.5]
Code
evaluate(
    model_name=MODEL_NAME,
    model_path=model_path,
    ignore_tokens={
        "top 10": top_10.tolist(),
        "top .5": top_p50.tolist(),
    }
)
Could not locate the tokenizer configuration file, will try to use the model config instead.
=== BASS EVALUATION ===
First Phrase is: We spotted a large bass in the ocean. Target is: bass
Second Phrase is: The bass player did not receive the acknowledgment she deserves. Target is: bass
Third Phrase is: The black sea bass, is a member of the wreckfish family. Target is: bass

First none description is:  Location, Description, Type, Name, Area, Status, Color, Material, View, Country
Second none description is: Description, Name, Type, Status, Title, Owner, Location, Details, Position, Material
Third none description is:  Type, Name, Description, Color, Owner, Weight, Status, Size, Country, Age
First & Second: ['Description', 'Location', 'Material', 'Name', 'Status', 'Type'] (6)
First & Third:  ['Color', 'Country', 'Description', 'Name', 'Status', 'Type'] (6)
Second & Third: ['Description', 'Name', 'Owner', 'Status', 'Type'] (5)

First top 10 description is:  Area, Color, View, Size, Title, Position, Category, Application, Cat, Land
Second top 10 description is: Title, Details, Position, Rating, Application, Contact, Item, Information, Service, Model
Third top 10 description is:  Color, Weight, Size, Race, Food, Model, Cat, Style, Profile, Feature
First & Second: ['Application', 'Position', 'Title'] (3)
First & Third:  ['Cat', 'Color', 'Size'] (3)
Second & Third: ['Model'] (1)

First top .5 description is:  Area, View, Category, Application, Cat, Feature, Subject, Views, Theme, Source
Second top .5 description is: Rating, Application, Item, Information, Service, Model, Feature, Image, Comment, Driver
Third top .5 description is:  Race, Food, Model, Cat, Style, Profile, Feature, Animal, Gene, Subject
First & Second: ['Application', 'Feature'] (2)
First & Third:  ['Cat', 'Feature', 'Subject'] (3)
Second & Third: ['Feature', 'Model'] (2)

=== FRIDAY EVALUATION ===
Spanish Phrase is: Friday es mi canción favorita.
English Phrase is: Friday is my favourite song.

Spanish none description is: Name, Comment, Owner, Description, Title, Country, Photo, Tags, Color, Location
English none description is: Description, Tags, Title, Tag, Name, Comment, Album, Comments, Country, Status
Overlap is:    Comment, Country, Description, Name, Tags, Title (6)
Difference is: Album, Color, Comments, Location, Owner, Photo, Status, Tag (8)

Spanish top 10 description is: Comment, Title, Photo, Tags, Color, Keyword, Video, Address, Details, Date
English top 10 description is: Tags, Title, Tag, Comment, Album, Comments, Details, Keyword, Labels, Photo
Overlap is:    Comment, Details, Keyword, Photo, Tags, Title (6)
Difference is: Address, Album, Color, Comments, Date, Labels, Tag, Video (8)

Spanish top .5 description is: Comment, Video, Album, Text, Subject, Source, Home, Comments, Cat, Email
English top .5 description is: Comment, Album, Comments, Labels, Theme, Video, ..., Label, More, Text
Overlap is:    Album, Comment, Comments, Text, Video (5)
Difference is: ..., Cat, Email, Home, Label, Labels, More, Source, Subject, Theme (10)

=== MALIBU EVALUATION ===
Phrase is: I like to drive my Malibu while drinking Malibu.

First Malibu (car) none description is:    Name, Type, Owner, Description, Color, Material, Product, Brand, Country, Food
Second Malibu (drink) none description is: Name, Type, Country, Owner, Food, Description, Product, Language, Color, Brand
Overlap is:    Brand, Color, Country, Description, Food, Name, Owner, Product, Type (9)
Difference is: Language, Material (2)

First Malibu (car) top 10 description is:    Color, Product, Brand, Food, Style, Application, Cat, Keyword, Model, Theme
Second Malibu (drink) top 10 description is: Food, Product, Color, Brand, Application, Keyword, Land, Style, Theme, Cat
Overlap is:    Application, Brand, Cat, Color, Food, Keyword, Product, Style, Theme (9)
Difference is: Land, Model (2)

First Malibu (car) top .5 description is:    Food, Style, Application, Cat, Model, Theme, Tip, Motor, Category, Animal
Second Malibu (drink) top .5 description is: Food, Application, Style, Theme, Cat, Service, Model, Animal, Category, Root
Overlap is:    Animal, Application, Cat, Category, Food, Model, Style, Theme (8)
Difference is: Motor, Root, Service, Tip (4)

=== FOOTBALL EVALUATION ===
Spanish Phrase is: Retiremos el equipo de la cancha, Boca no merece jugar esta copa que hace tiempo viene siendo desprestigiada.
Ya no se juega al futbol.
English Phrase is: Let's remove the team from the field, Boca does not deserve to play this cup that has long been discredited. Football is no longer played.

Spanish word is: equipo, English word is: team
Spanish none description is: Description, Type, Name, Title, Location, Owner, Game, Status, Age, Country
English none description is: Type, Description, Name, Title, Game, Location, Sport, Owner, Sports, Status
Overlap is:    Description, Game, Location, Name, Owner, Status, Title, Type (8)
Difference is: Age, Country, Sport, Sports (4)
Spanish top 10 description is: Title, Game, Organization, Rating, Category, Team, Theme, Color, Brand, Sport
English top 10 description is: Title, Game, Sport, Sports, Category, Theme, Application, Style, Team, Organization
Overlap is:    Category, Game, Organization, Sport, Team, Theme, Title (7)
Difference is: Application, Brand, Color, Rating, Sports, Style (6)
Spanish top .5 description is: Game, Organization, Rating, Category, Team, Theme, Sport, Application, Style, Sports
English top .5 description is: Game, Sport, Sports, Category, Theme, Application, Style, Team, Organization, Sponsor
Overlap is:    Application, Category, Game, Organization, Sport, Sports, Style, Team, Theme (9)
Difference is: Rating, Sponsor (2)

Spanish word is: Boca, English word is: Boca
Spanish none description is: Owner, Name, Color, Location, Song, Comment, Nick, Title, Author, ID
English none description is: Owner, Name, Color, Author, Family, Details, Location, Comment, Company, Photo
Overlap is:    Author, Color, Comment, Location, Name, Owner (6)
Difference is: Company, Details, Family, ID, Nick, Photo, Song, Title (8)
Spanish top 10 description is: Color, Song, Comment, Nick, Title, Author, ID, Photo, Details, Content
English top 10 description is: Color, Author, Family, Details, Comment, Company, Photo, Police, Nick, ID
Overlap is:    Author, Color, Comment, Details, ID, Nick, Photo (7)
Difference is: Company, Content, Family, Police, Song, Title (6)
Spanish top .5 description is: Song, Comment, Nick, Author, Content, Singer, Logo, Family, Car, Service
English top .5 description is: Author, Family, Comment, Company, Police, Nick, User, Person, Car, Customer
Overlap is:    Author, Car, Comment, Family, Nick (5)
Difference is: Company, Content, Customer, Logo, Person, Police, Service, Singer, Song, User (10)

Spanish word is: copa, English word is: cup
Spanish none description is: Type, Game, Title, Description, Sport, Status, Sports, Name, Location, Age
English none description is: Type, Game, Sport, Sports, Title, Description, Style, Application, Category, Theme
Overlap is:    Description, Game, Sport, Sports, Title, Type (6)
Difference is: Age, Application, Category, Location, Name, Status, Style, Theme (8)
Spanish top 10 description is: Game, Title, Sport, Sports, Category, Application, Theme, Series, Games, Style
English top 10 description is: Game, Sport, Sports, Title, Style, Application, Category, Theme, Series, Games
Overlap is:    Application, Category, Game, Games, Series, Sport, Sports, Style, Theme, Title (10)
Difference is:  (0)
Spanish top .5 description is: Game, Sport, Sports, Category, Application, Theme, Series, Games, Style, Rating
English top .5 description is: Game, Sport, Sports, Style, Application, Category, Theme, Series, Games, Football
Overlap is:    Application, Category, Game, Games, Series, Sport, Sports, Style, Theme (9)
Difference is: Football, Rating (2)

Spanish word is: tiempo, English word is: long
Spanish none description is: Age, Description, Duration, Year, Game, Time, Location, Status, Date, Rating
English none description is: long, Long, Age, age, Year, Country, Price, Title, Far, Racing
Overlap is:    Age, Year (2)
Difference is: Country, Date, Description, Duration, Far, Game, Location, Long, Price, Racing, Rating, Status, Time, Title, age, long (16)
Spanish top 10 description is: Duration, Year, Game, Time, Date, Rating, Size, Season, Weight, Race
English top 10 description is: long, Long, age, Year, Price, Title, Far, Racing, Weight, Contract
Overlap is:    Weight, Year (2)
Difference is: Contract, Date, Duration, Far, Game, Long, Price, Race, Racing, Rating, Season, Size, Time, Title, age, long (16)
Spanish top .5 description is: Duration, Year, Game, Rating, Season, Race, Level, Application, Price, Games
English top .5 description is: long, Long, age, Year, Price, Far, Racing, Contract, Brown, Song
Overlap is:    Price, Year (2)
Difference is: Application, Brown, Contract, Duration, Far, Game, Games, Level, Long, Race, Racing, Rating, Season, Song, age, long (16)

Spanish word is: futbol, English word is: Football
Spanish none description is: Sport, Sports, Game, Type, Style, Hobby, Football, Games, Application, Country
English none description is: Sport, Sports, Game, Type, Style, Football, Hobby, Theme, Application, Games
Overlap is:    Application, Football, Game, Games, Hobby, Sport, Sports, Style, Type (9)
Difference is: Country, Theme (2)
Spanish top 10 description is: Sport, Sports, Game, Style, Hobby, Football, Games, Application, Theme, Title
English top 10 description is: Sport, Sports, Game, Style, Football, Hobby, Theme, Application, Games, Category
Overlap is:    Application, Football, Game, Games, Hobby, Sport, Sports, Style, Theme (9)
Difference is: Category, Title (2)
Spanish top .5 description is: Sport, Sports, Game, Style, Hobby, Football, Games, Application, Theme, Series
English top .5 description is: Sport, Sports, Game, Style, Football, Hobby, Theme, Application, Games, Category
Overlap is:    Application, Football, Game, Games, Hobby, Sport, Sports, Style, Theme (9)
Difference is: Category, Series (2)

Reviewing these evaluations is becoming quite difficult. In each one I can see things that I do not like and things that I do. A more objective way to assess the quality of the model is required.