Matthew’s Blog - Prompt Internalization

Word sense induction is the process of clustering words according to their specific meaning. As an example we can consider the word bass. This can be used in several different ways:

We spotted a large bass in the ocean.
The bass player did not receive the acknowledgment she deserves.
The black sea bass, is a member of the wreckfish family

The word is the same but the meaning differs. Word sense induction is the process of clustering such that 1. and 3. are in the same cluster, and 2. is in a different cluster.

This process has been studied before and Word Sense Induction with Neural biLM and Symmetric Patterns {% cite amrami-goldberg-2018-word %} used prompting to extract vectors that could be clustered. Since I’ve investigated prompts before I’m quite keen to try this approach. I’m hopeful that this could be applied to my wikipedia clustering attempt from last year.

Code

import blog.transformers_logging

Data Preparation

Since the aim is to internalize a prompt that relates to a word, I’m actually interested in just taking random words from an utterance and then trying to work out the prompted word vector from them. Since the prompt can only refer to the word and not it’s position in the text I need the word to be unique. It would also be good to restrict the word to a noun as that should provide a more interesting vector to cluster. All of this can be done using spacy.

Code

import en_core_web_sm
from spacy.matcher import Matcher

nlp = en_core_web_sm.load()
matcher = Matcher(nlp.vocab)
matcher.add("nouns", [
    [{"POS": {"IN": ["NOUN", "PROPN"]}, "OP": "+"}],
])

Code

text = "I went back to my house and my cat was there"
doc = nlp(text)
words = matcher(doc, as_spans=True)
words

[house, cat]

Code

text[words[0].start_char : words[0].end_char]

'house'

With this we should be able to pull out the tokens associated with the nouns.

Code

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("roberta-base")

Code

tokens = tokenizer(text, return_offsets_mapping=True)
[
    token
    for token, offset in zip(tokens.input_ids, tokens.offset_mapping)
    if offset == (words[0].start_char, words[0].end_char)
]

[790]

Code

tokenizer.decode(790)

' house'

A little bit of care is needed to get multi token spans.

Code

text = "The sand castle has shells"
doc = nlp(text)
words = matcher(doc, as_spans=True)
words

[sand, sand castle, castle, shells]

Code

from itertools import takewhile

start, end = words[1].start_char, words[1].end_char

tokens = tokenizer(text, return_offsets_mapping=True)
[
    [
        token
        for token, _ in takewhile(
            lambda pair: pair[1][1] <= end,
            zip(tokens.input_ids[index:], tokens.offset_mapping[index:])
        )
    ]
    for index, (token_start, _) in enumerate(tokens.offset_mapping)
    if token_start == start
]

[[6255, 22637]]

To be able to work with this I need to have the start and end indicies of the tokens for each word of interest. With that I can create the prompt as well as measure the output from the unprompted model.

Code

from typing import List

import en_core_web_sm
from spacy.tokens import Span
from spacy.matcher import Matcher

class NounExtractor:
    def __init__(self) -> None:
        self.nlp = en_core_web_sm.load()
        self.matcher = Matcher(self.nlp.vocab)
        self.matcher.add("nouns", [
            [{"POS": {"IN": ["NOUN", "PROPN"]}, "OP": "+"}],
        ])

    def get_nouns(self, text: str) -> List[Span]:
        doc = self.nlp(text)
        nouns = self.matcher(doc, as_spans=True)
        return self.unique(text, nouns=nouns)

    def unique(self, text: str, nouns: List[Span]) -> List[Span]:
        text = text.casefold()
        return [
            noun
            for noun in nouns
            if text.count(noun.text.casefold()) == 1
        ]

Code

from typing import Any, Dict, Tuple
from spacy.tokens import Span
from transformers import AutoTokenizer

class Encoder:
    def __init__(self, name: str = "roberta-base") -> None:
        self.tokenizer = AutoTokenizer.from_pretrained(name)
        self.extractor = NounExtractor()

    def encode(self, text: str) -> Dict[str, Any]:
        nouns = self.extractor.get_nouns(text)
        tokens = self.tokenizer(text, truncation=True, return_offsets_mapping=True)
        labels = self.find(tokens.offset_mapping, nouns)
        return {
            "input_ids": tokens.input_ids,
            "attention_mask": tokens.attention_mask,
            "labels": labels
        }

    def find(self, offsets: List[Tuple[int, int]], nouns: List[Span]) -> List[Tuple[int, int]]:
        starts = {
            start: index
            for index, (start, end) in enumerate(offsets)
            if start != end
        }
        ends = {
            end: index
            for index, (start, end) in enumerate(offsets)
            if start != end
        }
        return [
            (starts[noun.start_char], 1 + ends[noun.end_char] - starts[noun.start_char])
            for noun in nouns
            if noun.start_char in starts and noun.end_char in ends
        ]

Code

encoder = Encoder()

Code

text = "The sand castle has shells"
encoder.encode(text)

{'input_ids': [0, 133, 6255, 22637, 34, 23647, 2],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1],
 'labels': [(2, 1), (2, 2), (3, 1), (5, 1)]}

Code

import pandas as pd

sentiment_df = pd.read_parquet("/data/sentiment/sentiment140/sentiment.gz.parquet")
sentiment_df

	sentiment	text
0	negative	@switchfoot http://twitpic.com/2y1zl - Awww, t...
1	negative	is upset that he can't update his Facebook by ...
2	negative	@Kenichan I dived many times for the ball. Man...
3	negative	my whole body feels itchy and like its on fire
4	negative	@nationwideclass no, it's not behaving at all....
...	...	...
1599995	positive	Just woke up. Having no school is the best fee...
1599996	positive	TheWDB.com - Very cool to hear old Walt interv...
1599997	positive	Are you ready for your MoJo Makeover? Ask me f...
1599998	positive	Happy 38th Birthday to my boo of alll time!!! ...
1599999	positive	happy #charitytuesday @theNSPCC @SparksCharity...

1600000 rows × 2 columns

Code

from tqdm.auto import tqdm
tqdm.pandas()

Code

%%time

sentiment_df = pd.read_parquet("/data/sentiment/sentiment140/sentiment.gz.parquet")
sentiment_df = sentiment_df.drop(columns="sentiment")
sentiment_df = sentiment_df.sample(n=100_000)
sentiment_df = sentiment_df.merge(
    pd.DataFrame(
        sentiment_df.text.apply(encoder.encode).tolist(),
        index=sentiment_df.index
    ),
    left_index=True,
    right_index=True,
)
sentiment_df = sentiment_df.explode("labels")
sentiment_df = sentiment_df.dropna()

CPU times: user 5min 11s, sys: 124 ms, total: 5min 11s
Wall time: 5min 11s

Code

sentiment_df.to_parquet("/data/blog/2022-05-02-prompt-internalization-word-sense/tweets.gz.parquet", compression="gzip")

Code

import datasets

sentiment_ds = datasets.Dataset.from_pandas(sentiment_df)

Code

sentiment_ds.save_to_disk("/data/blog/2022-05-02-prompt-internalization-word-sense/tweets.dataset")

Code

sentiment_ds

Dataset({
    features: ['text', 'input_ids', 'attention_mask', 'labels', '__index_level_0__'],
    num_rows: 425149
})

The next thing is to be able to add the prompt to the end. Roberta adds special characters to the output.

Code

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("roberta-base")
tokens = tokenizer("hello").input_ids
{token: tokenizer.decode(token) for token in tokens}

{0: '<s>', 42891: 'hello', 2: '</s>'}

Code

tokenizer.mask_token_id

It should be easy enough to add the prompt, I just need to ensure that the prompt doesn’t have the starting special token.

The next thing to do is to come up with a suitable prompt. It would be much easier to work with this if I add a prompt to the end. Since the existing prompt from the paper is trying to elicit an alternative description of the word maybe I can come up with a similar approach.

The roberta model is masked so I can come up with two appropriate forms:

is a WORD
WORD is a

We can try this out with the three examples from before.

Code

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

@torch.no_grad()
def get_word_signature(
    tokenizer: AutoTokenizer,
    model: AutoModelForMaskedLM,
    text: str,
    word: str,
    prompt: str
) -> List[str]:
    prompt = prompt.format(word)
    text = text.strip()
    if not text.endswith("."):
        text = text + "."
    text = f"{text} {prompt.strip()}"

    tokens = tokenizer(text, return_tensors="pt")
    tokens = tokens.to(model.device)
    mask_index = tokens.input_ids[0] == tokenizer.mask_token_id

    output = model(**tokens)

    predictions = output.logits[0, mask_index].mean(dim=0)
    predicted_tokens = predictions.argsort(descending=True)[:10]

    return tokenizer.batch_decode(predicted_tokens[:, None])

Code

from transformers import AutoTokenizer, AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("roberta-base")
model.eval()
model.cuda()
tokenizer = AutoTokenizer.from_pretrained("roberta-base")

Prompts

The next thing is to work out a good prompt. A good prompt will trigger the model to describe the word. The description is the distribution of the prediction for the masked token. Clustering this description is how we will perform word sense induction.

A good prompt will produce predicted tokens which describe the word well.

The `<mask>` is a WORD

For example, We spotted a large bass in the ocean. The <mask> is a bass. I’m hopeful that this will produce words like fish or creature.

Code

for text in [
    "We spotted a large bass in the ocean.",
    "The bass player did not receive the acknowledgment she deserves.",
    "The black sea bass, is a member of the wreckfish family.",
]:
    print(text)
    print(get_word_signature(
        tokenizer=tokenizer,
        model=model,
        text=text,
        word="bass",
        prompt="The <mask> is a {}.",
    ))
    print()

We spotted a large bass in the ocean.
[' fish', ' animal', ' photo', ' picture', ' species', ' creature', ' bird', ' shark', ' bass', ' specimen']

The bass player did not receive the acknowledgment she deserves.
[' player', ' woman', ' bass', ' girl', ' singer', ' man', ' culprit', ' song', ' musician', ' guitar']

The black sea bass, is a member of the wreckfish family.
[' fish', ' male', ' female', ' shark', ' adult', ' second', ' other', ' species', ' bottom', ' head']

The WORD is a `<mask>`

For example, We spotted a large bass in the ocean. The bass is a <mask>. Again fish and creature would fit fine here.

Code

for text in [
    "We spotted a large bass in the ocean.",
    "The bass player did not receive the acknowledgment she deserves.",
    "The black sea bass, is a member of the wreckfish family.",
]:
    print(text)
    print(get_word_signature(
        tokenizer=tokenizer,
        model=model,
        text=text,
        word="bass",
        prompt="The {} is a <mask>.",
    ))
    print()

We spotted a large bass in the ocean.
[' male', ' female', ' juvenile', ' fish', ' dolphin', ' shark', ' salmon', ' rainbow', ' bass', ' sub']

The bass player did not receive the acknowledgment she deserves.
[' legend', ' genius', ' monster', ' bitch', ' joke', ' god', ' hero', ' liability', ' star', ' failure']

The black sea bass, is a member of the wreckfish family.
[' juvenile', ' male', ' female', ' sub', ' hybrid', ' fish', ' species', ' bass', ' rainbow', ' carp']

The first prompt (The <mask> is a bass) seems to work well. The first word of each set of predictions is highly relevant to the different groups (fish for a and c vs player for b).

The second prompt (The bass is a <mask>) appears worse than the first. The predictions for the b sentence seem poorly related to the actual class, seemingly more related to emotive aspects of being a musician. The a and c sentence are also mixed up as they have the same top 3 predictions but in a different order.

There are other prompts that could be used. Coming up with a good prompt is tricky and I am more interested in seeing if the prompt internalization technique works at all.

Train the Model

Now we can work on internalizing the prompt. There isn’t really a metric for training that is better than the loss function. The KL Divergence that forms the loss is a measure of the difference between the prompted teacher and the unprompted student.

Code

from pathlib import Path

RUN_DIRECTORY = Path("/data/blog/2022-05-02-prompt-internalization-word-sense/runs")
RUN_DIRECTORY.mkdir(parents=True, exist_ok=True)

MODEL_NAME = "roberta-base"

BATCH_SIZE = 16 # 64

LEARNING_RATE = 1e-4
TEMPERATURE = 2
EPOCHS = 2
# MAX_STEPS = 5_000
# MAX_STEPS = 50
EVALUATION_STEPS = 1_000
# EVALUATION_STEPS = 10

Code

import datasets

tweet_ds = datasets.load_from_disk("/data/blog/2022-05-02-prompt-internalization-word-sense/tweets.dataset")
tweet_split_ds = tweet_ds.train_test_split(test_size=10_000)

Loading cached split indices for dataset at /data/blog/2022-05-02-prompt-internalization-word-sense/tweets.dataset/cache-dd3837117c50dc2a.arrow and /data/blog/2022-05-02-prompt-internalization-word-sense/tweets.dataset/cache-41027af85f6371d6.arrow

Code

# from src/main/python/blog/prompt_internalization/roberta/trainer.py
import string
from typing import Any, Dict, Tuple, Union

import torch
import torch.nn.functional as F
from transformers import AutoModelForMaskedLM, AutoTokenizer, Trainer, TrainingArguments
from transformers.modeling_outputs import MaskedLMOutput
from transformers.tokenization_utils_base import BatchEncoding


class MaskedPromptInternalizationTrainingArguments(TrainingArguments):
    def __init__(self, *args, temperature: float = 2.0, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.temperature = temperature


class MaskedPromptInternalizationTrainer(Trainer):
    def __init__(
        self,
        *args,
        teacher_model: AutoModelForMaskedLM = None,
        prompt: str = "The <mask> is a {}.",
        tokenizer: AutoTokenizer = None,
        **kwargs,
    ) -> None:
        super().__init__(*args, **kwargs)
        self.teacher = teacher_model
        self._move_model_to_device(self.teacher, self.model.device)
        self.teacher.eval()
        self.prompt = prompt
        self.tokenizer = tokenizer
        self.punctuation = set(string.punctuation)

    def compute_loss(
        self,
        model: AutoModelForMaskedLM,
        inputs: Dict[str, Union[torch.Tensor, Any]],
        return_outputs: bool = False,
    ) -> Union[Tuple[torch.Tensor, torch.Tensor], torch.Tensor]:
        student_output = model(
            input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"]
        )

        teacher_input = self._prompted_text(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            labels=inputs["labels"],
        )
        teacher_predictions = self._teacher_predictions(teacher_input)

        loss = self._student_loss(
            student_output=student_output,
            teacher_predictions=teacher_predictions,
            labels=inputs["labels"],
        )

        return (loss, student_output) if return_outputs else loss

    def _prompted_text(
        self,
        input_ids: torch.Tensor,
        attention_mask: torch.Tensor,
        labels: torch.Tensor,
    ) -> BatchEncoding:
        """
        This decodes all of the inputs, adds the prompt to the end, and then encodes them again.
        This is very inefficient.
        """
        prompted_text = [
            self._add_prompt(
                input_ids=input_ids,
                attention_mask=attention_mask,
                start=start,
                length=length,
            )
            for input_ids, attention_mask, (start, length) in zip(
                input_ids,
                attention_mask,
                labels,
            )
        ]
        prompted_tokens = self.tokenizer(
            prompted_text, return_tensors="pt", padding=True
        )
        return prompted_tokens.to(input_ids.device)

    def _add_prompt(
        self,
        input_ids: torch.Tensor,
        attention_mask: torch.Tensor,
        start: int,
        length: int,
    ) -> torch.Tensor:
        input_length = attention_mask.sum()
        text = self.tokenizer.decode(input_ids[:input_length], skip_special_tokens=True)
        text = text.strip()
        if text[-1] not in self.punctuation:
            text = text + "."
        word = self.tokenizer.decode(input_ids[start : start + length])
        word = word.strip()
        return text + " " + self.prompt.format(word)

    @torch.no_grad()
    def _teacher_predictions(self, inputs: BatchEncoding) -> torch.Tensor:
        outputs_teacher = self.teacher(**inputs)
        mask_indices = inputs.input_ids == self.tokenizer.mask_token_id
        return outputs_teacher.logits[mask_indices]

    def _student_loss(
        self,
        student_output: MaskedLMOutput,
        teacher_predictions: torch.Tensor,
        labels: torch.Tensor,
    ) -> torch.Tensor:
        # Calculating the student prediction is tricky.
        # Is the output for a multi token target the mean of the output for each token?
        # Should the loss instead be measured per token?

        # When calculating this it is very important to avoid breaking back propagation.
        # torch.cat will break back propagation, so the loss is calculated per row.

        losses = []
        for target, output, (start, length) in zip(
            teacher_predictions, student_output.logits, labels
        ):
            prediction = output[start : start + length]
            prediction = prediction.mean(dim=0)
            prediction = F.log_softmax(prediction / self.args.temperature, dim=-1)
            target = F.softmax(target / self.args.temperature, dim=-1)
            loss = F.kl_div(
                input=prediction[None, :],
                target=target[None, :],
                reduction="batchmean",
                log_target=False,
            )
            loss = loss * (self.args.temperature ** 2)
            losses.append(loss)
        return sum(losses) / len(losses)

Code

from transformers import AutoModelForMaskedLM, DataCollatorWithPadding, AutoTokenizer

training_args = MaskedPromptInternalizationTrainingArguments(
    report_to="none",

    output_dir=RUN_DIRECTORY,
    num_train_epochs=EPOCHS,
    # max_steps=MAX_STEPS,

    evaluation_strategy="steps",
    logging_steps=EVALUATION_STEPS,
    eval_steps=EVALUATION_STEPS,
    save_steps=EVALUATION_STEPS,

    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    fp16=False,
    learning_rate=LEARNING_RATE,
    seed=33,

    logging_dir=RUN_DIRECTORY / "logs",
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="loss",
    greater_is_better=False,

    temperature=TEMPERATURE,
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
teacher_model = AutoModelForMaskedLM.from_pretrained(MODEL_NAME)
student_model = AutoModelForMaskedLM.from_pretrained(MODEL_NAME)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = MaskedPromptInternalizationTrainer(
    model=student_model,
    args=training_args,
    teacher_model=teacher_model,
    train_dataset=tweet_split_ds["train"],
    eval_dataset=tweet_split_ds["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

trainer.train()

[51894/51894 2:45:00, Epoch 2/2]

Step	Training Loss	Validation Loss
1000	0.816900	0.653776
2000	0.666800	0.592557
3000	0.624500	0.555341
4000	0.602500	0.539190
5000	0.571100	0.521284
6000	0.552200	0.504296
7000	0.542400	0.488887
8000	0.527300	0.483320
9000	0.511100	0.468313
10000	0.504900	0.458751
11000	0.498700	0.451415
12000	0.484500	0.439299
13000	0.478300	0.442260
14000	0.466700	0.439036
15000	0.462900	0.419386
16000	0.457700	0.415736
17000	0.451500	0.414230
18000	0.442000	0.404812
19000	0.435000	0.394529
20000	0.425800	0.389718
21000	0.422400	0.390572
22000	0.422700	0.382137
23000	0.412800	0.382509
24000	0.411900	0.377646
25000	0.406900	0.363590
26000	0.397500	0.365397
27000	0.353800	0.358085
28000	0.354000	0.354499
29000	0.348200	0.349619
30000	0.348600	0.346917
31000	0.343600	0.346731
32000	0.340500	0.342117
33000	0.336800	0.339344
34000	0.336500	0.337132
35000	0.333000	0.331645
36000	0.329300	0.326415
37000	0.328100	0.329346
38000	0.328400	0.323757
39000	0.322600	0.323505
40000	0.317800	0.313984
41000	0.320900	0.312605
42000	0.316600	0.311725
43000	0.311300	0.308437
44000	0.307500	0.306327
45000	0.304900	0.303245
46000	0.303800	0.301074
47000	0.301700	0.297794
48000	0.298000	0.294003
49000	0.298100	0.293498
50000	0.295400	0.291945
51000	0.294500	0.290203

TrainOutput(global_step=51894, training_loss=0.4110746412703209, metrics={'train_runtime': 9900.5662, 'train_samples_per_second': 83.864, 'train_steps_per_second': 5.242, 'total_flos': 2.4962527857415496e+16, 'train_loss': 0.4110746412703209, 'epoch': 2.0})

Code

student_model.save_pretrained(RUN_DIRECTORY / "best-model")

Evaluation

With a trained model we can review some of the outputs that it produces for different nouns.

Evaluation - Bass

Let’s see how it does on the sample sentences from earlier:

We spotted a large bass in the ocean.
The bass player did not receive the acknowledgment she deserves.
The black sea bass, is a member of the wreckfish family

Code

from transformers import AutoModelForMaskedLM, AutoTokenizer

student_model = AutoModelForMaskedLM.from_pretrained(RUN_DIRECTORY / "best-model")
student_model.eval()

tokenizer = AutoTokenizer.from_pretrained("roberta-base")

Could not locate the tokenizer configuration file, will try to use the model config instead.

We spotted a large bass in the ocean.
[' fish', ' animal', ' photo', ' picture', ' species', ' creature', ' bird', ' shark', ' bass', ' specimen']

The bass player did not receive the acknowledgment she deserves.
[' player', ' woman', ' bass', ' girl', ' singer', ' man', ' culprit', ' song', ' musician', ' guitar']

The black sea bass, is a member of the wreckfish family.
[' fish', ' male', ' female', ' shark', ' adult', ' second', ' other', ' species', ' bottom', ' head']

Code

import torch

@torch.no_grad()
def get_predictions(text: str, start: int, end: int):
    print(f"Phrase is: {text}")
    tokens = tokenizer(text, return_tensors="pt")
    print(f"Target is: {tokenizer.decode(tokens.input_ids[0, start:end])}")

    output = student_model(**tokens)
    predictions = output.logits[0, start:end].mean(dim=0)
    predicted_tokens = predictions.argsort(descending=True)[:10]
    predicted_words = tokenizer.batch_decode(predicted_tokens)

    print(f"Description is: {predicted_words}")

Code

for text, start, end in [
    ["We spotted a large bass in the ocean.", 5, 6],
    ["The bass player did not receive the acknowledgment she deserves.", 2, 3],
    ["The black sea bass, is a member of the wreckfish family.", 4, 5],
]:
    get_predictions(text, start, end)
    print()

Phrase is: We spotted a large bass in the ocean.
Target is:  bass
Description is: [' fish', ' picture', ' bass', ' photo', ' animal', ' image', ' creature', ' bird', ' species', ' culprit']

Phrase is: The bass player did not receive the acknowledgment she deserves.
Target is:  bass
Description is: [' player', ' singer', ' woman', ' bass', ' musician', ' man', ' guitarist', ' drummer', ' guitar', ' girl']

Phrase is: The black sea bass, is a member of the wreckfish family.
Target is:  bass
Description is: [' fish', ' bass', ' shark', ' species', ' lobster', ' following', ' crab', ' animal', ' photo', ' picture']

The output of the prompted model for these inputs was:

We spotted a large bass in the ocean.
[' fish', ' animal', ' photo', ' picture', ' species', ' creature', ' bird', ' shark', ' bass', ' specimen']

The bass player did not receive the acknowledgment she deserves.
[' player', ' woman', ' bass', ' girl', ' singer', ' man', ' culprit', ' song', ' musician', ' guitar']

The black sea bass, is a member of the wreckfish family.
[' fish', ' male', ' female', ' shark', ' adult', ' second', ' other', ' species', ' bottom', ' head']

We can compare these to the predictions to get a rough idea of how accurate the model is.

Code

predictions_1 = [' fish', ' picture', ' bass', ' photo', ' animal', ' image', ' creature', ' bird', ' species', ' culprit']
original_1 = [' fish', ' animal', ' photo', ' picture', ' species', ' creature', ' bird', ' shark', ' bass', ' specimen']

predictions_2 = [' player', ' singer', ' woman', ' bass', ' musician', ' man', ' guitarist', ' drummer', ' guitar', ' girl']
original_2 = [' player', ' woman', ' bass', ' girl', ' singer', ' man', ' culprit', ' song', ' musician', ' guitar']

predictions_3 = [' fish', ' bass', ' shark', ' species', ' lobster', ' following', ' crab', ' animal', ' photo', ' picture']
original_3 = [' fish', ' male', ' female', ' shark', ' adult', ' second', ' other', ' species', ' bottom', ' head']

Code

len(set(predictions_1) & set(original_1))

Code

len(set(predictions_2) & set(original_2))

Code

len(set(predictions_3) & set(original_3))

Code

len(set(predictions_1) & set(predictions_2))

Code

len(set(predictions_1) & set(predictions_3))

Code

set(predictions_1) & set(predictions_3)

{' animal', ' bass', ' fish', ' photo', ' picture', ' species'}

This certainly is not working perfectly. I do think this is a strong performance with two of the sentences matching in 8 of the top 10 tokens. Furthermore the dissimilar sentences (1 and 2) only have an overlap of a single word, bass, while the similar sentences have 6 matches out of 10.

Even though I trained this for two epochs it could be trained more - the validation loss was still dropping. The settings that I used were also just the ones from the previous post, a hyperparameter search may well find better performing ones.

Evaluation - SemEval 2013

One of the tasks from SemEval 2013 is Word Sense Induction {% cite Jurgens_semeval-2013task %}. The original source for this dataset seems to be down but copies of it are available. I have found this one which I will be using for this evaluation.

SemEval 2013 Preprocessing

Part of the problem with using this data is that the sentences are encoded in XML and the labels for the data are stored in a separate file. As such there will be a bit of data wrangling to do first.

Code

import pandas as pd

pd.read_xml("/data/word_sense_induction/semeval/2013/13/raw/contexts/xml-format/board.n.xml")

	id	lemma	partOfSpeech	token	tokenEnd	tokenStart	instance
0	board.n.1	board	n	board	191	186	Field Comm's two main functions were to relay ...
1	board.n.2	board	n	board	47	42	. Sadly, he was forced to retire from the boar...
2	board.n.3	board	n	board	171	166	Refinements of both activities have been made ...
3	board.n.4	board	n	board	17	12	The library board had conducted a national sea...
4	board.n.5	board	n	board	87	82	oh yes but but what i meant was why do people ...
...	...	...	...	...	...	...	...
90	board.n.94	board	n	board	115	110	uh the uh achievement for the let's see how ar...
91	board.n.95	board	n	boards	288	282	This discussion was co-facilitated by Colleen ...
92	board.n.96	board	n	boards	55	49	Most agencies in Texas are run by commissions ...
93	board.n.97	board	n	board	232	227	Understanding how CIOs of leading organization...
94	board.n.98	board	n	board	283	278	Having been through those huge postwar sci-fi ...

95 rows × 7 columns

There is the more compact senseval2 format, however that is more deeply nested so pandas does not read it nicely. The plain xml format has everything that is required.

The structure of the labels is a space delimited csv file with a variable number of columns. As the column count is varied and the cluster labels are rich I have decided to write a custom bit of code to read it in.

The source data looks like:

board.n board.n.1 board%1:06:00::/4 board%1:06:03::/4
board.n board.n.2 board%1:14:00::/4
board.n board.n.3 board%1:06:00::/4

The columns are word, id, word sense, … I want to map those word sense labels into ids as I am just going to cluster the outputs of the model as the evaluation.

Code

from typing import Dict, List, Union
from pathlib import Path

import pandas as pd

class LabelReader:
    def __init__(self) -> None:
        self.next_id = 0
        self.id_map = {}

    def to_row(self, line: str) -> Dict[str, Union[str, List[int]]]:
        _, row_id, *senses = line.split()
        return {
            "id": row_id,
            "senses": list(map(self.to_id, senses))
        }

    def to_id(self, sense: str) -> int:
        if sense not in self.id_map:
            self.id_map[sense] = self.next_id
            self.next_id += 1
        return self.id_map[sense]

LABELS_FILE = Path("/data/word_sense_induction/semeval/2013/13/raw/keys/gold/nouns.key")
label_reader = LabelReader()

semeval_labels = pd.DataFrame(
    map(
        label_reader.to_row,
        LABELS_FILE.read_text().splitlines()
    )
)
semeval_labels

	id	senses
0	board.n.1	[0, 1]
1	board.n.2	[2]
2	board.n.3	[0]
3	board.n.4	[2]
4	board.n.5	[2]
...	...	...
1843	window.n.96	[223]
1844	window.n.97	[223]
1845	window.n.98	[234]
1846	window.n.99	[223]
1847	window.n.100	[223]

1848 rows × 2 columns

Code

semeval_labels.senses.apply(max).max()

Code

(
    semeval_labels
        .senses
        .explode()
        .value_counts()
        .reset_index(drop=True)
        .plot()
) ; None

We can see that the label distribution is quite wide with 234 total labels where the most common label is used more than 80 times.

Code

from pathlib import Path
import pandas as pd

SENTENCES_FOLDER = Path("/data/word_sense_induction/semeval/2013/13/raw/contexts/xml-format/")

semeval_df = pd.merge(
    pd.concat(map(pd.read_xml, sorted(SENTENCES_FOLDER.glob("*.n.xml")))),
    semeval_labels,
    left_on="id",
    right_on="id",
)
semeval_df

	id	lemma	partOfSpeech	token	tokenEnd	tokenStart	instance	senses
0	board.n.1	board	n	board	191	186	Field Comm's two main functions were to relay ...	[0, 1]
1	board.n.2	board	n	board	47	42	. Sadly, he was forced to retire from the boar...	[2]
2	board.n.3	board	n	board	171	166	Refinements of both activities have been made ...	[0]
3	board.n.4	board	n	board	17	12	The library board had conducted a national sea...	[2]
4	board.n.5	board	n	board	87	82	oh yes but but what i meant was why do people ...	[2]
...	...	...	...	...	...	...	...	...
1843	window.n.96	window	n	windows	73	66	[When Violet and her sisters appeared] they th...	[223]
1844	window.n.97	window	n	windows	89	82	This is certainly a challenging task, but the ...	[223]
1845	window.n.98	window	n	window	135	129	The rationale for interventions in the emergen...	[234]
1846	window.n.99	window	n	window	128	122	The bleak black and white of the scene in the ...	[223]
1847	window.n.100	window	n	window	116	110	like i said i'm just a balcony kind of gardene...	[223]

1848 rows × 8 columns

Code

semeval_df.to_parquet("/data/word_sense_induction/semeval/2013/13/nouns.gz.parquet", compression="gzip")

It looks like the text can be a bit mangled. There is a bracketted section in [When Violet and her sisters appeared] they and the sentence appears to start badly with . Sadly, he was forced to retire from.

The text must be encoded and the lemma turned into a token index to work with the model.

Code

from typing import Any, Dict, Tuple
from spacy.tokens import Span
from transformers import AutoTokenizer

class SemEvalEncoder:
    def __init__(self, name: str = "roberta-base") -> None:
        self.tokenizer = AutoTokenizer.from_pretrained(name)

    def __call__(self, row: pd.Series) -> Dict[str, Any]:
        return self.encode(
            text=row.instance,
            start=row.tokenStart,
            end=row.tokenEnd,
        )

    def encode(self, text: str, start: int, end: int) -> Dict[str, Any]:
        tokens = self.tokenizer(text, truncation=True, return_offsets_mapping=True)
        labels = self.find(tokens.offset_mapping, start=start, end=end)
        return {
            "input_ids": tokens.input_ids,
            "attention_mask": tokens.attention_mask,
            "labels": labels
        }

    def find(self, offsets: List[Tuple[int, int]], start: int, end: int) -> List[Tuple[int, int]]:
        starts = {
            start: index
            for index, (start, end) in enumerate(offsets)
            if start != end
        }
        ends = {
            end: index
            for index, (start, end) in enumerate(offsets)
            if start != end
        }
        return starts[start], 1 + ends[end] - starts[start]

Code

encoder = SemEvalEncoder()
encoded_df = pd.DataFrame(semeval_df.apply(encoder, axis="columns").tolist(), index=semeval_df.index)
encoded_df

	input_ids	attention_mask	labels
0	[0, 27780, 9204, 18, 80, 1049, 8047, 58, 7, 12...	[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...	(34, 1)
1	[0, 4, 18810, 6, 37, 21, 1654, 7, 7865, 31, 5,...	[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...	(11, 1)
2	[0, 31842, 179, 11217, 9, 258, 1713, 33, 57, 1...	[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...	(42, 1)
3	[0, 133, 5560, 792, 56, 2964, 10, 632, 1707, 1...	[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...	(3, 1)
4	[0, 2678, 4420, 53, 53, 99, 939, 2425, 21, 596...	[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...	(20, 1)
...	...	...	...
1843	[0, 10975, 1779, 31310, 8, 69, 7502, 1382, 742...	[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...	(15, 1)
1844	[0, 713, 16, 1819, 10, 4087, 3685, 6, 53, 5, 1...	[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...	(17, 1)
1845	[0, 133, 23437, 13, 15985, 11, 5, 1923, 2749, ...	[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...	(21, 1)
1846	[0, 133, 23530, 909, 8, 1104, 9, 5, 1310, 11, ...	[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...	(26, 1)
1847	[0, 3341, 939, 26, 939, 437, 95, 10, 19988, 76...	[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...	(26, 1)

1848 rows × 3 columns

This is aligned to the original semeval dataframe so the predictions can be joined back to the original.

SemEval 2013 Predictions

What we want now is to run this through the model to get the signature for each row. Then plotting the signature in PCA or something to see if the model output clusters appropriately.

Code

import torch
from transformers import AutoModelForMaskedLM

@torch.no_grad()
def get_predictions(
    model: AutoModelForMaskedLM,
    input_ids: List[int],
    attention_mask: List[int],
    labels: Tuple[int, int]
) -> List[float]:
    start, length = labels
    input_ids = torch.tensor(input_ids, dtype=torch.long, device=model.device)
    attention_mask = torch.tensor(attention_mask, dtype=torch.long, device=model.device)
    output = model(
        input_ids=input_ids[None, :],
        attention_mask=attention_mask[None, :]
    ).logits

    prediction = output[0, start:start+length].mean(dim=0)
    prediction = prediction.softmax(dim=0)

    return prediction.tolist()

Code

df = semeval_df.copy()
df["prediction"] = encoded_df.apply(
    lambda row: get_predictions(model=student_model, **row),
    axis="columns"
)
df

	id	lemma	partOfSpeech	token	tokenEnd	tokenStart	instance	senses	prediction
0	board.n.1	board	n	board	191	186	Field Comm's two main functions were to relay ...	[0, 1]	[1.2305063634698854e-08, 5.040778461307127e-09...
1	board.n.2	board	n	board	47	42	. Sadly, he was forced to retire from the boar...	[2]	[2.0327217686855192e-09, 2.6563975463744782e-0...
2	board.n.3	board	n	board	171	166	Refinements of both activities have been made ...	[0]	[2.0365833464097705e-08, 4.597210168100219e-09...
3	board.n.4	board	n	board	17	12	The library board had conducted a national sea...	[2]	[2.2280894906856474e-09, 2.0251571530849333e-0...
4	board.n.5	board	n	board	87	82	oh yes but but what i meant was why do people ...	[2]	[2.2438069180452658e-09, 1.0182827914206882e-0...
...	...	...	...	...	...	...	...	...	...
1843	window.n.96	window	n	windows	73	66	[When Violet and her sisters appeared] they th...	[223]	[1.2446962571743825e-08, 4.409422160733811e-09...
1844	window.n.97	window	n	windows	89	82	This is certainly a challenging task, but the ...	[223]	[1.8099743925859002e-08, 6.042446543119695e-09...
1845	window.n.98	window	n	window	135	129	The rationale for interventions in the emergen...	[234]	[5.634586841551936e-08, 5.296822092049069e-09,...
1846	window.n.99	window	n	window	128	122	The bleak black and white of the scene in the ...	[223]	[1.991478537632929e-09, 7.5159434231864e-10, 1...
1847	window.n.100	window	n	window	116	110	like i said i'm just a balcony kind of gardene...	[223]	[6.050367318266581e-09, 2.603556259472839e-09,...

1848 rows × 9 columns

SemEval 2013 Clustering

To cluster this I am going to use dimensionality reduction and then plot the predictions. By showing each lemma separately and coloring the different senses a good cluster would show the different senses separately.

Code

from sklearn.decomposition import PCA

pca_output = PCA(n_components=2).fit_transform(df.prediction.tolist())
df["pca_x"] = pca_output[:, 0].tolist()
df["pca_y"] = pca_output[:, 1].tolist()

Code

df.explode("senses").plot.scatter(x="pca_x", y="pca_y", c="senses") ; None

This isn’t a great start, as the output seems to be quite grouped already. Getting clear clusters out of this will be difficult.

Code

from sklearn.manifold import TSNE

tsne_output = TSNE(n_components=2).fit_transform(df.prediction.tolist())
df["tsne_x"] = tsne_output[:, 0].tolist()
df["tsne_y"] = tsne_output[:, 1].tolist()

Code

df.explode("senses").plot.scatter(x="tsne_x", y="tsne_y", c="senses") ; None

This visualization is slightly better. T-SNE can manipulate dimensions with the aim of keeping close points together and spreading out distant points. I can see some separate clusters in this output.

The next thing will be to show the clusters for each separate lemma.

Code

def plot_lemma(df: pd.DataFrame, lemma: str, method: str) -> None:
    plot_df = df.copy()
    plot_df = plot_df[
        (plot_df.lemma == lemma)
        & (plot_df.senses.str.len() == 1)
    ]
    plot_df = plot_df.explode("senses")
    plot_df["senses"] -= plot_df.senses.min()

    plot_df.plot.scatter(
        x=f"{method}_x",
        y=f"{method}_y",
        c="senses",
        colormap="viridis",
        title=lemma
    )

for label in sorted(df.lemma.unique()):
    plot_lemma(df, lemma=label, method="tsne")

This is a bit disappointing. The model does not well distinguish between the different senses of the words. Using the PCA visualization does not significantly change this.

I wonder how much this is to do with the prompt that was chosen. The internalization of the prompt has worked to a degree at least.

Data Preparation

Prompts

The <mask> is a WORD

The WORD is a <mask>

Train the Model

Evaluation

Evaluation - Bass

Evaluation - SemEval 2013

SemEval 2013 Preprocessing

SemEval 2013 Predictions

SemEval 2013 Clustering

The `<mask>` is a WORD

The WORD is a `<mask>`