Matthew’s Blog - MNLI vs XNLI

I want to use a MNLI model to do sentiment analysis. It’s an odd situation as you could just train a sentiment classifier, however in this case I want to train the model on a very small amount of data. Using a larger model trained on a more complex task should allow it the context to do well.

BART large MNLI (Lewis et al. 2019) is an English model trained on the English only Multi-Genre Natural Language Inference (Williams, Nangia, and Bowman 2018). There is a multilingual version of MNLI called XNLI (Conneau et al. 2018), which is described as

Lewis, Mike, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2019. “BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension.” CoRR abs/1910.13461. http://arxiv.org/abs/1910.13461.

Williams, Adina, Nikita Nangia, and Samuel Bowman. 2018. “A Broad-Coverage Challenge Corpus for Sentence Understanding Through Inference.” In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 1112–22. New Orleans, Louisiana: Association for Computational Linguistics. http://aclweb.org/anthology/N18-1101.

Conneau, Alexis, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. “XNLI: Evaluating Cross-Lingual Sentence Representations.” In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics.

XNLI is a subset of a few thousand examples from MNLI which has been translated into a 14 different languages

How well will BART Large MNLI do on this dataset? I think it will be fun to see. Training these large language models, even monolingually, can involve some understanding of other languages. The datasets that are used can include some text that is in another language. BART large is quite large and so has enough space to have some idea of other languages.

To make this evaluation more interesting I can test BART Large MNLI on the MNLI dataset first, to establish a baseline. Then I can test it against the XNLI dataset where both the passage and hypothesis are in a non-English language. Finally I can test it against the XNLI dataset where the passage is in a non-English language but the hypothesis is in English.

There are XNLI models as well so a comparison to a similar size model that is explicitly trained to do this would be good.

Dataset

As always we will start with the dataset. Getting it is easy as both are available on huggingface.

Let’s start with the mnli dataset, which has two validation datasets (matched and mismatched). The labels are entailment (0), neutral (1), contradiction (2).

Code

from datasets import load_dataset
import pandas as pd

mnli_ds = load_dataset("multi_nli")

print("validation matched")
display(
    pd.DataFrame(mnli_ds["validation_matched"])
        [["premise", "hypothesis", "label"]]
        .head()
)

print("validation mismatched")
display(
    pd.DataFrame(mnli_ds["validation_mismatched"])
        [["premise", "hypothesis", "label"]]
        .head()
)

mnli_df = pd.concat([
    pd.DataFrame(mnli_ds["validation_matched"]),
    pd.DataFrame(mnli_ds["validation_mismatched"]),
])[["premise", "hypothesis", "label"]]

validation matched

	premise	hypothesis	label
0	The new rights are nice enough	Everyone really likes the newest benefits	1
1	This site includes a list of all award winners...	The Government Executive articles housed on th...	2
2	uh i don't know i i have mixed emotions about ...	I like him for the most part, but would still ...	0
3	yeah i i think my favorite restaurant is alway...	My favorite restaurants are always at least a ...	2
4	i don't know um do you do a lot of camping	I know exactly.	2

validation mismatched

	premise	hypothesis	label
0	Your contribution helped make it possible for ...	Your contributions were of no help with our st...	2
1	The answer has nothing to do with their cause,...	Dictionaries are indeed exercises in bi-unique...	2
2	We serve a classic Tuscan meal that includes ...	We serve a meal of Florentine terrine.	0
3	A few months ago, Carl Newton and I wrote a le...	Carl Newton and I have never had any other pre...	2
4	I was on this earth you know, I've lived on th...	I don't yet know the reason why I have lived o...	0

The validation data is approximately evenly split:

Code

import pandas as pd

pd.DataFrame({
    "validation_matched": pd.DataFrame(mnli_ds["validation_matched"]).label.value_counts(),
    "validation_mismatched": pd.DataFrame(mnli_ds["validation_mismatched"]).label.value_counts(),
})

	validation_matched	validation_mismatched
label
0	3479	3463
2	3213	3240
1	3123	3129

Now we can consider the xnli dataset. This is split by language, as we want an effective test I am going to use all languages.

Code

from datasets import load_dataset
import pandas as pd

xnli_ds = load_dataset("xnli", "all_languages")
xnli_df = pd.DataFrame(xnli_ds["test"])
xnli_df.head()

	premise	hypothesis	label
0	{'ar': 'حسنا ، لم أكن أفكر حتى حول ذلك ، لكن ك...	{'language': ['ar', 'bg', 'de', 'el', 'en', 'e...	2
1	{'ar': 'حسنا ، لم أكن أفكر حتى حول ذلك ، لكن ك...	{'language': ['ar', 'bg', 'de', 'el', 'en', 'e...	0
2	{'ar': 'حسنا ، لم أكن أفكر حتى حول ذلك ، لكن ك...	{'language': ['ar', 'bg', 'de', 'el', 'en', 'e...	1
3	{'ar': 'واعتقدت أن ذلك شرف لي ، ولا يزال ، ولا...	{'language': ['ar', 'bg', 'de', 'el', 'en', 'e...	1
4	{'ar': 'واعتقدت أن ذلك شرف لي ، ولا يزال ، ولا...	{'language': ['ar', 'bg', 'de', 'el', 'en', 'e...	0

This is more tricky to deal with. If we wanted a French premise with an English hypothesis can we do that?

Code

df = xnli_df.copy()
df["premise"] = df.premise.apply(lambda premise: premise["fr"])
df["hypothesis"] = df.hypothesis.apply(lambda hypothesis: hypothesis["translation"][hypothesis["language"].index("en")])
df.head()

	premise	hypothesis	label
0	Eh bien, je ne pensais même pas à cela, mais j...	I havent spoken to him again.	2
1	Eh bien, je ne pensais même pas à cela, mais j...	I was so upset that I just started talking to ...	0
2	Eh bien, je ne pensais même pas à cela, mais j...	We had a great talk.	1
3	Et je pensais que c'était un privilège, et ça ...	I was not aware that I was not the only person...	1
4	Et je pensais que c'était un privilège, et ça ...	I was under the impression that I was the only...	0

The premise column is a dictionary of language code to premise. In comparison the hypothesis column has language and translation columns which are lists that are aligned. Extracting from the hypothesis seems needlessly complicated compared to the premise. Otherwise this is fine.

Code

xnli_df.label.value_counts()

label
2    1670
0    1670
1    1670
Name: count, dtype: int64

Evaluation

To evaluate the models I will start with a known-good test, which is to evaluate bart-large-mnli on the mnli dataset. This is what it was trained on and it should perform well.

Code

MNLI_MODEL_NAME = "facebook/bart-large-mnli"

Code

from typing import Callable
from transformers import AutoTokenizer

DatasetRows = dict[str, list[str]]
InputIds = list[int]
RowInputIds = list[InputIds]

def encode_mnli(model_name: str) -> Callable[[DatasetRows], RowInputIds]:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    def encode(rows: DatasetRows) -> RowInputIds:
        encoded = tokenizer(
            rows["premise"],
            rows["hypothesis"],
            truncation="only_first",
            return_attention_mask=False,
        )
        return {"input_ids": encoded.input_ids}
    return encode

Code

encoded_mnli_ds = mnli_ds.map(encode_mnli(MNLI_MODEL_NAME), batched=True)
encoded_mnli_ds = encoded_mnli_ds.select_columns(["input_ids", "label"])

I’ve encoded the mnli datset to add the hypothesis to the premise and convert them to the tokenized form. Doing this in advance makes inference faster as it can just move from batch to batch.

It’s now time to evaluate this. The model is both large and trained on this specific task so I expect it to do well.

Code

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import datasets
from sklearn.metrics import classification_report
import torch
from tqdm.auto import tqdm
import numpy as np

# the mnli dataset has labels entailment (0), neutral (1), contradiction (2)
# the bart-large-mnli has outputs contradiction (0), neutral (1), entailment (2)

@torch.inference_mode()
def evaluate(
    model_name: str,
    ds: datasets.Dataset,
    batch_size: int = 64,
    label_map: dict = {0: 2, 2: 0}, # FROM model TO dataset
    detail: bool = True,
) -> None:
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    model = model.cuda()
    model = model.eval()

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    predictions = []

    for index in tqdm(range(0, len(ds), batch_size)):
        batch = ds.select_columns("input_ids")[index:index+batch_size]
        encoded = tokenizer.pad(batch, return_tensors="pt")
        encoded = encoded.to(model.device)
        output = model(**encoded)
        output = output.logits.argmax(dim=-1).cpu().tolist()
        if label_map:
            output = [label_map.get(value, value) for value in output]
        predictions.extend(output)

    predictions = np.array(predictions)

    if detail:
        report = classification_report(
            y_true=ds["label"],
            y_pred=predictions,
            target_names=["entailment", "neutral", "contradiction"],
        )
        print(report)

    return (predictions == ds["label"]).mean()

Code

evaluate(MNLI_MODEL_NAME, ds=encoded_mnli_ds["validation_matched"], batch_size=64) ; None

You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

               precision    recall  f1-score   support

   entailment       0.92      0.92      0.92      3479
      neutral       0.88      0.85      0.86      3123
contradiction       0.91      0.94      0.92      3213

     accuracy                           0.90      9815
    macro avg       0.90      0.90      0.90      9815
 weighted avg       0.90      0.90      0.90      9815

Code

evaluate(MNLI_MODEL_NAME, ds=encoded_mnli_ds["validation_mismatched"], batch_size=64) ; None

You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

               precision    recall  f1-score   support

   entailment       0.90      0.92      0.91      3463
      neutral       0.87      0.84      0.86      3129
contradiction       0.92      0.93      0.93      3240

     accuracy                           0.90      9832
    macro avg       0.90      0.90      0.90      9832
 weighted avg       0.90      0.90      0.90      9832

90% overall accuracy looks good to me. I question the wisdom of swapping the model outputs so they don’t align with the dataset.

Anyway now we should be able to evaluate this model on the various forms of the xnli dataset.

Code

from typing import Callable
from transformers import AutoTokenizer

DatasetRows = dict[str, list[str]]
InputIds = list[int]
RowInputIds = list[InputIds]

def encode_xnli(
    model_name: str,
    *,
    premise_language: str,
    hypothesis_language: str,
) -> Callable[[DatasetRows], RowInputIds]:
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    def get_premise(rows: DatasetRows) -> list[str]:
        return [row[premise_language] for row in rows["premise"]]
    def get_hypothesis(rows: DatasetRows) -> list[str]:
        hypotheses = rows["hypothesis"]
        return [
            row["translation"][row["language"].index(hypothesis_language)]
            for row in hypotheses
        ]

    def encode(rows: DatasetRows) -> RowInputIds:
        premise = get_premise(rows)
        hypothesis = get_hypothesis(rows)
        encoded = tokenizer(
            premise,
            hypothesis,
            truncation="only_first",
            return_attention_mask=False,
        )
        return {"input_ids": encoded.input_ids}
    return encode

Code

encoded_xnli_ds = xnli_ds.map(
    encode_xnli(
        MNLI_MODEL_NAME,
        premise_language="en",
        hypothesis_language="en",
    ),
    batched=True,
)
encoded_xnli_ds = encoded_xnli_ds.select_columns(["input_ids", "label"])

evaluate(MNLI_MODEL_NAME, ds=encoded_xnli_ds["test"], batch_size=64) ; None

You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

               precision    recall  f1-score   support

   entailment       0.92      0.91      0.91      1670
      neutral       0.88      0.88      0.88      1670
contradiction       0.93      0.94      0.94      1670

     accuracy                           0.91      5010
    macro avg       0.91      0.91      0.91      5010
 weighted avg       0.91      0.91      0.91      5010

This is the first evaluation which is a sanity check. Here the text is in English, as is the prompt. The score of 91% is very similar to what was observed for the MNLI dataset.

This is great as the XNLI dataset is a subset of the MNLI dataset that has been translated. Let’s try it with a French premise and an English hypothesis.

Code

encoded_xnli_ds = xnli_ds.map(
    encode_xnli(
        MNLI_MODEL_NAME,
        premise_language="fr",
        hypothesis_language="en",
    ),
    batched=True,
)
encoded_xnli_ds = encoded_xnli_ds.select_columns(["input_ids", "label"])

evaluate(MNLI_MODEL_NAME, ds=encoded_xnli_ds["test"], batch_size=64) ; None

You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

               precision    recall  f1-score   support

   entailment       0.80      0.71      0.75      1670
      neutral       0.71      0.78      0.74      1670
contradiction       0.80      0.81      0.81      1670

     accuracy                           0.77      5010
    macro avg       0.77      0.77      0.77      5010
 weighted avg       0.77      0.77      0.77      5010

The performance against the French dataset is actually reasonable. It’s much worse than against English, however it is much better than random chance. I suspect that the underlying bart-large model learnt some French during training and the model is able to harness this. French and English also share a common root (Indo-European) so there may be some way that the model can harness that understanding.

To test this we can run the same evaluation against Arabic, which is a Afro-Asiatic language with a distinct character set.

Code

encoded_xnli_ds = xnli_ds.map(
    encode_xnli(
        MNLI_MODEL_NAME,
        premise_language="ar",
        hypothesis_language="en",
    ),
    batched=True,
)
encoded_xnli_ds = encoded_xnli_ds.select_columns(["input_ids", "label"])

evaluate(MNLI_MODEL_NAME, ds=encoded_xnli_ds["test"], batch_size=64) ; None

You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

               precision    recall  f1-score   support

   entailment       0.79      0.02      0.04      1670
      neutral       0.41      0.91      0.56      1670
contradiction       0.74      0.54      0.62      1670

     accuracy                           0.49      5010
    macro avg       0.65      0.49      0.41      5010
 weighted avg       0.65      0.49      0.41      5010

Here the simple accuracy appears to be better than random, however the model no longer meaningfully predicts entailment. This seems to be a broken model to me.

To fully evaluate the model we can compute the accuracy score for every language in the dataset. Remember that the hypothesis will remain in English for all of these.

Code

import datasets
import pandas as pd
from tqdm.auto import tqdm

ALL_XNLI_LANGUAGES = [
    "ar",
    "bg",
    "de",
    "el",
    "en",
    "es",
    "fr",
    "hi",
    "ru",
    "sw",
    "th",
    "tr",
    "ur",
    "vi",
    "zh",
]
LANGUAGE_NAMES = {
    "ar": "Arabic",
    "bg": "Bulgarian",
    "de": "German",
    "el": "Greek",
    "en": "English",
    "es": "Spanish",
    "fr": "French",
    "hi": "Hindi",
    "ru": "Russian",
    "sw": "Swahili",
    "th": "Thai",
    "tr": "Turkish",
    "ur": "Urdu",
    "vi": "Vietnamese",
    "zh": "Chinese",
}

def xnli_report(
    model_name: str,
    ds: datasets.Dataset,
    batch_size: int = 64,
    label_map: dict = {0: 2, 2: 0}, # FROM model TO dataset
    premise_languages: list[str] = ALL_XNLI_LANGUAGES,
    hypothesis_languages: str = "en",
) -> pd.DataFrame:
    results = []
    for premise_language in tqdm(premise_languages):
        language = LANGUAGE_NAMES[premise_language]
        encoded_ds = ds.map(
            encode_xnli(
                model_name,
                premise_language=premise_language,
                hypothesis_language=hypothesis_languages,
            ),
            batched=True,
        )
        encoded_ds = encoded_ds.select_columns(["input_ids", "label"])
        accuracy = evaluate(
            model_name=model_name,
            ds=encoded_ds,
            batch_size=batch_size,
            label_map=label_map,
            detail=False,
        )
        results.append({
            "language": language,
            "iso-language-code": premise_language,
            "accuracy": accuracy,
        })
        print(f"{language}: {accuracy:0.5f}")
    return pd.DataFrame(results)

Code

mnli_results_df = xnli_report(
    MNLI_MODEL_NAME,
    ds=xnli_ds["test"],
)

Code

mnli_results_df

	language	accuracy	iso-language-code
0	Arabic	0.489621	ar
1	Bulgarian	0.512774	bg
2	German	0.755689	de
3	Greek	0.507385	el
4	English	0.909581	en
5	Spanish	0.818363	es
6	French	0.766866	fr
7	Hindi	0.487824	hi
8	Russian	0.516966	ru
9	Swahili	0.515968	sw
10	Thai	0.491218	th
11	Turkish	0.525150	tr
12	Urdu	0.495010	ur
13	Vietnamese	0.528543	vi
14	Chinese	0.494212	zh

Code

(
    mnli_results_df.sort_values(by="accuracy", ascending=False)
        .plot.bar(
            x="language",
            y="accuracy",
            legend=False,
            title="bart-large-mnli accuracy by premise language",
            xlabel="premise language",
            ylabel="accuracy",
        )
) ; None

This clearly shows that the model performance for non English languages is dramatically worse. That’s not a surprise as the XNLI dataset contains a lot of non European languages.

We can compare the performance of bart-large-mnli to MoritzLaurer/mDeBERTa-v3-base-mnli-xnli which is a multilingual model trained on the XNLI dataset.

Code

XNLI_MODEL_NAME = "MoritzLaurer/mDeBERTa-v3-base-mnli-xnli"

Code

# the MoritzLaurer/mDeBERTa-v3-base-mnli-xnli has outputs entailment (0), neutral (1), contradiction (2)

xnli_results_df = xnli_report(
    XNLI_MODEL_NAME,
    ds=xnli_ds["test"],
    label_map={}, # model matches dataset
)

Code

xnli_results_df

	language	iso-language-code	accuracy
0	Arabic	ar	0.823154
1	Bulgarian	bg	0.852695
2	German	de	0.851497
3	Greek	el	0.844910
4	English	en	0.883234
5	Spanish	es	0.853693
6	French	fr	0.848902
7	Hindi	hi	0.804990
8	Russian	ru	0.849301
9	Swahili	sw	0.792415
10	Thai	th	0.828543
11	Turkish	tr	0.827345
12	Urdu	ur	0.787425
13	Vietnamese	vi	0.824950
14	Chinese	zh	0.831537

Code

(
    xnli_results_df.sort_values(by="accuracy", ascending=False)
        .plot.bar(
            x="language",
            y="accuracy",
            legend=False,
            title=f"{XNLI_MODEL_NAME} accuracy by premise language",
            xlabel="premise language",
            ylabel="accuracy",
        )
) ; None

The performance of this model is far more consistent. Unfortunately the performance in English has suffered a little (from 0.91 to 0.88). Moritz Laurer does note the following:

multilingual models tend to be less good than English-only models. For maximum performance, it can be better to first machine translate texts to English and then use an English-only model for zeroshot classification. See the other English-only models in this collection. For free open-source machine translation, I recommend https://github.com/UKPLab/EasyNMT.

For simplicity I will use the multilingual model, as machine translation has it’s own risk.