MNLI vs XNLI

How Multilingual is Multi-Genre Natural Language Inference
Published

February 7, 2024

I want to use a MNLI model to do sentiment analysis. It’s an odd situation as you could just train a sentiment classifier, however in this case I want to train the model on a very small amount of data. Using a larger model trained on a more complex task should allow it the context to do well.

BART large MNLI (Lewis et al. 2019) is an English model trained on the English only Multi-Genre Natural Language Inference (Williams, Nangia, and Bowman 2018). There is a multilingual version of MNLI called XNLI (Conneau et al. 2018), which is described as

Lewis, Mike, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2019. BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension.” CoRR abs/1910.13461. http://arxiv.org/abs/1910.13461.
Williams, Adina, Nikita Nangia, and Samuel Bowman. 2018. “A Broad-Coverage Challenge Corpus for Sentence Understanding Through Inference.” In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 1112–22. New Orleans, Louisiana: Association for Computational Linguistics. http://aclweb.org/anthology/N18-1101.
Conneau, Alexis, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. “XNLI: Evaluating Cross-Lingual Sentence Representations.” In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics.

XNLI is a subset of a few thousand examples from MNLI which has been translated into a 14 different languages

How well will BART Large MNLI do on this dataset? I think it will be fun to see. Training these large language models, even monolingually, can involve some understanding of other languages. The datasets that are used can include some text that is in another language. BART large is quite large and so has enough space to have some idea of other languages.

To make this evaluation more interesting I can test BART Large MNLI on the MNLI dataset first, to establish a baseline. Then I can test it against the XNLI dataset where both the passage and hypothesis are in a non-English language. Finally I can test it against the XNLI dataset where the passage is in a non-English language but the hypothesis is in English.

There are XNLI models as well so a comparison to a similar size model that is explicitly trained to do this would be good.

Dataset

As always we will start with the dataset. Getting it is easy as both are available on huggingface.

Let’s start with the mnli dataset, which has two validation datasets (matched and mismatched). The labels are entailment (0), neutral (1), contradiction (2).

Code
from datasets import load_dataset
import pandas as pd

mnli_ds = load_dataset("multi_nli")

print("validation matched")
display(
    pd.DataFrame(mnli_ds["validation_matched"])
        [["premise", "hypothesis", "label"]]
        .head()
)

print("validation mismatched")
display(
    pd.DataFrame(mnli_ds["validation_mismatched"])
        [["premise", "hypothesis", "label"]]
        .head()
)

mnli_df = pd.concat([
    pd.DataFrame(mnli_ds["validation_matched"]),
    pd.DataFrame(mnli_ds["validation_mismatched"]),
])[["premise", "hypothesis", "label"]]
validation matched
premise hypothesis label
0 The new rights are nice enough Everyone really likes the newest benefits 1
1 This site includes a list of all award winners... The Government Executive articles housed on th... 2
2 uh i don't know i i have mixed emotions about ... I like him for the most part, but would still ... 0
3 yeah i i think my favorite restaurant is alway... My favorite restaurants are always at least a ... 2
4 i don't know um do you do a lot of camping I know exactly. 2
validation mismatched
premise hypothesis label
0 Your contribution helped make it possible for ... Your contributions were of no help with our st... 2
1 The answer has nothing to do with their cause,... Dictionaries are indeed exercises in bi-unique... 2
2 We serve a classic Tuscan meal that includes ... We serve a meal of Florentine terrine. 0
3 A few months ago, Carl Newton and I wrote a le... Carl Newton and I have never had any other pre... 2
4 I was on this earth you know, I've lived on th... I don't yet know the reason why I have lived o... 0

The validation data is approximately evenly split:

Code
import pandas as pd

pd.DataFrame({
    "validation_matched": pd.DataFrame(mnli_ds["validation_matched"]).label.value_counts(),
    "validation_mismatched": pd.DataFrame(mnli_ds["validation_mismatched"]).label.value_counts(),
})
validation_matched validation_mismatched
label
0 3479 3463
2 3213 3240
1 3123 3129

Now we can consider the xnli dataset. This is split by language, as we want an effective test I am going to use all languages.

Code
from datasets import load_dataset
import pandas as pd

xnli_ds = load_dataset("xnli", "all_languages")
xnli_df = pd.DataFrame(xnli_ds["test"])
xnli_df.head()
premise hypothesis label
0 {'ar': 'حسنا ، لم أكن أفكر حتى حول ذلك ، لكن ك... {'language': ['ar', 'bg', 'de', 'el', 'en', 'e... 2
1 {'ar': 'حسنا ، لم أكن أفكر حتى حول ذلك ، لكن ك... {'language': ['ar', 'bg', 'de', 'el', 'en', 'e... 0
2 {'ar': 'حسنا ، لم أكن أفكر حتى حول ذلك ، لكن ك... {'language': ['ar', 'bg', 'de', 'el', 'en', 'e... 1
3 {'ar': 'واعتقدت أن ذلك شرف لي ، ولا يزال ، ولا... {'language': ['ar', 'bg', 'de', 'el', 'en', 'e... 1
4 {'ar': 'واعتقدت أن ذلك شرف لي ، ولا يزال ، ولا... {'language': ['ar', 'bg', 'de', 'el', 'en', 'e... 0

This is more tricky to deal with. If we wanted a French premise with an English hypothesis can we do that?

Code
df = xnli_df.copy()
df["premise"] = df.premise.apply(lambda premise: premise["fr"])
df["hypothesis"] = df.hypothesis.apply(lambda hypothesis: hypothesis["translation"][hypothesis["language"].index("en")])
df.head()
premise hypothesis label
0 Eh bien, je ne pensais même pas à cela, mais j... I havent spoken to him again. 2
1 Eh bien, je ne pensais même pas à cela, mais j... I was so upset that I just started talking to ... 0
2 Eh bien, je ne pensais même pas à cela, mais j... We had a great talk. 1
3 Et je pensais que c'était un privilège, et ça ... I was not aware that I was not the only person... 1
4 Et je pensais que c'était un privilège, et ça ... I was under the impression that I was the only... 0

The premise column is a dictionary of language code to premise. In comparison the hypothesis column has language and translation columns which are lists that are aligned. Extracting from the hypothesis seems needlessly complicated compared to the premise. Otherwise this is fine.

Code
xnli_df.label.value_counts()
label
2    1670
0    1670
1    1670
Name: count, dtype: int64

Evaluation

To evaluate the models I will start with a known-good test, which is to evaluate bart-large-mnli on the mnli dataset. This is what it was trained on and it should perform well.

Code
MNLI_MODEL_NAME = "facebook/bart-large-mnli"
Code
from typing import Callable
from transformers import AutoTokenizer

DatasetRows = dict[str, list[str]]
InputIds = list[int]
RowInputIds = list[InputIds]

def encode_mnli(model_name: str) -> Callable[[DatasetRows], RowInputIds]:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    def encode(rows: DatasetRows) -> RowInputIds:
        encoded = tokenizer(
            rows["premise"],
            rows["hypothesis"],
            truncation="only_first",
            return_attention_mask=False,
        )
        return {"input_ids": encoded.input_ids}
    return encode
Code
encoded_mnli_ds = mnli_ds.map(encode_mnli(MNLI_MODEL_NAME), batched=True)
encoded_mnli_ds = encoded_mnli_ds.select_columns(["input_ids", "label"])

I’ve encoded the mnli datset to add the hypothesis to the premise and convert them to the tokenized form. Doing this in advance makes inference faster as it can just move from batch to batch.

It’s now time to evaluate this. The model is both large and trained on this specific task so I expect it to do well.

Code
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import datasets
from sklearn.metrics import classification_report
import torch
from tqdm.auto import tqdm
import numpy as np

# the mnli dataset has labels entailment (0), neutral (1), contradiction (2)
# the bart-large-mnli has outputs contradiction (0), neutral (1), entailment (2)

@torch.inference_mode()
def evaluate(
    model_name: str,
    ds: datasets.Dataset,
    batch_size: int = 64,
    label_map: dict = {0: 2, 2: 0}, # FROM model TO dataset
    detail: bool = True,
) -> None:
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    model = model.cuda()
    model = model.eval()

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    predictions = []

    for index in tqdm(range(0, len(ds), batch_size)):
        batch = ds.select_columns("input_ids")[index:index+batch_size]
        encoded = tokenizer.pad(batch, return_tensors="pt")
        encoded = encoded.to(model.device)
        output = model(**encoded)
        output = output.logits.argmax(dim=-1).cpu().tolist()
        if label_map:
            output = [label_map.get(value, value) for value in output]
        predictions.extend(output)

    predictions = np.array(predictions)

    if detail:
        report = classification_report(
            y_true=ds["label"],
            y_pred=predictions,
            target_names=["entailment", "neutral", "contradiction"],
        )
        print(report)

    return (predictions == ds["label"]).mean()
Code
evaluate(MNLI_MODEL_NAME, ds=encoded_mnli_ds["validation_matched"], batch_size=64) ; None
You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
               precision    recall  f1-score   support

   entailment       0.92      0.92      0.92      3479
      neutral       0.88      0.85      0.86      3123
contradiction       0.91      0.94      0.92      3213

     accuracy                           0.90      9815
    macro avg       0.90      0.90      0.90      9815
 weighted avg       0.90      0.90      0.90      9815
Code
evaluate(MNLI_MODEL_NAME, ds=encoded_mnli_ds["validation_mismatched"], batch_size=64) ; None
You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
               precision    recall  f1-score   support

   entailment       0.90      0.92      0.91      3463
      neutral       0.87      0.84      0.86      3129
contradiction       0.92      0.93      0.93      3240

     accuracy                           0.90      9832
    macro avg       0.90      0.90      0.90      9832
 weighted avg       0.90      0.90      0.90      9832

90% overall accuracy looks good to me. I question the wisdom of swapping the model outputs so they don’t align with the dataset.

Anyway now we should be able to evaluate this model on the various forms of the xnli dataset.

Code
from typing import Callable
from transformers import AutoTokenizer

DatasetRows = dict[str, list[str]]
InputIds = list[int]
RowInputIds = list[InputIds]

def encode_xnli(
    model_name: str,
    *,
    premise_language: str,
    hypothesis_language: str,
) -> Callable[[DatasetRows], RowInputIds]:
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    def get_premise(rows: DatasetRows) -> list[str]:
        return [row[premise_language] for row in rows["premise"]]
    def get_hypothesis(rows: DatasetRows) -> list[str]:
        hypotheses = rows["hypothesis"]
        return [
            row["translation"][row["language"].index(hypothesis_language)]
            for row in hypotheses
        ]

    def encode(rows: DatasetRows) -> RowInputIds:
        premise = get_premise(rows)
        hypothesis = get_hypothesis(rows)
        encoded = tokenizer(
            premise,
            hypothesis,
            truncation="only_first",
            return_attention_mask=False,
        )
        return {"input_ids": encoded.input_ids}
    return encode
Code
encoded_xnli_ds = xnli_ds.map(
    encode_xnli(
        MNLI_MODEL_NAME,
        premise_language="en",
        hypothesis_language="en",
    ),
    batched=True,
)
encoded_xnli_ds = encoded_xnli_ds.select_columns(["input_ids", "label"])

evaluate(MNLI_MODEL_NAME, ds=encoded_xnli_ds["test"], batch_size=64) ; None
You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
               precision    recall  f1-score   support

   entailment       0.92      0.91      0.91      1670
      neutral       0.88      0.88      0.88      1670
contradiction       0.93      0.94      0.94      1670

     accuracy                           0.91      5010
    macro avg       0.91      0.91      0.91      5010
 weighted avg       0.91      0.91      0.91      5010

This is the first evaluation which is a sanity check. Here the text is in English, as is the prompt. The score of 91% is very similar to what was observed for the MNLI dataset.

This is great as the XNLI dataset is a subset of the MNLI dataset that has been translated. Let’s try it with a French premise and an English hypothesis.

Code
encoded_xnli_ds = xnli_ds.map(
    encode_xnli(
        MNLI_MODEL_NAME,
        premise_language="fr",
        hypothesis_language="en",
    ),
    batched=True,
)
encoded_xnli_ds = encoded_xnli_ds.select_columns(["input_ids", "label"])

evaluate(MNLI_MODEL_NAME, ds=encoded_xnli_ds["test"], batch_size=64) ; None
You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
               precision    recall  f1-score   support

   entailment       0.80      0.71      0.75      1670
      neutral       0.71      0.78      0.74      1670
contradiction       0.80      0.81      0.81      1670

     accuracy                           0.77      5010
    macro avg       0.77      0.77      0.77      5010
 weighted avg       0.77      0.77      0.77      5010

The performance against the French dataset is actually reasonable. It’s much worse than against English, however it is much better than random chance. I suspect that the underlying bart-large model learnt some French during training and the model is able to harness this. French and English also share a common root (Indo-European) so there may be some way that the model can harness that understanding.

To test this we can run the same evaluation against Arabic, which is a Afro-Asiatic language with a distinct character set.

Code
encoded_xnli_ds = xnli_ds.map(
    encode_xnli(
        MNLI_MODEL_NAME,
        premise_language="ar",
        hypothesis_language="en",
    ),
    batched=True,
)
encoded_xnli_ds = encoded_xnli_ds.select_columns(["input_ids", "label"])

evaluate(MNLI_MODEL_NAME, ds=encoded_xnli_ds["test"], batch_size=64) ; None
You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
               precision    recall  f1-score   support

   entailment       0.79      0.02      0.04      1670
      neutral       0.41      0.91      0.56      1670
contradiction       0.74      0.54      0.62      1670

     accuracy                           0.49      5010
    macro avg       0.65      0.49      0.41      5010
 weighted avg       0.65      0.49      0.41      5010

Here the simple accuracy appears to be better than random, however the model no longer meaningfully predicts entailment. This seems to be a broken model to me.

To fully evaluate the model we can compute the accuracy score for every language in the dataset. Remember that the hypothesis will remain in English for all of these.

Code
import datasets
import pandas as pd
from tqdm.auto import tqdm

ALL_XNLI_LANGUAGES = [
    "ar",
    "bg",
    "de",
    "el",
    "en",
    "es",
    "fr",
    "hi",
    "ru",
    "sw",
    "th",
    "tr",
    "ur",
    "vi",
    "zh",
]
LANGUAGE_NAMES = {
    "ar": "Arabic",
    "bg": "Bulgarian",
    "de": "German",
    "el": "Greek",
    "en": "English",
    "es": "Spanish",
    "fr": "French",
    "hi": "Hindi",
    "ru": "Russian",
    "sw": "Swahili",
    "th": "Thai",
    "tr": "Turkish",
    "ur": "Urdu",
    "vi": "Vietnamese",
    "zh": "Chinese",
}

def xnli_report(
    model_name: str,
    ds: datasets.Dataset,
    batch_size: int = 64,
    label_map: dict = {0: 2, 2: 0}, # FROM model TO dataset
    premise_languages: list[str] = ALL_XNLI_LANGUAGES,
    hypothesis_languages: str = "en",
) -> pd.DataFrame:
    results = []
    for premise_language in tqdm(premise_languages):
        language = LANGUAGE_NAMES[premise_language]
        encoded_ds = ds.map(
            encode_xnli(
                model_name,
                premise_language=premise_language,
                hypothesis_language=hypothesis_languages,
            ),
            batched=True,
        )
        encoded_ds = encoded_ds.select_columns(["input_ids", "label"])
        accuracy = evaluate(
            model_name=model_name,
            ds=encoded_ds,
            batch_size=batch_size,
            label_map=label_map,
            detail=False,
        )
        results.append({
            "language": language,
            "iso-language-code": premise_language,
            "accuracy": accuracy,
        })
        print(f"{language}: {accuracy:0.5f}")
    return pd.DataFrame(results)
Code
mnli_results_df = xnli_report(
    MNLI_MODEL_NAME,
    ds=xnli_ds["test"],
)
Code
mnli_results_df
language accuracy iso-language-code
0 Arabic 0.489621 ar
1 Bulgarian 0.512774 bg
2 German 0.755689 de
3 Greek 0.507385 el
4 English 0.909581 en
5 Spanish 0.818363 es
6 French 0.766866 fr
7 Hindi 0.487824 hi
8 Russian 0.516966 ru
9 Swahili 0.515968 sw
10 Thai 0.491218 th
11 Turkish 0.525150 tr
12 Urdu 0.495010 ur
13 Vietnamese 0.528543 vi
14 Chinese 0.494212 zh
Code
(
    mnli_results_df.sort_values(by="accuracy", ascending=False)
        .plot.bar(
            x="language",
            y="accuracy",
            legend=False,
            title="bart-large-mnli accuracy by premise language",
            xlabel="premise language",
            ylabel="accuracy",
        )
) ; None

This clearly shows that the model performance for non English languages is dramatically worse. That’s not a surprise as the XNLI dataset contains a lot of non European languages.

We can compare the performance of bart-large-mnli to MoritzLaurer/mDeBERTa-v3-base-mnli-xnli which is a multilingual model trained on the XNLI dataset.

Code
XNLI_MODEL_NAME = "MoritzLaurer/mDeBERTa-v3-base-mnli-xnli"
Code
# the MoritzLaurer/mDeBERTa-v3-base-mnli-xnli has outputs entailment (0), neutral (1), contradiction (2)

xnli_results_df = xnli_report(
    XNLI_MODEL_NAME,
    ds=xnli_ds["test"],
    label_map={}, # model matches dataset
)
Code
xnli_results_df
language iso-language-code accuracy
0 Arabic ar 0.823154
1 Bulgarian bg 0.852695
2 German de 0.851497
3 Greek el 0.844910
4 English en 0.883234
5 Spanish es 0.853693
6 French fr 0.848902
7 Hindi hi 0.804990
8 Russian ru 0.849301
9 Swahili sw 0.792415
10 Thai th 0.828543
11 Turkish tr 0.827345
12 Urdu ur 0.787425
13 Vietnamese vi 0.824950
14 Chinese zh 0.831537
Code
(
    xnli_results_df.sort_values(by="accuracy", ascending=False)
        .plot.bar(
            x="language",
            y="accuracy",
            legend=False,
            title=f"{XNLI_MODEL_NAME} accuracy by premise language",
            xlabel="premise language",
            ylabel="accuracy",
        )
) ; None

The performance of this model is far more consistent. Unfortunately the performance in English has suffered a little (from 0.91 to 0.88). Moritz Laurer does note the following:

multilingual models tend to be less good than English-only models. For maximum performance, it can be better to first machine translate texts to English and then use an English-only model for zeroshot classification. See the other English-only models in this collection. For free open-source machine translation, I recommend https://github.com/UKPLab/EasyNMT.

For simplicity I will use the multilingual model, as machine translation has it’s own risk.