Matthew’s Blog - CMU - Machine Translation

I’m watching the CMU Low Resource lectures. To try and absorb more of the information I want to try to apply some of the techniques that are introduced in the lessons.

The current one is about machine translation. It starts by suggesting that different languages can be translated between each other purely through substitution. So you can take a sentence in English and replace the words with French and end with a valid equivalent French sentence.

Parallel documents are equivalent documents in different languages. The rosetta stone is an example of a parallel document. There are contemporary documents that have parallel documents, like the documents that the United Nations or EU produce. These are legally required to be parallel as the documents must be understandable by all participants. This can be very useful for working out how to translate one text to another.

I can try to get such a document to try out some of the translations.

I’ve looked at the github for the lecturer and found this parallel corpus. Unfortunately I do not understand either language. Lets see how much that matters heh.

I’ve looked some more and the lecture talks about fast align which is available on github. Unfortunately that appears to be a C++ repo so it might be good to try to find an equivalent in python (I found systran-align). Got that now, the next thing is to get a parallel dataset - I’ve chosen the europarl v7 one from here. It takes quite a while to download though.

Alignment

So the problem of finding the equivalent word isn’t that bad. You can take parallel words and determine the frequency of co-occurrence.

It turns out that finding the correct order of the words is a more significant problem, as words can be reordered during translation. This is referred to as alignment.

Code

from pathlib import Path

Code

EUROPARL_FOLDER = Path(
    "data/2021-01-14-Machine-Translation/training"
)

sorted(EUROPARL_FOLDER.glob("*"))

[PosixPath('data/2021-01-14-Machine-Translation/training/europarl-v7.cs-en.cs'),
 PosixPath('data/2021-01-14-Machine-Translation/training/europarl-v7.cs-en.en'),
 PosixPath('data/2021-01-14-Machine-Translation/training/europarl-v7.de-en.de'),
 PosixPath('data/2021-01-14-Machine-Translation/training/europarl-v7.de-en.en'),
 PosixPath('data/2021-01-14-Machine-Translation/training/europarl-v7.es-en.en'),
 PosixPath('data/2021-01-14-Machine-Translation/training/europarl-v7.es-en.es'),
 PosixPath('data/2021-01-14-Machine-Translation/training/europarl-v7.fr-en.en'),
 PosixPath('data/2021-01-14-Machine-Translation/training/europarl-v7.fr-en.fr')]

So I have the data now and the files are text. They are made of pairs (e.g. fr-en) which is then suffixed with the language (e.g. fr).

So I can create the pairs by zipping them together.

Code

pairs = list(zip(*[
    path.read_text().splitlines()
    for path in sorted(EUROPARL_FOLDER.glob("*.fr-en.*"))
]))
pairs[:3]

[('Resumption of the session', 'Reprise de la session'),
 ('I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.',
  'Je déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances.'),
 ("Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.",
  'Comme vous avez pu le constater, le grand "bogue de l\'an 2000" ne s\'est pas produit. En revanche, les citoyens d\'un certain nombre de nos pays ont été victimes de catastrophes naturelles qui ont vraiment été terribles.')]

Wow this data seems quite old now heh. Anyway this looks reasonable. I need to try to apply the systran-align now.

Systran Align

As this is based on fast_align I need to reformat the data to match the requirements of fast_align. Fast align provides this example of aligned sentences:

doch jetzt ist der Held gefallen . ||| but now the hero has fallen .
neue Modelle werden erprobt . ||| new models are being tested .
doch fehlen uns neue Ressourcen . ||| but we lack new resources .

Converting the pairs to this form is straightforward. I can’t just do it in memory as the library expects a path to a file.

Code

ENGLISH_FRENCH_ALIGNED_SENTENCES_FILE = Path(
    "data/2021-01-14-Machine-Translation/en-fr.sentences.txt"
)

ENGLISH_FRENCH_ALIGNED_SENTENCES_FILE.write_text("\n".join(
    f"{en} ||| {fr}"
    for en, fr in pairs
))

644948389

Code

import systran_align

ENGLISH_FRENCH_FORWARD_PROBABILITIES_FILE = Path(
    "data/2021-01-14-Machine-Translation/en-fr.forward-probabilities"
)
ENGLISH_FRENCH_BACKWARD_PROBABILITIES_FILE = Path(
    "data/2021-01-14-Machine-Translation/en-fr.backward-probabilities"
)

systran_align.generate_alignment_probabilities(
    input_path=str(ENGLISH_FRENCH_ALIGNED_SENTENCES_FILE),
    forward_probs_path=str(ENGLISH_FRENCH_FORWARD_PROBABILITIES_FILE),
    backward_probs_path=str(ENGLISH_FRENCH_BACKWARD_PROBABILITIES_FILE),
#     verbose: bool = False,
#     iterations: int = 5,
#     favor_diagonal: bool = False,
#     beam_threshold: float = -4,
#     diagonal_tension: float = 4,
#     optimize_tension: bool = False,
#     variational_bayes: bool = False,
#     alpha: float = 0.01,
#     no_null_word: bool = False,
#     prob_align_null: float = 0.08,
#     thread_buffer_size: int = 10000
)

This took a few minutes to run. Now I would like to see some of the alignments.

Code

aligner = systran_align.Aligner(
    forward_probs_path=str(ENGLISH_FRENCH_FORWARD_PROBABILITIES_FILE),
    backward_probs_path=str(ENGLISH_FRENCH_BACKWARD_PROBABILITIES_FILE)
)

Code

aligner.align(
    source="Resumption of the session".split(),
    target="Reprise de la session".split()
)

{'alignments': [(0, 0), (1, 1), (2, 2), (3, 3)],
 'forward_log_prob': -7.635498771289478,
 'backward_log_prob': -7.6322351710899}

Code

aligner.align(
    source="Good morning".split(),
    target="Bonjour".split()
)

{'alignments': [(0, 0), (1, 0)],
 'forward_log_prob': -9.971272888429588,
 'backward_log_prob': -24.83337181125963}

Code

aligner.align(
    source="Good morning".split(),
    target="Reprise de la session".split()
)

{'alignments': [(0, 0), (1, 3)],
 'forward_log_prob': -52.15923717158851,
 'backward_log_prob': -43.39224509221587}

Code

aligner.align(
    source="Reprise de la session".split(),
    target="Resumption of the session".split(),
)

{'alignments': [(0, 0), (1, 1), (2, 2), (3, 3)],
 'forward_log_prob': -64.95893935235576,
 'backward_log_prob': -65.32427141154398}

I don’t really have context to determine if these scores are reasonable. It’s good that the unaligned sentences have a poor score at least. I’m also confident that I am providing the source and target correctly.