I’m returning to the CMU Low Resource lectures. In the last post it covered word alignment between parallel text in two different languages.
Lets see how they build on this today.
Currently they are covering how the alignment between words is computed. With the sentences:
- la maison -> the house
- la maison blue -> the blue house
- la fleur -> the flower
To determine the alignment for a word like la
, you can calculate the co-occurrence frequency:
french | english | frequency |
---|---|---|
la | the | 3 |
la | house | 2 |
la | blue | 1 |
la | flower | 1 |
So if you correlate la
with the
, then you can go and try to compute the frequency for the other words:
french | english | frequency |
---|---|---|
maison | house | 2 |
maison | blue | 1 |
So you can repeat this process to determine which words are aligned with each other.
The problem with a word based approach is that a single word on one side could translate to multiple words on the other side. Inversely this would look like a word translating to no-word.
You can generalize this to phrases, where a sequence of words on one side can translate to a different sequence on the other side. These sequences may not correlate exactly with the word-level alignment that would be performed.
Since we now have moved beyond word to word alignment, we can generate multiple different sequences for the destination sentence. When I was playing with the alignments last week I did not immediately see a way to generate the corresponding sentence given some input, however it must be present.
So since we are now dealing with different possible phrase groupings that can be applied to the sentence the choice of the correct phrase group will be a search problem. BEAM search is suitable for finding a target sentence which has high probability.
Improving alignment
Can try to generate a probable alignment for the different words based on all sentences in the language. This can be generated and then it can be used to provide a prior for the alignment for a new sentence.
It is also possible to do something like POS tagging or to generate a parse tree to use the structural features of the sentences to align them.
The lecture has become bogged down with debugging problems. They are trying to run the alignment code, which I did last week.
I’m going to stop at this point and see if I can get any further with the fastai lesson as I want to discuss that later today.