Matthew’s Blog - Cross Language Prompt Internalization

Predicting the features associated with a given word or phrase is all very nice, but without a way to resolve that to a specific word sense it lacks value. The pages on wikipedia can be used as targets for words as each noun should have a page.

To work out a feature signature for a page I can take the links to that page and work out how the teacher would describe it. The average of those descriptions should be good as an ideal set of features. Then the features that the student predicts for a noun can be measured against the wikipedia page features to find out how strongly they match.

I’m downloading the wikipedia data dump for 2022-07-01 (this month) to have something recent to work with. It’s quite slow, so this post will really be about setting up the code to process it correctly and then the real work will happen in a separate repo. This is because the wikipedia data dump is quite large and processing it in a reasonable amount of time requires efficient parallelized code.

Processing a single file in this post will be enough to show how.

Autobiography vs Heterobiography

I would describe myself as a shockingly attractive and intelligent individual :wink:. Other people would describe me as an insufferable know it all.

When I was trying this before I was using a page to describe itself. The problem with using this self description is that I then tried to use it to understand the description given by another. What I need to do is find all of the links to a page and use that text as the input to the teacher.

Another benefit of using these links is that we will find the words or phrases that people use to refer to the page. When we find an unknown noun we can check it against all of the links and see what pages could possibly describe it. This should reduce the search space quite substantially - while it was easy to search over all 6 million wikipedia pages before the results of doing so often didn’t make much sense.

So this is the task - to find all the sentences (or possibly paragraphs) that refer to another page. From them find the linked text and the title of the page. Then use the teacher to describe this.

Describing one example

It’s easiest to start with a single example. This will be quite involved as I need to extract the page text from the compressed xml documents. Then I can pass that through the wiki markup library to get at the text and links. The code must use streaming processing as the compressed files are large.

Code

from pathlib import Path

ENWIKI_FOLDER = Path("/data/wikipedia/external/enwiki/20220701/")

# small file to work with
ENWIKI_FILE = ENWIKI_FOLDER / "enwiki-20220701-pages-articles11.xml-p6899367p7054859.bz2"

Code

# from src/main/python/blog/wikipedia/extractor.py
import bz2
from collections import defaultdict
from dataclasses import dataclass
from pathlib import Path
from typing import Dict, Iterator, List, Optional, Set, Tuple

import regex as re
import wikitextparser
from lxml.etree import Element, iterparse

TABLE_PATTERN = re.compile(r"{{[^}]*}}", flags=re.MULTILINE)
TITLE_PATTERN = re.compile(r"^==[^=]*?==$", flags=re.MULTILINE)
LEADING_COLON_OR_HASH = re.compile(r"^[:#]", flags=re.MULTILINE)
HTML_TAG = re.compile(r"<[^>]+>")
MANY_BLANK_LINES = re.compile(r"(\n\s*)+\n+")
MANY_SPACES = re.compile(r"  +")
SENTENCE_SEPARATOR = re.compile(r"(?<=[.!?])")
MARKUP = re.compile(r"[']+")
MEDIAWIKI_NAMESPACE = "http://www.mediawiki.org/xml/export-0.10/"


@dataclass
class Article:
    title: str
    raw: str
    body: wikitextparser.WikiText

    def text(self) -> str:
        return self.body.plain_text(replace_wikilinks=False)

    def lines(self) -> List[str]:
        text = self.text()
        for pattern in [
            TABLE_PATTERN,
            TITLE_PATTERN,
            LEADING_COLON_OR_HASH,
            HTML_TAG,
            MARKUP,
            MANY_BLANK_LINES,
            MANY_SPACES,
        ]:
            text = pattern.sub(" ", text)

        return [sentence.strip() for sentence in SENTENCE_SEPARATOR.split(text)]

    def link_text(self) -> List[Tuple[str, str]]:
        def text_and_links(line: str) -> Tuple[str, List[str]]:
            parsed = wikitextparser.parse(line)
            text = parsed.plain_text()
            links = [link.target.casefold().strip() for link in parsed.wikilinks]
            return (text, links)

        return [
            (link, text)
            for text, links in [text_and_links(line) for line in self.lines()]
            for link in links
            if not any(":" in link for link in links)
        ]

    def link_synonyms(self) -> Dict[str, Set[str]]:
        def clean(text: str) -> str:
            return text.casefold().strip()

        synonyms = defaultdict(set)
        for link in self.body.wikilinks:
            target = clean(link.target)
            if ":" in target:
                continue
            synonyms[target].add(target)
            if not link.text:
                continue
            synonym = clean(link.text)
            synonyms[target].add(synonym)

        return dict(synonyms)


def read_articles(file: Path) -> Iterator[Article]:
    for page in read_pages(file):
        try:
            text = _get_article(page)
            if text:
                yield text
        except Exception as e:
            print(e)


def _get_article(element: Element) -> Optional[Article]:
    namespace = _get("mw:ns", element)
    if namespace is None or namespace.text != "0":
        return None

    redirect = _get("mw:redirect", element)
    if redirect is not None:
        return None

    text_element = _get("mw:revision/mw:text", element)
    if text_element is None:
        return None

    title = _get("mw:title", element)
    if title is None:
        return None

    raw = text_element.text
    body = wikitextparser.parse(text_element.text)
    return Article(title=title.text, raw=raw, body=body)


def _get(path: str, element: Element) -> Optional[Element]:
    elements = element.xpath(path, namespaces={"mw": MEDIAWIKI_NAMESPACE})
    if elements:
        return elements[0]
    return None


def read_pages(file: Path) -> Iterator[Element]:
    with bz2.BZ2File(file, "rb") as handle:
        for _event, element in iterparse(
            handle, tag=f"{{{MEDIAWIKI_NAMESPACE}}}page", events=("end",)
        ):
            yield element
            _clear_memory(element)


def _clear_memory(element: Element) -> None:
    element.clear()
    for ancestor in element.xpath("ancestor-or-self::*"):
        while ancestor.getprevious() is not None:
            del ancestor.getparent()[0]

Code

articles = [
    article
    for _, article in zip(range(10), read_articles(ENWIKI_FILE))
]
article = articles[3]

print(f"The article title is:\n\t{article.title}")
print()
print(f"The links in the article are:\n\t", "\n\t".join(map(str, article.body.wikilinks)))
print()
print(f"The body of the article is:\n\t{article.text()}")

The article title is:
    Scolomys ucayalensis

The links in the article are:
     [[Nocturnality|nocturnal]]
    [[rodent]]
    [[species]]
    [[South America]]
    [[Scolomys]]
    [[Oryzomyini]]
    [[Brazil]]
    [[Colombia]]
    [[Ecuador]]
    [[Peru]]
    [[Amazon rainforest]]
    [[Hypothenar eminence|hypothenar]]
    [[Scolomys melanops ]]
    [[karyotype]]
    [[Diploid|2n]]
    [[Fundamental number|FN]]
    [[Andes]]
    [[moss]]
    [[Bromeliaceae|bromeliads]]
    [[Category:Scolomys]]
    [[Category:Mammals described in 1991]]

The body of the article is:
    


Scolomys ucayalensis, also known as the long-nosed scolomysMusser and Carleton, 2005 or Ucayali spiny mouse is a [[Nocturnality|nocturnal]] [[rodent]] [[species]] from [[South America]]. It is part of the genus [[Scolomys]] within the tribe [[Oryzomyini]]. It is found in [[Brazil]], [[Colombia]], [[Ecuador]] and [[Peru]] in various different habitats in the [[Amazon rainforest]].

==Description==
Scolomys ucayalensis has a head-and-body length of between  and a tail around 83% of this. The head is small but broad with a pointed snout and small rounded ears. The fur is a mixture of fine hairs and thicker, flattened spines. The dorsal surface is some shade of reddish-brown to reddish-black, sometimes grizzled or streaked with black, and the underparts are grey. The tail is nearly naked, and the hind feet are small but broad. The [[Hypothenar eminence|hypothenar]] pad (next to the outer digit on the sole of the foot) is either absent or reduced in size on the hind feet, and this contrasts with the otherwise similar [[Scolomys melanops ]] which has well-developed hypothenar pads.  The [[karyotype]] of S. ucayalensis has [[Diploid|2n]] = 50 and [[Fundamental number|FN]] = 68, while that of S. melanops has 2n = 60, FN = 78.

==Distribution and habitat==
S. ucayalensis is found on the eastern side of the [[Andes]] in South America. Its range extends from southern Colombia and southern Ecuador, through western Brazil to northern Peru, and completely surrounds the range of S. melanops. Its habitat varies, with specimens being found in primary terra firme (non-flooded) lowland humid forest in Brazil, in undergrowth growing where primary forest had been cut back, and in cloud forest where the trees are clad in [[moss]]es and [[Bromeliaceae|bromeliads]]. Its altitudinal range is between .

==References==


==Literature cited==


* 
* 




[[Category:Scolomys]]
[[Category:Mammals described in 1991]]

The title of the page is the target of the link. This is case insensitive and stripped. There is a link with trailing whitespace (Scolomys melanops) which resolves to a page without that. If I change the case of the title in the link then it redirects to the “correctly” cased version, so they are case insensitive.

The text is organized into sections which have markup all over them. What would be nice is to identify the links that are present in the text at certain points and then process those. The model can only take so much text at any one time, so if the text was split into sentences and the links could be associated by sentence then that would be great.

To achieve this I need to split up the text appropriately. There are two levels of split that I can think of:

“Quoted speech”
Sentences

It might be worth just dealing with sentences. My concern is a sentence like this:

I said “Hello. I am interesting”

where there is sentence delimiters within the quoted speech. For now I think it’s better to continue with just splitting by sentence and dealing with problems as they arise.

Code

article.lines()

['Scolomys ucayalensis, also known as the long-nosed scolomysMusser and Carleton, 2005 or Ucayali spiny mouse is a [[Nocturnality|nocturnal]] [[rodent]] [[species]] from [[South America]].',
 'It is part of the genus [[Scolomys]] within the tribe [[Oryzomyini]].',
 'It is found in [[Brazil]], [[Colombia]], [[Ecuador]] and [[Peru]] in various different habitats in the [[Amazon rainforest]].',
 'Scolomys ucayalensis has a head-and-body length of between and a tail around 83% of this.',
 'The head is small but broad with a pointed snout and small rounded ears.',
 'The fur is a mixture of fine hairs and thicker, flattened spines.',
 'The dorsal surface is some shade of reddish-brown to reddish-black, sometimes grizzled or streaked with black, and the underparts are grey.',
 'The tail is nearly naked, and the hind feet are small but broad.',
 'The [[Hypothenar eminence|hypothenar]] pad (next to the outer digit on the sole of the foot) is either absent or reduced in size on the hind feet, and this contrasts with the otherwise similar [[Scolomys melanops ]] which has well-developed hypothenar pads.',
 'The [[karyotype]] of S.',
 'ucayalensis has [[Diploid|2n]] = 50 and [[Fundamental number|FN]] = 68, while that of S.',
 'melanops has 2n = 60, FN = 78.',
 'S.',
 'ucayalensis is found on the eastern side of the [[Andes]] in South America.',
 'Its range extends from southern Colombia and southern Ecuador, through western Brazil to northern Peru, and completely surrounds the range of S.',
 'melanops.',
 'Its habitat varies, with specimens being found in primary terra firme (non-flooded) lowland humid forest in Brazil, in undergrowth growing where primary forest had been cut back, and in cloud forest where the trees are clad in [[moss]]es and [[Bromeliaceae|bromeliads]].',
 'Its altitudinal range is between .',
 '* \n* [[Category:Scolomys]]\n[[Category:Mammals described in 1991]]']

This looks close enough. The first thing to do is to get the synonyms for each link so we can collect all these up.

Code

article.link_synonyms()

{'nocturnality': {'nocturnal', 'nocturnality'},
 'rodent': {'rodent'},
 'species': {'species'},
 'south america': {'south america'},
 'scolomys': {'scolomys'},
 'oryzomyini': {'oryzomyini'},
 'brazil': {'brazil'},
 'colombia': {'colombia'},
 'ecuador': {'ecuador'},
 'peru': {'peru'},
 'amazon rainforest': {'amazon rainforest'},
 'hypothenar eminence': {'hypothenar', 'hypothenar eminence'},
 'scolomys melanops': {'scolomys melanops'},
 'karyotype': {'karyotype'},
 'diploid': {'2n', 'diploid'},
 'fundamental number': {'fn', 'fundamental number'},
 'andes': {'andes'},
 'moss': {'moss'},
 'bromeliaceae': {'bromeliaceae', 'bromeliads'}}

Now I need to get the links for each line and organize the code by that. What I need is the most sentences around the link that will fit within the input.

Code

link_text = article.link_text()
link_text

[('nocturnality',
  'Scolomys ucayalensis, also known as the long-nosed scolomysMusser and Carleton, 2005 or Ucayali spiny mouse is a nocturnal rodent species from South America.'),
 ('rodent',
  'Scolomys ucayalensis, also known as the long-nosed scolomysMusser and Carleton, 2005 or Ucayali spiny mouse is a nocturnal rodent species from South America.'),
 ('species',
  'Scolomys ucayalensis, also known as the long-nosed scolomysMusser and Carleton, 2005 or Ucayali spiny mouse is a nocturnal rodent species from South America.'),
 ('south america',
  'Scolomys ucayalensis, also known as the long-nosed scolomysMusser and Carleton, 2005 or Ucayali spiny mouse is a nocturnal rodent species from South America.'),
 ('scolomys', 'It is part of the genus Scolomys within the tribe Oryzomyini.'),
 ('oryzomyini',
  'It is part of the genus Scolomys within the tribe Oryzomyini.'),
 ('brazil',
  'It is found in Brazil, Colombia, Ecuador and Peru in various different habitats in the Amazon rainforest.'),
 ('colombia',
  'It is found in Brazil, Colombia, Ecuador and Peru in various different habitats in the Amazon rainforest.'),
 ('ecuador',
  'It is found in Brazil, Colombia, Ecuador and Peru in various different habitats in the Amazon rainforest.'),
 ('peru',
  'It is found in Brazil, Colombia, Ecuador and Peru in various different habitats in the Amazon rainforest.'),
 ('amazon rainforest',
  'It is found in Brazil, Colombia, Ecuador and Peru in various different habitats in the Amazon rainforest.'),
 ('hypothenar eminence',
  'The hypothenar pad (next to the outer digit on the sole of the foot) is either absent or reduced in size on the hind feet, and this contrasts with the otherwise similar Scolomys melanops  which has well-developed hypothenar pads.'),
 ('scolomys melanops',
  'The hypothenar pad (next to the outer digit on the sole of the foot) is either absent or reduced in size on the hind feet, and this contrasts with the otherwise similar Scolomys melanops  which has well-developed hypothenar pads.'),
 ('karyotype', 'The karyotype of S.'),
 ('diploid', 'ucayalensis has 2n = 50 and FN = 68, while that of S.'),
 ('fundamental number',
  'ucayalensis has 2n = 50 and FN = 68, while that of S.'),
 ('andes',
  'ucayalensis is found on the eastern side of the Andes in South America.'),
 ('moss',
  'Its habitat varies, with specimens being found in primary terra firme (non-flooded) lowland humid forest in Brazil, in undergrowth growing where primary forest had been cut back, and in cloud forest where the trees are clad in mosses and bromeliads.'),
 ('bromeliaceae',
  'Its habitat varies, with specimens being found in primary terra firme (non-flooded) lowland humid forest in Brazil, in undergrowth growing where primary forest had been cut back, and in cloud forest where the trees are clad in mosses and bromeliads.')]

Code

from transformers import AutoTokenizer, AutoModelForMaskedLM

MODEL_NAME = "xlm-roberta-base"

model = AutoModelForMaskedLM.from_pretrained(MODEL_NAME)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

Code

import torch

@torch.inference_mode()
def get_features(
    text: str,
    link: str,
    prompt: str = " Pet: Dog, Color: Yellow, Vehicle: Tractor, Fruit: Banana,<mask>: {}",
) -> torch.Tensor:
    capitalized = link[0].upper() + link[1:]
    prompted = text.strip() + prompt.format(capitalized)
    tokens = tokenizer(
        prompted,
        return_tensors="pt",
        return_attention_mask=False,
    ).input_ids
    tokens = tokens.to(model.device)
    mask_index = tokens == tokenizer.mask_token_id
    output = model(tokens)
    return output.logits[mask_index][0].cpu().numpy()

Code

link, text = link_text[6]
tokens = get_features(text, link)

text, link, tokenizer.batch_decode(tokens.argsort()[::-1][:10][:, None])

('It is found in Brazil, Colombia, Ecuador and Peru in various different habitats in the Amazon rainforest.',
 'brazil',
 ['Country',
  'Land',
  'Origin',
  'Location',
  'Nation',
  'Language',
  'Region',
  'State',
  'Source',
  'Culture'])

This is pretty good. For some of the more technical terms it has more trouble but this should be good enough to process the wikipedia data.

The aim once this is processed is then to aggregate these into clusters in various ways. Processing this will be tricky as I have to ensure that there is enough context to allow the model to produce meaningful output. I also have to be careful as the raw output is 250,002 tokens which will take up a chunk of memory.

Writing this to files will be very important and then doing aggregation later.

Describing Wikipedia

I now need to run this over the whole of wikipedia. Once that has been done I can then try generating the list of difference words or phrases that link to each page.

I did start this by running each link/sentence pair through it and recording every single token probability. This resulted in files that were ~800MB for 1,000 pairs. Given that there are millions of articles and many links per page this is infeasible.

The next problem is that doing this one row at a time is very slow, even with CUDA. I need to be able to process the data quickly and then run batches through the model. Normally I would do this with ray, but I have upgraded the python that runs this blog.

That is why I have to process this in a separate repository.

Code

The repository is available here. I’ve processed the synonyms and about 2% of the features.

Results

We can have a quick look at the processed data now. Trying to cluster them will be another post.

Code

from pathlib import Path
import pandas as pd

DATA_FOLDER = Path("/data/prompt-internalization/multilingual/wikipedia/enwiki/20220701/")

synonyms_df = pd.read_parquet(DATA_FOLDER / "synonyms.gz.parquet")
features_df = pd.concat([
    pd.read_parquet(file)
    for file in sorted((DATA_FOLDER / "features").glob("*.gz.parquet"))
])

Code

synonyms_df.sort_values(by="count", ascending=False)

	target	synonym	count
31478537	united states	united states	275981
5118094	association football	association football	208462
33043071	world war ii	world war ii	151965
12447034	france	france	142281
30181088	the new york times	the new york times	134794
...	...	...	...
8160695	cho chong-kil	cho chong-kil	1
8160696	cho choong-hoon	cho choong-hoon	1
19889493	madhav gopal naseri	madhav gopal naseri	1
19889492	madhav godbole	madhav godbole	1
6194003	bhandasur	bhandasur	1

33691008 rows × 3 columns

Code

synonyms_df[synonyms_df.target != synonyms_df.synonym].sort_values(by="count", ascending=False)

	target	synonym	count
5118191	association football	footballer	117471
19202682	list of sovereign states	country	109280
5118139	association football	football	71084
31297808	u.s. state	state	69034
9057592	countries of the world	country	67898
...	...	...	...
12468103	francis egerton, 7th duke of sutherland	the 7th duke of sutherland	1
12468105	francis egerton, 8th earl of bridgewater	8th earl	1
12468107	francis egerton, 8th earl of bridgewater	duke of bridgewater (1756-1829)	1
12468108	francis egerton, 8th earl of bridgewater	earl of bridgewater	1
33691004	🮽	❎︎	1

16643648 rows × 3 columns

Code

synonyms_df.target.value_counts()

roman numerals                    1609
list of moths of north america    1250
billboard charts                  1070
u.s. cellular field               1036
postal codes in canada            1014
                                  ... 
intertidal chalk                     1
intertidal ecosystem                 1
intertidal fish                      1
intertidal flat                      1
🯅                                    1
Name: target, Length: 17047360, dtype: int64

There are about 34 million synonyms that have been found for 17 million articles. Some of the articles are unicode characters which are not rendering well in this notebook.

Code

features_df

	target	probability	index
0	goodwood festival of speed	[0.13652878, 0.09209253, 0.08164861, 0.0800229...	[70643, 15757, 48962, 90788, 131899, 60457, 32...
1	glorious goodwood	[0.18024956, 0.09665708, 0.06521337, 0.0638825...	[90788, 70643, 48962, 15757, 220197, 60457, 20...
2	goodwood revival	[0.116899185, 0.115187414, 0.10732716, 0.08845...	[70643, 90788, 48962, 15757, 60457, 74831, 499...
3	goodwood, south australia	[0.4112736, 0.16649252, 0.08323654, 0.04542979...	[90788, 74831, 6406, 6557, 41076, 49990, 79200...
4	electoral district of goodwood	[0.117247075, 0.10608322, 0.08746431, 0.064992...	[150533, 60457, 70643, 15757, 90788, 23994, 22...
...	...	...	...
15291	jerome kern	[0.18191482, 0.1264575, 0.1066387, 0.09394692,...	[83358, 13703, 69891, 220197, 61804, 15757, 31...
15292	guy bolton	[0.19696012, 0.12988408, 0.12146077, 0.0872782...	[61804, 83358, 13703, 31068, 69891, 220197, 15...
15293	bohemianism	[0.64192796, 0.0672424, 0.029274452, 0.0283800...	[98148, 105141, 83658, 74831, 104384, 13703, 5...
15294	basil rathbone	[0.23146918, 0.15656792, 0.09903663, 0.0662543...	[61804, 15757, 220197, 69891, 33734, 31068, 48...
15295	tallulah bankhead	[0.12158471, 0.07587505, 0.06905762, 0.060985,...	[15757, 48962, 33734, 220197, 10348, 61804, 69...

6155296 rows × 3 columns

Code

features_df.target.value_counts()

united states                6744
world war ii                 6194
new york city                4369
france                       3639
england                      3499
                             ... 
18th army (german empire)       1
preston brown (general)         1
vivières                        1
lizy-sur-ourcq                  1
george hendric houghton         1
Name: target, Length: 1793996, dtype: int64

Code

features_df["index"].explode().value_counts()

15757     5987901
70643     5920934
77641     5708869
90788     5668998
60457     5518613
           ...   
70548           1
90519           1
52428           1
213625          1
239593          1
Name: index, Length: 18318, dtype: int64

Code

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
tokenizer.batch_decode([
    [15757],
    [70643],
    [77641],
    [90788],
    [60457],
])

['Name', 'Description', 'Source', 'Location', 'Type']

6 million links have been processed associated with 1.8 million articles. The probability and index columns are aligned and are in descending order. Index is selecting values out of the 250,002 tokens that form the xlm-roberta-base vocabulary.

You can see that the model output has not been filtered as the most common tokens make a strong appearance again. If we compare those tokens to the top 5 from our previous investigation we can see a \(\frac{4}{5}\) overlap:

Token	Probability
Owner	0.107
Name	0.065
Description	0.048
Type	0.035
Location	0.029

When working with this to try to calculate clusters it may be appropriate to exclude these tokens.