Code
from pathlib import Path
= Path("/data/wikipedia/external/enwiki/20220701/")
ENWIKI_FOLDER
# small file to work with
= ENWIKI_FOLDER / "enwiki-20220701-pages-articles11.xml-p6899367p7054859.bz2" ENWIKI_FILE
July 15, 2022
Predicting the features associated with a given word or phrase is all very nice, but without a way to resolve that to a specific word sense it lacks value. The pages on wikipedia can be used as targets for words as each noun should have a page.
To work out a feature signature for a page I can take the links to that page and work out how the teacher would describe it. The average of those descriptions should be good as an ideal set of features. Then the features that the student predicts for a noun can be measured against the wikipedia page features to find out how strongly they match.
I’m downloading the wikipedia data dump for 2022-07-01 (this month) to have something recent to work with. It’s quite slow, so this post will really be about setting up the code to process it correctly and then the real work will happen in a separate repo. This is because the wikipedia data dump is quite large and processing it in a reasonable amount of time requires efficient parallelized code.
Processing a single file in this post will be enough to show how.
I would describe myself as a shockingly attractive and intelligent individual :wink:. Other people would describe me as an insufferable know it all.
When I was trying this before I was using a page to describe itself. The problem with using this self description is that I then tried to use it to understand the description given by another. What I need to do is find all of the links to a page and use that text as the input to the teacher.
Another benefit of using these links is that we will find the words or phrases that people use to refer to the page. When we find an unknown noun we can check it against all of the links and see what pages could possibly describe it. This should reduce the search space quite substantially - while it was easy to search over all 6 million wikipedia pages before the results of doing so often didn’t make much sense.
So this is the task - to find all the sentences (or possibly paragraphs) that refer to another page. From them find the linked text and the title of the page. Then use the teacher to describe this.
It’s easiest to start with a single example. This will be quite involved as I need to extract the page text from the compressed xml documents. Then I can pass that through the wiki markup library to get at the text and links. The code must use streaming processing as the compressed files are large.
# from src/main/python/blog/wikipedia/extractor.py
import bz2
from collections import defaultdict
from dataclasses import dataclass
from pathlib import Path
from typing import Dict, Iterator, List, Optional, Set, Tuple
import regex as re
import wikitextparser
from lxml.etree import Element, iterparse
TABLE_PATTERN = re.compile(r"{{[^}]*}}", flags=re.MULTILINE)
TITLE_PATTERN = re.compile(r"^==[^=]*?==$", flags=re.MULTILINE)
LEADING_COLON_OR_HASH = re.compile(r"^[:#]", flags=re.MULTILINE)
HTML_TAG = re.compile(r"<[^>]+>")
MANY_BLANK_LINES = re.compile(r"(\n\s*)+\n+")
MANY_SPACES = re.compile(r" +")
SENTENCE_SEPARATOR = re.compile(r"(?<=[.!?])")
MARKUP = re.compile(r"[']+")
MEDIAWIKI_NAMESPACE = "http://www.mediawiki.org/xml/export-0.10/"
@dataclass
class Article:
title: str
raw: str
body: wikitextparser.WikiText
def text(self) -> str:
return self.body.plain_text(replace_wikilinks=False)
def lines(self) -> List[str]:
text = self.text()
for pattern in [
TABLE_PATTERN,
TITLE_PATTERN,
LEADING_COLON_OR_HASH,
HTML_TAG,
MARKUP,
MANY_BLANK_LINES,
MANY_SPACES,
]:
text = pattern.sub(" ", text)
return [sentence.strip() for sentence in SENTENCE_SEPARATOR.split(text)]
def link_text(self) -> List[Tuple[str, str]]:
def text_and_links(line: str) -> Tuple[str, List[str]]:
parsed = wikitextparser.parse(line)
text = parsed.plain_text()
links = [link.target.casefold().strip() for link in parsed.wikilinks]
return (text, links)
return [
(link, text)
for text, links in [text_and_links(line) for line in self.lines()]
for link in links
if not any(":" in link for link in links)
]
def link_synonyms(self) -> Dict[str, Set[str]]:
def clean(text: str) -> str:
return text.casefold().strip()
synonyms = defaultdict(set)
for link in self.body.wikilinks:
target = clean(link.target)
if ":" in target:
continue
synonyms[target].add(target)
if not link.text:
continue
synonym = clean(link.text)
synonyms[target].add(synonym)
return dict(synonyms)
def read_articles(file: Path) -> Iterator[Article]:
for page in read_pages(file):
try:
text = _get_article(page)
if text:
yield text
except Exception as e:
print(e)
def _get_article(element: Element) -> Optional[Article]:
namespace = _get("mw:ns", element)
if namespace is None or namespace.text != "0":
return None
redirect = _get("mw:redirect", element)
if redirect is not None:
return None
text_element = _get("mw:revision/mw:text", element)
if text_element is None:
return None
title = _get("mw:title", element)
if title is None:
return None
raw = text_element.text
body = wikitextparser.parse(text_element.text)
return Article(title=title.text, raw=raw, body=body)
def _get(path: str, element: Element) -> Optional[Element]:
elements = element.xpath(path, namespaces={"mw": MEDIAWIKI_NAMESPACE})
if elements:
return elements[0]
return None
def read_pages(file: Path) -> Iterator[Element]:
with bz2.BZ2File(file, "rb") as handle:
for _event, element in iterparse(
handle, tag=f"{{{MEDIAWIKI_NAMESPACE}}}page", events=("end",)
):
yield element
_clear_memory(element)
def _clear_memory(element: Element) -> None:
element.clear()
for ancestor in element.xpath("ancestor-or-self::*"):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
articles = [
article
for _, article in zip(range(10), read_articles(ENWIKI_FILE))
]
article = articles[3]
print(f"The article title is:\n\t{article.title}")
print()
print(f"The links in the article are:\n\t", "\n\t".join(map(str, article.body.wikilinks)))
print()
print(f"The body of the article is:\n\t{article.text()}")
The article title is:
Scolomys ucayalensis
The links in the article are:
[[Nocturnality|nocturnal]]
[[rodent]]
[[species]]
[[South America]]
[[Scolomys]]
[[Oryzomyini]]
[[Brazil]]
[[Colombia]]
[[Ecuador]]
[[Peru]]
[[Amazon rainforest]]
[[Hypothenar eminence|hypothenar]]
[[Scolomys melanops ]]
[[karyotype]]
[[Diploid|2n]]
[[Fundamental number|FN]]
[[Andes]]
[[moss]]
[[Bromeliaceae|bromeliads]]
[[Category:Scolomys]]
[[Category:Mammals described in 1991]]
The body of the article is:
Scolomys ucayalensis, also known as the long-nosed scolomysMusser and Carleton, 2005 or Ucayali spiny mouse is a [[Nocturnality|nocturnal]] [[rodent]] [[species]] from [[South America]]. It is part of the genus [[Scolomys]] within the tribe [[Oryzomyini]]. It is found in [[Brazil]], [[Colombia]], [[Ecuador]] and [[Peru]] in various different habitats in the [[Amazon rainforest]].
==Description==
Scolomys ucayalensis has a head-and-body length of between and a tail around 83% of this. The head is small but broad with a pointed snout and small rounded ears. The fur is a mixture of fine hairs and thicker, flattened spines. The dorsal surface is some shade of reddish-brown to reddish-black, sometimes grizzled or streaked with black, and the underparts are grey. The tail is nearly naked, and the hind feet are small but broad. The [[Hypothenar eminence|hypothenar]] pad (next to the outer digit on the sole of the foot) is either absent or reduced in size on the hind feet, and this contrasts with the otherwise similar [[Scolomys melanops ]] which has well-developed hypothenar pads. The [[karyotype]] of S. ucayalensis has [[Diploid|2n]] = 50 and [[Fundamental number|FN]] = 68, while that of S. melanops has 2n = 60, FN = 78.
==Distribution and habitat==
S. ucayalensis is found on the eastern side of the [[Andes]] in South America. Its range extends from southern Colombia and southern Ecuador, through western Brazil to northern Peru, and completely surrounds the range of S. melanops. Its habitat varies, with specimens being found in primary terra firme (non-flooded) lowland humid forest in Brazil, in undergrowth growing where primary forest had been cut back, and in cloud forest where the trees are clad in [[moss]]es and [[Bromeliaceae|bromeliads]]. Its altitudinal range is between .
==References==
==Literature cited==
*
*
[[Category:Scolomys]]
[[Category:Mammals described in 1991]]
The title of the page is the target of the link. This is case insensitive and stripped. There is a link with trailing whitespace (Scolomys melanops) which resolves to a page without that. If I change the case of the title in the link then it redirects to the “correctly” cased version, so they are case insensitive.
The text is organized into sections which have markup all over them. What would be nice is to identify the links that are present in the text at certain points and then process those. The model can only take so much text at any one time, so if the text was split into sentences and the links could be associated by sentence then that would be great.
To achieve this I need to split up the text appropriately. There are two levels of split that I can think of:
It might be worth just dealing with sentences. My concern is a sentence like this:
I said “Hello. I am interesting”
where there is sentence delimiters within the quoted speech. For now I think it’s better to continue with just splitting by sentence and dealing with problems as they arise.
['Scolomys ucayalensis, also known as the long-nosed scolomysMusser and Carleton, 2005 or Ucayali spiny mouse is a [[Nocturnality|nocturnal]] [[rodent]] [[species]] from [[South America]].',
'It is part of the genus [[Scolomys]] within the tribe [[Oryzomyini]].',
'It is found in [[Brazil]], [[Colombia]], [[Ecuador]] and [[Peru]] in various different habitats in the [[Amazon rainforest]].',
'Scolomys ucayalensis has a head-and-body length of between and a tail around 83% of this.',
'The head is small but broad with a pointed snout and small rounded ears.',
'The fur is a mixture of fine hairs and thicker, flattened spines.',
'The dorsal surface is some shade of reddish-brown to reddish-black, sometimes grizzled or streaked with black, and the underparts are grey.',
'The tail is nearly naked, and the hind feet are small but broad.',
'The [[Hypothenar eminence|hypothenar]] pad (next to the outer digit on the sole of the foot) is either absent or reduced in size on the hind feet, and this contrasts with the otherwise similar [[Scolomys melanops ]] which has well-developed hypothenar pads.',
'The [[karyotype]] of S.',
'ucayalensis has [[Diploid|2n]] = 50 and [[Fundamental number|FN]] = 68, while that of S.',
'melanops has 2n = 60, FN = 78.',
'S.',
'ucayalensis is found on the eastern side of the [[Andes]] in South America.',
'Its range extends from southern Colombia and southern Ecuador, through western Brazil to northern Peru, and completely surrounds the range of S.',
'melanops.',
'Its habitat varies, with specimens being found in primary terra firme (non-flooded) lowland humid forest in Brazil, in undergrowth growing where primary forest had been cut back, and in cloud forest where the trees are clad in [[moss]]es and [[Bromeliaceae|bromeliads]].',
'Its altitudinal range is between .',
'* \n* [[Category:Scolomys]]\n[[Category:Mammals described in 1991]]']
This looks close enough. The first thing to do is to get the synonyms for each link so we can collect all these up.
{'nocturnality': {'nocturnal', 'nocturnality'},
'rodent': {'rodent'},
'species': {'species'},
'south america': {'south america'},
'scolomys': {'scolomys'},
'oryzomyini': {'oryzomyini'},
'brazil': {'brazil'},
'colombia': {'colombia'},
'ecuador': {'ecuador'},
'peru': {'peru'},
'amazon rainforest': {'amazon rainforest'},
'hypothenar eminence': {'hypothenar', 'hypothenar eminence'},
'scolomys melanops': {'scolomys melanops'},
'karyotype': {'karyotype'},
'diploid': {'2n', 'diploid'},
'fundamental number': {'fn', 'fundamental number'},
'andes': {'andes'},
'moss': {'moss'},
'bromeliaceae': {'bromeliaceae', 'bromeliads'}}
Now I need to get the links for each line and organize the code by that. What I need is the most sentences around the link that will fit within the input.
[('nocturnality',
'Scolomys ucayalensis, also known as the long-nosed scolomysMusser and Carleton, 2005 or Ucayali spiny mouse is a nocturnal rodent species from South America.'),
('rodent',
'Scolomys ucayalensis, also known as the long-nosed scolomysMusser and Carleton, 2005 or Ucayali spiny mouse is a nocturnal rodent species from South America.'),
('species',
'Scolomys ucayalensis, also known as the long-nosed scolomysMusser and Carleton, 2005 or Ucayali spiny mouse is a nocturnal rodent species from South America.'),
('south america',
'Scolomys ucayalensis, also known as the long-nosed scolomysMusser and Carleton, 2005 or Ucayali spiny mouse is a nocturnal rodent species from South America.'),
('scolomys', 'It is part of the genus Scolomys within the tribe Oryzomyini.'),
('oryzomyini',
'It is part of the genus Scolomys within the tribe Oryzomyini.'),
('brazil',
'It is found in Brazil, Colombia, Ecuador and Peru in various different habitats in the Amazon rainforest.'),
('colombia',
'It is found in Brazil, Colombia, Ecuador and Peru in various different habitats in the Amazon rainforest.'),
('ecuador',
'It is found in Brazil, Colombia, Ecuador and Peru in various different habitats in the Amazon rainforest.'),
('peru',
'It is found in Brazil, Colombia, Ecuador and Peru in various different habitats in the Amazon rainforest.'),
('amazon rainforest',
'It is found in Brazil, Colombia, Ecuador and Peru in various different habitats in the Amazon rainforest.'),
('hypothenar eminence',
'The hypothenar pad (next to the outer digit on the sole of the foot) is either absent or reduced in size on the hind feet, and this contrasts with the otherwise similar Scolomys melanops which has well-developed hypothenar pads.'),
('scolomys melanops',
'The hypothenar pad (next to the outer digit on the sole of the foot) is either absent or reduced in size on the hind feet, and this contrasts with the otherwise similar Scolomys melanops which has well-developed hypothenar pads.'),
('karyotype', 'The karyotype of S.'),
('diploid', 'ucayalensis has 2n = 50 and FN = 68, while that of S.'),
('fundamental number',
'ucayalensis has 2n = 50 and FN = 68, while that of S.'),
('andes',
'ucayalensis is found on the eastern side of the Andes in South America.'),
('moss',
'Its habitat varies, with specimens being found in primary terra firme (non-flooded) lowland humid forest in Brazil, in undergrowth growing where primary forest had been cut back, and in cloud forest where the trees are clad in mosses and bromeliads.'),
('bromeliaceae',
'Its habitat varies, with specimens being found in primary terra firme (non-flooded) lowland humid forest in Brazil, in undergrowth growing where primary forest had been cut back, and in cloud forest where the trees are clad in mosses and bromeliads.')]
import torch
@torch.inference_mode()
def get_features(
text: str,
link: str,
prompt: str = " Pet: Dog, Color: Yellow, Vehicle: Tractor, Fruit: Banana,<mask>: {}",
) -> torch.Tensor:
capitalized = link[0].upper() + link[1:]
prompted = text.strip() + prompt.format(capitalized)
tokens = tokenizer(
prompted,
return_tensors="pt",
return_attention_mask=False,
).input_ids
tokens = tokens.to(model.device)
mask_index = tokens == tokenizer.mask_token_id
output = model(tokens)
return output.logits[mask_index][0].cpu().numpy()
('It is found in Brazil, Colombia, Ecuador and Peru in various different habitats in the Amazon rainforest.',
'brazil',
['Country',
'Land',
'Origin',
'Location',
'Nation',
'Language',
'Region',
'State',
'Source',
'Culture'])
This is pretty good. For some of the more technical terms it has more trouble but this should be good enough to process the wikipedia data.
The aim once this is processed is then to aggregate these into clusters in various ways. Processing this will be tricky as I have to ensure that there is enough context to allow the model to produce meaningful output. I also have to be careful as the raw output is 250,002 tokens which will take up a chunk of memory.
Writing this to files will be very important and then doing aggregation later.
I now need to run this over the whole of wikipedia. Once that has been done I can then try generating the list of difference words or phrases that link to each page.
I did start this by running each link/sentence pair through it and recording every single token probability. This resulted in files that were ~800MB for 1,000 pairs. Given that there are millions of articles and many links per page this is infeasible.
The next problem is that doing this one row at a time is very slow, even with CUDA. I need to be able to process the data quickly and then run batches through the model. Normally I would do this with ray, but I have upgraded the python that runs this blog.
That is why I have to process this in a separate repository.
The repository is available here. I’ve processed the synonyms and about 2% of the features.
We can have a quick look at the processed data now. Trying to cluster them will be another post.
from pathlib import Path
import pandas as pd
DATA_FOLDER = Path("/data/prompt-internalization/multilingual/wikipedia/enwiki/20220701/")
synonyms_df = pd.read_parquet(DATA_FOLDER / "synonyms.gz.parquet")
features_df = pd.concat([
pd.read_parquet(file)
for file in sorted((DATA_FOLDER / "features").glob("*.gz.parquet"))
])
target | synonym | count | |
---|---|---|---|
31478537 | united states | united states | 275981 |
5118094 | association football | association football | 208462 |
33043071 | world war ii | world war ii | 151965 |
12447034 | france | france | 142281 |
30181088 | the new york times | the new york times | 134794 |
... | ... | ... | ... |
8160695 | cho chong-kil | cho chong-kil | 1 |
8160696 | cho choong-hoon | cho choong-hoon | 1 |
19889493 | madhav gopal naseri | madhav gopal naseri | 1 |
19889492 | madhav godbole | madhav godbole | 1 |
6194003 | bhandasur | bhandasur | 1 |
33691008 rows × 3 columns
target | synonym | count | |
---|---|---|---|
5118191 | association football | footballer | 117471 |
19202682 | list of sovereign states | country | 109280 |
5118139 | association football | football | 71084 |
31297808 | u.s. state | state | 69034 |
9057592 | countries of the world | country | 67898 |
... | ... | ... | ... |
12468103 | francis egerton, 7th duke of sutherland | the 7th duke of sutherland | 1 |
12468105 | francis egerton, 8th earl of bridgewater | 8th earl | 1 |
12468107 | francis egerton, 8th earl of bridgewater | duke of bridgewater (1756-1829) | 1 |
12468108 | francis egerton, 8th earl of bridgewater | earl of bridgewater | 1 |
33691004 | 🮽 | ❎︎ | 1 |
16643648 rows × 3 columns
roman numerals 1609
list of moths of north america 1250
billboard charts 1070
u.s. cellular field 1036
postal codes in canada 1014
...
intertidal chalk 1
intertidal ecosystem 1
intertidal fish 1
intertidal flat 1
🯅 1
Name: target, Length: 17047360, dtype: int64
There are about 34 million synonyms that have been found for 17 million articles. Some of the articles are unicode characters which are not rendering well in this notebook.
target | probability | index | |
---|---|---|---|
0 | goodwood festival of speed | [0.13652878, 0.09209253, 0.08164861, 0.0800229... | [70643, 15757, 48962, 90788, 131899, 60457, 32... |
1 | glorious goodwood | [0.18024956, 0.09665708, 0.06521337, 0.0638825... | [90788, 70643, 48962, 15757, 220197, 60457, 20... |
2 | goodwood revival | [0.116899185, 0.115187414, 0.10732716, 0.08845... | [70643, 90788, 48962, 15757, 60457, 74831, 499... |
3 | goodwood, south australia | [0.4112736, 0.16649252, 0.08323654, 0.04542979... | [90788, 74831, 6406, 6557, 41076, 49990, 79200... |
4 | electoral district of goodwood | [0.117247075, 0.10608322, 0.08746431, 0.064992... | [150533, 60457, 70643, 15757, 90788, 23994, 22... |
... | ... | ... | ... |
15291 | jerome kern | [0.18191482, 0.1264575, 0.1066387, 0.09394692,... | [83358, 13703, 69891, 220197, 61804, 15757, 31... |
15292 | guy bolton | [0.19696012, 0.12988408, 0.12146077, 0.0872782... | [61804, 83358, 13703, 31068, 69891, 220197, 15... |
15293 | bohemianism | [0.64192796, 0.0672424, 0.029274452, 0.0283800... | [98148, 105141, 83658, 74831, 104384, 13703, 5... |
15294 | basil rathbone | [0.23146918, 0.15656792, 0.09903663, 0.0662543... | [61804, 15757, 220197, 69891, 33734, 31068, 48... |
15295 | tallulah bankhead | [0.12158471, 0.07587505, 0.06905762, 0.060985,... | [15757, 48962, 33734, 220197, 10348, 61804, 69... |
6155296 rows × 3 columns
united states 6744
world war ii 6194
new york city 4369
france 3639
england 3499
...
18th army (german empire) 1
preston brown (general) 1
vivières 1
lizy-sur-ourcq 1
george hendric houghton 1
Name: target, Length: 1793996, dtype: int64
15757 5987901
70643 5920934
77641 5708869
90788 5668998
60457 5518613
...
70548 1
90519 1
52428 1
213625 1
239593 1
Name: index, Length: 18318, dtype: int64
['Name', 'Description', 'Source', 'Location', 'Type']
6 million links have been processed associated with 1.8 million articles. The probability and index columns are aligned and are in descending order. Index is selecting values out of the 250,002 tokens that form the xlm-roberta-base vocabulary.
You can see that the model output has not been filtered as the most common tokens make a strong appearance again. If we compare those tokens to the top 5 from our previous investigation we can see a \(\frac{4}{5}\) overlap:
Token | Probability |
---|---|
Owner | 0.107 |
Name | 0.065 |
Description | 0.048 |
Type | 0.035 |
Location | 0.029 |
When working with this to try to calculate clusters it may be appropriate to exclude these tokens.