Code
from pathlib import Path
= sorted(Path("/data/wikipedia/external/enwiki/20210701").glob("*.bz2")) ENWIKI_FILES
July 28, 2021
Wikipedia has text with hyperlinks. Those hyperlinks lead to the page for the link in a way that disambiguates them. For example the iphone page has the text designed and marketed by Apple Inc. while the fruit page has the text such as the apple and the pomegranate. Both of these pages have the term apple but the page they link to differs.
In the previous posts I have explored using the output of a language model to predict the sentiment of a piece of text. What if I took the raw output of the model and clustered it to find out which page a given term refers to?
This is training without a target so the aim would be to take short utterances that have links within them, strip the wikipedia markup, and then extract the outputs for the link tokens. These outputs can then be passed to the cosine similarity loss which will draw together the outputs for the same link and drive apart those for different links. This allows the output for the same link in different texts to cluster together without knowing in advance where it will cluster.
I’ve used this approach for the prompt training, and it did not work well then. Hopefully this time it will work better. This will involve some complex wikipedia processing to achieve so I’ll have to get started on that.
I need to get a wikipedia dump that I can process to extract the linked text and the links. As much as possible I need to get the clean text out of the page as well.
#collapse
from typing import *
from pathlib import Path
import string
import bz2
import regex as re
from lxml.etree import Element, iterparse
import wikitextparser as wtp
MEDIAWIKI_NAMESPACE = "http://www.mediawiki.org/xml/export-0.10/"
def read_articles(file: Path) -> Iterator[Tuple[str, str]]:
for page in read_pages(file):
try:
article = _get_article(page)
if article:
yield article
except Exception as e:
print(e)
pass
def _get_article(element: Element) -> Optional[Tuple[str, str]]:
namespace = _get("mw:ns", element)
if namespace is None or namespace.text != "0":
return None
redirect = _get("mw:redirect", element)
if redirect is not None:
return None
title = _get("mw:title", element)
if title is None:
return None
title = title.text
text_element = _get("mw:revision/mw:text", element)
if text_element is None:
return None
text = text_element.text
parsed = wtp.parse(text)
return title, parsed
def _get(path: str, element: Element) -> Optional[Element]:
elements = element.xpath(path, namespaces={"mw": MEDIAWIKI_NAMESPACE})
if elements:
return elements[0]
return None
TITLE_PATTERN = re.compile(r"^=+ .* =+$", flags=re.MULTILINE)
CATEGORY_PATTERN = re.compile(r"^Categoría:.*$", flags=re.MULTILINE)
LEADING_COLON_OR_HASH = re.compile(r"^[:#]", flags=re.MULTILINE)
MANY_BLANK_LINES = re.compile(r"(\n\s*)+\n+")
def _clean_text(parsed: wtp.WikiText) -> str:
text = parsed.plain_text()
for to_remove in [*parsed.get_lists(), *parsed.get_tables(), *parsed.get_tags()]:
text = text.replace(to_remove.plain_text(), "")
for pattern in [TITLE_PATTERN, CATEGORY_PATTERN, LEADING_COLON_OR_HASH, MANY_BLANK_LINES]:
text = pattern.sub("", text)
text = text.strip(string.whitespace + "\n\r")
return text
def read_pages(file: Path) -> Iterator[Element]:
with bz2.open(file, "rb") as handle:
for _event, element in iterparse(
handle,
tag=f"{{{MEDIAWIKI_NAMESPACE}}}page",
events=("end",)
):
yield element
_clear_memory(element)
def _clear_memory(element: Element) -> None:
element.clear()
for ancestor in element.xpath("ancestor-or-self::*"):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
[WikiLink('[[Emir]]'),
WikiLink('[[Mosul]]'),
WikiLink('[[Dirham]]'),
WikiLink('[[Qutb al-Din Mawdud]]'),
WikiLink("[[Izz ad-Din Mas'ud]]")]
'\n\n\nSayf al-Din Ghazi (II) ibn Mawdud (; full name: Sayf al-Din Ghazi II ibn Mawdud ibn Zengi; died 1180) was a Zangid Emir of Mosul, the nephew of Nur ad-Din Zengi. \n\nHe became Emir of Mosul in 1170 after the death of his father Qutb ad-Din Mawdud. S'
'{{Multiple issues|\n{{Refimprove|date=March 2016}}\n{{More footnotes|date=March 2016}}\n}}\n{{Infobox royalty\n| type =\n| title = [[Emir]] of [[Mosul]]\n| name = Sayf al-Din Ghazi II\n| more = \n| image = Dirham of Saif al-Din Ghazi II, 1171-1172.jpg\n| capti'
(WikiLink('[[Qutb al-Din Mawdud|Mawdud]]'), 'Qutb al-Din Mawdud', 'Mawdud')
So it looks like I need to handle this manually a little. I can take all of the wikilinks from the text, search for the plain_text of it, and associate that with the target.
A thought I have had is that the language model output (the tokens) could also be restrained. If the target page is tokenized and the tokens that occur more frequently on that page (TF/IDF) are retained then the model could also be trained to prefer them.
'\n\n\nSayf al-Din Ghazi (II) ibn Mawdud (; full name: Sayf al-Din Ghazi II ibn [[Qutb al-Din Mawdud|Mawdud]] ibn [[Imad al-Din Z'
Turns out I don’t need to do this at all. I will need to extract the links from the text and then be able to spot them in the tokenized form. They will also need filtering - I don’t think the Category links help.
#collapse
from typing import *
import pandas as pd
import wikitextparser as wtp
import regex as re
CATEGORY_LINK_PATTERN = re.compile(r"\[\[[^]]+:[^]]+\]\]")
TITLE_PATTERN = re.compile(r"(=+)[^=]+\1")
WHITESPACE_PATTERN = re.compile(r"\s+")
def extract_links(article: wtp.WikiText) -> Dict[str, Any]:
text = (
article
.plain_text(replace_wikilinks=False)
.strip()
)
text = re.sub(CATEGORY_LINK_PATTERN, "", text)
text = re.sub(TITLE_PATTERN, "", text)
text = re.sub(WHITESPACE_PATTERN, " ", text)
parsed = wtp.parse(text)
links = []
starts = []
ends = []
offset = 0
for link in parsed.wikilinks:
target = link.target
start, end = link.span
length = end - start
start -= offset
offset += length - len(link.plain_text())
end -= offset
links.append(target)
starts.append(start)
ends.append(end)
return {
"text": parsed.plain_text(),
"link": links,
"start": starts,
"end": ends,
}
text Sayf al-Din Ghazi (II) ibn Mawdud (; full name...
start [73, 84, 108, 115, 144, 172, 224, 452, 555, 67...
end [79, 89, 114, 128, 160, 185, 242, 458, 561, 68...
Name: 0, dtype: object
So this seems to be working. I need to convert all of this appropriately. This needs post processing as the different page titles need to be categorized and then I can work on the different ideas for training the model.
My current thoughts are that spotting the entities would be nice. Apparently for that it’s best to use a combination of start & within/outside as the two categorizers as this captures entity boundaries and makes it slightly easier to see where the entities are.
I need to be able to convert all this data. Hopefully I can write a dataframe that has a dataframe as a cell value heh. Turns out that didn’t work. I’ve converted the link / start / end into lists.
from tqdm.auto import tqdm
from pathlib import Path
import pandas as pd
DATA_FOLDER = Path("/data/blog/2021-07-28-wikipedia-link-recognition/")
DATA_FOLDER.mkdir(exist_ok=True, parents=True)
for path in tqdm(ENWIKI_FILES):
destination = DATA_FOLDER / f"{path.stem}.gz.parquet"
# if destination.exists():
# continue
rows = []
failed = 0
for title, article in read_articles(path):
try:
rows.append({
"title": title,
**extract_links(article)
})
except:
failed += 1
df = pd.DataFrame(rows)
df.to_parquet(destination)
print(f"Written {len(df):,} articles to {destination}. {failed:,} articles could not be converted")
Written 21,078 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles1.xml-p1p41242.gz.parquet. 3 articles could not be converted
Written 169,553 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles10.xml-p4045403p5399366.gz.parquet. 1 articles could not be converted
Written 179,682 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles11.xml-p5399367p6899366.gz.parquet. 1 articles could not be converted
Written 16,346 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles11.xml-p6899367p7054859.gz.parquet. 0 articles could not be converted
Written 143,743 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles12.xml-p7054860p8554859.gz.parquet. 1 articles could not be converted
Written 56,813 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles12.xml-p8554860p9172788.gz.parquet. 0 articles could not be converted
Written 84,568 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles13.xml-p10672789p11659682.gz.parquet. 0 articles could not be converted
Written 120,029 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles13.xml-p9172789p10672788.gz.parquet. 1 articles could not be converted
Written 168,883 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles14.xml-p11659683p13159682.gz.parquet. 0 articles could not be converted
Written 101,040 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles14.xml-p13159683p14324602.gz.parquet. 0 articles could not be converted
Written 144,741 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles15.xml-p14324603p15824602.gz.parquet. 0 articles could not be converted
Written 117,697 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles15.xml-p15824603p17324602.gz.parquet. 0 articles could not be converted
Written 10,976 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles15.xml-p17324603p17460152.gz.parquet. 0 articles could not be converted
Written 143,385 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles16.xml-p17460153p18960152.gz.parquet. 1 articles could not be converted
Written 130,359 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles16.xml-p18960153p20460152.gz.parquet. 1 articles could not be converted
Written 8,809 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles16.xml-p20460153p20570392.gz.parquet. 0 articles could not be converted
Written 148,661 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles17.xml-p20570393p22070392.gz.parquet. 1 articles could not be converted
Written 142,442 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles17.xml-p22070393p23570392.gz.parquet. 0 articles could not be converted
Written 19,514 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles17.xml-p23570393p23716197.gz.parquet. 0 articles could not be converted
Written 143,022 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles18.xml-p23716198p25216197.gz.parquet. 1 articles could not be converted
Written 128,863 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles18.xml-p25216198p26716197.gz.parquet. 0 articles could not be converted
Written 40,797 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles18.xml-p26716198p27121850.gz.parquet. 0 articles could not be converted
Written 128,463 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles19.xml-p27121851p28621850.gz.parquet. 0 articles could not be converted
Written 116,959 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles19.xml-p28621851p30121850.gz.parquet. 1 articles could not be converted
Written 95,154 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles19.xml-p30121851p31308442.gz.parquet. 0 articles could not be converted
Written 66,705 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles2.xml-p41243p151573.gz.parquet. 2 articles could not be converted
Written 126,813 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles20.xml-p31308443p32808442.gz.parquet. 1 articles could not be converted
Written 130,112 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles20.xml-p32808443p34308442.gz.parquet. 1 articles could not be converted
Written 88,631 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles20.xml-p34308443p35522432.gz.parquet. 0 articles could not be converted
Written 149,321 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles21.xml-p35522433p37022432.gz.parquet. 1 articles could not be converted
Written 126,428 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles21.xml-p37022433p38522432.gz.parquet. 2 articles could not be converted
Written 127,916 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles21.xml-p38522433p39996245.gz.parquet. 1 articles could not be converted
Written 138,490 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles22.xml-p39996246p41496245.gz.parquet. 0 articles could not be converted
Written 133,085 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles22.xml-p41496246p42996245.gz.parquet. 1 articles could not be converted
Written 140,593 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles22.xml-p42996246p44496245.gz.parquet. 1 articles could not be converted
Written 21,534 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles22.xml-p44496246p44788941.gz.parquet. 0 articles could not be converted
Written 85,430 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles23.xml-p44788942p46288941.gz.parquet. 1 articles could not be converted
Written 132,003 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles23.xml-p46288942p47788941.gz.parquet. 2 articles could not be converted
Written 108,389 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles23.xml-p47788942p49288941.gz.parquet. 0 articles could not be converted
Written 84,273 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles23.xml-p49288942p50564553.gz.parquet. 1 articles could not be converted
Written 117,048 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles24.xml-p50564554p52064553.gz.parquet. 0 articles could not be converted
Written 118,427 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles24.xml-p52064554p53564553.gz.parquet. 0 articles could not be converted
Written 110,100 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles24.xml-p53564554p55064553.gz.parquet. 1 articles could not be converted
Written 113,552 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles24.xml-p55064554p56564553.gz.parquet. 0 articles could not be converted
Written 36,474 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles24.xml-p56564554p57025655.gz.parquet. 0 articles could not be converted
Written 124,248 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles25.xml-p57025656p58525655.gz.parquet. 2 articles could not be converted
Written 99,364 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles25.xml-p58525656p60025655.gz.parquet. 1 articles could not be converted
Written 113,551 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles25.xml-p60025656p61525655.gz.parquet. 0 articles could not be converted
Written 80,868 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles25.xml-p61525656p62585850.gz.parquet. 2 articles could not be converted
Written 105,752 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles26.xml-p62585851p63975909.gz.parquet. 4 articles could not be converted
Written 98,352 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles27.xml-p63975910p65475909.gz.parquet. 1 articles could not be converted
Written 105,971 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles27.xml-p65475910p66975909.gz.parquet. 0 articles could not be converted
Written 85,744 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles27.xml-p66975910p68108549.gz.parquet. 0 articles could not be converted
Written 54,167 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles3.xml-p151574p311329.gz.parquet. 1 articles could not be converted
Written 76,425 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles4.xml-p311330p558391.gz.parquet. 1 articles could not be converted
Written 95,129 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles5.xml-p558392p958045.gz.parquet. 1 articles could not be converted
Written 116,929 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles6.xml-p958046p1483661.gz.parquet. 0 articles could not be converted
Written 126,645 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles7.xml-p1483662p2134111.gz.parquet. 2 articles could not be converted
Written 143,575 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles8.xml-p2134112p2936260.gz.parquet. 0 articles could not be converted
Written 164,857 articles to /data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles9.xml-p2936261p4045402.gz.parquet. 3 articles could not be converted
This is clearly a monstrous task so further data processing and model training will be done in a separate post.