Wikipedia Data Generation

This seems to be bigger than I anticipated - let’s further investigate and prepare the data
Published

July 30, 2021

I want to be able to take some text and work out what it is talking about. Wikipedia looks like a good dataset for this as it has a consistent way to refer to things within text (the link system) and I can find out about the thing (the linked page). I’ve processed the data from a data dump to some degree and now I need to further process this to make it suitable for a model to work with.

The preprocessing relates to the two aspects of interest - identifying the linked text and the target, and describing the page as a target itself. To start with the titles need to be categorized. The links that are not available as titles should not be categorized as there is no way to describe them.


Tokenization

The first part of all of this is to tokenize the text. This is because the tokens will be linked to the target page, and the tokens will be used to describe the pages. I’m going to practice by processing a single file, and then extend this process to all pages.

Code
import pandas as pd

df = pd.read_parquet("/data/blog/2021-07-28-wikipedia-link-recognition/enwiki-20210701-pages-articles1.xml-p1p41242.gz.parquet")
df
title text link start end
0 Anarchism Anarchism is a political philosophy and moveme... [political philosophy, Political movement, aut... [15, 40, 70, 127, 179, 264, 317, 344, 362, 392... [35, 48, 79, 136, 184, 272, 336, 355, 383, 410...
1 Autism Autism is a developmental disorder characteriz... [developmental disorder, Regressive autism, de... [12, 308, 375, 461, 473, 562, 588, 612, 621, 6... [34, 318, 399, 468, 494, 569, 601, 619, 631, 6...
2 Albedo sunlight relative to various surface conditio... [sunlight, diffuse reflection, sunlight, solar... [1, 117, 139, 172, 239, 397, 417, 820, 865, 14... [9, 135, 154, 187, 249, 406, 427, 839, 876, 14...
3 A A, or a, is the first letter and the first vow... [Letter (alphabet), vowel letter, English alph... [22, 43, 63, 95, 144, 168, 203, 224, 258, 598,... [28, 55, 86, 119, 145, 171, 223, 229, 267, 609...
4 Alabama Alabama () is a state in the Southeastern regi... [Southeastern United States, United States, Te... [29, 56, 83, 107, 128, 144, 177, 217, 246, 272... [41, 69, 92, 114, 135, 158, 188, 237, 264, 283...
... ... ... ... ... ...
21073 Heuristic routing Heuristic routing is a system used to describe... [network topology, Heuristic, Routing, telecom... [90, 114, 212, 325, 357, 435, 1843, 1848, 2037... [106, 123, 219, 351, 374, 444, 1847, 1853, 204...
21074 Hierarchical routing Hierarchical routing is a method of routing in... [routing, network address, Transmission Contro... [36, 86, 103, 133, 152, 228, 254, 276, 340, 35... [43, 96, 132, 150, 158, 235, 261, 280, 348, 36...
21075 High-performance equipment High-performance equipment describes telecommu... [telecommunications, electromagnetic interfere... [37, 249, 309] [55, 277, 316]
21076 Hop A hop is a type of jump. Hop or hops may also ... [Jumping, Hop (film), Hop! Channel, House of P... [19, 58, 84, 122, 167, 217, 274, 405, 432, 512... [23, 68, 96, 136, 176, 225, 286, 429, 444, 522...
21077 Horn Horn most often refers to: *Horn (acoustic), a... [Horn (acoustic), Horn (instrument), Horn (ana... [28, 102, 179, 340, 495, 514, 530, 543, 560, 6... [43, 119, 193, 347, 511, 526, 540, 557, 566, 6...

21078 rows × 5 columns

Code
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-base")
Code
row = df.iloc[0]
tokenized_text = tokenizer(row.text, return_attention_mask=False, return_offsets_mapping=True)
Token indices sequence length is longer than the specified maximum sequence length for this model (6536 > 1024). Running this sequence through the model will result in indexing errors
Code
tokenized_text.keys()
dict_keys(['input_ids', 'offset_mapping'])
Code
token_starts = {start for start, _ in tokenized_text["offset_mapping"]}
token_ends = {end for _, end in tokenized_text["offset_mapping"]}

all(
    start in token_starts
    for start in row.start
), all(
    end in token_ends
    for end in row.end
)
(True, False)
Code
for index in [
    index
    for index, end in enumerate(row.end)
    if end not in token_ends
]:
    start, end, link = row.start[index], row.end[index], row.link[index]
    surrounding_text = row.text[start-3:end+3]
    print(f"text[{start: 6d}:{end: 6d}] = '{link}', surround is '{surrounding_text}'")
text[   632:   637] = 'realm', surround is 's, realms o'
text[   642:   648] = 'empire', surround is 'or empires. '
text[  4851:  4862] = 'institution', surround is 'ed institutions) '
text[  8745:  8751] = 'reason', surround is 'nd reasoning'
text[ 14445: 14459] = 'peace movement', surround is 'nd peace movements, '
text[ 15586: 15600] = 'affinity group', surround is 'de affinity groups, '
text[ 20165: 20186] = 'voluntary association', surround is 'of voluntary associations, '
text[ 20189: 20205] = 'workers' council', surround is 's, workers' councils a'
text[ 20211: 20229] = 'worker cooperative', surround is 'nd worker cooperatives, '
text[ 20769: 20785] = 'labour syndicate', surround is 'ws labour syndicates a'
text[ 25454: 25464] = 'revolution', surround is 'in revolutions. '
text[ 26594: 26607] = 'strike action', surround is 'in strike actions, '
text[ 29306: 29320] = 'affinity group', surround is 'st affinity groups p'
text[ 30895: 30920] = 'Temporary Autonomous Zone', surround is 'ed Temporary Autonomous Zones ('

The problem here appears to be that the tokenized text includes the suffix while the linked text does not. I feel that the solution is to include any token that intersects a link. In this page all of the problems are at the end, however I can believe that this could also occur at the start, so the solution should be symmetric.

When marking up the tokens there are three outputs - the token is the start of an entity, the token is within an entity, and the specific entity that the token relates to. I can start by checking the boundary code.

Code
from typing import *

def to_boundaries(
    token_offsets: List[Tuple[int, int]],
    link_starts: List[int],
    link_ends: List[int],
    link_targets: List[str],
) -> List[Tuple[bool, bool, Optional[str]]]:
    boundaries = []
    link_iter = zip(link_starts, link_ends, link_targets)
    try:
        link_start, link_end, link_target = next(link_iter)

        within = False
        for token_start, token_end in token_offsets:
            if token_start == token_end: # zero width token
                boundaries.append((False, within, None))
                continue

            while token_start >= link_end:
                link_start, link_end, link_target = next(link_iter)
                within = False

            if token_start < link_end and token_end > link_start:
                # inside link
                boundaries.append((not within, True, link_target))
                within = True
            else:
                boundaries.append((False, False, None))
    except StopIteration:
        boundaries += [(False, False, None)] * (len(token_offsets) - len(boundaries))

    return boundaries
Code
list(
    zip(
        [
            tokenizer.decode(token)
            for token in tokenized_text["input_ids"]
        ],
        to_boundaries(
            token_offsets=tokenized_text["offset_mapping"],
            link_starts=row.start,
            link_ends=row.end,
            link_targets=row.link,
        )
    )
)[:25]
[('<s>', (False, False, None)),
 ('An', (False, False, None)),
 ('arch', (False, False, None)),
 ('ism', (False, False, None)),
 (' is', (False, False, None)),
 (' a', (False, False, None)),
 (' political', (True, True, 'political philosophy')),
 (' philosophy', (False, True, 'political philosophy')),
 (' and', (False, False, None)),
 (' movement', (True, True, 'Political movement')),
 (' that', (False, False, None)),
 (' is', (False, False, None)),
 (' scept', (False, False, None)),
 ('ical', (False, False, None)),
 (' of', (False, False, None)),
 (' authority', (True, True, 'authority')),
 (' and', (False, False, None)),
 (' rejects', (False, False, None)),
 (' all', (False, False, None)),
 (' involuntary', (False, False, None)),
 (',', (False, False, None)),
 (' coercive', (False, False, None)),
 (' forms', (False, False, None)),
 (' of', (False, False, None)),
 (' hierarchy', (True, True, 'hierarchy'))]

This looks good. The next thing is to collect the complete list of titles so that they can be categorized.

Code
from pathlib import Path

ENWIKI_FILES = sorted(Path("/data/blog/2021-07-28-wikipedia-link-recognition/").glob("*.gz.parquet"))
DATA_FOLDER = Path("/data/blog/2021-07-30-wikipedia-data-generation/")
DATA_FOLDER.mkdir(exist_ok=True, parents=True)
Code
import pandas as pd
from typing import *
from tqdm.auto import tqdm

def categorize_titles(paths: List[Path]) -> pd.DataFrame:
    titles = pd.Series(dtype="object")
    for path in tqdm(paths):
        df = pd.read_parquet(path)
        titles = pd.concat([
            titles,
            df.title
        ])
    titles = (
        titles.drop_duplicates()
            .sort_values()
            .reset_index(drop=True)
    )
    df = pd.DataFrame(data=titles.index, index=titles.values)
    return df.rename(columns={0: "index"})
Code
titles = categorize_titles(ENWIKI_FILES)
titles.to_parquet(DATA_FOLDER / "title-to-index.gz.parquet", compression="gzip")
Code
titles.to_parquet(DATA_FOLDER / "title-to-index.gz.parquet", compression="gzip")
Code
len(titles)
6328478

Creating a 6 million class classifier is silly. There is no reason to believe it would work well, and what happens when the next wikipedia dump adds new pages?

The next thing to do is to tokenize the text for each page and work out what the dominant tokens are. These tokens should be the ones that are present more frequently in the page than in all of wikipedia. Once again this needs to be done incrementally to allow this to fit in memory. I can combine this with the categorization of the titles by creating an appropriate dataframe.

Code
from typing import *
from collections import Counter
from transformers import AutoTokenizer
from tqdm.auto import tqdm
import numpy as np

COLUMNS = [str(index) for index in range(tokenizer.vocab_size)]
ZERO = pd.Series(
    data=0,
    index=COLUMNS
).astype(int)

def all_token_counts(df: pd.DataFrame, tokenizer: AutoTokenizer, title_to_index: pd.DataFrame) -> pd.DataFrame:
    metadata_df = pd.DataFrame([
        {
            "title": title,
            "index": title_to_index.loc[title].item(),
        }
        for title in df.title
    ])
    token_df = token_counts(
        tokenizer=tokenizer, text=df.text
    )
    df = pd.merge(
        left=metadata_df,
        right=token_df,
        left_index=True,
        right_index=True,
    ).set_index("index")

    return df

def token_counts(tokenizer: AutoTokenizer, text: pd.Series) -> pd.DataFrame:
    text_tokens = tokenizer(text.tolist(), return_attention_mask=False)["input_ids"]
    return pd.DataFrame([
        _bincount(tokenizer, tokens)
        for tokens in text_tokens
    ], columns=COLUMNS)

def _bincount(tokenizer: AutoTokenizer, tokens: List[int]) -> np.array:
    array = np.bincount(np.array(tokens))
    array = np.append(
        array,
        np.zeros(tokenizer.vocab_size - array.shape[0], dtype=int)
    )
    return array
Code
from pathlib import Path
from tqdm.auto import tqdm

DATA_FOLDER = Path("/data/blog/2021-07-30-wikipedia-data-generation/")
DATA_FOLDER.mkdir(exist_ok=True, parents=True)
count = 0

for path in tqdm(ENWIKI_FILES):
    df = pd.read_parquet(path)
    
    for index in tqdm(range(0, len(df), 1_000), leave=False):
        destination = DATA_FOLDER / f"{count:08d}-token-counts.gz.parquet"
        count += 1
        if destination.exists():
            continue

        token_count_df = all_token_counts(
            df[index: index+1_000],
            tokenizer=tokenizer,
            title_to_index=titles
        )
        token_count_df.to_parquet(
            destination,
            compression="gzip"
        )

So here we are with another monstrous data processing task. This will take ~30 hours to complete. I’ll have to continue this in another post.