Wikipedia Page Mutual Information

Trying to extract meaningful tokens from a page using mutual information
Published

August 1, 2021

I’ve started collecting the token counts per page. I’m going to use the distinctive tokens to identify the page when it is linked to. So to calculate this I want to use the Pointwise Mutual Information algorithm to find tokens which are distinctive for the page.


Pointwise Mutual Information

The equation for PMI is

\[ \text{PMI} = \log \left( \frac{ p(x,y) }{ p(x)p(y) } \right) \]

This is made up of three parts, and they relate to the independent probabilities (\(p(x)\) and \(p(y)\)) and the joint probability (\(p(x,y)\)). The intutition is that if the independent probabilities are the same as the joint probability (i.e. the two variables are independent) then the PMI will be \(\log(1) = 0\). Variations from this relationship will either lead to a positive PMI, if the joint probability is higher than expected, or a negative PMI, if the joint probability is lower than expected.

I’m looking for tokens that the model should predict, so that means looking for the high PMI tokens.

The first thing to consider is the practicality of this. I can’t hold the entire dataset in memory at once - there are over 50 thousand different tokens in the vocabulary, and there are over 6 million rows. So I need to be able to compute this incrementally.

To calculate the PMI for a single token I need to know the independent probability for each individual token and for each individual row. So given the matrix:

\[ \begin{array}{ c c c } X_{1,1} & ... & X_{1,vocab} \\ ... & ... & ... \\ X_{rows,1} & ... & X_{rows,vocab} \end{array} \]

We want two sums, one for the columns and one for the rows:

\[ \begin{array}{ c c c c c c } & & & & & p(y) = \\ & X_{1,1} & ... & X_{1,vocab} & \rightarrow & \sum_{i=1}^{vocab} X_{1,i} \\ & ... & ... & ... & \rightarrow & ... \\ & X_{rows,1} & ... & X_{rows,vocab} & \rightarrow & \sum_{i=1}^{vocab} X_{rows,i} \\ & \downarrow & \downarrow & \downarrow \\ p(x) = & \sum_{i=1}^{rows} X_{i,1} & ... & \sum_{i=1}^{rows} X_{i,vocab} \end{array} \]

It is possible to hold a number of rows in memory, and that is why I have written out the rows in blocks of one thousand. So to calculate \(p(x)\) I can go through all of the files one by one and keep a running total. Calculating \(p(y)\) is easy as it is a constant - the rows will be normalized so that they sum to one and thus the probability of any one row being selected is \(\frac{1}{rows}\).


Evaluation

To check if this is going to work at all I am going to evaluate this on a single file and treat that as if it were the entire dataset. This will allow me to easily implement the computations and extract the significant tokens without having to wait for the data preparation from the previous post to complete (it still has around 12 hours to go at time of writing).

Code
from pathlib import Path
import pandas as pd

TOKEN_COUNT_FOLDER = Path("/data/blog/2021-07-30-wikipedia-data-generation/")
TOKEN_COUNT_FILES = sorted(TOKEN_COUNT_FOLDER.glob("*-token-counts.gz.parquet"))

df = pd.read_parquet(TOKEN_COUNT_FILES[0])
df.head()
title 0 1 10 100 1000 10000 10001 10002 10003 ... 9990 9991 9992 9993 9994 9995 9996 9997 9998 9999
index
582886 Anarchism 1 0 112 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
766997 Autism 1 0 107 2 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
480347 Albedo 1 0 22 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
343406 A 1 0 42 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
471190 Alabama 1 0 66 3 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 50266 columns

This shows the token counts for each page, heavily truncated. As you can see token 10 occurs quite a lot while the other tokens are far more rare. To calculate PMI we can start by calculating the independent factors - the p(x) and p(y) for the table.

Code
# This makes each row sum to 1
normalized_df = df.drop(columns="title")
normalized_df = normalized_df.div(
    normalized_df.sum(axis="columns"),
    axis="rows",
)

# p_x is then the probability distribution of the tokens
p_x = normalized_df.sum(axis="rows")
p_x = p_x / p_x.sum()

# then p_y will be 1 / 1000 for this evaluation
p_y = 1 / len(df)

p_x.head()
0       0.001097
1       0.000000
10      0.012672
100     0.000239
1000    0.000102
dtype: float64

Here I implement the PMI equation. I’m actually excluding \(p(y)\) from the divisor because I am making each row sum to 1. This means that \(p(x,y)\) would actually sum to 1,000. The way to fix this would be to divide my calculation of \(p(x,y)\) by the number of rows, and then add this divisor back to the PMI calculation:

\[ \begin{aligned} p_{true}(x,y) &= p_{bad}(x,y)p(y) \\ \frac{p_{bad}(x,y)p(y)}{p(x)p(y)} &= \frac{p_{bad}(x,y)}{p(x)} \end{aligned} \]

So I can skip \(p(y)\) completely in my calculation. I’ve probably made a mistake with this heh.

We can find the most informative tokens for the first row (the Anarchism page) by sorting the tokens.

Code
import numpy as np

pmi = np.log(normalized_df.div(p_x, axis="columns"))
pmi = pmi.fillna(-np.inf)

pmi.iloc[0].sort_values(ascending=False).head(10)
46179    6.907755
43718    6.907755
40874    6.907755
40939    6.907755
43830    6.907755
37255    6.714953
34936    6.688564
44771    6.674962
45232    6.664664
36816    6.663923
Name: 582886, dtype: float64

So what are these tokens? Do they relate to anarchism? Lets find out.

Code
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-base")
tokenizer.batch_decode(
    np.array(
        pmi.iloc[0]
            .sort_values(ascending=False)
            .head(10)
            .index
    )[:, None].astype(int)
)
[' Chomsky',
 ' fascists',
 ' leftists',
 ' patriarchy',
 'java',
 ' racists',
 ' gays',
 ' anarchist',
 ' anarchists',
 ' decentral']

This might be legit? Most of these tokens describe Anarchism, if only by contrast.

Lets have a look at a few rows.

Code
for i in range(25):
    title = df.iloc[i].title
    tokens = (
        tokenizer.batch_decode(
            np.array(
                pmi.iloc[i]
                    .sort_values(ascending=False)
                    .head(10)
                    .index
            )[:, None].astype(int)
        )
    )
    # truncated to display a bit better
    print(f"{title[:25]: <25} - {', '.join(tokens)}"[:100])
Anarchism                 -  Chomsky,  fascists,  leftists,  patriarchy, java,  racists,  gays,  ana
Autism                    -  autistic,  mindfulness, abbling, CHAT, rette,  ADHD, Thousands,  waving
Albedo                    - 434, 088, 782, herical, parent, 442,  MOD,  darkest, chart,  Leadership
A                         - Ã, �, rounded, cial, Â, ursive,  comma,  shoe, anting,  Ital
Alabama                   - DH,  Trash, funding, mingham, juries, LIST,  tremendously, 565,  Mississ
Achilles                  - adiator, redients, comed, ulnerability, 474,  heel, Tro, tis, rha,  heel
Abraham Lincoln           - itures, Honest,  Spot,  Passed, braska, Click,  scrutin,  Fill,  Lincoln
Aristotle                 -  salient, olphin,  stares, DJ, reason, none,  disbel, eor,  unab,  brood
An American in Paris      -  stroll,  edits, ilk, walking, haps,  sluggish,  listens,  contr,  dragg
Academy Award for Best Pr -  Desire, aryl, ampoo,  Guys,  Elven,  Rollins, EB,  Would,  Ced,  Stub
Academy Awards            -  Kimmel,  advertise,  whisper,  Removed,  Costume,  Makes, Meet,  Castin
Actrius                   -  refreshing,  bitch,  Ventura,  unimagin, esome, ufficient,  Liz,  tant,
Animalia (book)           -  Puzzle, igsaw, iddles,  airs,  Abrams,  ET,  butterfly, Pod,  jacket,  
International Atomic Time -  JD,  TA,  ticking,  clocks, AI,  BI, Circ, UTC,  Measures,  drifted
Altruism                  -  atroc,  rethink,  slime,  empath,  Harbaugh,  weeping, izons, eree,  pr
Ayn Rand                  - deals,  Essence, opping,  quaint,  Medicare,  Mavericks, doctor, Force, 
Alain Connes              -  inject,  Vanderbilt,  Triangle,  Lich, Crit,  acad, functional, comm,  
Allan Dwan                - Stage, handled, elcome, Around, pering,  Swanson,  Enemies, endez,  Para
Algeria                   -  wat,  dont,  Flake, Insert,  RAD, enaries, ndum, HCR, ERE, gui
List of Atlas Shrugged ch -  hires, iddy,  dispatcher, Mayor,  applicant,  Loot,  cabal,  Kinn,  mys
Anthropology              -  locale, ritional,  unfolding, nsic, focus,  exper,  Anthropology,  anth
Agricultural science      -  Regener, RIC,  WE, ertation, ussions,  Agricultural,  Immunity, Range, 
Alchemy                   -  ath, piring,  illuminate, mage,  Bonus,  Khe, berus, CG, ixir,  sacrifi
Alien                     -  Konami, Alien,  Aliens,  Predator,  maze, Tank,  Alien,  Balls,  Spears
Astronomer                -  snapshots,  outreach,  moons, ategories,  maths,  educators,  billions,

Generally this seems good. The tokens are very different across the different pages and quite a few of the tokens are justified.

This isn’t perfect though. For example:

  • The Lich token from Alain Connes comes from the start of a name in a book list.
  • Most of the interesting tokens for Allan Dwan comes from a list of films.
  • The List of Atlas Shrugged characters should ultimately be many different pages.

I’m pleased that stopwords are not prevalent, which is something that PMI is good at excluding. When the data processing has completed I’ll be able to apply this to extract the tokens for every page. I’ll take the top 50 as that should give the model quite a bit of flexibility in it’s optimization.


P(x) for all rows

The processing of the data from the previous post has finally completed so the true p(y) can now be calculated. Then we can run the evaluation again with this new baseline.

This calculation was going to take almost 40 hours to complete! It’s crazy how long this stuff is taking - this is dealing with the processed data after all. I guess the large matricies would’ve been better stored as raw numpy arrays.

I can speed it up by using ray.

Code
#collapse
from pathlib import Path
from typing import *
import pandas as pd
from tqdm.auto import tqdm
import ray

def calculate_px(paths: List[Path]) -> pd.Series:
    try:
        ray.init()

        tasks = [
            _px.remote(path)
            for path in paths
        ]
        with tqdm(total=len(paths)) as progress:
            p_x, count = None, 0
            while tasks:
                ready, tasks = ray.wait(tasks, num_returns=10, fetch_local=True)
                current_p_x, current_count = _sum_px(ray.get(ready))
                if p_x is None:
                    p_x = current_p_x
                    count = current_count
                else:
                    p_x += current_p_x
                    count += current_count
                progress.update(len(ready))
            return p_x / count
    finally:
        ray.shutdown()

def _sum_px(results: List[Tuple[pd.Series, int]]) -> Tuple[pd.Series, int]:
    return sum(px for px, _ in results), sum(count for _, count in results)

@ray.remote # (num_cpus=3) # limited to control memory use
def _px(path: Path) -> Tuple[pd.Series, int]:
    df = pd.read_parquet(path)

    # This makes each row sum to 1
    df = df.drop(columns="title")
    df = df.div(
        df.sum(axis="columns"),
        axis="rows",
    )

    # p_x is then the probability distribution of the tokens
    p_x = df.sum(axis="rows")
    p_x = p_x / p_x.sum()

    return p_x, len(df)
Code
#hide_output
full_p_x = calculate_px(TOKEN_COUNT_FILES)
2021-08-02 09:45:28,760 INFO services.py:1272 -- View the Ray dashboard at http://127.0.0.1:8265
KeyboardInterrupt: 

This operation was going to take some 70 hours to complete, so I cancelled it. This is much too slow.

I’ve been investigating how to optimize this and it might well be faster to just compute it over the files again. Looking at the code in a separate notebook I’ve found that it should be possible to compute the p_x across the entire dataset in significantly less time than this calculation is taking.

Code
#collapse
from pathlib import Path
from typing import *

from transformers import AutoTokenizer
import pandas as pd
import numpy as np
from tqdm.auto import tqdm

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-base")

def calculate_px(tokenizer: AutoTokenizer, paths: List[Path]) -> np.array:
    total_token_counts, total_rows = None, 0
    for path in tqdm(paths):
        token_counts, rows = _px(tokenizer, path)
        if total_token_counts is None:
            total_token_counts = token_counts
            total_rows = rows
        else:
            total_token_counts += token_counts
            total_rows += rows
    return total_token_counts / total_rows

def _px(tokenizer: AutoTokenizer, path: Path) -> Tuple[np.array, int]:
    vocab_size = tokenizer.vocab_size
    def _block(text: List[str]) -> np.array:
        tokens = tokenizer(text, return_attention_mask=False)["input_ids"]
        counts = np.concatenate([
            np.bincount(ids, minlength=vocab_size)[None, :]
            for ids in tokens
        ])
        counts = counts / counts.sum(axis=1)[:, None]
        return counts.sum(axis=0)

    text = pd.read_parquet(path).text.tolist()
    token_counts = sum(
        _block(text[idx:idx+1_000])
        for idx in range(0, len(text), 1_000)
    )
    return token_counts, len(text)
Code
%%time
from pathlib import Path

ENWIKI_FILES = sorted(Path("/data/blog/2021-07-28-wikipedia-link-recognition/").glob("*.gz.parquet"))
token_counts = calculate_px(tokenizer=tokenizer, paths=ENWIKI_FILES)
Token indices sequence length is longer than the specified maximum sequence length for this model (6536 > 1024). Running this sequence through the model will result in indexing errors

CPU times: user 4h 56min 46s, sys: 8min 36s, total: 5h 5min 23s
Wall time: 47min 57s
Code
#hide
from pathlib import Path

DATA_FOLDER = Path("/data/blog/2021-08-01-wikipedia-page-pmi/")
DATA_FOLDER.mkdir(exist_ok=True, parents=True)

(
    pd.DataFrame(token_counts)
        .rename(columns={0: "px"})
        .to_parquet(DATA_FOLDER / "px.gz.parquet", compression="gzip")
)

It feels crazy that it’s faster to tokenize all of the text and count the tokens than to read some files. Still this makes it fast enough to calculate the top 50 tonight. Way better than waiting almost two days.

Code
#collapse
from pathlib import Path
import pandas as pd

ENWIKI_FILES = sorted(Path("/data/blog/2021-07-28-wikipedia-link-recognition/").glob("*.gz.parquet"))
DATA_FOLDER = Path("/data/blog/2021-08-01-wikipedia-page-pmi/")

token_counts = pd.read_parquet(DATA_FOLDER / "px.gz.parquet").px.to_numpy()
Code
#collapse
from pathlib import Path
from typing import *

from transformers import AutoTokenizer
import pandas as pd
import numpy as np
from tqdm.auto import tqdm

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-base")

def calculate_pmi(tokenizer: AutoTokenizer, paths: List[Path], p_x: np.array) -> None:
    for path in tqdm(paths):
        text_df = pd.read_parquet(path)
        token_df = _pmi(
            tokenizer=tokenizer,
            text=text_df.text.tolist(),
            p_x=p_x,
            index=text_df.index,
        )
        df = pd.merge(
            text_df[["title"]],
            token_df,
            left_index=True,
            right_index=True,
        )
        df.to_parquet(DATA_FOLDER / f"{path.stem}-pmi.gz.parquet")

def _pmi(tokenizer: AutoTokenizer, text: List[str], p_x: np.array, index: pd.Series) -> pd.DataFrame:
    vocab_size = tokenizer.vocab_size
    def _block(text: List[str]) -> np.array:
        tokens = tokenizer(text, return_attention_mask=False)["input_ids"]
        counts = np.concatenate([
            np.bincount(ids, minlength=vocab_size)[None, :]
            for ids in tokens
        ])
        counts = counts / counts.sum(axis=1)[:, None]
        pmi = np.log(counts / p_x)
        pmi = np.nan_to_num(
            pmi,
            copy=False,
            nan=-1e300,
        )
        return pmi.argsort()[:, -50:].copy()

    all_pmi = np.concatenate([
        _block(text[idx:idx+1_000])
        for idx in range(0, len(text), 1_000)
    ])
    return (
        pd.Series(all_pmi.tolist(), index=index)
            .to_frame()
            .rename(columns={0: "tokens"})
    )
Code
#hide_output
calculate_pmi(tokenizer=tokenizer, paths=ENWIKI_FILES, p_x=token_counts)
Token indices sequence length is longer than the specified maximum sequence length for this model (6536 > 1024). Running this sequence through the model will result in indexing errors
<ipython-input-7-d65885d89661>:37: RuntimeWarning: invalid value encountered in true_divide
  pmi = np.log(counts / p_x)
<ipython-input-7-d65885d89661>:37: RuntimeWarning: divide by zero encountered in log
  pmi = np.log(counts / p_x)

So the processing now took about 2 hours 30 mins instead of almost 70 hours. This is a vast improvement. The resulting data is also significantly smaller. It should now finally be possible to generate the target data.


Evaluation for All Rows

Now we can repeat the evaluation over the same rows that we looked at earlier and see how the top tokens have changed.

Code
from pathlib import Path

DATA_FOLDER = Path("/data/blog/2021-08-01-wikipedia-page-pmi/")
TITLE_TOKENS = sorted(DATA_FOLDER.glob("*-pmi.gz.parquet"))

df = pd.read_parquet(TITLE_TOKENS[0])
df
title tokens
0 Anarchism [36047, 25539, 43053, 32257, 32574, 34936, 450...
1 Autism [18297, 42500, 31136, 18477, 47788, 46912, 441...
2 Albedo [24985, 30578, 11840, 27643, 28280, 34241, 146...
3 A [31128, 34739, 30712, 35993, 11173, 46098, 171...
4 Alabama [16145, 27318, 16407, 22774, 13573, 38773, 153...
... ... ...
21073 Heuristic routing [6381, 21395, 16311, 15601, 25212, 28617, 1932...
21074 Hierarchical routing [7655, 6105, 1701, 7145, 3665, 2643, 7089, 348...
21075 High-performance equipment [0, 1437, 11, 8, 35, 12, 43, 102, 14, 29, 13, ...
21076 Hop [14472, 13170, 3886, 9532, 20496, 6674, 28530,...
21077 Horn [19108, 35850, 17408, 34293, 41392, 19831, 657...

21078 rows × 2 columns

Now that we have calculated the top tokens for all of these pages we can compare them to the partial view we looked at before. How will they change?

Code
for i in range(25):
    title = df.iloc[i].title
    tokens = (
        tokenizer.batch_decode(
            df.iloc[i].tokens[:, None][::-1]
        )
    )
    # truncated to display a bit better
    print(f"{title[:25]: <25} - {', '.join(tokens)}"[:100])
Anarchism                 -  anarchism,  anarchists,  Anarch,  anarch,  anarchist,  anarchy,  racist
Autism                    -  autistic,  ASD, CHAT,  autism,  preval,  Autism,  gluten,  deficits, ab
Albedo                    - }", herical,  directional, iosity,  MOD,  darkest,  absorbs,  Unless,  {
A                         -  curs, rounded,  vowel, ishable,  subscript, Ã,  handwriting,  Alphabet,
Alabama                   -  disenfranch,  Outbreak,  Hunts,  Restrict, represented, adoes,  tremend
Achilles                  - redients,  Achilles,  Pyrrha,  Trojan,  heel, tis,  Hera, ulnerability, 
Abraham Lincoln           -  bolst,  slavery,  reassured, isively, Honest,  cursor,  chores,  Lincol
Aristotle                 -  Aristotle,  disbel,  Pyrrha,  impressions,  stares,  gestation,  waking
An American in Paris      -  Include,  arous,  stroll,  sluggish, awaited,  horns, haps,  ordinarily
Academy Award for Best Pr - EB,  Elven,  Ced, ampoo, bons, terson,  Absent,  Gib,  Wheeler,  Willis,
Academy Awards            -  Oscars,  whisper,  Removed,  stat,  viewership,  gif,  nominees,  justi
Actrius                   -  unimagin,  bitch, ufficient, esome,  deserve,  refreshing,  stray, rums
Animalia (book)           -  Frie, igsaw,  Abrams,  Animal,  Puzzle,  iPad,  aids, iddles, ROM, Anim
International Atomic Time -  clocks,  ticking, AI,  hindsight,  calibr,  calibrated,  atomic, mean, 
Altruism                  -  altru,  karma,  reciproc,  bystanders,  counterproductive,  kindness, W
Ayn Rand                  - rugged,  Rand,  libertarian,  altru,  anarchism,  libertarians, ethical,
Alain Connes              - comm, functional,  coh, Crit,  embed,  inject,  bos,  Convers,  download
Allan Dwan                - handled, lightly, Getting, elcome,  Swanson, rozen,  Forbidden,  compass
Algeria                   -  OPEC,  cous, Insert,  Algeria,  Alger,  dont,  unequ,  empires,  civili
List of Atlas Shrugged ch - rugged,  cabal,  loot,  incompetent,  apologizing,  disgusting,  pedd,  
Anthropology              -  anthropology,  Anthrop,  Anthropology,  scrim,  anthrop, capitalist,  C
Agricultural science      -  Immunity,  expenditures,  fertilizer, consumer, ultural, lihood,  Agric
Alchemy                   - chemy,  Alchemy, chemical, metic,  transm, ixir,  sacrific, chem, oteric
Alien                     - Alien,  Aliens,  Alien,  Predator,  extrater, restrial,  Ridley,  Invade
Astronomer                -  astronomers,  snapshots,  observational,  astronomy, ategories,  maths,

This looks quite a bit better. While it could be cleaned up by removing the tokens which are suffixes, or restricting the tokens to those that have more meaning, I think that these are a solid start.

The next post will be about generating the final dataset - the tokenized inputs and target values per token.

For comparison this is the previous run:

Anarchism                 -  Chomsky,  fascists,  leftists,  patriarchy, java,  racists,  gays,  ana
Autism                    -  autistic,  mindfulness, abbling, CHAT, rette,  ADHD, Thousands,  waving
Albedo                    - 434, 088, 782, herical, parent, 442,  MOD,  darkest, chart,  Leadership
A                         - Ã, �, rounded, cial, Â, ursive,  comma,  shoe, anting,  Ital
Alabama                   - DH,  Trash, funding, mingham, juries, LIST,  tremendously, 565,  Mississ
Achilles                  - adiator, redients, comed, ulnerability, 474,  heel, Tro, tis, rha,  heel
Abraham Lincoln           - itures, Honest,  Spot,  Passed, braska, Click,  scrutin,  Fill,  Lincoln
Aristotle                 -  salient, olphin,  stares, DJ, reason, none,  disbel, eor,  unab,  brood
An American in Paris      -  stroll,  edits, ilk, walking, haps,  sluggish,  listens,  contr,  dragg
Academy Award for Best Pr -  Desire, aryl, ampoo,  Guys,  Elven,  Rollins, EB,  Would,  Ced,  Stub
Academy Awards            -  Kimmel,  advertise,  whisper,  Removed,  Costume,  Makes, Meet,  Castin
Actrius                   -  refreshing,  bitch,  Ventura,  unimagin, esome, ufficient,  Liz,  tant,
Animalia (book)           -  Puzzle, igsaw, iddles,  airs,  Abrams,  ET,  butterfly, Pod,  jacket,  
International Atomic Time -  JD,  TA,  ticking,  clocks, AI,  BI, Circ, UTC,  Measures,  drifted
Altruism                  -  atroc,  rethink,  slime,  empath,  Harbaugh,  weeping, izons, eree,  pr
Ayn Rand                  - deals,  Essence, opping,  quaint,  Medicare,  Mavericks, doctor, Force, 
Alain Connes              -  inject,  Vanderbilt,  Triangle,  Lich, Crit,  acad, functional, comm,  
Allan Dwan                - Stage, handled, elcome, Around, pering,  Swanson,  Enemies, endez,  Para
Algeria                   -  wat,  dont,  Flake, Insert,  RAD, enaries, ndum, HCR, ERE, gui
List of Atlas Shrugged ch -  hires, iddy,  dispatcher, Mayor,  applicant,  Loot,  cabal,  Kinn,  mys
Anthropology              -  locale, ritional,  unfolding, nsic, focus,  exper,  Anthropology,  anth
Agricultural science      -  Regener, RIC,  WE, ertation, ussions,  Agricultural,  Immunity, Range, 
Alchemy                   -  ath, piring,  illuminate, mage,  Bonus,  Khe, berus, CG, ixir,  sacrifi
Alien                     -  Konami, Alien,  Aliens,  Predator,  maze, Tank,  Alien,  Balls,  Spears
Astronomer                -  snapshots,  outreach,  moons, ategories,  maths,  educators,  billions,

When I look at this I have at least a passing familiarity with Anarchism, Autism, Achilles, Aristotle, International Atomic Time, Altruism… and when I look at the new top tokens compared to the old I tend to think that the new ones better capture properties of the subject matter. In many cases they are the term itself, that is to be expected though as the page will tend to refer to itself more than other pages do.

All in all I think this is a strong start.