Trying to extract meaningful tokens from a page using mutual information
Published
August 1, 2021
I’ve started collecting the token counts per page. I’m going to use the distinctive tokens to identify the page when it is linked to. So to calculate this I want to use the Pointwise Mutual Information algorithm to find tokens which are distinctive for the page.
This is made up of three parts, and they relate to the independent probabilities (\(p(x)\) and \(p(y)\)) and the joint probability (\(p(x,y)\)). The intutition is that if the independent probabilities are the same as the joint probability (i.e. the two variables are independent) then the PMI will be \(\log(1) = 0\). Variations from this relationship will either lead to a positive PMI, if the joint probability is higher than expected, or a negative PMI, if the joint probability is lower than expected.
I’m looking for tokens that the model should predict, so that means looking for the high PMI tokens.
The first thing to consider is the practicality of this. I can’t hold the entire dataset in memory at once - there are over 50 thousand different tokens in the vocabulary, and there are over 6 million rows. So I need to be able to compute this incrementally.
To calculate the PMI for a single token I need to know the independent probability for each individual token and for each individual row. So given the matrix:
\[
\begin{array}{ c c c }
X_{1,1} & ... & X_{1,vocab} \\
... & ... & ... \\
X_{rows,1} & ... & X_{rows,vocab}
\end{array}
\]
We want two sums, one for the columns and one for the rows:
It is possible to hold a number of rows in memory, and that is why I have written out the rows in blocks of one thousand. So to calculate \(p(x)\) I can go through all of the files one by one and keep a running total. Calculating \(p(y)\) is easy as it is a constant - the rows will be normalized so that they sum to one and thus the probability of any one row being selected is \(\frac{1}{rows}\).
Evaluation
To check if this is going to work at all I am going to evaluate this on a single file and treat that as if it were the entire dataset. This will allow me to easily implement the computations and extract the significant tokens without having to wait for the data preparation from the previous post to complete (it still has around 12 hours to go at time of writing).
Code
from pathlib import Pathimport pandas as pdTOKEN_COUNT_FOLDER = Path("/data/blog/2021-07-30-wikipedia-data-generation/")TOKEN_COUNT_FILES =sorted(TOKEN_COUNT_FOLDER.glob("*-token-counts.gz.parquet"))df = pd.read_parquet(TOKEN_COUNT_FILES[0])df.head()
title
0
1
10
100
1000
10000
10001
10002
10003
...
9990
9991
9992
9993
9994
9995
9996
9997
9998
9999
index
582886
Anarchism
1
0
112
1
0
0
0
0
0
...
0
0
0
0
0
0
0
0
1
0
766997
Autism
1
0
107
2
0
0
0
0
0
...
0
0
0
0
0
0
0
0
0
0
480347
Albedo
1
0
22
0
0
0
0
0
0
...
0
0
0
0
0
0
0
0
0
0
343406
A
1
0
42
0
0
0
0
0
0
...
0
0
0
1
0
0
0
0
0
0
471190
Alabama
1
0
66
3
0
0
0
0
0
...
0
0
0
0
0
0
0
0
0
0
5 rows × 50266 columns
This shows the token counts for each page, heavily truncated. As you can see token 10 occurs quite a lot while the other tokens are far more rare. To calculate PMI we can start by calculating the independent factors - the p(x) and p(y) for the table.
Code
# This makes each row sum to 1normalized_df = df.drop(columns="title")normalized_df = normalized_df.div( normalized_df.sum(axis="columns"), axis="rows",)# p_x is then the probability distribution of the tokensp_x = normalized_df.sum(axis="rows")p_x = p_x / p_x.sum()# then p_y will be 1 / 1000 for this evaluationp_y =1/len(df)p_x.head()
Here I implement the PMI equation. I’m actually excluding \(p(y)\) from the divisor because I am making each row sum to 1. This means that \(p(x,y)\) would actually sum to 1,000. The way to fix this would be to divide my calculation of \(p(x,y)\) by the number of rows, and then add this divisor back to the PMI calculation:
This might be legit? Most of these tokens describe Anarchism, if only by contrast.
Lets have a look at a few rows.
Code
for i inrange(25): title = df.iloc[i].title tokens = ( tokenizer.batch_decode( np.array( pmi.iloc[i] .sort_values(ascending=False) .head(10) .index )[:, None].astype(int) ) )# truncated to display a bit betterprint(f"{title[:25]: <25} - {', '.join(tokens)}"[:100])
Generally this seems good. The tokens are very different across the different pages and quite a few of the tokens are justified.
This isn’t perfect though. For example:
The Lich token from Alain Connes comes from the start of a name in a book list.
Most of the interesting tokens for Allan Dwan comes from a list of films.
The List of Atlas Shrugged characters should ultimately be many different pages.
I’m pleased that stopwords are not prevalent, which is something that PMI is good at excluding. When the data processing has completed I’ll be able to apply this to extract the tokens for every page. I’ll take the top 50 as that should give the model quite a bit of flexibility in it’s optimization.
P(x) for all rows
The processing of the data from the previous post has finally completed so the true p(y) can now be calculated. Then we can run the evaluation again with this new baseline.
This calculation was going to take almost 40 hours to complete! It’s crazy how long this stuff is taking - this is dealing with the processed data after all. I guess the large matricies would’ve been better stored as raw numpy arrays.
I can speed it up by using ray.
Code
#collapsefrom pathlib import Pathfrom typing import*import pandas as pdfrom tqdm.auto import tqdmimport raydef calculate_px(paths: List[Path]) -> pd.Series:try: ray.init() tasks = [ _px.remote(path)for path in paths ]with tqdm(total=len(paths)) as progress: p_x, count =None, 0while tasks: ready, tasks = ray.wait(tasks, num_returns=10, fetch_local=True) current_p_x, current_count = _sum_px(ray.get(ready))if p_x isNone: p_x = current_p_x count = current_countelse: p_x += current_p_x count += current_count progress.update(len(ready))return p_x / countfinally: ray.shutdown()def _sum_px(results: List[Tuple[pd.Series, int]]) -> Tuple[pd.Series, int]:returnsum(px for px, _ in results), sum(count for _, count in results)@ray.remote# (num_cpus=3) # limited to control memory usedef _px(path: Path) -> Tuple[pd.Series, int]: df = pd.read_parquet(path)# This makes each row sum to 1 df = df.drop(columns="title") df = df.div( df.sum(axis="columns"), axis="rows", )# p_x is then the probability distribution of the tokens p_x = df.sum(axis="rows") p_x = p_x / p_x.sum()return p_x, len(df)
2021-08-02 09:45:28,760 INFO services.py:1272 -- View the Ray dashboard at http://127.0.0.1:8265
KeyboardInterrupt:
This operation was going to take some 70 hours to complete, so I cancelled it. This is much too slow.
I’ve been investigating how to optimize this and it might well be faster to just compute it over the files again. Looking at the code in a separate notebook I’ve found that it should be possible to compute the p_x across the entire dataset in significantly less time than this calculation is taking.
Token indices sequence length is longer than the specified maximum sequence length for this model (6536 > 1024). Running this sequence through the model will result in indexing errors
CPU times: user 4h 56min 46s, sys: 8min 36s, total: 5h 5min 23s
Wall time: 47min 57s
It feels crazy that it’s faster to tokenize all of the text and count the tokens than to read some files. Still this makes it fast enough to calculate the top 50 tonight. Way better than waiting almost two days.
Token indices sequence length is longer than the specified maximum sequence length for this model (6536 > 1024). Running this sequence through the model will result in indexing errors
<ipython-input-7-d65885d89661>:37: RuntimeWarning: invalid value encountered in true_divide
pmi = np.log(counts / p_x)
<ipython-input-7-d65885d89661>:37: RuntimeWarning: divide by zero encountered in log
pmi = np.log(counts / p_x)
So the processing now took about 2 hours 30 mins instead of almost 70 hours. This is a vast improvement. The resulting data is also significantly smaller. It should now finally be possible to generate the target data.
Evaluation for All Rows
Now we can repeat the evaluation over the same rows that we looked at earlier and see how the top tokens have changed.
Code
from pathlib import PathDATA_FOLDER = Path("/data/blog/2021-08-01-wikipedia-page-pmi/")TITLE_TOKENS =sorted(DATA_FOLDER.glob("*-pmi.gz.parquet"))df = pd.read_parquet(TITLE_TOKENS[0])df
title
tokens
0
Anarchism
[36047, 25539, 43053, 32257, 32574, 34936, 450...
1
Autism
[18297, 42500, 31136, 18477, 47788, 46912, 441...
2
Albedo
[24985, 30578, 11840, 27643, 28280, 34241, 146...
3
A
[31128, 34739, 30712, 35993, 11173, 46098, 171...
4
Alabama
[16145, 27318, 16407, 22774, 13573, 38773, 153...
...
...
...
21073
Heuristic routing
[6381, 21395, 16311, 15601, 25212, 28617, 1932...
21074
Hierarchical routing
[7655, 6105, 1701, 7145, 3665, 2643, 7089, 348...
21075
High-performance equipment
[0, 1437, 11, 8, 35, 12, 43, 102, 14, 29, 13, ...
21076
Hop
[14472, 13170, 3886, 9532, 20496, 6674, 28530,...
21077
Horn
[19108, 35850, 17408, 34293, 41392, 19831, 657...
21078 rows × 2 columns
Now that we have calculated the top tokens for all of these pages we can compare them to the partial view we looked at before. How will they change?
Code
for i inrange(25): title = df.iloc[i].title tokens = ( tokenizer.batch_decode( df.iloc[i].tokens[:, None][::-1] ) )# truncated to display a bit betterprint(f"{title[:25]: <25} - {', '.join(tokens)}"[:100])
This looks quite a bit better. While it could be cleaned up by removing the tokens which are suffixes, or restricting the tokens to those that have more meaning, I think that these are a solid start.
The next post will be about generating the final dataset - the tokenized inputs and target values per token.
When I look at this I have at least a passing familiarity with Anarchism, Autism, Achilles, Aristotle, International Atomic Time, Altruism… and when I look at the new top tokens compared to the old I tend to think that the new ones better capture properties of the subject matter. In many cases they are the term itself, that is to be expected though as the page will tend to refer to itself more than other pages do.