Matthew’s Blog - Wikipedia Link Recognition

I still feel bad that I didn’t write a paper about prompt training after coming up with the idea and implementation well within time. To help address this I am going to restructure this blog to be arranged around papers where the different sections are fleshed out further in individual posts. I’m not sure how well this will work, one way to find out!

Overview

To achieve this, I need to run through the steps from the manual.

Market Need

The first step is to identify the Market Need - who wants this information? Normally this blog is aimed at me prior to learning this information, a reasonably competent programmer who is learning data science. For a paper the audience is a scientifically informed audience. This means that the explanations could be a bit higher level than before.

Concept

The core concept here is to work out what the subject of a statement is. There are statements like I like my apple where the subject, apple, is ambiguous. Is it the fruit or the phone?

…

The manual has a concept sheet as part of this, however I am now going to move to the new blog structure. What I will do is create the broad sections of the paper - introduction, method, results - and then start linking in the blog posts. Each linked blog post will then have a backlink to this to show the bigger picture.

Introduction

Need to outline the following:

What is the subject of a statement
What is an ambiguous subject
How we are good at resolving the ambiguity
How computers are bad

Then can compare it to the Facebook GENRE approach, and other attempts to resolve ambiguity.

The wikipedia approach relies on the fact that each page describes a thing. Then can use links to the page as a way to disambiguate the linked text.

Method

The wikipedia approach relies on the fact that each page describes a thing. Then can use links to the page as a way to disambiguate the linked text.

How do we describe the linked text though? What can we use?

The GENRE approach relies upon a sequence to sequence model generating the title of the page after detecting that a term is linked, as is shown in this gif:

Using a sequence to sequence model in this way is inefficient. Every step of the way you need to feed the input through the model again, so for the short sequence The hero saved Metropolis . the model must be run at least 5 times, and when the Metropolis term is linked that becomes 10 times (additional tokens are the three delimiters and two title tokens). Since they are using beam search that can get even worse as it is maintaining a number of beams to traverse the sequence.

The entire model is being run in this way to make a binary choice most of the time - start link or not? finish link or not? The only significant choice comes when it is linking the wikipedia page, which relies on a consistent naming convention.

More broadly the GENRE approach requires that the wikipedia title for a page be a transparent means of representing the entity. They claim that new entities can be detected by just adding their unambiguous title to the trie used for title generation.

Is there a better way?

Transformer Output

The transformer model produces output for every token. If we can train this output to describe the input then we can process the sequence in a single invocation of the model.

Page Descriptions

To get this working the model has to have a way to describe each page in wikipedia. Then the links have to produce this description as the output.

Data Processing

Quite a lot was done to preprocess the data. This can be split into the following stages:

The labels are rather complex as the Page PMI needs to be expanded to a full 50k vector. Limited memory prevents the full vivification of the label dataset. I need to create a custom dataloader that can create the target matrix at the point that the data is being passed to the model.

I still feel that I haven’t completely processed the data. The current approach of binary cross entropy over the 50 top PMI tokens feels like it needs more work.

Data Processing Posts

Wikipedia Link Recognition This is the very start of the data processing. In this the wikipedia article data is lightly inspected. This writes the article text and the link boundaries to parquet files.

Wikipedia Data Generation The association of the tokens from tokenized text with the boundaries.

Wikipedia Page Mutual Information First pass at PMI calculation by counting the different tokens produced by the tokenization of the article.

Wikipedia Preprocessing Performance At this point the performance of PMI calculation was a problem, this post goes over performance improvements.

Wikipedia IOB Link Boundaries Moving from start/within to outside/inside/beginning (IOB).

Results

The results came after the processing. Production of metrics came quite late so the first few runs only have manual evaluation.

Wikipedia Link Model First go at training the model. This one used within/start boundary labels. The model was evaluated by reviewing some of the tokens that were produced for the recognized entities.

Wikipedia Preprocessing Performance This involved improving the performance of the data preprocessing and model. There was an evaluation of the trained model in this which reviews the produced tokens as before. This time an attempt was made to map the produced tokens to the wikipedia pages by ordering pages by the sum of the selected tokens. The predictions across tokens for an entity were also combined. This showed encouraging performance as the correct result was often in the top 10, and once was the top result (across the whole of wikipedia).

Wikipedia IOB Link Boundaries The use of within/start as two binary classifiers leads to an illegal prediction being possible (is the start of an entity, is not within an entity). The standard classification system is described as IOB and that is a three class classifier, so it cannot predict the illegal state. This tries training a model to use the IOB boundary classifier with the regular link classifier. The manual evaluation shows weaker performance than the previous model that used within/start. I did not investigate the weaker performance further.

Wikipedia Boundary Metrics Part of the problem with training the model is that the accuracy of the different parts has not been established. This starts by measuring the accuracy of the within/start and IOB boundary predictions. IOB gets a marginally better score although that may be down to random chance.

Wikipedia Link Metrics This measures the performance of the link prediction, using jaccard similarity and perplexity. Both metrics show very poor performance. Much worse than I was expecting given that the manual evaluations have seemed quite good. This performance suggests to me that the training target is quite hard and that a more appropriate loss function should be used.

Wikipedia Perplexity Train In an attempt to adjust the training to perform better on the metrics this moves to a perplexity (i.e. cross entropy) training system. I think the implementation is fundamentally bugged as cross entropy takes a single target to predict, which is incompatible with the current target. The performance of the trained model is terrible and it trains very slowly anyway. I feel that a mean squared error based loss between softmax and pmi scores could be better than this, until I find a mathematically sound approach.

Wikipedia PMI Values MSE Train To provide more information to the loss function the PMI values are retained and used as targets to train against. To keep the model output and the PMI values in the same range both are softmaxed before comparison. Mean Squared Error is then used to move the model towards these continuous outputs (as cross entropy / binary cross entropy are no longer suitable). While it is possible to get good results the mean squared error loss value drops off and this seems to impact training performance. A different metric may provide more for the model to train with.