Matthew’s Blog - Part of Speech Tagging

I need to evaluate the spacy part of speech (POS) tagger for German, Spanish, French and Italian. To test it I want to be able to recover the wikipedia links. These are the delimiters which indicate that some text links to another page. They look like [[this]], or [[this link|this text]].

It should be quite easy to strip the special syntax from the page and then I can test if I can recover it by using specific tags.

To get started with this I need to download some wikipedia data. I’m going to get the english wikipedia dump and then the language specific versions. That’s currently destroying my internet.

While I wait I can try out some of the pos taggers on the text of a single page. I should also try to get familiar with spacy so I can start with the tutorial.

Code

import spacy
! python -m spacy download en_core_web_sm

Code

nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sentence.")
print([(w.text, w.pos_) for w in doc])

[('This', 'DET'), ('is', 'AUX'), ('a', 'DET'), ('sentence', 'NOUN'), ('.', 'PUNCT')]

That’s pretty easy. Lets try with some French.

Code

! python -m spacy download fr_core_news_sm

Code

nlp = spacy.load("fr_core_news_sm")
doc = nlp("C'est une phrase.")
print([(w.text, w.pos_) for w in doc])

[("C'", 'PRON'), ('est', 'AUX'), ('une', 'DET'), ('phrase', 'NOUN'), ('.', 'PUNCT')]

Now what I want to do is to take some wikipedia text and strip out the links. I like Moby Dick so lets start there…

Code

call_me_ishmael = """
'''''Moby-Dick; or, The Whale''''' is an 1851 novel by American writer [[Herman Melville]].
The book is the sailor [[Ishmael (Moby-Dick)|Ishmael]]'s narrative of the obsessive quest of
[[Captain Ahab|Ahab]], captain of the [[whaler|whaling ship]] ''[[Pequod (Moby-Dick)|Pequod]]'',
for revenge on [[Moby Dick (whale)|Moby Dick]], the giant white [[sperm whale]] that on the
ship's previous voyage bit off Ahab's leg at the knee. A contribution to the literature of
the [[American Renaissance (literature)|American Renaissance]], ''Moby-Dick'' was published
to mixed reviews, was a commercial failure, and was out of print at the time of the author's
death in 1891. Its reputation as a "[[Great American Novel]]" was established only in the 20th
century, after the centennial of its author's birth. [[William Faulkner]] said he wished he
had written the book himself,<ref>Faulkner (1927)</ref> and [[D. H. Lawrence]] called it "one
of the strangest and most wonderful books in the world" and "the greatest book of the sea ever
written".<ref>Lawrence (1923), 168</ref> Its [[opening sentence]], "Call me Ishmael", is among
world literature's most famous.<ref> Buell (2014), 362 note.</ref>
"""

The changes that I need to make to this seem pretty straightforward:

Remove the ''''' sequences (bold)
Remove the '' sequences (italic)
Remove the <ref>TEXT</ref> sequences (footnotes)
Where [[word]] appears, replace it with word
Where [[link|word]] appears, replace it with word

Code

import regex as re

def remove_bold(text: str) -> str:
    return text.replace("'''''", "")

def remove_italic(text: str) -> str:
    return text.replace("''", "")

REFERENCE_PATTERN = re.compile(r"<ref>.*?</ref>")
def remove_references(text: str) -> str:
    return REFERENCE_PATTERN.sub("", text)

LINK_PATTERN = re.compile(r"\[\[(?:.*?\|)?([^|\]]+)]]")
def expand_links(text: str) -> str:
    return LINK_PATTERN.sub(r"\1", text)

WHITESPACE_PATTERN = re.compile(r"\s+")
def normalize_whitespace(text: str) -> str:
    return WHITESPACE_PATTERN.sub(" ", text)

Code

print(remove_bold("'''''Moby-Dick; or, The Whale'''''"))

Moby-Dick; or, The Whale

Code

print(remove_italic("''Moby-Dick''"))

Moby-Dick

Code

print(remove_references("himself,<ref>Faulkner (1927)</ref> and"))

himself, and

Code

print(expand_links("American writer [[Herman Melville]]"))
print(expand_links("captain of the [[whaler|whaling ship]]"))

American writer Herman Melville
captain of the whaling ship

Code

def clean_text(text: str) -> str:
    text = remove_bold(text)
    text = remove_italic(text)
    text = remove_references(text)
    text = expand_links(text)
    text = normalize_whitespace(text)
    return text.strip()

Code

print(clean_text(call_me_ishmael))

Moby-Dick; or, The Whale is an 1851 novel by American writer Herman Melville. The book is the sailor Ishmael's narrative of the obsessive quest of Ahab, captain of the whaling ship Pequod, for revenge on Moby Dick, the giant white sperm whale that on the ship's previous voyage bit off Ahab's leg at the knee. A contribution to the literature of the American Renaissance, Moby-Dick was published to mixed reviews, was a commercial failure, and was out of print at the time of the author's death in 1891. Its reputation as a "Great American Novel" was established only in the 20th century, after the centennial of its author's birth. William Faulkner said he wished he had written the book himself, and D. H. Lawrence called it "one of the strangest and most wonderful books in the world" and "the greatest book of the sea ever written". Its opening sentence, "Call me Ishmael", is among world literature's most famous.

Now I can try out the POS tagging to see what corresponds to the links.

Code

from typing import List, Tuple

WordAndTag = Tuple[str, str]

def tag_english(text: str) -> List[WordAndTag]:
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    return [(w.text, w.pos_) for w in doc]

Code

tag_english(clean_text(call_me_ishmael))[:10]

[('Moby', 'PROPN'),
 ('-', 'PUNCT'),
 ('Dick', 'PROPN'),
 (';', 'PUNCT'),
 ('or', 'CCONJ'),
 (',', 'PUNCT'),
 ('The', 'DET'),
 ('Whale', 'PROPN'),
 ('is', 'AUX'),
 ('an', 'DET')]

I was half hoping that the POS tags were going to be some simpler set. I remember working with these before and there are quite a lot of distinct tags.

Since this has to work with multi word phrases (e.g. Herman Melville) I will need to do some sort of grouping.

Lets start by finding the POS tags for the linked terms.

Code

def find_links(text: str) -> List[str]:
    return LINK_PATTERN.findall(text)

Code

find_links(call_me_ishmael)

['Herman Melville',
 'Ishmael',
 'Ahab',
 'whaling ship',
 'Pequod',
 'Moby Dick',
 'sperm whale',
 'American Renaissance',
 'Great American Novel',
 'William Faulkner',
 'D. H. Lawrence',
 'opening sentence']

Code

list(nlp('Herman Melville'))

[Herman, Melville]

These arn’t really words, they have an overridden string representation. I should be able to find the words that match in sequence in the given text easily enough. It also looks like my downloads are going to finish soon.

Code

def find_links(text: str) -> List[List[str]]:
    nlp = spacy.load("en_core_web_sm")
    return [
        [word.text for word in nlp(link)]
        for link in LINK_PATTERN.findall(text)
    ]

Code

from typing import Iterator

def find_all_sequences(text: List[WordAndTag], patterns: List[List[str]]) -> List[List[WordAndTag]]:
    return sum(
        (list(_find_sequence(text, pattern)) for pattern in patterns),
        []
    )

def _find_sequence(text: List[WordAndTag], pattern: List[str]) -> Iterator[List[WordAndTag]]:
    for index in range(len(text)):
        if _matches(text, pattern, index):
            yield text[index:index+len(pattern)]

def _matches(text: List[WordAndTag], pattern: List[str], index: int) -> bool:
    return all(
        text[index + pattern_index][0] == pattern[pattern_index]
        for pattern_index in range(len(pattern))
    )

Code

find_all_sequences(
    text=tag_english(clean_text(call_me_ishmael)),
    patterns=find_links(call_me_ishmael)
)

[[('Herman', 'PROPN'), ('Melville', 'PROPN')],
 [('Ishmael', 'PROPN')],
 [('Ishmael', 'PROPN')],
 [('Ahab', 'PROPN')],
 [('Ahab', 'PROPN')],
 [('whaling', 'VERB'), ('ship', 'NOUN')],
 [('Pequod', 'PROPN')],
 [('Moby', 'PROPN'), ('Dick', 'PROPN')],
 [('sperm', 'PROPN'), ('whale', 'NOUN')],
 [('American', 'PROPN'), ('Renaissance', 'PROPN')],
 [('Great', 'PROPN'), ('American', 'PROPN'), ('Novel', 'PROPN')],
 [('William', 'PROPN'), ('Faulkner', 'PROPN')],
 [('D.', 'PROPN'), ('H.', 'PROPN'), ('Lawrence', 'PROPN')],
 [('opening', 'NOUN'), ('sentence', 'NOUN')]]

If I just look for PROPN and NOUN how close do I get?

Code

[word for word, tag in tag_english(clean_text(call_me_ishmael)) if tag in ("PROPN", "NOUN")][:10]

['Moby',
 'Dick',
 'Whale',
 'novel',
 'writer',
 'Herman',
 'Melville',
 'book',
 'sailor',
 'Ishmael']

So there are a lot more matches for the noun tags than there are links.

It looks like the initial investigation has found that recovering the links requires more than the word-by-word POS tags. At the very least I would have to evaluate several words to recover the multi-word links.

Code

type(nlp(clean_text(call_me_ishmael)))

spacy.tokens.doc.Doc

I’ve looked at the documentation for this type and it looks like there are a few interesting methods available. Currently I’ve just been using the suggested code to process this, now I should investigate these additional methods.

The first is noun_chunks (docs) which might give me something more reliable than the noun tagged words.

Code

doc = nlp(clean_text(call_me_ishmael))
noun_chunks = list(doc.noun_chunks)
noun_chunks

[Moby-Dick;,
 by American writer Herman Melville,
 Pequod,
 white,
 previous voyage bit,
 A contribution to the literature of the,
 was published to mixed,
 was,
 Its reputation as a "Great American Novel" was established only in the 20th century, after the centennial of its author's birth.,
 William Faulkner said he wished,
 written,
 called,
 one of the,
 ever written,
 Its opening sentence, "Call me Ishmael", is among world,
 literature',
 s most famous.]

So these are spacy.tokens.span.Span objects. Again these have promising methods on them, like ents which should return the named entities.

Code

[chunk.ents for chunk in noun_chunks]

[[Dick],
 [American writer, Herman Melville],
 [],
 [],
 [],
 [],
 [],
 [],
 [Great American Novel],
 [William Faulkner],
 [],
 [],
 [],
 [],
 [Call me Ishmael],
 [],
 []]

Code

doc.ents

(Dick,
 The Whale is,
 American writer,
 Herman Melville,
 Pequod, for revenge on Moby Dick,
 Moby-Dick,
 Great American Novel,
 William Faulkner,
 D. H. Lawrence,
 Call me Ishmael)

This certainly seems very hit and miss. Perhaps it is over emphasizing the capitalization of some of the words?

Why is it just Dick at the start but Moby-Dick later?

Code

spacy.__version__

'2.3.5'

Spacy 3 is coming out soon, so it might be good to run this again when it does.