Code
import spacy
! python -m spacy download en_core_web_sm
January 8, 2021
I need to evaluate the spacy part of speech (POS) tagger for German, Spanish, French and Italian. To test it I want to be able to recover the wikipedia links. These are the delimiters which indicate that some text links to another page. They look like [[this]]
, or [[this link|this text]]
.
It should be quite easy to strip the special syntax from the page and then I can test if I can recover it by using specific tags.
To get started with this I need to download some wikipedia data. I’m going to get the english wikipedia dump and then the language specific versions. That’s currently destroying my internet.
While I wait I can try out some of the pos taggers on the text of a single page. I should also try to get familiar with spacy so I can start with the tutorial.
[('This', 'DET'), ('is', 'AUX'), ('a', 'DET'), ('sentence', 'NOUN'), ('.', 'PUNCT')]
That’s pretty easy. Lets try with some French.
[("C'", 'PRON'), ('est', 'AUX'), ('une', 'DET'), ('phrase', 'NOUN'), ('.', 'PUNCT')]
Now what I want to do is to take some wikipedia text and strip out the links. I like Moby Dick so lets start there…
call_me_ishmael = """
'''''Moby-Dick; or, The Whale''''' is an 1851 novel by American writer [[Herman Melville]].
The book is the sailor [[Ishmael (Moby-Dick)|Ishmael]]'s narrative of the obsessive quest of
[[Captain Ahab|Ahab]], captain of the [[whaler|whaling ship]] ''[[Pequod (Moby-Dick)|Pequod]]'',
for revenge on [[Moby Dick (whale)|Moby Dick]], the giant white [[sperm whale]] that on the
ship's previous voyage bit off Ahab's leg at the knee. A contribution to the literature of
the [[American Renaissance (literature)|American Renaissance]], ''Moby-Dick'' was published
to mixed reviews, was a commercial failure, and was out of print at the time of the author's
death in 1891. Its reputation as a "[[Great American Novel]]" was established only in the 20th
century, after the centennial of its author's birth. [[William Faulkner]] said he wished he
had written the book himself,<ref>Faulkner (1927)</ref> and [[D. H. Lawrence]] called it "one
of the strangest and most wonderful books in the world" and "the greatest book of the sea ever
written".<ref>Lawrence (1923), 168</ref> Its [[opening sentence]], "Call me Ishmael", is among
world literature's most famous.<ref> Buell (2014), 362 note.</ref>
"""
The changes that I need to make to this seem pretty straightforward:
'''''
sequences (bold)''
sequences (italic)<ref>TEXT</ref>
sequences (footnotes)[[word]]
appears, replace it with word
[[link|word]]
appears, replace it with word
import regex as re
def remove_bold(text: str) -> str:
return text.replace("'''''", "")
def remove_italic(text: str) -> str:
return text.replace("''", "")
REFERENCE_PATTERN = re.compile(r"<ref>.*?</ref>")
def remove_references(text: str) -> str:
return REFERENCE_PATTERN.sub("", text)
LINK_PATTERN = re.compile(r"\[\[(?:.*?\|)?([^|\]]+)]]")
def expand_links(text: str) -> str:
return LINK_PATTERN.sub(r"\1", text)
WHITESPACE_PATTERN = re.compile(r"\s+")
def normalize_whitespace(text: str) -> str:
return WHITESPACE_PATTERN.sub(" ", text)
American writer Herman Melville
captain of the whaling ship
Moby-Dick; or, The Whale is an 1851 novel by American writer Herman Melville. The book is the sailor Ishmael's narrative of the obsessive quest of Ahab, captain of the whaling ship Pequod, for revenge on Moby Dick, the giant white sperm whale that on the ship's previous voyage bit off Ahab's leg at the knee. A contribution to the literature of the American Renaissance, Moby-Dick was published to mixed reviews, was a commercial failure, and was out of print at the time of the author's death in 1891. Its reputation as a "Great American Novel" was established only in the 20th century, after the centennial of its author's birth. William Faulkner said he wished he had written the book himself, and D. H. Lawrence called it "one of the strangest and most wonderful books in the world" and "the greatest book of the sea ever written". Its opening sentence, "Call me Ishmael", is among world literature's most famous.
Now I can try out the POS tagging to see what corresponds to the links.
[('Moby', 'PROPN'),
('-', 'PUNCT'),
('Dick', 'PROPN'),
(';', 'PUNCT'),
('or', 'CCONJ'),
(',', 'PUNCT'),
('The', 'DET'),
('Whale', 'PROPN'),
('is', 'AUX'),
('an', 'DET')]
I was half hoping that the POS tags were going to be some simpler set. I remember working with these before and there are quite a lot of distinct tags.
Since this has to work with multi word phrases (e.g. Herman Melville) I will need to do some sort of grouping.
Lets start by finding the POS tags for the linked terms.
['Herman Melville',
'Ishmael',
'Ahab',
'whaling ship',
'Pequod',
'Moby Dick',
'sperm whale',
'American Renaissance',
'Great American Novel',
'William Faulkner',
'D. H. Lawrence',
'opening sentence']
These arn’t really words, they have an overridden string representation. I should be able to find the words that match in sequence in the given text easily enough. It also looks like my downloads are going to finish soon.
from typing import Iterator
def find_all_sequences(text: List[WordAndTag], patterns: List[List[str]]) -> List[List[WordAndTag]]:
return sum(
(list(_find_sequence(text, pattern)) for pattern in patterns),
[]
)
def _find_sequence(text: List[WordAndTag], pattern: List[str]) -> Iterator[List[WordAndTag]]:
for index in range(len(text)):
if _matches(text, pattern, index):
yield text[index:index+len(pattern)]
def _matches(text: List[WordAndTag], pattern: List[str], index: int) -> bool:
return all(
text[index + pattern_index][0] == pattern[pattern_index]
for pattern_index in range(len(pattern))
)
[[('Herman', 'PROPN'), ('Melville', 'PROPN')],
[('Ishmael', 'PROPN')],
[('Ishmael', 'PROPN')],
[('Ahab', 'PROPN')],
[('Ahab', 'PROPN')],
[('whaling', 'VERB'), ('ship', 'NOUN')],
[('Pequod', 'PROPN')],
[('Moby', 'PROPN'), ('Dick', 'PROPN')],
[('sperm', 'PROPN'), ('whale', 'NOUN')],
[('American', 'PROPN'), ('Renaissance', 'PROPN')],
[('Great', 'PROPN'), ('American', 'PROPN'), ('Novel', 'PROPN')],
[('William', 'PROPN'), ('Faulkner', 'PROPN')],
[('D.', 'PROPN'), ('H.', 'PROPN'), ('Lawrence', 'PROPN')],
[('opening', 'NOUN'), ('sentence', 'NOUN')]]
If I just look for PROPN
and NOUN
how close do I get?
['Moby',
'Dick',
'Whale',
'novel',
'writer',
'Herman',
'Melville',
'book',
'sailor',
'Ishmael']
So there are a lot more matches for the noun tags than there are links.
It looks like the initial investigation has found that recovering the links requires more than the word-by-word POS tags. At the very least I would have to evaluate several words to recover the multi-word links.
I’ve looked at the documentation for this type and it looks like there are a few interesting methods available. Currently I’ve just been using the suggested code to process this, now I should investigate these additional methods.
The first is noun_chunks
(docs) which might give me something more reliable than the noun tagged words.
[Moby-Dick;,
by American writer Herman Melville,
Pequod,
white,
previous voyage bit,
A contribution to the literature of the,
was published to mixed,
was,
Its reputation as a "Great American Novel" was established only in the 20th century, after the centennial of its author's birth.,
William Faulkner said he wished,
written,
called,
one of the,
ever written,
Its opening sentence, "Call me Ishmael", is among world,
literature',
s most famous.]
So these are spacy.tokens.span.Span objects. Again these have promising methods on them, like ents which should return the named entities.
[[Dick],
[American writer, Herman Melville],
[],
[],
[],
[],
[],
[],
[Great American Novel],
[William Faulkner],
[],
[],
[],
[],
[Call me Ishmael],
[],
[]]
(Dick,
The Whale is,
American writer,
Herman Melville,
Pequod, for revenge on Moby Dick,
Moby-Dick,
Great American Novel,
William Faulkner,
D. H. Lawrence,
Call me Ishmael)
This certainly seems very hit and miss. Perhaps it is over emphasizing the capitalization of some of the words?
Why is it just Dick at the start but Moby-Dick later?
Spacy 3 is coming out soon, so it might be good to run this again when it does.