Matthew’s Blog - Aspect Sentiment Dataset Review

I’ve been thinking about aspect sentiment recently, and how a model could be trained to address this problem. The technique that I have been investigating has worked well for entity extraction and now it is time to try to apply it to the real problem. How well will my token classification approach work?

The first thing to do is to find a dataset that is suitable for this approach. I want something which delimits entities in the text and then associates them with a sentiment. A quick search finds the MAMS dataset from {% cite jiang-etal-2019-challenge %}.

The MAMS dataset provides two different representations of the problem. One is for categories and associated sentiment, while the other is for the specific entities in the text. The raw data is XML which is reasonably easy to understand:

    <sentence>
        <text>The decor is not special at all but their food and amazing prices make up for it.</text>
        <aspectTerms>
            <aspectTerm from="4" polarity="negative" term="decor" to="9"/>
            <aspectTerm from="42" polarity="positive" term="food" to="46"/>
            <aspectTerm from="59" polarity="positive" term="prices" to="65"/>
        </aspectTerms>
    </sentence>

There is a github repo that has the raw xml. I can try loading this directly.

Code

import pandas as pd

# new in pandas 1.3.0
pd.read_xml("https://raw.githubusercontent.com/siat-nlp/MAMS-for-ABSA/master/data/MAMS-ATSA/raw/train.xml")

	text	aspectTerms
0	The decor is not special at all but their food...	NaN
1	when tables opened up, the manager sat another...	NaN
2	Though the menu includes some unorthodox offer...	NaN
3	service is good although a bit in your face, w...	NaN
4	PS- I just went for brunch on Saturday and the...	NaN
...	...	...
4292	For dinner, I love the churrasco and halibut w...	NaN
4293	Was there for dinner last night, and the food ...	NaN
4294	The menu sounded good but the grilled eggplant...	NaN
4295	Service is coddling and correct and there's no...	NaN
4296	USC has a cold smoker and smoked the avocado i...	NaN

4297 rows × 2 columns

Looks like I’ll have to do some preprocessing to this data as the aspect terms haven’t been parsed. I’m guessing it’s because they are all attributes.

Code

import requests
import xml.etree.ElementTree as ET

def to_df(url: str) -> pd.DataFrame:
    data = requests.get(url)
    tree = ET.fromstring(data.content)
    as_json = [
        {
            "text": sentence.find("text").text,
            "entities": [
                {
                    "sentiment": entity.attrib["polarity"],
                    "text": entity.attrib["term"],
                    "start": int(entity.attrib["from"]),
                    "end": int(entity.attrib["to"]),
                }
                for entity in sentence.find("aspectTerms").findall("aspectTerm")
            ]
        }
        for sentence in tree.findall("sentence")
    ]
    return pd.DataFrame(as_json)

train_df = to_df("https://raw.githubusercontent.com/siat-nlp/MAMS-for-ABSA/master/data/MAMS-ATSA/raw/train.xml")
validation_df = to_df("https://raw.githubusercontent.com/siat-nlp/MAMS-for-ABSA/master/data/MAMS-ATSA/raw/val.xml")
test_df = to_df("https://raw.githubusercontent.com/siat-nlp/MAMS-for-ABSA/master/data/MAMS-ATSA/raw/test.xml")

train_df

	text	entities
0	The decor is not special at all but their food...	[{'sentiment': 'negative', 'text': 'decor', 's...
1	when tables opened up, the manager sat another...	[{'sentiment': 'neutral', 'text': 'tables', 's...
2	Though the menu includes some unorthodox offer...	[{'sentiment': 'neutral', 'text': 'menu', 'sta...
3	service is good although a bit in your face, w...	[{'sentiment': 'positive', 'text': 'service', ...
4	PS- I just went for brunch on Saturday and the...	[{'sentiment': 'neutral', 'text': 'brunch', 's...
...	...	...
4292	For dinner, I love the churrasco and halibut w...	[{'sentiment': 'neutral', 'text': 'dinner', 's...
4293	Was there for dinner last night, and the food ...	[{'sentiment': 'neutral', 'text': 'dinner', 's...
4294	The menu sounded good but the grilled eggplant...	[{'sentiment': 'neutral', 'text': 'menu', 'sta...
4295	Service is coddling and correct and there's no...	[{'sentiment': 'positive', 'text': 'Service', ...
4296	USC has a cold smoker and smoked the avocado i...	[{'sentiment': 'neutral', 'text': 'avocado', '...

4297 rows × 2 columns

Code

train_df.to_parquet("/data/blog/2021-07-18-aspect-sentiment-dataset/train.gz.parquet", compression="gzip")
validation_df.to_parquet("/data/blog/2021-07-18-aspect-sentiment-dataset/validation.gz.parquet", compression="gzip")
test_df.to_parquet("/data/blog/2021-07-18-aspect-sentiment-dataset/test.gz.parquet", compression="gzip")

The last thing to do is to check that the marked up entities are actually what I can extract from the text.

Code

train_df.iloc[0].entities

[{'sentiment': 'negative', 'text': 'decor', 'start': 4, 'end': 9},
 {'sentiment': 'positive', 'text': 'food', 'start': 42, 'end': 46},
 {'sentiment': 'positive', 'text': 'prices', 'start': 59, 'end': 65}]

Code

text = train_df.iloc[0].text
text[4:9], text[42:46], text[59:65]

('decor', 'food', 'prices')

This all looks very hopeful. Does the tokenizer agree?

Code

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-base")

Code

offsets = tokenizer(
    text,
    return_offsets_mapping=True,
)["offset_mapping"]

(4, 9) in offsets, (42, 46) in offsets, (59, 65) in offsets

(True, True, True)

Great stuff. I think this is an excellent dataset.