Aspect Sentiment Dataset Review

Can I find a dataset that can work with my approach?
Published

July 18, 2021

I’ve been thinking about aspect sentiment recently, and how a model could be trained to address this problem. The technique that I have been investigating has worked well for entity extraction and now it is time to try to apply it to the real problem. How well will my token classification approach work?

The first thing to do is to find a dataset that is suitable for this approach. I want something which delimits entities in the text and then associates them with a sentiment. A quick search finds the MAMS dataset from {% cite jiang-etal-2019-challenge %}.

The MAMS dataset provides two different representations of the problem. One is for categories and associated sentiment, while the other is for the specific entities in the text. The raw data is XML which is reasonably easy to understand:

    <sentence>
        <text>The decor is not special at all but their food and amazing prices make up for it.</text>
        <aspectTerms>
            <aspectTerm from="4" polarity="negative" term="decor" to="9"/>
            <aspectTerm from="42" polarity="positive" term="food" to="46"/>
            <aspectTerm from="59" polarity="positive" term="prices" to="65"/>
        </aspectTerms>
    </sentence>

There is a github repo that has the raw xml. I can try loading this directly.

Code
import pandas as pd

# new in pandas 1.3.0
pd.read_xml("https://raw.githubusercontent.com/siat-nlp/MAMS-for-ABSA/master/data/MAMS-ATSA/raw/train.xml")
text aspectTerms
0 The decor is not special at all but their food... NaN
1 when tables opened up, the manager sat another... NaN
2 Though the menu includes some unorthodox offer... NaN
3 service is good although a bit in your face, w... NaN
4 PS- I just went for brunch on Saturday and the... NaN
... ... ...
4292 For dinner, I love the churrasco and halibut w... NaN
4293 Was there for dinner last night, and the food ... NaN
4294 The menu sounded good but the grilled eggplant... NaN
4295 Service is coddling and correct and there's no... NaN
4296 USC has a cold smoker and smoked the avocado i... NaN

4297 rows × 2 columns

Looks like I’ll have to do some preprocessing to this data as the aspect terms haven’t been parsed. I’m guessing it’s because they are all attributes.

Code
import requests
import xml.etree.ElementTree as ET

def to_df(url: str) -> pd.DataFrame:
    data = requests.get(url)
    tree = ET.fromstring(data.content)
    as_json = [
        {
            "text": sentence.find("text").text,
            "entities": [
                {
                    "sentiment": entity.attrib["polarity"],
                    "text": entity.attrib["term"],
                    "start": int(entity.attrib["from"]),
                    "end": int(entity.attrib["to"]),
                }
                for entity in sentence.find("aspectTerms").findall("aspectTerm")
            ]
        }
        for sentence in tree.findall("sentence")
    ]
    return pd.DataFrame(as_json)

train_df = to_df("https://raw.githubusercontent.com/siat-nlp/MAMS-for-ABSA/master/data/MAMS-ATSA/raw/train.xml")
validation_df = to_df("https://raw.githubusercontent.com/siat-nlp/MAMS-for-ABSA/master/data/MAMS-ATSA/raw/val.xml")
test_df = to_df("https://raw.githubusercontent.com/siat-nlp/MAMS-for-ABSA/master/data/MAMS-ATSA/raw/test.xml")

train_df
text entities
0 The decor is not special at all but their food... [{'sentiment': 'negative', 'text': 'decor', 's...
1 when tables opened up, the manager sat another... [{'sentiment': 'neutral', 'text': 'tables', 's...
2 Though the menu includes some unorthodox offer... [{'sentiment': 'neutral', 'text': 'menu', 'sta...
3 service is good although a bit in your face, w... [{'sentiment': 'positive', 'text': 'service', ...
4 PS- I just went for brunch on Saturday and the... [{'sentiment': 'neutral', 'text': 'brunch', 's...
... ... ...
4292 For dinner, I love the churrasco and halibut w... [{'sentiment': 'neutral', 'text': 'dinner', 's...
4293 Was there for dinner last night, and the food ... [{'sentiment': 'neutral', 'text': 'dinner', 's...
4294 The menu sounded good but the grilled eggplant... [{'sentiment': 'neutral', 'text': 'menu', 'sta...
4295 Service is coddling and correct and there's no... [{'sentiment': 'positive', 'text': 'Service', ...
4296 USC has a cold smoker and smoked the avocado i... [{'sentiment': 'neutral', 'text': 'avocado', '...

4297 rows × 2 columns

Code
train_df.to_parquet("/data/blog/2021-07-18-aspect-sentiment-dataset/train.gz.parquet", compression="gzip")
validation_df.to_parquet("/data/blog/2021-07-18-aspect-sentiment-dataset/validation.gz.parquet", compression="gzip")
test_df.to_parquet("/data/blog/2021-07-18-aspect-sentiment-dataset/test.gz.parquet", compression="gzip")

The last thing to do is to check that the marked up entities are actually what I can extract from the text.

Code
train_df.iloc[0].entities
[{'sentiment': 'negative', 'text': 'decor', 'start': 4, 'end': 9},
 {'sentiment': 'positive', 'text': 'food', 'start': 42, 'end': 46},
 {'sentiment': 'positive', 'text': 'prices', 'start': 59, 'end': 65}]
Code
text = train_df.iloc[0].text
text[4:9], text[42:46], text[59:65]
('decor', 'food', 'prices')

This all looks very hopeful. Does the tokenizer agree?

Code
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-base")
Code
offsets = tokenizer(
    text,
    return_offsets_mapping=True,
)["offset_mapping"]

(4, 9) in offsets, (42, 46) in offsets, (59, 65) in offsets
(True, True, True)

Great stuff. I think this is an excellent dataset.