Can I find a dataset that can work with my approach?
Published
July 18, 2021
I’ve been thinking about aspect sentiment recently, and how a model could be trained to address this problem. The technique that I have been investigating has worked well for entity extraction and now it is time to try to apply it to the real problem. How well will my token classification approach work?
The first thing to do is to find a dataset that is suitable for this approach. I want something which delimits entities in the text and then associates them with a sentiment. A quick search finds the MAMS dataset from {% cite jiang-etal-2019-challenge %}.
The MAMS dataset provides two different representations of the problem. One is for categories and associated sentiment, while the other is for the specific entities in the text. The raw data is XML which is reasonably easy to understand:
<sentence> <text>The decor is not special at all but their food and amazing prices make up for it.</text> <aspectTerms> <aspectTerm from="4" polarity="negative" term="decor" to="9"/> <aspectTerm from="42" polarity="positive" term="food" to="46"/> <aspectTerm from="59" polarity="positive" term="prices" to="65"/> </aspectTerms> </sentence>
There is a github repo that has the raw xml. I can try loading this directly.
Code
import pandas as pd# new in pandas 1.3.0pd.read_xml("https://raw.githubusercontent.com/siat-nlp/MAMS-for-ABSA/master/data/MAMS-ATSA/raw/train.xml")
text
aspectTerms
0
The decor is not special at all but their food...
NaN
1
when tables opened up, the manager sat another...
NaN
2
Though the menu includes some unorthodox offer...
NaN
3
service is good although a bit in your face, w...
NaN
4
PS- I just went for brunch on Saturday and the...
NaN
...
...
...
4292
For dinner, I love the churrasco and halibut w...
NaN
4293
Was there for dinner last night, and the food ...
NaN
4294
The menu sounded good but the grilled eggplant...
NaN
4295
Service is coddling and correct and there's no...
NaN
4296
USC has a cold smoker and smoked the avocado i...
NaN
4297 rows × 2 columns
Looks like I’ll have to do some preprocessing to this data as the aspect terms haven’t been parsed. I’m guessing it’s because they are all attributes.
Code
import requestsimport xml.etree.ElementTree as ETdef to_df(url: str) -> pd.DataFrame: data = requests.get(url) tree = ET.fromstring(data.content) as_json = [ {"text": sentence.find("text").text,"entities": [ {"sentiment": entity.attrib["polarity"],"text": entity.attrib["term"],"start": int(entity.attrib["from"]),"end": int(entity.attrib["to"]), }for entity in sentence.find("aspectTerms").findall("aspectTerm") ] }for sentence in tree.findall("sentence") ]return pd.DataFrame(as_json)train_df = to_df("https://raw.githubusercontent.com/siat-nlp/MAMS-for-ABSA/master/data/MAMS-ATSA/raw/train.xml")validation_df = to_df("https://raw.githubusercontent.com/siat-nlp/MAMS-for-ABSA/master/data/MAMS-ATSA/raw/val.xml")test_df = to_df("https://raw.githubusercontent.com/siat-nlp/MAMS-for-ABSA/master/data/MAMS-ATSA/raw/test.xml")train_df