Creating an Aspect Sentiment Dataset

Uploading some Aspect Sentiment datasets to the Huggingface hub
Published

October 2, 2023

I’ve been looking into aspect sentiment again and I’ve found several interesting datasets. Unfortunately most of these datasets are missing from the huggingface hub or are in a format that is difficult to work with. Since I find the hub very useful I would like to remedy this.

This post will be an exploration of the datasets, a review of how we could use them to train and a conversion of the original data into a huggingface dataset. Thinking about how the data will be used should help create a useful dataset.

The Dataset

The dataset that I am going to use for this example is the Aspect Sentiment Triplet Extraction Task data. This is an Aspect Sentiment Triplet Extraction (ASTE) task which is the task of extracting the triplets of target entities, their associated sentiment, and opinion spans explaining the reason for the sentiment.

An example of this kind of data is the statement the food was awful, but the view was great. Here we have two target entities of food and view and we want to extract the associated sentiment and opinion spans. The resulting data would look like:

  • entity: food, sentiment: negative, opinion: awful
  • entity: view, sentiment: positive, opinion: great

When identifying the entity and opinion in the text it is important to have the exact position (character index, token index etc). This is because a given term can be used more than once in the text with different associated opinions and sentiments. The positional information can also feed into downstream tasks such as coreference resolution for ambiguous terms.

The linked dataset provides this information in the following format:

I charge it at night and skip taking the cord with me because of the good battery life .####[([16, 17], [15], 'POS')]

Here the word index is provided, so translated that becomes:

  • entity: battery life, sentiment: positive, opinion: good

There can be multiple relationships in a sentence:

The speed is incredible and I am more than satisfied .####[([1], [3], 'POS'), ([1], [9], 'POS')]
  • entity: speed, sentiment: positive, opinion: incredible
  • entity: speed, sentiment: positive, opinion: satisfied

We can see here that the text has been preprocessed and there can be multiple entries per line. How can this be turned into a good dataset?

Data Origin

This dataset is an annotated version of another dataset from SemEval. It’s from Sem Eval 2014 Task 4: Aspect Based Sentiment Analysis. The task for Sem Eval is split into several different parts, none of which fully reproduce the triplet of entity, sentiment and opinion. Finding this is useful as the original text can be recovered.

Let’s get this data and then try joining the two together. I’ve had to do a little fiddling to get things working nicely - the preprocessing that produces the ASTE dataset also includes some spelling correction. For example:

Comfortable to use light easy to transport .
Comfterbale to use light easy to transport.

Code
import ast
import string
import Levenshtein
import pandas as pd

def read_sem_eval_file(file: str) -> pd.DataFrame:
    df = pd.read_xml(file)[["text"]]
    return df

def read_aste_file(file: str) -> pd.DataFrame:

    def triple_to_hashable(
        triple: tuple[list[int], list[int], str]
    ) -> tuple[tuple[int], tuple[int], str]:
        aspect_span, opinion_span, sentiment = triple
        return tuple(aspect_span), tuple(opinion_span), sentiment

    df = pd.read_csv(
        file,
        sep="####",
        header=None,
        names=["text", "triples"],
        engine="python",
    )

    # There are duplicate rows, some of which have the same triples and some don't
    # This deals with that by
    # * first dropping the pure duplicates,
    # * then parsing the triples and exploding them to one per row
    # * then dropping the exploded duplicates (have to convert triples back to string for this)
    # * then grouping the triples up again
    # * finally sorting the distinct triples

    # df = df.copy()
    df = df.drop_duplicates()
    df["triples"] = df.triples.apply(ast.literal_eval)
    df = df.explode("triples")
    df["triples"] = df.triples.apply(triple_to_hashable)
    df = df.drop_duplicates()
    df = df.groupby("text").agg(list)
    df = df.reset_index(drop=False)
    df["triples"] = df.triples.apply(set).apply(sorted)

    return df
    

def get_original_text(
    aste_file: str,
    sem_eval_file: str,
    debug: bool = False,
) -> pd.DataFrame:
    approximate_matches = 0

    def best_match(text: str) -> str:
        comparison = text.replace(" ", "")
        if comparison in comparison_to_text:
            return comparison_to_text[comparison]

        nonlocal approximate_matches
        approximate_matches += 1
        distances = sem_eval_comparison.apply(
            lambda se_comparison: Levenshtein.distance(comparison, se_comparison)
        )
        best = sem_eval_df.iloc[distances.argmin()].text
        return best

    def triple_to_hashable(triple: tuple[list[int], list[int], str]) -> tuple[tuple[int], tuple[int], str]:
        aspect_span, opinion_span, sentiment = triple
        return tuple(aspect_span), tuple(opinion_span), sentiment

    sem_eval_df = read_sem_eval_file(sem_eval_file)
    sem_eval_comparison = sem_eval_df.text.str.replace(" ", "")
    comparison_to_text = dict(zip(sem_eval_comparison, sem_eval_df.text))

    aste_df = read_aste_file(aste_file)
    aste_df = aste_df.rename(columns={"text": "preprocessed_text"})
    aste_df["text"] = aste_df.preprocessed_text.apply(best_match)
    if debug:
        print(f"Read {len(aste_df):,} rows")
        print(f"Had to use {approximate_matches:,} approximate matches")
    return aste_df[["text", "preprocessed_text", "triples"]]
Code
df = get_original_text(
    aste_file="/data/aspect-sentiment/raw/aspect-sentiment-triplet-extraction/laptop-2014/train_triplets.txt",
    sem_eval_file="/data/aspect-sentiment/raw/sem-eval/2014/Laptop_Train_v2.xml",
    debug=True,
)
df
Read 900 rows
Had to use 35 approximate matches
text preprocessed_text triples
0 (No problem with the ordering or shipping by t... ( No problem with the ordering or shipping by ... [((7,), (1, 2), POS)]
1 ) And printing from either word processor is a... ) And printing from either word processor is a... [((5, 6), (9,), NEG)]
2 -Called headquarters again, they report that T... -Called headquarters again , they report that ... [((7, 8), (10,), NEG)]
3 -Computer crashed frequently and battery life ... -Computer crashed frequently and battery life ... [((4, 5), (6, 7, 8), NEG)]
4 -I propose that they can just swap the hard dr... -I propose that they can just swap the hard dr... [((8, 9), (6,), NEU)]
... ... ... ...
895 this computer will last you at least 7 years, ... this computer will last you at least 7 years ,... [((13,), (12,), POS)]
896 this is my second one and the same problem, ba... this is my second one and the same problem , b... [((11, 12), (10,), NEG), ((11, 12), (13,), NEG)]
897 very convenient when you travel and the batter... very convenient when you travel and the batter... [((7, 8), (10,), POS)]
898 while the keyboard itself is alright, the plat... while the keyboard itself is alright , the pla... [((2,), (5,), POS), ((8,), (12,), NEG), ((22, ...
899 wonderful features. wonderful features . [((1,), (0,), POS)]

900 rows × 3 columns

The original text would be a better target for training as it is more like the style that people use when writing normally. Changing the text to the original isn’t without problems though, as the word indices have now changed.

What I need to do is to match each whitespace separated word in the preprocessed text to the original. The transformation from the preprocessed text to the original can join two “words” that were previously separate (e.g. display . \(\rightarrow\) display.). As such it would be good to move from word indices to character indices.

Doing this for the specific words that are marked up in the triplets seems like it would be easier than for every word. I feel that marking every word becomes easier as that allows some constraints to be added to the processing:

  • each word of the preprocessed text is a disjoint section of the original text
  • each word of the preprocessed text immediately follows the preceding word
  • most words match spelling

Using Levenshtein to Map Characters

I want a mapping of the preprocessed text to the original text.

Most of the letters are exact matches. Sometimes there are extra letters on either side. Sometimes there are letters that are different on either side.

I can start by just lining up the two texts. Then I can go to each misalignment and determine if more is fixed by deleting on either side or marking as incorrect. A window of error for deleting is the maximum variance between the two.

It occurs to me that this is the edit distance, rewritten as alignment. Since I’ve already got the levenshtein package installed I can use the editops command to find the list of alterations required to make one string into another. This will give me alignment if I can use those operations to update the offsets of the valid letters.

Let’s try it out.

The documentation states that it can produce a list of operations that are “replace”, “insert” or “delete” with the source position and destination position. I should be able to map these to operations over a list of indices. Then when stuff is inserted or replaced I can use None as a semaphore to indicate that the character has no mapping.

Code
import Levenshtein
from typing import Optional

def edit(original: str, preprocessed: str) -> list[Optional[int]]:
    indices = list(range(len(preprocessed)))
    for operation, source_position, destination_position in Levenshtein.editops(preprocessed, original):
        if operation == "replace":
            indices[destination_position] = None
        elif operation == "insert":
            indices.insert(destination_position, None)
        elif operation == "delete":
            del indices[destination_position]
    return indices
Code
df["text_indices"] = df.apply(
    lambda row: edit(original=row.text, preprocessed=row.preprocessed_text),
    axis="columns",
)

This has applied the changes very quickly. Since we had 35 cases where the remove the spaces approach did not produce an exact match I should expect a comparable number of rows where at least one letter is unmapped. We can find these by looking for the None letters:

Code
from typing import Optional

def has_unmapped(indicies: list[Optional[int]]) -> bool:
    return any(
        index is None
        for index in indicies
    )

df.text_indices.apply(has_unmapped).sum()
43

There are 43 cases now where some text does not exactly match. Given that the original transformation for finding the best match removed all of the spaces are there any cases where the unmapped characters are just spaces?

Code
from typing import Optional
import pandas as pd

def has_unmapped_non_space(row: pd.Series) -> bool:
    letter_and_index: tuple[str, Optional[int]] = list(zip(row.text, row.text_indices))
    return any(
        index is None
        for letter, index in letter_and_index
        if letter != " "
    )

df.apply(has_unmapped_non_space, axis="columns").sum()
27

This initially seems surprising, however remember that there are manipulations to the text which would not result in a None appearing in the text_indices column. For example if the original text had a spelling mistake where a letter was missing from the word then the change to the preprocessed text would not introduce a None index.

Mapping Word Indices to Character Indices

With this mapping I can substitute the preprocessed text for the original and provide a character level mapping of the spans. This helps as the original text joins together words that were separate in the preprocessed text, notably the punctuation with the previous word. Doing this mapping should help any tokenizer identify the correct tokens to work with.

Code
from dataclasses import dataclass
from typing import TypedDict, Optional
import re

@dataclass(frozen=True)
class WordSpan:
    start_index: int
    end_index: int # this is the letter after the end

class CharacterIndices(TypedDict):
    aspect_start_index: int
    aspect_end_index: int
    aspect_term: str
    opinion_start_index: int
    opinion_end_index: int
    opinion_term: str
    sentiment: str

word_pattern = re.compile(r"\S+")

def row_to_character_indices(row: pd.Series) -> pd.Series:
    try:
        return pd.Series(
            to_character_indices(
                triplet=row.triples,
                preprocessed=row.preprocessed_text,
                text=row.text,
                text_indices=row.text_indices,
            )
        )
    except:
        print(f"failed to process row {row.name}")
        display(row)
        raise

def to_character_indices(
    *,
    triplet: tuple[tuple[int], tuple[int], str],
    preprocessed: str,
    text: str,
    text_indices: list[Optional[int]],
) -> CharacterIndices:
    def is_sequential(span: list[int]) -> bool:
        return all(
            span[index + 1] - span[index] == 1
            for index in range(len(span) - 1)
        )

    def find_start_index(span: WordSpan) -> int:
        # the starting letter in the lookup can be missing or None
        # this would cause a lookup failure
        # to recover from this we can find the following letter index and backtrack
        for index in range(span.start_index, span.end_index):
            try:
                text_index = text_indices.index(index)
                for _ in range(index - span.start_index):
                    if text_index - 1 <= 0:
                        break
                    if text_indices[text_index - 1] is not None:
                        break
                    text_index -= 1
                return text_index
            except ValueError:
                pass
                # not present in list
        raise ValueError(f"cannot find any part of {span}")

    def find_end_index(span: WordSpan) -> int:
        # the ending letter in the lookup can be missing or None
        # this would cause a lookup failure
        # to recover from this we can find the preceding letter index and backtrack
        for index in range(span.end_index - 1, span.start_index -1, -1):
            try:
                text_index = text_indices.index(index)
                for _ in range(span.end_index - index):
                    if text_index + 1 >= len(text_indices):
                        break
                    if text_indices[text_index + 1] is not None:
                        break
                    text_index += 1
                return text_index
            except ValueError:
                pass
                # not present in list
        raise ValueError(f"cannot find any part of {span}")

    def to_indices(span: list[int]) -> tuple[int, int]:
        word_start = span[0]
        word_start_span = word_indices[word_start]

        word_end = span[-1]
        word_end_span = word_indices[word_end]

        start_index = find_start_index(word_start_span)
        end_index = find_end_index(word_end_span)
        return start_index, end_index
    
    aspect_span, opinion_span, sentiment = triplet
    assert is_sequential(aspect_span), f"aspect span not sequential: {aspect_span}"
    assert is_sequential(opinion_span), f"opinion span not sequential: {opinion_span}"
    assert sentiment in {"POS", "NEG", "NEU"}, f"unknown sentiment: {sentiment}"

    word_indices = [
        WordSpan(start_index=match.start(), end_index=match.end())
        for match in word_pattern.finditer(preprocessed)
    ]

    aspect_start_index, aspect_end_index = to_indices(aspect_span)
    aspect_term = text[aspect_start_index:aspect_end_index+1]
    opinion_start_index, opinion_end_index = to_indices(opinion_span)
    opinion_term = text[opinion_start_index:opinion_end_index+1]

    nice_sentiment = {
        "POS": "positive",
        "NEG": "negative",
        "NEU": "neutral",
    }[sentiment]

    return {
        "aspect_start_index": aspect_start_index,
        "aspect_end_index": aspect_end_index,
        "aspect_term": aspect_term,
        "opinion_start_index": opinion_start_index,
        "opinion_end_index": opinion_end_index,
        "opinion_term": opinion_term,
        "sentiment": nice_sentiment,
    }
Code
df.explode("triples").apply(row_to_character_indices, axis="columns")
aspect_start_index aspect_end_index aspect_term opinion_start_index opinion_end_index opinion_term sentiment
0 33 40 shipping 1 10 No problem positive
1 27 40 word processor 48 56 adventure negative
2 45 53 TFT panel 58 63 broken negative
3 33 44 battery life 46 67 decreased very quickly negative
4 39 49 hard drives 30 33 swap neutral
... ... ... ... ... ... ... ...
897 40 51 battery life 56 64 excellent positive
898 10 17 keyboard 29 35 alright positive
898 42 46 plate 61 65 cheap negative
898 115 135 mouse command buttons 87 98 hollow sound negative
899 10 17 features 0 8 wonderful positive

1450 rows × 7 columns

This looks good to me. There were some issues where misspellings caused the start or end index to be unavailable. I have a patch for that to try to recover it. For example the preprocessed text had the opinion term of easy to use but the original text was wasy to use (e->w which is a believable mistake as they are next to each other on the keyboard).

The final thing to do is to produce the complete final dataframe.

Code
import pandas as pd

def convert_sem_eval_text(
    aste_file: str,
    sem_eval_file: str,
    debug: bool = False,
) -> pd.DataFrame:
    df = get_original_text(
        aste_file=aste_file,
        sem_eval_file=sem_eval_file,
        debug=debug,
    )
    df = df.explode("triples")
    df = df.reset_index(drop=False)
    df["text_indices"] = df.apply(
        lambda row: edit(original=row.text, preprocessed=row.preprocessed_text),
        axis="columns",
    )
    df = df.merge(
        df.apply(row_to_character_indices, axis="columns"),
        left_index=True,
        right_index=True,
    )
    df = df.drop(columns=["preprocessed_text", "triples", "text_indices"])
    return df
Code
sem_eval_train_df = convert_sem_eval_text(
    aste_file="/data/aspect-sentiment/raw/aspect-sentiment-triplet-extraction/laptop-2014/train_triplets.txt",
    sem_eval_file="/data/aspect-sentiment/raw/sem-eval/2014/Laptop_Train_v2.xml",
    debug=True,
)

sem_eval_valid_df = convert_sem_eval_text(
    aste_file="/data/aspect-sentiment/raw/aspect-sentiment-triplet-extraction/laptop-2014/dev_triplets.txt",
    sem_eval_file="/data/aspect-sentiment/raw/sem-eval/2014/Laptop_Train_v2.xml",
    debug=True,
)

sem_eval_test_df = convert_sem_eval_text(
    aste_file="/data/aspect-sentiment/raw/aspect-sentiment-triplet-extraction/laptop-2014/test_triplets.txt",
    sem_eval_file="/data/aspect-sentiment/raw/sem-eval/2014/Laptops_Test_Data_PhaseA.xml",
    debug=True,
)
Read 900 rows
Had to use 35 approximate matches
Read 219 rows
Had to use 19 approximate matches
Read 328 rows
Had to use 11 approximate matches

It is not a given that the original Sem Eval dataset is better. As such it would be kind to create a version where the text is the preprocessed text and the same character level conversion has been done.

Code
import pandas as pd

def convert_aste_text(
    aste_file: str,
    debug: bool = False,
) -> pd.DataFrame:
    df = read_aste_file(aste_file)
    df = df.explode("triples")
    df = df.reset_index(drop=False)
    df = df.merge(
        df.apply(aste_row_to_character_indices, axis="columns"),
        left_index=True,
        right_index=True,
    )
    df = df.drop(columns=["triples"])
    return df

def aste_row_to_character_indices(row: pd.Series) -> pd.Series:
    try:
        return pd.Series(
            aste_to_character_indices(
                triplet=row.triples,
                text=row.text,
            )
        )
    except:
        print(f"failed to process row {row.name}")
        display(row)
        raise

def aste_to_character_indices(
    *,
    triplet: tuple[tuple[int], tuple[int], str],
    text: str,
) -> CharacterIndices:
    def is_sequential(span: list[int]) -> bool:
        return all(
            span[index + 1] - span[index] == 1
            for index in range(len(span) - 1)
        )

    def to_indices(span: list[int]) -> tuple[int, int]:
        word_start = span[0]
        word_start_span = word_indices[word_start]

        word_end = span[-1]
        word_end_span = word_indices[word_end]

        return word_start_span.start_index, word_end_span.end_index - 1
    
    aspect_span, opinion_span, sentiment = triplet
    assert is_sequential(aspect_span), f"aspect span not sequential: {aspect_span}"
    assert is_sequential(opinion_span), f"opinion span not sequential: {opinion_span}"
    assert sentiment in {"POS", "NEG", "NEU"}, f"unknown sentiment: {sentiment}"

    word_indices = [
        WordSpan(start_index=match.start(), end_index=match.end())
        for match in word_pattern.finditer(text)
    ]

    aspect_start_index, aspect_end_index = to_indices(aspect_span)
    aspect_term = text[aspect_start_index:aspect_end_index+1]
    opinion_start_index, opinion_end_index = to_indices(opinion_span)
    opinion_term = text[opinion_start_index:opinion_end_index+1]

    nice_sentiment = {
        "POS": "positive",
        "NEG": "negative",
        "NEU": "neutral",
    }[sentiment]

    return {
        "aspect_start_index": aspect_start_index,
        "aspect_end_index": aspect_end_index,
        "aspect_term": aspect_term,
        "opinion_start_index": opinion_start_index,
        "opinion_end_index": opinion_end_index,
        "opinion_term": opinion_term,
        "sentiment": nice_sentiment,
    }
Code
convert_aste_text(
    aste_file="/data/aspect-sentiment/raw/aspect-sentiment-triplet-extraction/laptop-2014/train_triplets.txt",
    debug=True,
)
index text aspect_start_index aspect_end_index aspect_term opinion_start_index opinion_end_index opinion_term sentiment
0 0 ( No problem with the ordering or shipping by ... 34 41 shipping 2 11 No problem positive
1 1 ) And printing from either word processor is a... 27 40 word processor 48 56 adventure negative
2 2 -Called headquarters again , they report that ... 46 54 TFT panel 59 64 broken negative
3 3 -Computer crashed frequently and battery life ... 33 44 battery life 46 67 decreased very quickly negative
4 4 -I propose that they can just swap the hard dr... 39 49 hard drives 30 33 swap neutral
... ... ... ... ... ... ... ... ... ...
1445 897 very convenient when you travel and the batter... 40 51 battery life 56 64 excellent positive
1446 898 while the keyboard itself is alright , the pla... 10 17 keyboard 29 35 alright positive
1447 898 while the keyboard itself is alright , the pla... 43 47 plate 62 66 cheap negative
1448 898 while the keyboard itself is alright , the pla... 116 136 mouse command buttons 88 99 hollow sound negative
1449 899 wonderful features . 10 17 features 0 8 wonderful positive

1450 rows × 9 columns

Code
aste_train_df = convert_aste_text(
    aste_file="/data/aspect-sentiment/raw/aspect-sentiment-triplet-extraction/laptop-2014/train_triplets.txt",
    debug=True,
)

aste_valid_df = convert_aste_text(
    aste_file="/data/aspect-sentiment/raw/aspect-sentiment-triplet-extraction/laptop-2014/dev_triplets.txt",
    debug=True,
)

aste_test_df = convert_aste_text(
    aste_file="/data/aspect-sentiment/raw/aspect-sentiment-triplet-extraction/laptop-2014/test_triplets.txt",
    debug=True,
)

With these I can create the dataset. Reading through the instructions at Share a dataset I can see that writing these files out with some additional metadata files should be sufficient.

Code
sem_eval_test_df.to_parquet("/data/aspect-sentiment/hub/aste-v2/data/2014/laptop/sem-eval/test.gz.parquet", compression="gzip")
sem_eval_valid_df.to_parquet("/data/aspect-sentiment/hub/aste-v2/data/2014/laptop/sem-eval/valid.gz.parquet", compression="gzip")
sem_eval_train_df.to_parquet("/data/aspect-sentiment/hub/aste-v2/data/2014/laptop/sem-eval/train.gz.parquet", compression="gzip")

aste_test_df.to_parquet("/data/aspect-sentiment/hub/aste-v2/data/2014/laptop/aste/test.gz.parquet", compression="gzip")
aste_valid_df.to_parquet("/data/aspect-sentiment/hub/aste-v2/data/2014/laptop/aste/valid.gz.parquet", compression="gzip")
aste_train_df.to_parquet("/data/aspect-sentiment/hub/aste-v2/data/2014/laptop/aste/train.gz.parquet", compression="gzip")

Huggingface Dataset Repository

I’ve had to create a repository to hold the dataset. Since there was a little involved in this I am going to go over the steps here. You can see the dataset here and the files of the repository can be viewed with the Files and versions tab.

You need to have a huggingface account, if you do not have one you can sign up here.

SSH Access to Huggingface

You need to add a SSH key to your profile. It’s sensible to generate a key for this purpose, which you can do with the command:

ssh-keygen -t ed25519 -f id_huggingface

This will generate the files id_huggingface and id_huggingface.pub. Put these in the .ssh folder in your home directory and add the following to the .ssh/config file:

Host hf.co
    User git
    IdentityFile ~/.ssh/id_huggingface

Then copy the content of the id_huggingface.pub file which will be a single line. It is this that you give to huggingface as it is the public key. Do not give the content of the file that has OPENSSH PRIVATE KEY in it.

If you want to do this then you need to have huggingface-hub and datasets installed. I created a poetry environment in the dataset project to manage this.

You can test that this has been done successfully using this command:

➜ ssh -T git@hf.co
Hi matthewfranglen, welcome to Hugging Face.

If you see your username in the greeting then it worked. If you see Hi anonymous, welcome to Hugging Face. then you need to check your configuration and the key you registered on huggingface.

Creating the Dataset

In theory you can create the repository using the huggingface cli. First check that you are logged in:

➜ huggingface-cli whoami
matthewfranglen

(I prefix this command with poetry run as this is within a poetry managed virtual environment, ymmv).

If you do not see your username then log in with:

➜ huggingface-cli login

Once you are logged in you can try to create the dataset with:

➜ huggingface-cli repo create NAME --type dataset

When I did this I got the following messages:

You are about to create datasets/matthewfranglen/aste-v2
Proceed? [Y/n]
403 Client Error: Forbidden for url: https://huggingface.co/api/repos/create (Request ID: ...)

You don't have the rights to create a dataset under this namespace

It looks like my barely active account isn’t permitted to create datasets from the command line. Instead I used the web interface which was fine.

Once this is set up you can either clone the repository that has been created with:

➜ git clone git@hf.co:datasets/USERNAME/DATASET_NAME

Or add it as a remote with:

➜ git remote add huggingface git@hf.co:datasets/USERNAME/DATASET_NAME

There will be a commit in the repository already. If you have a repository then you can do the following to merge that commit in and set the upstream:

➜ git fetch huggingface
➜ git merge --no-ff --allow-unrelated-histories huggingface/main
➜ git branch --set-upstream-to=huggingface/main

It’s best to use SSH for the repository instead of https as the SSH key is already set up as your identity so you won’t have to write your username and password to your .netrc file.

Dataset Data

Write the data to the folder, separating the data by subset (sem-eval and aste in this blog post) and split (train, valid and test in this blog post). I wrote the files above and this created the structure:

data/
  - aste/
    - train.gz.parquet
    - valid.gz.parquet
    - test.gz.parquet
  - sem-eval/
    - train.gz.parquet
    - valid.gz.parquet
    - test.gz.parquet

To add these to the repository you need to set up git lfs. The installation instructions are available here and there are apt repositories available for linux. Once you have installed it then you can register the use of lfs with the repository:

git lfs install

And then mark the files as tracked with:

git lfs track data/*/*.gz.parquet

This will alter the .gitattributes file in the root of the repository. You must do this before adding the files:

git add data/*/*.gz.parquet

Finally you can create the commit:

git commit

Dataset Metadata

The README.md file is both the dataset card and the metadata. Repository metadata is at the start of the file and starts and ends with ---.

There are several parts to the metadata, the first we will cover is the definition of the data files. The documentation for this is available here.

For this dataset we have a subsplit between the ASTE preprocessed text and the original Sem Eval text. This means our metadata starts as:

---
configs:
- config_name: sem-eval-2014
  data_files:
  - split: train
    path: "data/sem-eval/train.gz.parquet"
  - split: valid
    path: "data/sem-eval/valid.gz.parquet"
  - split: test
    path: "data/sem-eval/test.gz.parquet"
- config_name: aste-v2
  data_files:
  - split: train
    path: "data/aste/train.gz.parquet"
  - split: valid
    path: "data/aste/valid.gz.parquet"
  - split: test
    path: "data/aste/test.gz.parquet"
---

The metadata has far more fields than this, which are documented here. Important fields are for the pretty name, the languages of the dataset, the tasks the dataset is used for and the size:

pretty_name: "Aspect Sentiment Triplet Extraction v2"
language:
- en
size_categories:
- 1K<n<10K
task_categories:
- token-classification
- text-classification

The language is an ISO-639-1 code. It’s the 2 letter version.

The size categories and task categories come from a list that is curated by huggingface. For these it’s best to use the web interface to edit the file which will create a commit to the repository.

Dataset Description

The description is an important part of the dataset. If people don’t have a good idea what it is for then it won’t get used.

When writing technical documentation it can be helpful to review an existing high quality example. In my case I looked at the readme for the glue dataset. This gave me an idea of the sections that would be good, and I can look at the raw README to determine what markup was used.

With all of this I was able to create a reasonable dataset card.

The Other Datasets

I’ve only converted a single dataset here (there are three more restaurant datasets). Furthermore the reproducibility of the dataset is not easy for others to verify.

If I put the conversion code in the repo then others would be able to use it to check my work. I could also make it easy to apply so that I can use it to convert the remaining datasets. Fetching the restaurant 2015 and 2016 data might be tricky though.