Code
= "facebook/bart-base"
MODEL_NAME = 64
MAXIMUM_TOKEN_LENGTH = 64 BATCH_SIZE
July 19, 2021
Aspect sentiment is the sentiment expressed towards specific entities in the text. For example The decor is not special at all but their food and amazing prices make up for it. expresses sentiment that differs between the decor, the food and the prices.
I’ve been investigating a way to determine aspect sentiment made out of two parts - the first is to find the entities, and the second is to classify the sentiment. Entity extraction works well and now it is time to try classifying the sentiment. Since the two tasks are so closely related, this will involve training both abilities in tandem. We can still evaluate them separately.
So lets get started. The first thing to do is to create the dataset. Here we want a set of labels that combine the two tasks.
The first two indexes will cover start entity and end entity. The final index will be the sentiment - good: 0, neutral: 1, bad: 2, corresponding to the three sentiment classes. These are only valid when it is the end of an entity.
import pandas as pd
train_df = pd.read_parquet("/data/blog/2021-07-18-aspect-sentiment-dataset/train.gz.parquet")
validation_df = pd.read_parquet("/data/blog/2021-07-18-aspect-sentiment-dataset/validation.gz.parquet")
test_df = pd.read_parquet("/data/blog/2021-07-18-aspect-sentiment-dataset/test.gz.parquet")
#collapse
from typing import *
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
sentiment_index = {
"negative": 0,
"neutral": 1,
"positive": 2,
}
def encode(row: Dict[str, Any]) -> Dict[str, Any]:
text = row["text"]
entities = row["entities"]
span_starts = {entity["start"] for entity in entities}
span_ends = {entity["end"] for entity in entities}
end_sentiments = {
entity["end"]: sentiment_index[entity["sentiment"]]
for entity in entities
}
tokenized_text = tokenizer(
text,
return_offsets_mapping=True,
max_length=MAXIMUM_TOKEN_LENGTH,
truncation=True,
padding="max_length"
)
offset_mapping = tokenized_text["offset_mapping"]
boundaries = [
(
int(start in span_starts and start != end),
int(end in span_ends and start != end),
end_sentiments.get(end, 0)
)
for start, end in offset_mapping
]
return {
"input_ids": tokenized_text["input_ids"],
"attention_mask": tokenized_text["attention_mask"],
"label": boundaries,
}
Dataset({
features: ['attention_mask', 'entities', 'input_ids', 'label', 'text'],
num_rows: 4297
})
Now we can adjust the entity extraction model that we previously used to predict sentiment as well.
Calculating the loss for the sentiment is slightly tricky. We have to extract the predictions from the end of the entities, which will vary per row, so it has to be flattened into the individual token predictions first.
from typing import *
from transformers import BartModel, AutoConfig
import torch
class EntitySentimentSequenceClassifier(BartModel):
def __init__(self, config: AutoConfig) -> None:
config.num_labels = 5 # start and copy, end and copy, negative, neutral, positive
super().__init__(config)
# bart model for sequence classification actually has a more complex classification head
self.score = torch.nn.Linear(
in_features=config.d_model,
out_features=config.num_labels,
bias=False,
)
def forward(
self,
input_ids: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
labels: Optional[torch.Tensor] = None
) -> Tuple[torch.Tensor, ...]:
outputs = super().forward(
input_ids=input_ids,
attention_mask=attention_mask,
)
hidden_states = outputs[0] # last hidden state
predictions = self.score(hidden_states)
if labels is not None:
entity_loss = torch.nn.functional.binary_cross_entropy_with_logits(
predictions[:, :, :2],
labels[:, :, :2].float(),
)
flat_predictions = predictions.reshape(-1, 5)
flat_labels = labels.reshape(-1, 3)
end_mask = flat_labels[:, 1] > 0
sentiment_predictions = flat_predictions[end_mask, 2:]
sentiment_targets = flat_labels[end_mask, 2]
sentiment_loss = torch.nn.functional.cross_entropy(
sentiment_predictions,
sentiment_targets
)
loss = entity_loss + sentiment_loss
return (loss, predictions)
return (predictions,)
We have our dataset and the model, lets try training it. At some point I really should write some metrics for this.
Some weights of EntitySentimentSequenceClassifier were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['model.score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
from pathlib import Path
from transformers import Trainer, TrainingArguments
MODEL_RUN_FOLDER = Path("/data/blog/2021-07-19-aspect-sentiment-training/runs")
MODEL_RUN_FOLDER.mkdir(parents=True, exist_ok=True)
training_args = TrainingArguments(
report_to=[],
output_dir=MODEL_RUN_FOLDER / "output",
overwrite_output_dir=True,
per_device_train_batch_size=BATCH_SIZE,
per_device_eval_batch_size=BATCH_SIZE,
learning_rate=5e-5,
num_train_epochs=5,
evaluation_strategy="epoch",
logging_dir=MODEL_RUN_FOLDER / "output",
logging_steps=100,
load_best_model_at_end=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_ds,
eval_dataset=validation_ds,
tokenizer=tokenizer,
# compute_metrics=compute_metrics,
)
trainer.train()
Epoch | Training Loss | Validation Loss | Runtime | Samples Per Second |
---|---|---|---|---|
1 | No log | 0.575488 | 0.818000 | 611.282000 |
2 | 0.691700 | 0.537444 | 0.825500 | 605.696000 |
3 | 0.451700 | 0.535601 | 0.864200 | 578.592000 |
4 | 0.451700 | 0.550816 | 0.841800 | 593.981000 |
5 | 0.336500 | 0.583033 | 0.844600 | 592.006000 |
TrainOutput(global_step=340, training_loss=0.4691026210784912, metrics={'train_runtime': 162.3442, 'train_samples_per_second': 2.094, 'total_flos': 1150283573821440.0, 'epoch': 5.0, 'init_mem_cpu_alloc_delta': 2136219648, 'init_mem_gpu_alloc_delta': 558472192, 'init_mem_cpu_peaked_delta': 380579840, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 409477120, 'train_mem_gpu_alloc_delta': 2341379072, 'train_mem_cpu_peaked_delta': 720564224, 'train_mem_gpu_peaked_delta': 3380646400})
A proper evaluation would be nice. For now let’s see what entity sentiment it can extract from this text.
sentiment_names = ["negative", "neutral", "positive"]
def aspect_sentiment(text: str) -> List[Tuple[str, str]]:
tokenized_text = tokenizer(text, return_tensors="pt")
with torch.no_grad():
input_ids = tokenized_text["input_ids"].to(model.device)
output = model(input_ids=input_ids)[0]
entity_boundaries = output[:, :, :2] > 0.
entity_mask = (output[:, :, 1] > 0.).flatten()
entity_sentiment = (
output.reshape(-1, 5)
[entity_mask]
[:, 2:]
.argmax(dim=-1)
)
entities = tokenizer.batch_decode([
[input_id]
for input_id, boundaries in zip(tokenized_text["input_ids"][0], entity_boundaries[0])
if True in boundaries
])
return [
(entity, sentiment_names[sentiment])
for entity, sentiment in zip(entities, entity_sentiment.tolist())
]