Code
from pathlib import Path
= "domain-shift"
PROJECT_NAME
= Path("/data/blog/2022-01-28-domain-shift-by-embedding-replacement")
DATA_FOLDER =True, exist_ok=True) DATA_FOLDER.mkdir(parents
January 28, 2022
An NLP model performs a specific task. To do this it has expectations about the words and phrases used. This is why training a task specific model from a general language model is so good - the general language model has already learned the expectations, and then they can be used to perform the task.
This all becomes a problem when the expectations are wrong. If we consider science and cooking then the word chemical has very different meanings. In science it is merely a descriptive word, while describing food with the word chemical has strong negative connotations. A model that has expectations that are appropriate for scientific writing will not perform well when used on an article by a food critic.
The word chemical hasn’t changed it’s strict meaning - it still means a basic substance. Cooking can even be considered chemistry. Words are more than their dictionary definition though. The difference between the meanings is because Scientists and Cooks are in two different Domains (A sphere of activity, influence, or knowledge).
If we want to have a model that works well in one domain and then transfer it to another, we have to shift domains. This post is an exploration of that process.
My idea is that a general purpose sentiment model that is trained without altering the embedding layer can be shifted to a specific domain by retraining only the embeddings of the original language model.
The task will be sentiment analysis. The general purpose sentiment model will be trained using the Sentiment 140 dataset (Go, Bhayani, and Huang 2009). This trained model will then be transformed into a domain specific model using the Multi-Domain Sentiment dataset {% cite blitzer-etal-2007-biographies %}. These datasets only have positive and negative sentiment text.
This experiment will evaluate several different models to see how they perform against the domain specific dataset with and without retraining.
The datasets need to be restructured to have the text to classify and the target sentiment.
This has sentiment as an integer with values 0 (negative) and 4 (positive) along with several bits of metadata that are not interesting for this task.
# from src/main/python/blog/domain_shift/data/sentiment140.py
from pathlib import Path
import pandas as pd
def load_sentiment140(path: Path) -> pd.DataFrame:
df = pd.read_csv(
path,
names=["sentiment", "id", "date", "query", "user", "text"],
encoding="ISO-8859-1",
)
df = df[["sentiment", "text"]].copy()
# The sentiment column contains two values, 0 and 4.
# There are 80,000 rows of each.
# Example sentiment 0 row: my whole body feels itchy and like its on fire
# Example sentiment 4 row: Happy 38th Birthday to my boo of alll time!!!
df["sentiment"] = df.sentiment.map({0: "negative", 4: "positive"})
return df
#collapse
from pathlib import Path
GENERAL_DATASET = Path("/data/sentiment/sentiment140/sentiment140.zip")
general_df = load_sentiment140(GENERAL_DATASET)
general_df.to_parquet(
"/data/sentiment/sentiment140/sentiment.gz.parquet",
compression="gzip"
)
display(
general_df.sentiment
.value_counts()
.to_frame()
)
general_df
sentiment | |
---|---|
negative | 800000 |
positive | 800000 |
sentiment | text | |
---|---|---|
0 | negative | @switchfoot http://twitpic.com/2y1zl - Awww, t... |
1 | negative | is upset that he can't update his Facebook by ... |
2 | negative | @Kenichan I dived many times for the ball. Man... |
3 | negative | my whole body feels itchy and like its on fire |
4 | negative | @nationwideclass no, it's not behaving at all.... |
... | ... | ... |
1599995 | positive | Just woke up. Having no school is the best fee... |
1599996 | positive | TheWDB.com - Very cool to hear old Walt interv... |
1599997 | positive | Are you ready for your MoJo Makeover? Ask me f... |
1599998 | positive | Happy 38th Birthday to my boo of alll time!!! ... |
1599999 | positive | happy #charitytuesday @theNSPCC @SparksCharity... |
1600000 rows × 2 columns
This is encoded in an almost xml file that needs some preprocessing. The xml file mixes &
which is illegal in XML with the correctly encoded &
, there are characters that are out of range for the default pandas xml parser, and finally there is no root node for the file.
Once all of these have been fixed the dataset is quite rich with \(rating \in \{ 1, 2, 4, 5 \}\). I’m going to consider \(negative \in \{ 1, 2 \}\) and \(positive \in \{ 4, 5 \}\). I’m only taking the text of the review, not the title, and some of them have no text.
Finally the domains are not evenly distributed. To ensure that there is enough training and evaluation data the top 5 domains are being used.
# from src/main/python/blog/domain_shift/data/multi_domain_sentiment.py
from pathlib import Path
from typing import Tuple, Union
import pandas as pd
from lxml import etree
def load_multi_domain_sentiment(folder: Path) -> pd.DataFrame:
# Loading these files requires quite a lot of preprocessing. This is split
# into individual cleaning methods which are composed in the load_file
# method below. The last section reads all of the different reviews and
# filters them to the top 5 by domain volume.
def load_all_files(folder: Path) -> pd.DataFrame:
files = sorted(folder.glob("*/all.review"))
df = pd.concat([load_file(path) for path in files])
df = filter_to_top_5_domains(df)
return df
def filter_to_top_5_domains(df: pd.DataFrame) -> pd.DataFrame:
# The number of reviews for each domain vary, from a few hundred to over
# ten thousand. To ensure that there is enough data to train a model we
# will take the top 5 domains by volume.
top_5_domains = df.domain.value_counts()[:5].index
df = df[df.domain.isin(top_5_domains)]
return df
def load_file(path: Path) -> pd.DataFrame:
df = read_file(path)
df = split_helpful_column(df)
df = clean_rating(df)
df = clean_text(df)
df = parse_date(df)
df = drop_unrelated_columns(df)
df = rating_to_sentiment(df)
return df
def read_file(path: Path) -> pd.DataFrame:
# This reads the file from disk. There are three problems with the data
# that need to be addressed before it can be loaded:
# The data is stored in an xml-like structure where each data row is contained in a node.
# There is no root node so the document is not valid xml.
xml = path.read_text(encoding="ISO-8859-1")
xml = f"<node>{xml}</node>"
# Secondly the ampersand symbol is not consistently escaped, sometimes
# appearing as & and sometimes appearing as &.
xml = xml.replace("&", "&").replace("&", "&")
# Finally there are invalid characters in the document as the document
# seems to lack a consistent encoding. It is possible that the xml like
# structure is in ISO-8859-1 and the contents of each field are in
# UTF-8?
parser = etree.XMLParser(ns_clean=True, recover=True)
tree = etree.fromstring(xml, parser=parser)
xml = etree.tostring(tree, encoding="utf-8")
# Pandas can load dataframes from xml. In order to get the character
# re-encoding to work we must use the etree parser that we used to
# reencode the xml, instead of the libxml parser (which is faster and
# is the default).
return pd.read_xml(xml, parser="etree")
def split_helpful_column(df: pd.DataFrame) -> pd.DataFrame:
# There is a "helpful" column which is a review of the review by other
# users. A review that is more consistently marked as helpful may be
# higher quality.
# If this is present then it is a string of the form "N of M" where N
# is the number of users that considered the review helpful. If no-one
# has reviewed the column then this value is missing.
def parse_helpful(value: Union[str, float]) -> Tuple[int, int]:
if not isinstance(value, str):
return (0, 0)
helpful, total = value.split(" of ")
return int(helpful), int(total)
def get_helpful(row: Tuple[int, int]) -> int:
return row[0]
def get_unhelpful(row: Tuple[int, int]) -> int:
helpful, total = row
return total - helpful
helpful_total = df.helpful.apply(parse_helpful)
df["helpful"] = helpful_total.apply(get_helpful)
df["unhelpful"] = helpful_total.apply(get_unhelpful)
return df
def clean_rating(df: pd.DataFrame) -> pd.DataFrame:
# The rating column is the sentiment proxy for the text. Some rows are
# missing a value for this, and so cannot be used. Once they have been
# dropped the column can be converted from a float to an int.
df = df.dropna(subset=["rating"]).copy()
df["rating"] = df.rating.astype(int)
return df
def clean_text(df: pd.DataFrame) -> pd.DataFrame:
# Some reviews only have a title and no text body. We are not
# considering the title for this so any rows that are considered blank
# or too short have to be dropped.
df = df[df.review_text.str.len() > 10]
return df
def parse_date(df: pd.DataFrame) -> pd.DataFrame:
df["date"] = pd.to_datetime(df.date)
return df
def drop_unrelated_columns(df: pd.DataFrame) -> pd.DataFrame:
# The product_type column is the domain, the rating is the sentiment
# and the review_text is the text.
df = df[["product_type", "rating", "review_text"]].copy()
df = df.rename(
columns={
"product_type": "domain",
"review_text": "text",
}
)
return df
def rating_to_sentiment(df: pd.DataFrame) -> pd.DataFrame:
# The rating contains values 1, 2, 4, and 5, being the number of stars
# assigned to the review by the reviewer.
df["sentiment"] = df.rating.map(
{1: "negative", 2: "negative", 4: "positive", 5: "positive"}
)
df = df.drop(columns=["rating"])
return df
return load_all_files(folder)
#collapse
from pathlib import Path
DOMAIN_DATASET_FOLDER = Path("/data/sentiment/multi-domain-sentiment/sorted_data")
domain_df = load_multi_domain_sentiment(DOMAIN_DATASET_FOLDER)
domain_df.to_parquet(
"/data/sentiment/multi-domain-sentiment/sentiment-top-5.gz.parquet",
compression="gzip"
)
display(
domain_df[["domain", "sentiment"]]
.value_counts()
.to_frame()
.rename(columns={0: "count"})
.reset_index()
.sort_values(by=["domain", "sentiment"], ascending=[True, True])
.set_index(["domain", "sentiment"])
)
domain_df
count | ||
---|---|---|
domain | sentiment | |
electronics | negative | 5048 |
positive | 17959 | |
kitchen & housewares | negative | 4119 |
positive | 15737 | |
music | negative | 2441 |
positive | 14587 | |
toys & games | negative | 2568 |
positive | 10579 | |
video | negative | 2587 |
positive | 12764 |
domain | text | sentiment | |
---|---|---|---|
0 | electronics | I have bought and returned three of these unit... | negative |
1 | electronics | I used a 25 pack of these doing DVD backups, a... | negative |
2 | electronics | I bought these discs at CompUSA because I need... | negative |
3 | electronics | The DVDs I burned successfully showed the movi... | negative |
4 | electronics | Please don't expect to get the cash back from ... | negative |
... | ... | ... | ... |
15346 | video | After watching this documentary, I was left th... | positive |
15347 | video | I finally made my first purchase from Amazon's... | positive |
15348 | video | Don't buy this disc unless you are a real Jack... | negative |
15349 | video | Oh my goodness, they've outlawed sex! That is ... | positive |
15350 | video | In this erotic science fiction film from the f... | positive |
88389 rows × 3 columns
# from src/main/python/blog/domain_shift/data/balance_domain.py
import datasets
import pandas as pd
def make_product_dataset(domain_df: pd.DataFrame, domain: str) -> datasets.DatasetDict:
"""
This creates a balanced dataset that is limited to the specified domain.
"""
df = domain_df[domain_df.domain == domain]
# sample the dataframe to balance the sentiment classes
positive_df = df[df.sentiment == "positive"]
negative_df = df[df.sentiment == "negative"]
smaller_size = min(len(positive_df), len(negative_df))
positive_df = positive_df.sample(n=smaller_size)
negative_df = negative_df.sample(n=smaller_size)
# recombine and shuffle
df = pd.concat([positive_df, negative_df]).sample(frac=1)
test_size = min(1_000, len(df) // 4)
return datasets.Dataset.from_pandas(df).train_test_split(test_size=test_size)
#collapse
electronics_ds = make_product_dataset(domain_df, "electronics")
electronics_ds.save_to_disk(
"/data/sentiment/multi-domain-sentiment/sentiment-electronics.dataset"
)
kitchen_ds = make_product_dataset(domain_df, "kitchen & housewares")
kitchen_ds.save_to_disk(
"/data/sentiment/multi-domain-sentiment/sentiment-kitchen.dataset"
)
music_ds = make_product_dataset(domain_df, "music")
music_ds.save_to_disk(
"/data/sentiment/multi-domain-sentiment/sentiment-music.dataset"
)
toys_ds = make_product_dataset(domain_df, "toys & games")
toys_ds.save_to_disk(
"/data/sentiment/multi-domain-sentiment/sentiment-toys.dataset"
)
video_ds = make_product_dataset(domain_df, "video")
video_ds.save_to_disk(
"/data/sentiment/multi-domain-sentiment/sentiment-video.dataset"
)
This is more for me to make it easy to work with this notebook. By having this step I can run parts of this more easily.
#collapse
import datasets
general_ds = datasets.load_from_disk(
"/data/sentiment/sentiment140/sentiment.dataset"
)
electronics_ds = datasets.load_from_disk(
"/data/sentiment/multi-domain-sentiment/sentiment-electronics.dataset"
)
kitchen_ds = datasets.load_from_disk(
"/data/sentiment/multi-domain-sentiment/sentiment-kitchen.dataset"
)
music_ds = datasets.load_from_disk(
"/data/sentiment/multi-domain-sentiment/sentiment-music.dataset"
)
toys_ds = datasets.load_from_disk(
"/data/sentiment/multi-domain-sentiment/sentiment-toys.dataset"
)
video_ds = datasets.load_from_disk(
"/data/sentiment/multi-domain-sentiment/sentiment-video.dataset"
)
Training the different models needs to be consistent and the easiest way to produce consistency is to use the same code. Here I am defining the different methods that are required to train and evaluate the models.
To consistently train the different models we have a set of methods:
train_classifier_full
This trains a normal classifier on the dataset. The classifier can adjust any parameters in the entire model. This provides a baseline to measure against as this domain specific classifier should be the best achievable performance.
train_classifier_base
This trains a classifier with a frozen embedding layer. The classifier can be adjusted to become domain specific by swapping out the embedding layer. This provides the base for the domain specific classifier.
train_language_model_embedding
This trains an embedding by language model pretraining. The embedding layer is the only part of the model that can be adjusted. This can be swapped into the base classifier to make it domain specific.
get_embedding_parameters_bert
This method returns all of the parameters in the model that form the embedding layer. The train_classifier_base
and train_language_model_embedding
methods use this to either freeze the embedding layer or freeze the model and unfreeze the embedding layer.
# from src/main/python/blog/domain_shift/model/train_classifier.py
from pathlib import Path
from typing import Callable, Dict, List, Optional
import datasets
import torch
import wandb
from transformers import (
AutoModel,
AutoModelForSequenceClassification,
AutoTokenizer,
Trainer,
TrainingArguments,
)
from transformers.trainer_utils import EvalPrediction
def train_classifier_full(
ds: datasets.Dataset,
*,
project_name: str,
model_name: str,
dataset_name: str,
data_folder: Path,
metric: Callable[[EvalPrediction], Dict[str, float]],
batch_size: int,
epochs: float = 5,
**settings,
) -> None:
"""
This trains the classifier for the single purpose of classifying this
dataset. The training process has full freedom to alter any and all
parameters in the model. This should produce a model with the best
performance possible.
"""
train_classifier(
ds=ds,
train_name="full",
project_name=project_name,
model_name=model_name,
dataset_name=dataset_name,
data_folder=data_folder,
metric=metric,
batch_size=batch_size,
epochs=epochs,
**settings,
)
def train_classifier_base(
ds: datasets.Dataset,
*,
project_name: str,
model_name: str,
embedding_accessor: Callable[[AutoModel], List[torch.nn.Parameter]],
dataset_name: str,
data_folder: Path,
metric: Callable[[EvalPrediction], Dict[str, float]],
batch_size: int,
epochs: float = 5,
**settings,
) -> None:
"""
This trains the classifier for the single purpose of classifying this
dataset. The training process can alter any parameters except for the
initial embedding layer. This should produce a model with good
performance which is compatible with a retrained embedding layer.
"""
def model_preparation(model: AutoModelForSequenceClassification) -> None:
for parameter in embedding_accessor(model):
parameter.requires_grad_(False)
train_classifier(
ds=ds,
train_name="no-embedding",
project_name=project_name,
model_name=model_name,
dataset_name=dataset_name,
data_folder=data_folder,
metric=metric,
batch_size=batch_size,
epochs=epochs,
model_preparation=model_preparation,
**settings,
)
def train_classifier(
ds: datasets.Dataset,
*,
train_name: str,
project_name: str,
model_name: str,
dataset_name: str,
data_folder: Path,
metric: Callable[[EvalPrediction], Dict[str, float]],
batch_size: int,
epochs: float = 5,
model_preparation: Optional[
Callable[[AutoModelForSequenceClassification], None]
] = None,
**settings,
) -> None:
"""
This trains the classifier for the single purpose of classifying this
dataset. The model_preparation function, if provided, can alter the model
to freeze or alter layers as appropriate.
"""
# Set default values for training, which can be overridden with the settings
training_arguments = {
"per_device_train_batch_size": batch_size,
"per_device_eval_batch_size": batch_size,
"num_train_epochs": epochs,
"learning_rate": 5e-5,
"warmup_ratio": 0.06,
"logging_steps": 1_000,
"save_steps": 1_000,
"eval_steps": 1_000,
"metric_for_best_model": "accuracy",
"greater_is_better": True,
} | settings
run_name = f"{train_name}-{model_name}-{dataset_name}-{batch_size}bs-{epochs}e"
model_run_folder = data_folder / "runs" / run_name
model_run_folder.mkdir(parents=True, exist_ok=True)
best_model_folder = data_folder / "best-model" / run_name
best_model_folder.mkdir(parents=True, exist_ok=True)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
if model_preparation is not None:
model_preparation(model)
tokenizer = AutoTokenizer.from_pretrained(model_name)
with wandb.init(
project=project_name,
name=run_name,
mode="online",
):
training_args = TrainingArguments(
report_to=["wandb"],
output_dir=model_run_folder / "output",
logging_dir=model_run_folder / "output",
overwrite_output_dir=True,
evaluation_strategy="steps",
load_best_model_at_end=True,
**training_arguments,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=ds["train"],
eval_dataset=ds["test"],
tokenizer=tokenizer,
compute_metrics=metric,
)
trainer.train()
model.save_pretrained(best_model_folder)
# from src/main/python/blog/domain_shift/model/train_language_model.py
from pathlib import Path
from typing import Callable, Dict, List, Optional
import datasets
import torch
import wandb
from transformers import (
AutoModel,
AutoModelForMaskedLM,
AutoTokenizer,
DataCollatorForLanguageModeling,
Trainer,
TrainingArguments,
)
from transformers.trainer_utils import EvalPrediction
def train_language_model_embedding(
ds: datasets.Dataset,
*,
project_name: str,
model_name: str,
embedding_accessor: Callable[[AutoModel], List[torch.nn.Parameter]],
dataset_name: str,
data_folder: Path,
batch_size: int,
epochs: float = 5,
metric: Optional[Callable[[EvalPrediction], Dict[str, float]]] = None,
**settings,
) -> None:
"""
This trains the embedding layer of the language model using language model
pretraining. This involves adjusting the model to better match the domain
specific language use.
"""
def model_preparation(model: AutoModelForMaskedLM) -> None:
# disable gradient updates on the model
model.requires_grad_(False)
# enable gradient updates on the embedding
for parameter in embedding_accessor(model):
parameter.requires_grad_(True)
train_language_model(
ds=ds,
train_name="embedding",
project_name=project_name,
model_name=model_name,
dataset_name=dataset_name,
data_folder=data_folder,
batch_size=batch_size,
epochs=epochs,
metric=metric,
model_preparation=model_preparation,
**settings,
)
def train_language_model(
ds: datasets.Dataset,
*,
train_name: str,
project_name: str,
model_name: str,
dataset_name: str,
data_folder: Path,
batch_size: int,
epochs: float = 5,
metric: Optional[Callable[[EvalPrediction], Dict[str, float]]] = None,
model_preparation: Optional[Callable[[AutoModelForMaskedLM], None]] = None,
save_preparation: Optional[Callable[[AutoModelForMaskedLM], None]] = None,
**settings,
) -> None:
# Set default values for training, which can be overridden with the settings
training_arguments = {
"per_device_train_batch_size": batch_size,
"per_device_eval_batch_size": batch_size,
"num_train_epochs": epochs,
"learning_rate": 5e-5,
"warmup_ratio": 0.06,
"logging_steps": 1_000,
"save_steps": 1_000,
"eval_steps": 1_000,
"metric_for_best_model": "loss",
"greater_is_better": False,
} | settings
run_name = f"{train_name}-{model_name}-{dataset_name}-{batch_size}bs-{epochs}e"
model_run_folder = data_folder / "runs" / run_name
model_run_folder.mkdir(parents=True, exist_ok=True)
best_model_folder = data_folder / "best-model" / run_name
best_model_folder.mkdir(parents=True, exist_ok=True)
model = AutoModelForMaskedLM.from_pretrained(model_name)
if model_preparation is not None:
model_preparation(model)
tokenizer = AutoTokenizer.from_pretrained(model_name)
non_text_columns = set(ds["train"].column_names) - set(["input_ids"])
ds = ds.remove_columns(non_text_columns)
# there is a problem running the evaluation over more than 100 rows
test_ds = datasets.Dataset.from_dict(ds["test"][:100])
with wandb.init(
project=project_name,
name=run_name,
mode="online",
):
training_args = TrainingArguments(
report_to=["wandb"],
output_dir=model_run_folder / "output",
logging_dir=model_run_folder / "output",
overwrite_output_dir=True,
evaluation_strategy="steps",
load_best_model_at_end=True,
**training_arguments,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=ds["train"],
eval_dataset=test_ds,
data_collator=DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=True,
),
tokenizer=tokenizer,
compute_metrics=metric,
)
trainer.train()
if save_preparation is not None:
save_preparation(model)
model.save_pretrained(best_model_folder)
# from src/main/python/blog/domain_shift/model/embedding.py
from typing import List
import torch
from transformers import BertModel
def get_embedding_parameters_bert(model: BertModel) -> List[torch.nn.Parameter]:
# Given a classification model, base_model returns the core bert model
# without the classification head. Given the core bert model, base_model
# returns the core bert model again! This means this approach works with
# any kind of bert model.
return list(model.base_model.embeddings.parameters())
After training the model we need a way to evaluate it.
evaluate_classifier
This evaluates a classification model without altering it.
evaluate_combined_classifier
This evaluates a classification model made from a base model combined with the embedding layer of a pretrained language model.
# from src/main/python/blog/domain_shift/model/evaluate.py
from pathlib import Path
from typing import Callable, Dict, List, Optional
import datasets
import torch
from transformers import (
AutoModel,
AutoModelForSequenceClassification,
AutoTokenizer,
Trainer,
TrainingArguments,
)
from transformers.trainer_utils import EvalPrediction
@torch.no_grad()
def evaluate_classifier(
ds: datasets.Dataset,
*,
model_name: str,
model: AutoModelForSequenceClassification,
batch_size: int,
data_folder: Path,
metric: Optional[Callable[[EvalPrediction], Dict[str, float]]] = None,
) -> Dict[str, float]:
model_run_folder = data_folder / "evaluation"
model_run_folder.mkdir(parents=True, exist_ok=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model.eval()
training_args = TrainingArguments(
report_to=[],
output_dir=model_run_folder / "output",
logging_dir=model_run_folder / "output",
overwrite_output_dir=True,
num_train_epochs=1,
per_device_eval_batch_size=batch_size,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=ds["train"],
eval_dataset=ds["test"],
tokenizer=tokenizer,
compute_metrics=metric,
)
return trainer.evaluate()
@torch.no_grad()
def evaluate_combined_classifier(
ds: datasets.Dataset,
*,
model_name: str,
base_model: AutoModelForSequenceClassification,
embedding_model: AutoModel,
embedding_accessor: Callable[[AutoModel], List[torch.nn.Parameter]],
batch_size: int,
data_folder: Path,
metric: Optional[Callable[[EvalPrediction], Dict[str, float]]] = None,
) -> Dict[str, float]:
model_run_folder = data_folder / "evaluation"
model_run_folder.mkdir(parents=True, exist_ok=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
for model_parameter, embedding_parameter in zip(
embedding_accessor(base_model),
embedding_accessor(embedding_model),
):
model_parameter.data = embedding_parameter.data
base_model.eval()
training_args = TrainingArguments(
report_to=[],
output_dir=model_run_folder / "output",
logging_dir=model_run_folder / "output",
overwrite_output_dir=True,
num_train_epochs=1,
per_device_eval_batch_size=batch_size,
)
trainer = Trainer(
model=base_model,
args=training_args,
train_dataset=ds["train"],
eval_dataset=ds["test"],
tokenizer=tokenizer,
compute_metrics=metric,
)
return trainer.evaluate()
To load the models we also have:
load_classifier_full
This loads a classifier model created by train_classifier_full
.
load_classifier_base
This loads a classifier model created by train_classifier_base
.
load_language_model_embedding
This loads a language model created by train_language_model_embedding
.
# from src/main/python/blog/domain_shift/model/load.py
from pathlib import Path
from transformers import AutoModelForMaskedLM, AutoModelForSequenceClassification
def load_classifier_full(
model_name: str,
dataset_name: str,
data_folder: Path,
batch_size: int,
epochs: float = 5,
) -> AutoModelForSequenceClassification:
return load_classifier(
train_name="full",
model_name=model_name,
dataset_name=dataset_name,
data_folder=data_folder,
batch_size=batch_size,
epochs=epochs,
)
def load_classifier_base(
model_name: str,
dataset_name: str,
data_folder: Path,
batch_size: int,
epochs: float = 5,
) -> AutoModelForSequenceClassification:
return load_classifier(
train_name="no-embedding",
model_name=model_name,
dataset_name=dataset_name,
data_folder=data_folder,
batch_size=batch_size,
epochs=epochs,
)
def load_classifier(
*,
train_name: str,
model_name: str,
dataset_name: str,
data_folder: Path,
batch_size: int,
epochs: float = 5,
) -> AutoModelForSequenceClassification:
run_name = f"{train_name}-{model_name}-{dataset_name}-{batch_size}bs-{epochs}e"
best_model_folder = data_folder / "best-model" / run_name
return AutoModelForSequenceClassification.from_pretrained(best_model_folder)
def load_language_model_embedding(
model_name: str,
dataset_name: str,
data_folder: Path,
batch_size: int,
epochs: float = 5,
) -> AutoModelForSequenceClassification:
return load_language_model(
train_name="embedding",
model_name=model_name,
dataset_name=dataset_name,
data_folder=data_folder,
batch_size=batch_size,
epochs=epochs,
)
def load_language_model_embedding_overlay(
model_name: str,
dataset_name: str,
data_folder: Path,
batch_size: int,
epochs: float = 5,
) -> AutoModelForSequenceClassification:
return load_language_model(
train_name="embedding-overlay",
model_name=model_name,
dataset_name=dataset_name,
data_folder=data_folder,
batch_size=batch_size,
epochs=epochs,
)
def load_language_model(
*,
train_name: str,
model_name: str,
dataset_name: str,
data_folder: Path,
batch_size: int,
epochs: float = 5,
) -> AutoModelForMaskedLM:
run_name = f"{train_name}-{model_name}-{dataset_name}-{batch_size}bs-{epochs}e"
best_model_folder = data_folder / "best-model" / run_name
return AutoModelForMaskedLM.from_pretrained(best_model_folder)
# from src/main/python/blog/domain_shift/model/layer/embedding_overlay.py
import torch
from torch import nn
from transformers.models.bert.modeling_bert import BertEmbeddings, BertModel
class EmbeddingOverlay(nn.Module):
def __init__(self, embedding: BertEmbeddings, device: str) -> None:
super().__init__()
self.embedding = embedding
self.base = embedding.word_embeddings.weight
self.overlay = torch.zeros_like(self.base, device=device)
embedding.word_embeddings.weight = nn.Parameter(
torch.zeros_like(self.base, device=device)
)
@classmethod
def update_model(cls, model: BertModel) -> None:
embedding = cls(
model.base_model.embeddings,
device="cuda" if torch.cuda.is_available() else "cpu",
)
model.base_model.embeddings = embedding
model.requires_grad_(False)
embedding.overlay.requires_grad_(True)
@staticmethod
def restore_model(model: BertModel) -> None:
embedding = model.base_model.embeddings.to_embedding()
model.base_model.embeddings = embedding
def forward(self, *args, **kwargs) -> torch.Tensor:
return self.to_embedding().forward(*args, **kwargs)
def to_embedding(self) -> BertEmbeddings:
self.embedding.word_embeddings.weight = nn.Parameter(self.base + self.overlay)
return self.embedding
First we need to define the code to train and evaluate the models, then we can run the experiments.
We need a way to measure the performance of the model. Since this is a two class problem accuracy is a sufficient metric.
# from src/main/python/blog/metrics/accuracy.py
from typing import Dict
from sklearn.metrics import accuracy_score
from transformers.trainer_utils import EvalPrediction
def metric_accuracy(model_output: EvalPrediction) -> Dict[str, float]:
predictions = model_output.predictions.argmax(axis=1)
targets = model_output.label_ids
accuracy = accuracy_score(targets, predictions)
return {"accuracy": accuracy}
# from src/main/python/blog/metrics/perplexity.py
import torch
import torch.nn.functional as F
from transformers.trainer_utils import EvalPrediction
def metric_perplexity_bert(model_output: EvalPrediction, vocab_size: int = 30_522):
# This loss calculation comes directly from the BERT forward method
labels = torch.tensor(model_output.label_ids)
lm_logits = torch.tensor(model_output.predictions)
loss = F.cross_entropy(lm_logits.view(-1, vocab_size), labels.view(-1))
perplexity = torch.exp(loss)
return {"perplexity": perplexity.item()}
def metric_perplexity_gpt2(model_output: EvalPrediction):
# This loss calculation comes directly from the GPT2 forward method
# that handles correctly offsetting the labels to match the positions that are predicting
labels = torch.tensor(model_output.label_ids)
lm_logits = torch.tensor(model_output.predictions)
# Shift so that tokens < n predict n
shift_logits = lm_logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Flatten the tokens
loss = F.cross_entropy(
shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)
)
perplexity = torch.exp(loss)
return {"perplexity": perplexity.item()}
For the language model pretraining we need a perplexity measure.
This is going to evaluate BERT for this task.
The first stage will be to evaluate a pure classifier trained on the general dataset and on each domain dataset. This will establish a baseline.
Then the base sentiment model will be trained which will have a frozen embedding layer. After that the masked language model can be pretrained with the domain specific text, only training the embedding layer. I can also try restricting that to having weight decay only over the alteration to the base embeddings.
The raw datasets need to be encoded using the BERT tokenizer.
#collapse
#hide_output
from typing import Any, Dict
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained(MODEL_NAME)
sentiment_index = {
"negative": 0,
"positive": 1,
}
def encode(row: Dict[str, Any]) -> Dict[str, Any]:
return {
"input_ids": tokenizer(row["text"], truncation=True).input_ids,
"label": sentiment_index[row["sentiment"]],
}
general_ds = general_ds.map(encode)
electronics_ds = electronics_ds.map(encode)
kitchen_ds = kitchen_ds.map(encode)
music_ds = music_ds.map(encode)
toys_ds = toys_ds.map(encode)
video_ds = video_ds.map(encode)
This trains a separate model for each task to provide a baseline for comparison.
wandb: Currently logged in as: brandwatch-ml (use `wandb login --relogin` to force relogin)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
wandb: wandb version 0.12.10 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220212_193000-eyn03xlb
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Step | Training Loss | Validation Loss | Accuracy |
---|---|---|---|
1000 | 0.513200 | 0.403615 | 0.824300 |
2000 | 0.386900 | 0.369579 | 0.838800 |
3000 | 0.362800 | 0.352850 | 0.849500 |
4000 | 0.353500 | 0.352404 | 0.850100 |
5000 | 0.351500 | 0.345889 | 0.853700 |
6000 | 0.349400 | 0.340666 | 0.854200 |
7000 | 0.345300 | 0.344296 | 0.854400 |
8000 | 0.340700 | 0.352917 | 0.850300 |
9000 | 0.341600 | 0.341545 | 0.852500 |
10000 | 0.336900 | 0.332418 | 0.858500 |
11000 | 0.337300 | 0.334889 | 0.857700 |
12000 | 0.328600 | 0.343519 | 0.853100 |
13000 | 0.330900 | 0.335796 | 0.862200 |
14000 | 0.323400 | 0.325259 | 0.861300 |
15000 | 0.327800 | 0.337436 | 0.855200 |
16000 | 0.325600 | 0.340725 | 0.861600 |
17000 | 0.324700 | 0.329951 | 0.866400 |
18000 | 0.324500 | 0.316455 | 0.868000 |
19000 | 0.317800 | 0.314791 | 0.867000 |
20000 | 0.317200 | 0.312795 | 0.868500 |
21000 | 0.321100 | 0.312556 | 0.868100 |
22000 | 0.318400 | 0.315015 | 0.868300 |
23000 | 0.319700 | 0.313522 | 0.867100 |
24000 | 0.315900 | 0.311461 | 0.869200 |
25000 | 0.311100 | 0.322506 | 0.873000 |
26000 | 0.277000 | 0.310173 | 0.870000 |
27000 | 0.280600 | 0.319567 | 0.867000 |
28000 | 0.277400 | 0.314650 | 0.870900 |
29000 | 0.278500 | 0.311275 | 0.870000 |
30000 | 0.278100 | 0.318782 | 0.871400 |
31000 | 0.277300 | 0.306667 | 0.870800 |
32000 | 0.276500 | 0.309154 | 0.870300 |
33000 | 0.277400 | 0.320674 | 0.868100 |
34000 | 0.279300 | 0.318263 | 0.872000 |
35000 | 0.282800 | 0.307125 | 0.873200 |
36000 | 0.277600 | 0.320166 | 0.873300 |
37000 | 0.278900 | 0.309924 | 0.870800 |
38000 | 0.280500 | 0.312997 | 0.870400 |
39000 | 0.275800 | 0.317793 | 0.868100 |
40000 | 0.280100 | 0.305172 | 0.871300 |
41000 | 0.282300 | 0.312909 | 0.871100 |
42000 | 0.282300 | 0.305909 | 0.873700 |
43000 | 0.277500 | 0.310162 | 0.869600 |
44000 | 0.277100 | 0.313971 | 0.869600 |
45000 | 0.275900 | 0.319665 | 0.872500 |
46000 | 0.277900 | 0.318283 | 0.868900 |
47000 | 0.281000 | 0.313041 | 0.872800 |
48000 | 0.276200 | 0.300781 | 0.877500 |
49000 | 0.281300 | 0.307026 | 0.875800 |
50000 | 0.259300 | 0.325227 | 0.871600 |
51000 | 0.221100 | 0.325990 | 0.871500 |
52000 | 0.223000 | 0.337404 | 0.874800 |
53000 | 0.215900 | 0.340205 | 0.872400 |
54000 | 0.217100 | 0.325959 | 0.873400 |
55000 | 0.219700 | 0.329393 | 0.872000 |
56000 | 0.221900 | 0.325739 | 0.871300 |
57000 | 0.224400 | 0.327636 | 0.870800 |
58000 | 0.222300 | 0.341080 | 0.872400 |
59000 | 0.226100 | 0.332482 | 0.869100 |
60000 | 0.223500 | 0.323489 | 0.874300 |
61000 | 0.226700 | 0.316245 | 0.872000 |
62000 | 0.222300 | 0.325050 | 0.875100 |
63000 | 0.226700 | 0.325248 | 0.874400 |
64000 | 0.225100 | 0.322689 | 0.871700 |
65000 | 0.225300 | 0.321286 | 0.871200 |
66000 | 0.224800 | 0.334543 | 0.870800 |
67000 | 0.226700 | 0.326299 | 0.869100 |
68000 | 0.224900 | 0.329174 | 0.870800 |
69000 | 0.221800 | 0.323690 | 0.871900 |
70000 | 0.225800 | 0.318156 | 0.872000 |
71000 | 0.224500 | 0.326445 | 0.873700 |
72000 | 0.226000 | 0.328679 | 0.869900 |
73000 | 0.224500 | 0.316885 | 0.876000 |
74000 | 0.222900 | 0.316419 | 0.873000 |
75000 | 0.196700 | 0.378959 | 0.870200 |
76000 | 0.159600 | 0.375573 | 0.873600 |
77000 | 0.155600 | 0.379308 | 0.872300 |
78000 | 0.159400 | 0.365117 | 0.872600 |
79000 | 0.159700 | 0.387099 | 0.870200 |
80000 | 0.163400 | 0.381868 | 0.868100 |
81000 | 0.163000 | 0.369293 | 0.871300 |
82000 | 0.161500 | 0.361120 | 0.866600 |
83000 | 0.161300 | 0.381293 | 0.869200 |
84000 | 0.161400 | 0.381637 | 0.869200 |
85000 | 0.163700 | 0.378771 | 0.867900 |
86000 | 0.165200 | 0.372763 | 0.868500 |
87000 | 0.164500 | 0.372205 | 0.869500 |
88000 | 0.164000 | 0.387928 | 0.869300 |
89000 | 0.163700 | 0.366503 | 0.871200 |
90000 | 0.165600 | 0.371311 | 0.870200 |
91000 | 0.165700 | 0.368546 | 0.870400 |
92000 | 0.161900 | 0.390258 | 0.866100 |
93000 | 0.160700 | 0.373525 | 0.868200 |
94000 | 0.162000 | 0.359105 | 0.869300 |
95000 | 0.164800 | 0.380203 | 0.868400 |
96000 | 0.161400 | 0.366745 | 0.871500 |
97000 | 0.160300 | 0.379058 | 0.873300 |
98000 | 0.163600 | 0.377647 | 0.869500 |
99000 | 0.162400 | 0.376305 | 0.872400 |
100000 | 0.133200 | 0.441574 | 0.870600 |
101000 | 0.112200 | 0.450519 | 0.868300 |
102000 | 0.111800 | 0.461471 | 0.867600 |
103000 | 0.111000 | 0.453476 | 0.868500 |
104000 | 0.112100 | 0.464585 | 0.870000 |
105000 | 0.108200 | 0.461612 | 0.869700 |
106000 | 0.115700 | 0.450644 | 0.870300 |
107000 | 0.112700 | 0.463014 | 0.868500 |
108000 | 0.112600 | 0.454774 | 0.870100 |
109000 | 0.113800 | 0.455071 | 0.869800 |
110000 | 0.113400 | 0.474106 | 0.866800 |
111000 | 0.113100 | 0.450578 | 0.868300 |
112000 | 0.113700 | 0.453555 | 0.868100 |
113000 | 0.109200 | 0.453771 | 0.869300 |
114000 | 0.114500 | 0.449492 | 0.869100 |
115000 | 0.109200 | 0.450538 | 0.867900 |
116000 | 0.110700 | 0.463325 | 0.870000 |
117000 | 0.109900 | 0.464494 | 0.869500 |
118000 | 0.104600 | 0.462927 | 0.867900 |
119000 | 0.109000 | 0.455559 | 0.870600 |
120000 | 0.112200 | 0.450070 | 0.869400 |
121000 | 0.109800 | 0.447597 | 0.869500 |
122000 | 0.109600 | 0.451558 | 0.869200 |
123000 | 0.109600 | 0.451785 | 0.869400 |
124000 | 0.108400 | 0.454002 | 0.869700 |
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220212_193000-eyn03xlb/logs/debug.log
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220212_193000-eyn03xlb/logs/debug-internal.log
train/loss | 0.1084 |
train/learning_rate | 0.0 |
train/epoch | 5.0 |
train/global_step | 124220 |
_runtime | 27799 |
_timestamp | 1644721999 |
_step | 248 |
eval/loss | 0.454 |
eval/accuracy | 0.8697 |
eval/runtime | 10.5481 |
eval/samples_per_second | 948.04 |
eval/steps_per_second | 14.884 |
train/train_runtime | 27796.6567 |
train/train_samples_per_second | 286.006 |
train/train_steps_per_second | 4.469 |
train/total_flos | 2.0687274460977792e+17 |
train/train_loss | 0.2234 |
train/loss | █▅▅▅▅▅▅▅▄▄▄▄▄▄▄▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁ |
train/learning_rate | ▂▅████▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▁▁▁ |
train/epoch | ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███ |
train/global_step | ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███ |
_runtime | ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███ |
_timestamp | ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███ |
_step | ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███ |
eval/loss | ▅▃▃▂▂▃▂▂▁▁▁▁▂▁▂▁▂▂▂▂▂▂▂▂▄▅▄▄▄▅▄▄▇██▇▇█▇█ |
eval/accuracy | ▁▄▅▆▆▆▇▇▇▇▇▇▇█▇█▇▇▇█▇▇▇█▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ |
eval/runtime | ▁▅▆▆▇▆▇▇▇▇▇▇▇▇▇▇▇▇█▇██▇▇█▇█▇█▇███▇▇█▇██▇ |
eval/samples_per_second | █▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▂▂▁▂▁▂▁▂▁▁▁▂▂▁▁▁▁▂ |
eval/steps_per_second | █▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▂▂▁▂▁▂▁▂▁▁▁▂▂▁▁▁▁▂ |
train/train_runtime | ▁ |
train/train_samples_per_second | ▁ |
train/train_steps_per_second | ▁ |
train/total_flos | ▁ |
train/train_loss | ▁ |
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
wandb: wandb version 0.12.10 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_033833-2cqwf0qp
PyTorch: setting up devices
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Step | Training Loss | Validation Loss | Accuracy |
---|---|---|---|
1000 | 0.258500 | 0.168365 | 0.950000 |
2000 | 0.068000 | 0.226074 | 0.958000 |
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_033833-2cqwf0qp/logs/debug.log
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_033833-2cqwf0qp/logs/debug-internal.log
train/loss | 0.068 |
train/learning_rate | 2e-05 |
train/epoch | 5.0 |
train/global_step | 2845 |
_runtime | 1226 |
_timestamp | 1644724739 |
_step | 4 |
eval/loss | 0.22607 |
eval/accuracy | 0.958 |
eval/runtime | 8.4727 |
eval/samples_per_second | 118.026 |
eval/steps_per_second | 7.436 |
train/train_runtime | 1225.0684 |
train/train_samples_per_second | 37.124 |
train/train_steps_per_second | 2.322 |
train/total_flos | 9167977279601760.0 |
train/train_loss | 0.1197 |
train/loss | █▁ |
train/learning_rate | █▁ |
train/epoch | ▁▁▅▅█ |
train/global_step | ▁▁▅▅█ |
_runtime | ▁▁▅▅█ |
_timestamp | ▁▁▅▅█ |
_step | ▁▃▅▆█ |
eval/loss | ▁█ |
eval/accuracy | ▁█ |
eval/runtime | ▁█ |
eval/samples_per_second | █▁ |
eval/steps_per_second | █▁ |
train/train_runtime | ▁ |
train/train_samples_per_second | ▁ |
train/train_steps_per_second | ▁ |
train/total_flos | ▁ |
train/train_loss | ▁ |
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
wandb: wandb version 0.12.10 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_035925-22i0d0ij
PyTorch: setting up devices
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Step | Training Loss | Validation Loss | Accuracy |
---|---|---|---|
1000 | 0.208400 | 0.314208 | 0.928000 |
2000 | 0.040300 | 0.361787 | 0.940000 |
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_035925-22i0d0ij/logs/debug.log
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_035925-22i0d0ij/logs/debug-internal.log
train/loss | 0.0403 |
train/learning_rate | 1e-05 |
train/epoch | 5.0 |
train/global_step | 2265 |
_runtime | 846 |
_timestamp | 1644725611 |
_step | 4 |
eval/loss | 0.36179 |
eval/accuracy | 0.94 |
eval/runtime | 7.5391 |
eval/samples_per_second | 132.642 |
eval/steps_per_second | 8.356 |
train/train_runtime | 845.3493 |
train/train_samples_per_second | 42.811 |
train/train_steps_per_second | 2.679 |
train/total_flos | 6108018316871280.0 |
train/train_loss | 0.11069 |
train/loss | █▁ |
train/learning_rate | █▁ |
train/epoch | ▁▁▇▇█ |
train/global_step | ▁▁▇▇█ |
_runtime | ▁▁▆▇█ |
_timestamp | ▁▁▆▇█ |
_step | ▁▃▅▆█ |
eval/loss | ▁█ |
eval/accuracy | ▁█ |
eval/runtime | █▁ |
eval/samples_per_second | ▁█ |
eval/steps_per_second | ▁█ |
train/train_runtime | ▁ |
train/train_samples_per_second | ▁ |
train/train_steps_per_second | ▁ |
train/total_flos | ▁ |
train/train_loss | ▁ |
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
wandb: wandb version 0.12.10 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_041355-2ynufcz4
PyTorch: setting up devices
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Step | Training Loss | Validation Loss | Accuracy |
---|---|---|---|
1000 | 0.165200 | 0.389923 | 0.929000 |
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_041355-2ynufcz4/logs/debug.log
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_041355-2ynufcz4/logs/debug-internal.log
train/loss | 0.1652 |
train/learning_rate | 1e-05 |
train/epoch | 5.0 |
train/global_step | 1215 |
_runtime | 642 |
_timestamp | 1644726277 |
_step | 2 |
eval/loss | 0.38992 |
eval/accuracy | 0.929 |
eval/runtime | 11.275 |
eval/samples_per_second | 88.692 |
eval/steps_per_second | 5.588 |
train/train_runtime | 641.3766 |
train/train_samples_per_second | 30.263 |
train/train_steps_per_second | 1.894 |
train/total_flos | 4571028364769280.0 |
train/train_loss | 0.1399 |
train/loss | ▁ |
train/learning_rate | ▁ |
train/epoch | ▁▁█ |
train/global_step | ▁▁█ |
_runtime | ▁▂█ |
_timestamp | ▁▂█ |
_step | ▁▅█ |
eval/loss | ▁ |
eval/accuracy | ▁ |
eval/runtime | ▁ |
eval/samples_per_second | ▁ |
eval/steps_per_second | ▁ |
train/train_runtime | ▁ |
train/train_samples_per_second | ▁ |
train/train_steps_per_second | ▁ |
train/total_flos | ▁ |
train/train_loss | ▁ |
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
wandb: wandb version 0.12.10 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_042455-16f10f6f
PyTorch: setting up devices
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Step | Training Loss | Validation Loss | Accuracy |
---|---|---|---|
1000 | 0.178000 | 0.477703 | 0.910000 |
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_042455-16f10f6f/logs/debug.log
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_042455-16f10f6f/logs/debug-internal.log
train/loss | 0.178 |
train/learning_rate | 1e-05 |
train/epoch | 5.0 |
train/global_step | 1295 |
_runtime | 450 |
_timestamp | 1644726745 |
_step | 2 |
eval/loss | 0.4777 |
eval/accuracy | 0.91 |
eval/runtime | 7.1618 |
eval/samples_per_second | 139.631 |
eval/steps_per_second | 8.797 |
train/train_runtime | 449.3178 |
train/train_samples_per_second | 46.025 |
train/train_steps_per_second | 2.882 |
train/total_flos | 3259571864878560.0 |
train/train_loss | 0.14495 |
train/loss | ▁ |
train/learning_rate | ▁ |
train/epoch | ▁▁█ |
train/global_step | ▁▁█ |
_runtime | ▁▁█ |
_timestamp | ▁▁█ |
_step | ▁▅█ |
eval/loss | ▁ |
eval/accuracy | ▁ |
eval/runtime | ▁ |
eval/samples_per_second | ▁ |
eval/steps_per_second | ▁ |
train/train_runtime | ▁ |
train/train_samples_per_second | ▁ |
train/train_steps_per_second | ▁ |
train/total_flos | ▁ |
train/train_loss | ▁ |
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
wandb: wandb version 0.12.10 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_043243-iouz3zx9
PyTorch: setting up devices
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Step | Training Loss | Validation Loss | Accuracy |
---|---|---|---|
1000 | 0.143700 | 0.310546 | 0.944000 |
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_043243-iouz3zx9/logs/debug.log
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_043243-iouz3zx9/logs/debug-internal.log
train/loss | 0.1437 |
train/learning_rate | 1e-05 |
train/epoch | 5.0 |
train/global_step | 1305 |
_runtime | 721 |
_timestamp | 1644727485 |
_step | 2 |
eval/loss | 0.31055 |
eval/accuracy | 0.944 |
eval/runtime | 11.8683 |
eval/samples_per_second | 84.258 |
eval/steps_per_second | 5.308 |
train/train_runtime | 720.419 |
train/train_samples_per_second | 28.969 |
train/train_steps_per_second | 1.811 |
train/total_flos | 5153342463603840.0 |
train/train_loss | 0.11579 |
train/loss | ▁ |
train/learning_rate | ▁ |
train/epoch | ▁▁█ |
train/global_step | ▁▁█ |
_runtime | ▁▁█ |
_timestamp | ▁▁█ |
_step | ▁▅█ |
eval/loss | ▁ |
eval/accuracy | ▁ |
eval/runtime | ▁ |
eval/samples_per_second | ▁ |
eval/steps_per_second | ▁ |
train/train_runtime | ▁ |
train/train_samples_per_second | ▁ |
train/train_steps_per_second | ▁ |
train/total_flos | ▁ |
train/train_loss | ▁ |
This trains a model for general sentiment classification with a frozen embedding layer. The resulting model will be used as the base for the domain adjusted models.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
wandb: wandb version 0.12.10 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_044503-1c4qi8r3
PyTorch: setting up devices
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Step | Training Loss | Validation Loss | Accuracy |
---|---|---|---|
1000 | 0.511300 | 0.406170 | 0.822400 |
2000 | 0.388700 | 0.371869 | 0.837000 |
3000 | 0.364600 | 0.354540 | 0.848000 |
4000 | 0.355000 | 0.356526 | 0.849200 |
5000 | 0.353200 | 0.346191 | 0.852400 |
6000 | 0.350800 | 0.341020 | 0.855100 |
7000 | 0.345700 | 0.343301 | 0.853700 |
8000 | 0.341700 | 0.352299 | 0.853900 |
9000 | 0.341200 | 0.339812 | 0.855800 |
10000 | 0.337600 | 0.333934 | 0.858800 |
11000 | 0.337800 | 0.333922 | 0.858500 |
12000 | 0.328800 | 0.339086 | 0.856300 |
13000 | 0.331700 | 0.329456 | 0.863300 |
14000 | 0.322400 | 0.325021 | 0.860800 |
15000 | 0.326400 | 0.341783 | 0.855300 |
16000 | 0.326400 | 0.338934 | 0.862700 |
17000 | 0.323400 | 0.331051 | 0.866300 |
18000 | 0.325000 | 0.319770 | 0.865600 |
19000 | 0.315500 | 0.309761 | 0.870200 |
20000 | 0.318200 | 0.311332 | 0.869200 |
21000 | 0.320600 | 0.314540 | 0.866800 |
22000 | 0.317600 | 0.313203 | 0.869900 |
23000 | 0.319400 | 0.311689 | 0.869800 |
24000 | 0.315400 | 0.306948 | 0.870800 |
25000 | 0.310800 | 0.316544 | 0.871000 |
26000 | 0.280500 | 0.305228 | 0.873900 |
27000 | 0.284600 | 0.315119 | 0.869700 |
28000 | 0.282100 | 0.316112 | 0.871200 |
29000 | 0.281500 | 0.311524 | 0.870400 |
30000 | 0.280600 | 0.318128 | 0.872500 |
31000 | 0.280600 | 0.310752 | 0.871400 |
32000 | 0.278800 | 0.309930 | 0.873400 |
33000 | 0.281400 | 0.318493 | 0.872800 |
34000 | 0.283400 | 0.321827 | 0.872100 |
35000 | 0.286400 | 0.307348 | 0.872800 |
36000 | 0.280200 | 0.323695 | 0.870700 |
37000 | 0.282700 | 0.306045 | 0.871800 |
38000 | 0.284300 | 0.306637 | 0.869500 |
39000 | 0.279700 | 0.313403 | 0.869800 |
40000 | 0.282500 | 0.301105 | 0.873000 |
41000 | 0.283600 | 0.306601 | 0.874800 |
42000 | 0.285000 | 0.307430 | 0.873400 |
43000 | 0.279200 | 0.305863 | 0.871100 |
44000 | 0.277800 | 0.307858 | 0.871100 |
45000 | 0.279200 | 0.306109 | 0.871500 |
46000 | 0.280300 | 0.312159 | 0.870600 |
47000 | 0.282800 | 0.310434 | 0.875600 |
48000 | 0.281400 | 0.302623 | 0.878500 |
49000 | 0.282900 | 0.302531 | 0.880000 |
50000 | 0.264000 | 0.309589 | 0.873400 |
51000 | 0.229000 | 0.321463 | 0.873500 |
52000 | 0.230600 | 0.341251 | 0.871700 |
53000 | 0.227100 | 0.330379 | 0.872000 |
54000 | 0.227500 | 0.321092 | 0.871200 |
55000 | 0.229700 | 0.319525 | 0.873400 |
56000 | 0.231000 | 0.317524 | 0.872500 |
57000 | 0.234000 | 0.315405 | 0.871000 |
58000 | 0.230900 | 0.317642 | 0.871800 |
59000 | 0.233800 | 0.321154 | 0.872600 |
60000 | 0.231900 | 0.318218 | 0.874600 |
61000 | 0.233800 | 0.316618 | 0.875300 |
62000 | 0.232700 | 0.325526 | 0.875100 |
63000 | 0.235600 | 0.320183 | 0.873000 |
64000 | 0.235500 | 0.319495 | 0.871600 |
65000 | 0.234800 | 0.314187 | 0.873500 |
66000 | 0.232800 | 0.325829 | 0.873500 |
67000 | 0.234400 | 0.311478 | 0.871800 |
68000 | 0.232200 | 0.317831 | 0.873600 |
69000 | 0.232300 | 0.311483 | 0.874300 |
70000 | 0.235600 | 0.305324 | 0.877200 |
71000 | 0.234200 | 0.315522 | 0.872300 |
72000 | 0.232900 | 0.319899 | 0.868900 |
73000 | 0.234100 | 0.308927 | 0.874900 |
74000 | 0.230300 | 0.308656 | 0.872400 |
75000 | 0.208900 | 0.360322 | 0.872700 |
76000 | 0.175900 | 0.357876 | 0.870900 |
77000 | 0.172800 | 0.349695 | 0.872500 |
78000 | 0.174800 | 0.351596 | 0.874000 |
79000 | 0.173700 | 0.359032 | 0.873700 |
80000 | 0.177600 | 0.355052 | 0.872600 |
81000 | 0.177800 | 0.353396 | 0.869900 |
82000 | 0.177200 | 0.346814 | 0.870100 |
83000 | 0.174200 | 0.354160 | 0.874200 |
84000 | 0.177300 | 0.355552 | 0.868400 |
85000 | 0.180100 | 0.355578 | 0.869500 |
86000 | 0.178800 | 0.358164 | 0.869500 |
87000 | 0.177000 | 0.355007 | 0.871100 |
88000 | 0.180300 | 0.348180 | 0.872800 |
89000 | 0.176000 | 0.346942 | 0.869900 |
90000 | 0.179300 | 0.344493 | 0.871400 |
91000 | 0.181100 | 0.344836 | 0.870600 |
92000 | 0.180300 | 0.353251 | 0.872400 |
93000 | 0.175700 | 0.354549 | 0.870000 |
94000 | 0.176000 | 0.353133 | 0.870100 |
95000 | 0.179900 | 0.374774 | 0.868700 |
96000 | 0.173200 | 0.355910 | 0.870800 |
97000 | 0.176100 | 0.361503 | 0.872800 |
98000 | 0.176800 | 0.356372 | 0.869200 |
99000 | 0.176300 | 0.360096 | 0.874800 |
100000 | 0.148100 | 0.418238 | 0.870900 |
101000 | 0.129100 | 0.416172 | 0.870400 |
102000 | 0.131000 | 0.419957 | 0.868900 |
103000 | 0.129700 | 0.409167 | 0.869400 |
104000 | 0.127100 | 0.430586 | 0.872300 |
105000 | 0.126700 | 0.428001 | 0.867700 |
106000 | 0.130300 | 0.415673 | 0.868400 |
107000 | 0.132300 | 0.414793 | 0.868600 |
108000 | 0.128100 | 0.438585 | 0.869700 |
109000 | 0.129900 | 0.424922 | 0.869000 |
110000 | 0.128400 | 0.429560 | 0.868200 |
111000 | 0.129300 | 0.420508 | 0.869400 |
112000 | 0.128200 | 0.426265 | 0.869900 |
113000 | 0.126300 | 0.419459 | 0.868500 |
114000 | 0.130100 | 0.412851 | 0.869100 |
115000 | 0.127900 | 0.417601 | 0.869100 |
116000 | 0.129400 | 0.421481 | 0.869300 |
117000 | 0.126300 | 0.432261 | 0.868900 |
118000 | 0.122100 | 0.426714 | 0.868600 |
119000 | 0.127300 | 0.419353 | 0.869800 |
120000 | 0.127800 | 0.418641 | 0.868600 |
121000 | 0.123400 | 0.421648 | 0.867300 |
122000 | 0.127500 | 0.419938 | 0.867600 |
123000 | 0.125900 | 0.421461 | 0.867100 |
124000 | 0.123900 | 0.422832 | 0.867600 |
wandb: Network error (ReadTimeout), entering retry loop.
wandb: Network error resolved after 0:00:42.418510, resuming normal operation.
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_044503-1c4qi8r3/logs/debug.log
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_044503-1c4qi8r3/logs/debug-internal.log
train/loss | 0.1239 |
train/learning_rate | 0.0 |
train/epoch | 5.0 |
train/global_step | 124220 |
_runtime | 26994 |
_timestamp | 1644754497 |
_step | 248 |
eval/loss | 0.42283 |
eval/accuracy | 0.8676 |
eval/runtime | 10.5191 |
eval/samples_per_second | 950.653 |
eval/steps_per_second | 14.925 |
train/train_runtime | 26993.4931 |
train/train_samples_per_second | 294.515 |
train/train_steps_per_second | 4.602 |
train/total_flos | 2.0687274460977792e+17 |
train/train_loss | 0.2321 |
train/loss | █▅▅▅▅▅▄▅▄▄▄▄▄▄▄▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁ |
train/learning_rate | ▂▅████▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▁▁▁ |
train/epoch | ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███ |
train/global_step | ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███ |
_runtime | ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███ |
_timestamp | ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███ |
_step | ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███ |
eval/loss | ▆▄▃▃▂▃▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▁▁▁▄▄▄▄▃▄▅▄▇▇█▇▇█▇▇ |
eval/accuracy | ▁▄▅▆▆▆▇▇▇▇▇▇▇▇▇█▇▇▇█▇▇██▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ |
eval/runtime | ▄▃▃▂▃▂▃▂▂▂▂▃▄▄▂▃▅▅▆█▇▆▅▆▅▃▄▆▄▄▂▁▄▃▃▂▁▁▄▄ |
eval/samples_per_second | ▅▆▆▇▆▇▆▇▇▇▇▆▅▅▇▆▄▄▃▁▂▃▄▃▄▆▅▃▅▅▇█▅▆▆▇██▅▅ |
eval/steps_per_second | ▅▆▆▇▆▇▆▇▇▇▇▆▅▅▇▆▄▄▃▁▂▃▄▃▄▆▅▃▅▅▇█▅▆▆▇██▅▅ |
train/train_runtime | ▁ |
train/train_samples_per_second | ▁ |
train/train_steps_per_second | ▁ |
train/total_flos | ▁ |
train/train_loss | ▁ |
Since this is the model that will be adjusted we can check that the evaluate method produces consistent results. The best evaluation was this one:
Step | Training Loss | Validation Loss | Accuracy |
---|---|---|---|
49000 | 0.282900 | 0.302531 | 0.880000 |
PyTorch: setting up devices
{'eval_loss': 0.30253100395202637,
'eval_accuracy': 0.88,
'eval_runtime': 9.1462,
'eval_samples_per_second': 1093.35,
'eval_steps_per_second': 17.166}
We haven’t run the training here so the training loss metric is not produced. The validation loss and accuracy match, accounting for rounding, so I am satisfied that the evaluation code works.
This adjust the embeddings for a model to match the domain distribution.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
wandb: wandb version 0.12.10 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_124044-17z6ree0
PyTorch: setting up devices
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Step | Training Loss | Validation Loss | Perplexity |
---|---|---|---|
1000 | 2.525900 | 2.186024 | 9.104887 |
2000 | 2.339300 | 2.050233 | 7.756434 |
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_124044-17z6ree0/logs/debug.log
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_124044-17z6ree0/logs/debug-internal.log
train/loss | 2.3393 |
train/learning_rate | 2e-05 |
train/epoch | 5.0 |
train/global_step | 2845 |
_runtime | 1152 |
_timestamp | 1644757196 |
_step | 4 |
eval/loss | 2.05023 |
eval/perplexity | 7.75643 |
eval/runtime | 4.6223 |
eval/samples_per_second | 21.634 |
eval/steps_per_second | 1.514 |
train/train_runtime | 1151.0397 |
train/train_samples_per_second | 39.512 |
train/train_steps_per_second | 2.472 |
train/total_flos | 9171244212184800.0 |
train/train_loss | 2.39209 |
train/loss | █▁ |
train/learning_rate | █▁ |
train/epoch | ▁▁▅▅█ |
train/global_step | ▁▁▅▅█ |
_runtime | ▁▁▅▅█ |
_timestamp | ▁▁▅▅█ |
_step | ▁▃▅▆█ |
eval/loss | █▁ |
eval/perplexity | █▁ |
eval/runtime | █▁ |
eval/samples_per_second | ▁█ |
eval/steps_per_second | ▁█ |
train/train_runtime | ▁ |
train/train_samples_per_second | ▁ |
train/train_steps_per_second | ▁ |
train/total_flos | ▁ |
train/train_loss | ▁ |
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
wandb: wandb version 0.12.10 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_130021-3clte4rs
PyTorch: setting up devices
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Step | Training Loss | Validation Loss | Perplexity |
---|---|---|---|
1000 | 2.462700 | 2.134396 | 8.371171 |
2000 | 2.259400 | 2.161706 | 8.692562 |
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_130021-3clte4rs/logs/debug.log
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_130021-3clte4rs/logs/debug-internal.log
train/loss | 2.2594 |
train/learning_rate | 1e-05 |
train/epoch | 5.0 |
train/global_step | 2265 |
_runtime | 785 |
_timestamp | 1644758006 |
_step | 4 |
eval/loss | 2.16171 |
eval/perplexity | 8.69256 |
eval/runtime | 4.6087 |
eval/samples_per_second | 21.698 |
eval/steps_per_second | 1.519 |
train/train_runtime | 784.1761 |
train/train_samples_per_second | 46.15 |
train/train_steps_per_second | 2.888 |
train/total_flos | 6110194858484400.0 |
train/train_loss | 2.34641 |
train/loss | █▁ |
train/learning_rate | █▁ |
train/epoch | ▁▁▇▇█ |
train/global_step | ▁▁▇▇█ |
_runtime | ▁▁▆▇█ |
_timestamp | ▁▁▆▇█ |
_step | ▁▃▅▆█ |
eval/loss | ▁█ |
eval/perplexity | ▁█ |
eval/runtime | █▁ |
eval/samples_per_second | ▁▁ |
eval/steps_per_second | ▁▁ |
train/train_runtime | ▁ |
train/train_samples_per_second | ▁ |
train/train_steps_per_second | ▁ |
train/total_flos | ▁ |
train/train_loss | ▁ |
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
wandb: wandb version 0.12.10 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_131348-2g7nseso
PyTorch: setting up devices
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Step | Training Loss | Validation Loss | Perplexity |
---|---|---|---|
1000 | 2.479700 | 2.364102 | 10.497963 |
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_131348-2g7nseso/logs/debug.log
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_131348-2g7nseso/logs/debug-internal.log
train/loss | 2.4797 |
train/learning_rate | 1e-05 |
train/epoch | 5.0 |
train/global_step | 1215 |
_runtime | 605 |
_timestamp | 1644758633 |
_step | 2 |
eval/loss | 2.3641 |
eval/perplexity | 10.49796 |
eval/runtime | 5.052 |
eval/samples_per_second | 19.794 |
eval/steps_per_second | 1.386 |
train/train_runtime | 604.3056 |
train/train_samples_per_second | 32.12 |
train/train_steps_per_second | 2.011 |
train/total_flos | 4572657212774400.0 |
train/train_loss | 2.46521 |
train/loss | ▁ |
train/learning_rate | ▁ |
train/epoch | ▁▁█ |
train/global_step | ▁▁█ |
_runtime | ▁▁█ |
_timestamp | ▁▁█ |
_step | ▁▅█ |
eval/loss | ▁ |
eval/perplexity | ▁ |
eval/runtime | ▁ |
eval/samples_per_second | ▁ |
eval/steps_per_second | ▁ |
train/train_runtime | ▁ |
train/train_samples_per_second | ▁ |
train/train_steps_per_second | ▁ |
train/total_flos | ▁ |
train/train_loss | ▁ |
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
wandb: wandb version 0.12.10 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_132411-3ubu4oj0
PyTorch: setting up devices
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Step | Training Loss | Validation Loss | Perplexity |
---|---|---|---|
1000 | 2.387300 | 2.285632 | 9.708434 |
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_132411-3ubu4oj0/logs/debug.log
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_132411-3ubu4oj0/logs/debug-internal.log
train/loss | 2.3873 |
train/learning_rate | 1e-05 |
train/epoch | 5.0 |
train/global_step | 1295 |
_runtime | 418 |
_timestamp | 1644759069 |
_step | 2 |
eval/loss | 2.28563 |
eval/perplexity | 9.70843 |
eval/runtime | 4.7261 |
eval/samples_per_second | 21.159 |
eval/steps_per_second | 1.481 |
train/train_runtime | 416.9974 |
train/train_samples_per_second | 49.593 |
train/train_steps_per_second | 3.106 |
train/total_flos | 3260733386248800.0 |
train/train_loss | 2.35955 |
train/loss | ▁ |
train/learning_rate | ▁ |
train/epoch | ▁▁█ |
train/global_step | ▁▁█ |
_runtime | ▁▁█ |
_timestamp | ▁▁█ |
_step | ▁▅█ |
eval/loss | ▁ |
eval/perplexity | ▁ |
eval/runtime | ▁ |
eval/samples_per_second | ▁ |
eval/steps_per_second | ▁ |
train/train_runtime | ▁ |
train/train_samples_per_second | ▁ |
train/train_steps_per_second | ▁ |
train/total_flos | ▁ |
train/train_loss | ▁ |
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
wandb: wandb version 0.12.10 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_133126-2v07t57q
PyTorch: setting up devices
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Step | Training Loss | Validation Loss | Perplexity |
---|---|---|---|
1000 | 2.390500 | 2.200149 | 8.946708 |
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_133126-2v07t57q/logs/debug.log
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220213_133126-2v07t57q/logs/debug-internal.log
train/loss | 2.3905 |
train/learning_rate | 1e-05 |
train/epoch | 5.0 |
train/global_step | 1305 |
_runtime | 684 |
_timestamp | 1644759770 |
_step | 2 |
eval/loss | 2.20015 |
eval/perplexity | 8.94671 |
eval/runtime | 5.1369 |
eval/samples_per_second | 19.467 |
eval/steps_per_second | 1.363 |
train/train_runtime | 683.0861 |
train/train_samples_per_second | 30.553 |
train/train_steps_per_second | 1.91 |
train/total_flos | 5155178814403200.0 |
train/train_loss | 2.36121 |
train/loss | ▁ |
train/learning_rate | ▁ |
train/epoch | ▁▁█ |
train/global_step | ▁▁█ |
_runtime | ▁▁█ |
_timestamp | ▁▁█ |
_step | ▁▅█ |
eval/loss | ▁ |
eval/perplexity | ▁ |
eval/runtime | ▁ |
eval/samples_per_second | ▁ |
eval/steps_per_second | ▁ |
train/train_runtime | ▁ |
train/train_samples_per_second | ▁ |
train/train_steps_per_second | ▁ |
train/total_flos | ▁ |
train/train_loss | ▁ |
This uses an overlay on the word embeddings to adjust them to the domain distribution. The overlay means that weight decay will act only on the adjustment instead of the pretrained model embedding.
#hide_output
train_language_model(
ds=electronics_ds,
train_name="embedding-overlay",
project_name=PROJECT_NAME,
model_name=MODEL_NAME,
dataset_name="domain-electronics",
data_folder=DATA_FOLDER,
batch_size=16,
epochs=5,
metric=metric_perplexity_bert,
model_preparation=EmbeddingOverlay.update_model,
save_preparation=EmbeddingOverlay.restore_model,
)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
wandb: wandb version 0.12.10 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220216_214813-ligwgdfe
PyTorch: setting up devices
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Step | Training Loss | Validation Loss | Perplexity |
---|---|---|---|
1000 | 2.791700 | 2.655851 | 14.515713 |
2000 | 2.784700 | 2.461272 | 11.607203 |
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220216_214813-ligwgdfe/logs/debug.log
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220216_214813-ligwgdfe/logs/debug-internal.log
train/loss | 2.7847 |
train/learning_rate | 2e-05 |
train/epoch | 5.0 |
train/global_step | 2845 |
_runtime | 1075 |
_timestamp | 1645049168 |
_step | 4 |
eval/loss | 2.46127 |
eval/perplexity | 11.6072 |
eval/runtime | 4.5188 |
eval/samples_per_second | 22.13 |
eval/steps_per_second | 1.549 |
train/train_runtime | 1073.6786 |
train/train_samples_per_second | 42.359 |
train/train_steps_per_second | 2.65 |
train/total_flos | 1.1680412853012192e+16 |
train/train_loss | 2.7854 |
train/loss | █▁ |
train/learning_rate | █▁ |
train/epoch | ▁▁▅▅█ |
train/global_step | ▁▁▅▅█ |
_runtime | ▁▁▅▅█ |
_timestamp | ▁▁▅▅█ |
_step | ▁▃▅▆█ |
eval/loss | █▁ |
eval/perplexity | █▁ |
eval/runtime | █▁ |
eval/samples_per_second | ▁█ |
eval/steps_per_second | ▁█ |
train/train_runtime | ▁ |
train/train_samples_per_second | ▁ |
train/train_steps_per_second | ▁ |
train/total_flos | ▁ |
train/train_loss | ▁ |
#hide_output
train_language_model(
ds=kitchen_ds,
train_name="embedding-overlay",
project_name=PROJECT_NAME,
model_name=MODEL_NAME,
dataset_name="domain-kitchen",
data_folder=DATA_FOLDER,
batch_size=16,
epochs=5,
metric=metric_perplexity_bert,
model_preparation=EmbeddingOverlay.update_model,
save_preparation=EmbeddingOverlay.restore_model,
)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
wandb: wandb version 0.12.10 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220216_220718-1l77hyg0
PyTorch: setting up devices
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Step | Training Loss | Validation Loss | Perplexity |
---|---|---|---|
1000 | 2.751600 | 2.558401 | 12.753925 |
2000 | 2.722900 | 2.673736 | 14.132344 |
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220216_220718-1l77hyg0/logs/debug.log
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220216_220718-1l77hyg0/logs/debug-internal.log
train/loss | 2.7229 |
train/learning_rate | 1e-05 |
train/epoch | 5.0 |
train/global_step | 2265 |
_runtime | 730 |
_timestamp | 1645049968 |
_step | 4 |
eval/loss | 2.67374 |
eval/perplexity | 14.13234 |
eval/runtime | 4.4861 |
eval/samples_per_second | 22.291 |
eval/steps_per_second | 1.56 |
train/train_runtime | 728.6837 |
train/train_samples_per_second | 49.665 |
train/train_steps_per_second | 3.108 |
train/total_flos | 7781888357593776.0 |
train/train_loss | 2.73636 |
train/loss | █▁ |
train/learning_rate | █▁ |
train/epoch | ▁▁▇▇█ |
train/global_step | ▁▁▇▇█ |
_runtime | ▁▁▇▇█ |
_timestamp | ▁▁▇▇█ |
_step | ▁▃▅▆█ |
eval/loss | ▁█ |
eval/perplexity | ▁█ |
eval/runtime | █▁ |
eval/samples_per_second | ▁█ |
eval/steps_per_second | ▁▁ |
train/train_runtime | ▁ |
train/train_samples_per_second | ▁ |
train/train_steps_per_second | ▁ |
train/total_flos | ▁ |
train/train_loss | ▁ |
#hide_output
train_language_model(
ds=music_ds,
train_name="embedding-overlay",
project_name=PROJECT_NAME,
model_name=MODEL_NAME,
dataset_name="domain-music",
data_folder=DATA_FOLDER,
batch_size=16,
epochs=5,
metric=metric_perplexity_bert,
model_preparation=EmbeddingOverlay.update_model,
save_preparation=EmbeddingOverlay.restore_model,
)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
wandb: wandb version 0.12.10 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220216_222004-1bc3396y
PyTorch: setting up devices
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Step | Training Loss | Validation Loss | Perplexity |
---|---|---|---|
1000 | 2.698700 | 2.667158 | 14.097964 |
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220216_222004-1bc3396y/logs/debug.log
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220216_222004-1bc3396y/logs/debug-internal.log
train/loss | 2.6987 |
train/learning_rate | 1e-05 |
train/epoch | 5.0 |
train/global_step | 1215 |
_runtime | 570 |
_timestamp | 1645050574 |
_step | 2 |
eval/loss | 2.66716 |
eval/perplexity | 14.09796 |
eval/runtime | 5.3364 |
eval/samples_per_second | 18.739 |
eval/steps_per_second | 1.312 |
train/train_runtime | 568.3681 |
train/train_samples_per_second | 34.15 |
train/train_steps_per_second | 2.138 |
train/total_flos | 5823694456805376.0 |
train/train_loss | 2.69958 |
train/loss | ▁ |
train/learning_rate | ▁ |
train/epoch | ▁▁█ |
train/global_step | ▁▁█ |
_runtime | ▁▁█ |
_timestamp | ▁▁█ |
_step | ▁▅█ |
eval/loss | ▁ |
eval/perplexity | ▁ |
eval/runtime | ▁ |
eval/samples_per_second | ▁ |
eval/steps_per_second | ▁ |
train/train_runtime | ▁ |
train/train_samples_per_second | ▁ |
train/train_steps_per_second | ▁ |
train/total_flos | ▁ |
train/train_loss | ▁ |
#hide_output
train_language_model(
ds=toys_ds,
train_name="embedding-overlay",
project_name=PROJECT_NAME,
model_name=MODEL_NAME,
dataset_name="domain-toys",
data_folder=DATA_FOLDER,
batch_size=16,
epochs=5,
metric=metric_perplexity_bert,
model_preparation=EmbeddingOverlay.update_model,
save_preparation=EmbeddingOverlay.restore_model,
)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
wandb: wandb version 0.12.10 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220216_222958-3fsvsvap
PyTorch: setting up devices
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Step | Training Loss | Validation Loss | Perplexity |
---|---|---|---|
1000 | 2.684900 | 2.626420 | 13.626650 |
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220216_222958-3fsvsvap/logs/debug.log
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220216_222958-3fsvsvap/logs/debug-internal.log
train/loss | 2.6849 |
train/learning_rate | 1e-05 |
train/epoch | 5.0 |
train/global_step | 1295 |
_runtime | 393 |
_timestamp | 1645050991 |
_step | 2 |
eval/loss | 2.62642 |
eval/perplexity | 13.62665 |
eval/runtime | 4.6422 |
eval/samples_per_second | 21.542 |
eval/steps_per_second | 1.508 |
train/train_runtime | 392.1879 |
train/train_samples_per_second | 52.73 |
train/train_steps_per_second | 3.302 |
train/total_flos | 4152840255238752.0 |
train/train_loss | 2.68775 |
train/loss | ▁ |
train/learning_rate | ▁ |
train/epoch | ▁▁█ |
train/global_step | ▁▁█ |
_runtime | ▁▁█ |
_timestamp | ▁▁█ |
_step | ▁▅█ |
eval/loss | ▁ |
eval/perplexity | ▁ |
eval/runtime | ▁ |
eval/samples_per_second | ▁ |
eval/steps_per_second | ▁ |
train/train_runtime | ▁ |
train/train_samples_per_second | ▁ |
train/train_steps_per_second | ▁ |
train/total_flos | ▁ |
train/train_loss | ▁ |
#hide_output
train_language_model(
ds=video_ds,
train_name="embedding-overlay",
project_name=PROJECT_NAME,
model_name=MODEL_NAME,
dataset_name="domain-video",
data_folder=DATA_FOLDER,
batch_size=16,
epochs=5,
metric=metric_perplexity_bert,
model_preparation=EmbeddingOverlay.update_model,
save_preparation=EmbeddingOverlay.restore_model,
)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
wandb: wandb version 0.12.10 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220216_223653-1wwaq6od
PyTorch: setting up devices
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Step | Training Loss | Validation Loss | Perplexity |
---|---|---|---|
1000 | 2.609000 | 2.516007 | 12.213467 |
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220216_223653-1wwaq6od/logs/debug.log
/home/matthew/Programming/Blog/blog/_notebooks/wandb/run-20220216_223653-1wwaq6od/logs/debug-internal.log
train/loss | 2.609 |
train/learning_rate | 1e-05 |
train/epoch | 5.0 |
train/global_step | 1305 |
_runtime | 646 |
_timestamp | 1645051659 |
_step | 2 |
eval/loss | 2.51601 |
eval/perplexity | 12.21347 |
eval/runtime | 5.1202 |
eval/samples_per_second | 19.53 |
eval/steps_per_second | 1.367 |
train/train_runtime | 643.9198 |
train/train_samples_per_second | 32.411 |
train/train_steps_per_second | 2.027 |
train/total_flos | 6565588647539328.0 |
train/train_loss | 2.60432 |
train/loss | ▁ |
train/learning_rate | ▁ |
train/epoch | ▁▁█ |
train/global_step | ▁▁█ |
_runtime | ▁▁█ |
_timestamp | ▁▁█ |
_step | ▁▅█ |
eval/loss | ▁ |
eval/perplexity | ▁ |
eval/runtime | ▁ |
eval/samples_per_second | ▁ |
eval/steps_per_second | ▁ |
train/train_runtime | ▁ |
train/train_samples_per_second | ▁ |
train/train_steps_per_second | ▁ |
train/total_flos | ▁ |
train/train_loss | ▁ |
Now that the models have been trained we can evaluate the different models against each dataset.
#collapse
from typing import Dict, Union
import datasets
def cross_evaluation(
ds: datasets.Dataset,
model_name: str,
domain: str,
classifier_batch_size: int,
domain_batch_size: int,
lm_batch_size: int,
epochs: int
) -> Dict[str, Union[str, float]]:
specific_name = f"full-{model_name}-domain-{domain}-{domain_batch_size}bs-{epochs}e"
full_name = f"full-{model_name}-general-{classifier_batch_size}bs-{epochs}e"
no_embedding_name = f"no-embedding-{model_name}-general-{classifier_batch_size}bs-{epochs}e"
embedding_name = f"embedding-{model_name}-domain-{domain}-{lm_batch_size}bs-{epochs}e"
specific_results = evaluate_classifier(
ds=ds,
model_name=model_name,
model=load_classifier_full(
model_name=model_name,
dataset_name=f"domain-{domain}",
data_folder=DATA_FOLDER,
batch_size=domain_batch_size,
epochs=epochs,
),
batch_size=64,
data_folder=DATA_FOLDER,
metric=metric_accuracy,
)
specific_combined_results = evaluate_combined_classifier(
ds=ds,
model_name=model_name,
base_model=load_classifier_full(
model_name=model_name,
dataset_name=f"domain-{domain}",
data_folder=DATA_FOLDER,
batch_size=domain_batch_size,
epochs=epochs,
),
embedding_model=load_language_model_embedding(
model_name=model_name,
dataset_name=f"domain-{domain}",
data_folder=DATA_FOLDER,
batch_size=lm_batch_size,
epochs=epochs,
),
embedding_accessor=get_embedding_parameters_bert,
batch_size=64,
data_folder=DATA_FOLDER,
metric=metric_accuracy,
)
specific_overlay_results = evaluate_combined_classifier(
ds=ds,
model_name=model_name,
base_model=load_classifier_full(
model_name=model_name,
dataset_name=f"domain-{domain}",
data_folder=DATA_FOLDER,
batch_size=domain_batch_size,
epochs=epochs,
),
embedding_model=load_language_model_embedding_overlay(
model_name=model_name,
dataset_name=f"domain-{domain}",
data_folder=DATA_FOLDER,
batch_size=lm_batch_size,
epochs=epochs,
),
embedding_accessor=get_embedding_parameters_bert,
batch_size=64,
data_folder=DATA_FOLDER,
metric=metric_accuracy,
)
full_results = evaluate_classifier(
ds=ds,
model_name=model_name,
model=load_classifier_full(
model_name=model_name,
dataset_name="general",
data_folder=DATA_FOLDER,
batch_size=classifier_batch_size,
epochs=epochs,
),
batch_size=64,
data_folder=DATA_FOLDER,
metric=metric_accuracy,
)
full_combined_results = evaluate_combined_classifier(
ds=ds,
model_name=model_name,
base_model=load_classifier_full(
model_name=model_name,
dataset_name="general",
data_folder=DATA_FOLDER,
batch_size=classifier_batch_size,
epochs=epochs,
),
embedding_model=load_language_model_embedding(
model_name=model_name,
dataset_name=f"domain-{domain}",
data_folder=DATA_FOLDER,
batch_size=lm_batch_size,
epochs=epochs,
),
embedding_accessor=get_embedding_parameters_bert,
batch_size=64,
data_folder=DATA_FOLDER,
metric=metric_accuracy,
)
full_overlay_results = evaluate_combined_classifier(
ds=ds,
model_name=model_name,
base_model=load_classifier_full(
model_name=model_name,
dataset_name="general",
data_folder=DATA_FOLDER,
batch_size=classifier_batch_size,
epochs=epochs,
),
embedding_model=load_language_model_embedding_overlay(
model_name=model_name,
dataset_name=f"domain-{domain}",
data_folder=DATA_FOLDER,
batch_size=lm_batch_size,
epochs=epochs,
),
embedding_accessor=get_embedding_parameters_bert,
batch_size=64,
data_folder=DATA_FOLDER,
metric=metric_accuracy,
)
no_embedding_results = evaluate_classifier(
ds=ds,
model_name=model_name,
model=load_classifier_base(
model_name=model_name,
dataset_name="general",
data_folder=DATA_FOLDER,
batch_size=classifier_batch_size,
epochs=epochs,
),
batch_size=64,
data_folder=DATA_FOLDER,
metric=metric_accuracy,
)
base_combined_results = evaluate_combined_classifier(
ds=ds,
model_name=model_name,
base_model=load_classifier_base(
model_name=model_name,
dataset_name="general",
data_folder=DATA_FOLDER,
batch_size=classifier_batch_size,
epochs=epochs,
),
embedding_model=load_language_model_embedding(
model_name=model_name,
dataset_name=f"domain-{domain}",
data_folder=DATA_FOLDER,
batch_size=lm_batch_size,
epochs=epochs,
),
embedding_accessor=get_embedding_parameters_bert,
batch_size=64,
data_folder=DATA_FOLDER,
metric=metric_accuracy,
)
base_overlay_results = evaluate_combined_classifier(
ds=ds,
model_name=model_name,
base_model=load_classifier_base(
model_name=model_name,
dataset_name="general",
data_folder=DATA_FOLDER,
batch_size=classifier_batch_size,
epochs=epochs,
),
embedding_model=load_language_model_embedding_overlay(
model_name=model_name,
dataset_name=f"domain-{domain}",
data_folder=DATA_FOLDER,
batch_size=lm_batch_size,
epochs=epochs,
),
embedding_accessor=get_embedding_parameters_bert,
batch_size=64,
data_folder=DATA_FOLDER,
metric=metric_accuracy,
)
return {
"domain": domain,
"specific_accuracy": specific_results["eval_accuracy"],
"specific_combined_accuracy": specific_combined_results["eval_accuracy"],
"specific_overlay_accuracy": specific_overlay_results["eval_accuracy"],
"full_accuracy": full_results["eval_accuracy"],
"full_combined_accuracy": full_combined_results["eval_accuracy"],
"full_overlay_accuracy": full_overlay_results["eval_accuracy"],
"base_accuracy": no_embedding_results["eval_accuracy"],
"base_combined_accuracy": base_combined_results["eval_accuracy"],
"base_overlay_accuracy": base_overlay_results["eval_accuracy"],
}
tensor(True)
(BertForSequenceClassification(
(bert): BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(1): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(2): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(3): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(4): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(5): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(6): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(7): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(8): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(9): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(10): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(11): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
(dropout): Dropout(p=0.1, inplace=False)
(classifier): Linear(in_features=768, out_features=2, bias=True)
),)
#hide_output
import pandas as pd
result_df = pd.DataFrame([
cross_evaluation(
ds=ds,
model_name=MODEL_NAME,
domain=domain,
classifier_batch_size=64,
domain_batch_size=16,
lm_batch_size=16,
epochs=5,
)
for ds, domain in [
(electronics_ds, "electronics"),
(kitchen_ds, "kitchen"),
(music_ds, "music"),
(toys_ds, "toys"),
(video_ds, "video"),
]
])
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
PyTorch: setting up devices
domain | specific_accuracy | specific_combined_accuracy | specific_overlay_accuracy | full_accuracy | full_combined_accuracy | full_overlay_accuracy | base_accuracy | base_combined_accuracy | base_overlay_accuracy | |
---|---|---|---|---|---|---|---|---|---|---|
0 | electronics | 0.958 | 0.952 | 0.957 | 0.781 | 0.795 | 0.787 | 0.768 | 0.774 | 0.768 |
1 | kitchen | 0.940 | 0.944 | 0.941 | 0.800 | 0.797 | 0.804 | 0.794 | 0.796 | 0.794 |
2 | music | 0.929 | 0.925 | 0.927 | 0.773 | 0.771 | 0.777 | 0.759 | 0.757 | 0.759 |
3 | toys | 0.910 | 0.904 | 0.908 | 0.828 | 0.833 | 0.827 | 0.836 | 0.837 | 0.836 |
4 | video | 0.944 | 0.949 | 0.947 | 0.745 | 0.722 | 0.733 | 0.737 | 0.721 | 0.737 |
Here we review the domain specific model and evaluate how replacing the embedding layer affects performance.
Domain | Domain Sentiment Model | with Replacement Embeddings | |
---|---|---|---|
0 | electronics | 0.958 | 0.952 |
1 | kitchen | 0.94 | 0.944 |
2 | music | 0.929 | 0.925 |
3 | toys | 0.91 | 0.904 |
4 | video | 0.944 | 0.949 |
We can see that the domain specific model has high base accuracy. Replacing the embedding layer harms performance more often than it improves it. The difference when replacing the embeddings is not great.
Here we review the general sentiment model and evaluate how replacing the embedding layer affects performance.
Domain | General Sentiment Model | with Replacement Embeddings | |
---|---|---|---|
0 | electronics | 0.781 | 0.795 |
1 | kitchen | 0.8 | 0.797 |
2 | music | 0.773 | 0.771 |
3 | toys | 0.828 | 0.833 |
4 | video | 0.745 | 0.722 |
Replacing the embedding layer produces only a small change, and harms performance more often than improving it.
Here we review the general sentiment model that has been trained with a frozen embedding layer. The intent of this is to ensure that the retraining of the embeddings for the domain is consistent with how the sentiment classifier works with the base embedding.
Domain | General Sentiment Model with Frozen Embeddings | with Replacement Embeddings | |
---|---|---|---|
0 | electronics | 0.768 | 0.774 |
1 | kitchen | 0.794 | 0.796 |
2 | music | 0.759 | 0.757 |
3 | toys | 0.836 | 0.837 |
4 | video | 0.737 | 0.721 |
Once again the performance changes are marginal. It appears that this technique has not worked as I expected.
These results are not great. The use of the language model pretrained embedding layer does not result in a consistent improvement.
These results do show that the datasets significantly differ. I wonder if the amazon dataset could be considered sentiment?
One thing that I want to evaluate is the way that the domain specific embedding layer is trained. If I separate the adjustment to the weights then weight decay can apply strictly to the change to the embeddings. This would be interesting as it could then provide a concrete insight about which tokens have different meaning.