How to prepare data, customize a model, and train it
Published
September 12, 2021
I’ve been training a lot of custom huggingface models and it would be nice to break down the process and show how each stage works. This should help if you want to create your own custom model.
The process can be most easily understood by training an existing model using an existing dataset. Then we can try customizing each step of the process.
First lets define the model and batch size used. When you are working on a custom model it’s good to define things like this at the start of the notebook as it makes it easy to adjust later.
Code
from pathlib import PathMODEL_NAME ="bert-base-uncased"BATCH_SIZE =32# change this to suit your machineMODEL_RUN_FOLDER = Path("/data/blog/2021-09-12-train-using-huggingface/runs")MODEL_RUN_FOLDER.mkdir(parents=True, exist_ok=True)
This next cell turns on the debugger. The pdb debugger will start a REPL which accepts python code as well as some specific commands, which are documented here. The debugging session will block the execution of any further cells in the notebook, so to finish a debugging session use the c (continue) command.
Code
%pdb
Automatic pdb calling has been turned ON
Using the debugger is especially helpful when defining a custom model or custom metrics.
How to Train
Training a model requires a dataset, a model and some metrics to track performance.
We can split these tasks up by the library that provides them. First the datasets library provides datasets and the metrics that can be used to evaluate performance against them. Then the transformers library provides both the models and the training framework.
Dataset and Metric
I’m aiming to show how to do this, so the task can be quite simple. The prepared datasets can be found at huggingface.co/datasets.
Let’s stick with sentiment analysis as it’s a simple task. We can use the GLUE sst2 task, which is limited to positive and negative sentiment. This is a good dataset to use as the evaluation metric is defined as part of the challenge.
The documentation for loading a dataset is here. The documentation for loading a metric is here.
This shows that the dataset is already split into train, validation and test sets. Performing this step manually will be shown later.
Code
sst2_ds["train"][0]
{'sentence': 'hide new secretions from the parental units ',
'label': 0,
'idx': 0}
Code
sst2_ds["train"][2]
{'sentence': 'that loves its characters and communicates something rather beautiful about human nature ',
'label': 1,
'idx': 2}
Here you can see that a Dataset can be treated as a list of dictionaries. These two rows show that the labels are 0 for negative and 1 for positive.
Metric
How do metrics work? Let’s have a look at the one we just loaded. The core code for it is:
def _compute(self, predictions, references):ifself.config_name =="cola":return {"matthews_correlation": matthews_corrcoef(references, predictions)}elifself.config_name =="stsb":return pearson_and_spearman(predictions, references)elifself.config_name in ["mrpc", "qqp"]:return acc_and_f1(predictions, references)elifself.config_name in ["sst2", "mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]:return {"accuracy": simple_accuracy(predictions, references)}else:raiseKeyError("You should supply a configuration name selected in "'["sst2", "mnli", "mnli_mismatched", "mnli_matched", ''"cola", "stsb", "mrpc", "qqp", "qnli", "rte", "wnli", "hans"]' )
This is a class that extends evaluate.Metric and evaluate.EvaluationModule. There is a lot of code in the EvaluationModule so if you want to look at it then I recommend going to github.
For now I am going to provide the high level view. A metric takes the model predictions and labels and returns a dictionary. The dictionary is used to create a dataframe showing the training progress. The metrics you create can be used to select the best model after training, or to allow for early stopping.
This metric is complex because it is written using the evaluate.Metric class. Alternatively you can write a simple function that takes an EvalPrediction object and returns a dictionary. The equivalent metric written in the simple form would be:
Code
from typing import Dictfrom transformers import EvalPredictiondef sst2_metric_simple(results: EvalPrediction) -> Dict[str, float]:# convert raw model output into predicted class predictions = results.predictions.argmax(axis=1) targets = results.label_ids accuracy = (predictions == targets).mean().item()return {"accuracy": accuracy}
The predictions and label_ids are numpy arrays.
Why does the really really complicated version exist? It’s because the entire result set is saved as the model runs the evaluation, and this can lead to memory errors. The complicated version can receive incremental results allowing memory to be released. We won’t cover the details of that in this though, not least because it doesn’t work with the huggingface trainer 😢.
Model and Trainer
The model and trainer come from the transformers library. There are a lot of different models available and the techniques I am about to show should work with most of them.
The transformers API for them is documented here in the models section. We are classifying text, so we want a ForSequenceClassification model. This example uses BertForSequenceClassification.
Model
All of the pretrained models are at huggingface.co/models. We need to find a BERT model to load. Searching for bert gives me bert-base-uncased so lets load it:
Code
from transformers import BertForSequenceClassificationmodel = BertForSequenceClassification.from_pretrained(MODEL_NAME)
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
It’s quite normal to see some complaints when loading this from huggingface. This is because the model is made of a trained body and an untrained head.
We can see this by reviewing the forward method of the model:
def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, labels=None, output_attentions=None, output_hidden_states=None, return_dict=None, ):# minor preparation ...# this is where the body output is calculated outputs =self.bert( input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, position_ids=position_ids, head_mask=head_mask, inputs_embeds=inputs_embeds, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict, )# this is the classification head pooled_output = outputs[1] pooled_output =self.dropout(pooled_output) logits =self.classifier(pooled_output)# ... loss calculationreturn SequenceClassifierOutput( loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions, )
Here you can see the input being passed to the transformer body (self.bert), then the output of that is passed through the classification head self.classifier. These logits are the final output of the model.
When we create a custom model we almost always keep the same body and just change the head.
Tokenizer
The input to the model is tokenized text. Each model has a specific tokenizer associated with it. So when loading the tokenizer we must use the same model name:
Code
from transformers import AutoTokenizer# make sure the pretrained name matches the model!tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
If there are complaints when loading the tokenizer then you have a problem. This training process does not involve creating a new tokenizer, so the tokenizer should be complete and valid.
The interesting part here is the input_ids. It’s good to keep the values as a list of integers so that the huggingface trainer can reform them as required. The attention_mask will be generated by the trainer, and token_type_ids are not interesting for this task.
To decode the text you can call tokenizer.decode:
Code
tokenizer.decode([101, 7592, 2088, 102])
'[CLS] hello world [SEP]'
This is the tokenizer for the uncased model, so it has lowercased the text. It also adds special tokens that let the model work with the boundaries of the utterance.
Trainer
With all of these parts we can train the model.
Code
from pathlib import Pathfrom transformers import Trainer, TrainingArgumentstraining_args = TrainingArguments( per_device_train_batch_size=BATCH_SIZE, per_device_eval_batch_size=BATCH_SIZE, learning_rate=5e-5, warmup_ratio=0.06, report_to=[], # you'd use wandb for weights and biases# very short as this is a demonstration evaluation_strategy="steps", max_steps=50, logging_steps=10, eval_steps=10, load_best_model_at_end=True, metric_for_best_model="accuracy", greater_is_better=True, no_cuda=True, # let you run it on any machine# output_dir is compulsory logging_dir=MODEL_RUN_FOLDER /"output", output_dir=MODEL_RUN_FOLDER /"output", overwrite_output_dir=True,)trainer = Trainer( model=model, args=training_args, train_dataset=sst2_ds["train"], eval_dataset=sst2_ds["validation"], tokenizer=tokenizer, compute_metrics=sst2_metric,)trainer.train()
max_steps is given, it will override any value given in num_train_epochs
The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, idx.
***** Running training *****
Num examples = 67349
Num Epochs = 1
Instantaneous batch size per device = 32
Total train batch size (w. parallel, distributed & accumulation) = 32
Gradient Accumulation steps = 1
Total optimization steps = 50
ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided ['label']
I’m showing this error as it’s quite important to be aware of it. The problem here is that the data has not been tokenized, and the Trainer is expecting tokenized input.
input_ids
The input to the model is the input_ids field. If these are lists of integers then the trainer will correctly form these into batches, creating the appropriate attention mask.
However if you use tensors then it will attempt to concatenate them, which only works if they are all the same size. Making them the same size would also require creating the attention_mask as you would almost certainly be padding some inputs.
It’s much easier to create the input_ids as a list of integers.
{'idx': 0,
'label': 0,
'sentence': 'hide new secretions from the parental units ',
'input_ids': [101, 5342, 2047, 3595, 8496, 2013, 1996, 18643, 3197, 102]}
As you can see encoding the dataset applies the function to all rows in the different splits. The map function is a bit odd - if you list the columns to transform then it will drop the others. Calling map produces a new dataset, leaving the original unaltered.
We can drop everything except the label and the input_ids now.
Now we can train a model with this dataset. The training configuration here is identical to before.
Code
from pathlib import Pathfrom transformers import Trainer, TrainingArguments, EvalPredictiondef compute_metrics(results: EvalPrediction) -> Dict[str, Any]:return sst2_metric.compute(# convert raw model output into predicted class predictions=results.predictions.argmax(axis=1), references=results.label_ids )training_args = TrainingArguments( per_device_train_batch_size=BATCH_SIZE, per_device_eval_batch_size=BATCH_SIZE, learning_rate=5e-5, warmup_ratio=0.06, report_to=[], # you'd use wandb for weights and biases# very short as this is a demonstration evaluation_strategy="steps", max_steps=50, logging_steps=10, eval_steps=10, load_best_model_at_end=True, metric_for_best_model="accuracy", greater_is_better=True, no_cuda=True, # let you run it on any machine# output_dir is compulsory logging_dir=MODEL_RUN_FOLDER /"output", output_dir=MODEL_RUN_FOLDER /"output", overwrite_output_dir=True,)trainer = Trainer( model=model, args=training_args, train_dataset=encoded_sst2_ds["train"], eval_dataset=encoded_sst2_ds["validation"], tokenizer=tokenizer, compute_metrics=compute_metrics,)trainer.train()
PyTorch: setting up devices
max_steps is given, it will override any value given in num_train_epochs
***** Running training *****
Num examples = 67349
Num Epochs = 1
Instantaneous batch size per device = 32
Total train batch size (w. parallel, distributed & accumulation) = 32
Gradient Accumulation steps = 1
Total optimization steps = 50
[50/50 03:30, Epoch 0/1]
Step
Training Loss
Validation Loss
Accuracy
10
0.239100
0.338536
0.871560
20
0.155000
0.359864
0.861239
30
0.159900
0.394075
0.857798
40
0.219400
0.335876
0.869266
50
0.371100
0.313348
0.879587
***** Running Evaluation *****
Num examples = 872
Batch size = 32
Saving model checkpoint to /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-10
Configuration saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-10/config.json
Model weights saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-10/pytorch_model.bin
tokenizer config file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-10/tokenizer_config.json
Special tokens file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-10/special_tokens_map.json
***** Running Evaluation *****
Num examples = 872
Batch size = 32
Saving model checkpoint to /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-20
Configuration saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-20/config.json
Model weights saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-20/pytorch_model.bin
tokenizer config file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-20/tokenizer_config.json
Special tokens file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-20/special_tokens_map.json
***** Running Evaluation *****
Num examples = 872
Batch size = 32
Saving model checkpoint to /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-30
Configuration saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-30/config.json
Model weights saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-30/pytorch_model.bin
tokenizer config file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-30/tokenizer_config.json
Special tokens file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-30/special_tokens_map.json
***** Running Evaluation *****
Num examples = 872
Batch size = 32
Saving model checkpoint to /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-40
Configuration saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-40/config.json
Model weights saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-40/pytorch_model.bin
tokenizer config file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-40/tokenizer_config.json
Special tokens file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-40/special_tokens_map.json
***** Running Evaluation *****
Num examples = 872
Batch size = 32
Saving model checkpoint to /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-50
Configuration saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-50/config.json
Model weights saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-50/pytorch_model.bin
tokenizer config file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-50/tokenizer_config.json
Special tokens file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-50/special_tokens_map.json
Training completed. Do not forget to share your model on huggingface.co/models =)
Loading best model from /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-50 (score: 0.8795871559633027).
shows that the GLUE datasets.Metric object isn’t currently designed for use with the huggingface trainer. There are a few rough edges like this between the two. I’m hopeful that the integration between the two will improve in time as it would be nice for the trainer to use the incremental metric calculation - I’ve had memory issues with large validation datasets in the past and using incremental metric calculation would address these. For now if you run into memory issues when evaluating training, you can either reduce your validation dataset size or use the eval_accumulation_steps option of the TrainingArguments.
eval_accumulation_steps (int, optional) — Number of predictions steps to accumulate the output tensors for, before moving the results to the CPU. If left unset, the whole predictions are accumulated on GPU/TPU before being moved to the CPU (faster but requires more memory).
Let’s see how this would be with the rewritten form of the metric:
Code
from typing import Dictfrom transformers import EvalPredictiondef sst2_metric_simple(results: EvalPrediction) -> Dict[str, float]:# convert raw model output into predicted class predictions = results.predictions.argmax(axis=1) targets = results.label_ids accuracy = (predictions == targets).mean().item()return {"accuracy": accuracy}
Code
from transformers import BertForSequenceClassification# reload the model to start from scratch againmodel = BertForSequenceClassification.from_pretrained(MODEL_NAME)
Code
from pathlib import Pathfrom transformers import Trainer, TrainingArguments, EvalPredictiontraining_args = TrainingArguments( per_device_train_batch_size=BATCH_SIZE, per_device_eval_batch_size=BATCH_SIZE, learning_rate=5e-5, warmup_ratio=0.06, report_to=[], # you'd use wandb for weights and biases# very short as this is a demonstration evaluation_strategy="steps", max_steps=50, logging_steps=10, eval_steps=10, load_best_model_at_end=True, metric_for_best_model="accuracy", greater_is_better=True, no_cuda=True, # let you run it on any machine# output_dir is compulsory logging_dir=MODEL_RUN_FOLDER /"output", output_dir=MODEL_RUN_FOLDER /"output", overwrite_output_dir=True,)trainer = Trainer( model=model, args=training_args, train_dataset=encoded_sst2_ds["train"], eval_dataset=encoded_sst2_ds["validation"], tokenizer=tokenizer, compute_metrics=sst2_metric_simple,)trainer.train()
PyTorch: setting up devices
max_steps is given, it will override any value given in num_train_epochs
***** Running training *****
Num examples = 67349
Num Epochs = 1
Instantaneous batch size per device = 32
Total train batch size (w. parallel, distributed & accumulation) = 32
Gradient Accumulation steps = 1
Total optimization steps = 50
[50/50 03:32, Epoch 0/1]
Step
Training Loss
Validation Loss
Accuracy
10
0.742300
0.614478
0.731651
20
0.577600
0.473612
0.840596
30
0.388400
0.352919
0.860092
40
0.330900
0.341197
0.865826
50
0.415000
0.322916
0.875000
***** Running Evaluation *****
Num examples = 872
Batch size = 32
Saving model checkpoint to /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-10
Configuration saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-10/config.json
Model weights saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-10/pytorch_model.bin
tokenizer config file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-10/tokenizer_config.json
Special tokens file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-10/special_tokens_map.json
***** Running Evaluation *****
Num examples = 872
Batch size = 32
Saving model checkpoint to /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-20
Configuration saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-20/config.json
Model weights saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-20/pytorch_model.bin
tokenizer config file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-20/tokenizer_config.json
Special tokens file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-20/special_tokens_map.json
***** Running Evaluation *****
Num examples = 872
Batch size = 32
Saving model checkpoint to /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-30
Configuration saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-30/config.json
Model weights saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-30/pytorch_model.bin
tokenizer config file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-30/tokenizer_config.json
Special tokens file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-30/special_tokens_map.json
***** Running Evaluation *****
Num examples = 872
Batch size = 32
Saving model checkpoint to /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-40
Configuration saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-40/config.json
Model weights saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-40/pytorch_model.bin
tokenizer config file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-40/tokenizer_config.json
Special tokens file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-40/special_tokens_map.json
***** Running Evaluation *****
Num examples = 872
Batch size = 32
Saving model checkpoint to /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-50
Configuration saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-50/config.json
Model weights saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-50/pytorch_model.bin
tokenizer config file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-50/tokenizer_config.json
Special tokens file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-50/special_tokens_map.json
Training completed. Do not forget to share your model on huggingface.co/models =)
Loading best model from /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-50 (score: 0.875).
We have to split it into test and train, which is easy enough. After this we will have something extremely similar to what we loaded from the GLUE benchmark.
That is how you can create your own dataset. The dataset class has a caching mechanism so if you run the same transformation over the same dataset then it can use the cache.
In some edge cases the creation of this cache can take a significant amount of time, so if the dataset preprocessing is too slow try moving your transformation to pandas. The cache creation can be slow if saving the data takes a long time or if the transformation function you use is difficult to serialize.
Custom Metric
I’m going to define the custom metric here to keep the order the same. The basic accuracy metric is fine, however it is quite nice to be able to provide precision and recall figures. We can easily write such a function by extending the compute metrics method we defined before. (This metric is provided in the evaluation library metrics as f1, accuracy and precision which can be combined using evaluate.combine).
Remember that the predictions are the raw model outputs, so we will have two values and the predicted class is whichever is bigger.
Code
from typing import Dictfrom transformers import EvalPredictionimport numpy as npfrom sklearn.metrics import accuracy_score, precision_recall_fscore_supportdef custom_metrics(results: EvalPrediction) -> Dict[str, float]:# convert raw model output into predicted class predictions = results.predictions.argmax(axis=1) targets = results.label_ids accuracy = accuracy_score(y_true=targets, y_pred=predictions) ( (negative_precision, positive_precision), (negative_recall, positive_recall), (negative_fscore, positive_fscore), (negative_support, positive_support), ) = precision_recall_fscore_support( y_true=targets, y_pred=predictions,# you can often get no predictions for a class# providing this argument prevents a lot of warnings zero_division=0., )return {"accuracy": accuracy,"negative_precision": negative_precision,"negative_recall": negative_recall,"negative_fscore": negative_fscore,"negative_support": negative_support,"positive_precision": positive_precision,"positive_recall": positive_recall,"positive_fscore": positive_fscore,"positive_support": positive_support, }
Remember that the EvalPrediction is just the predictions and labels and writing a metric becomes very easy.
Custom Model
This gets a bit more involved. You’ve already seen that there are different models for different tasks - we used a BertForSequenceClassification model earlier to classify the text. There are many other models based on Bert such as BertForQuestionAnswering, BertForTokenClassification, BertForNextSentencePrediction…
Each of these distinct model types can change the output. To allow the trainer to correctly train any of these models a model is responsible for calculating it’s own loss. If you pass the labels to a model alongside the input then the output will include the loss based on the distance to those labels.
If you make a model that does this then it will train correctly.
When you look at an actual model it frequently does a lot more than this, and it can be helpful to provide additional details to the caller. For example the BertForSequenceClassification.forward method has the following behaviour:
def forward(self,# lots of parameters): return_dict = return_dict if return_dict isnotNoneelseself.config.use_return_dict outputs =self.bert(# lots of arguments ) pooled_output = outputs[1] pooled_output =self.dropout(pooled_output) logits =self.classifier(pooled_output) loss =Noneif labels isnotNone:# calculate lossifnot return_dict: output = (logits,) + outputs[2:]return ((loss,) + output) if loss isnotNoneelse outputreturn SequenceClassifierOutput( loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions, )
Simplest Custom Model
The aim here is not to create something cutting edge, it’s to understand it. Let’s start with the simplest thing. The core behaviour of these NLP models is to pass the input through a transformer stack and then pass it through the task specific head. We can see that in the code snippet above here:
# the transformer stackoutputs =self.bert(# lots of arguments)# the classifierpooled_output = outputs[1]pooled_output =self.dropout(pooled_output)logits =self.classifier(pooled_output)
We can just duplicate that and build upon the base bert model.
Code
from typing import*from transformers import BertModel, BertPreTrainedModel, BertConfigimport torchfrom torch import nnclass BertSimpleClassifier(BertPreTrainedModel):def__init__(self, config: BertConfig) ->None:super().__init__(config)# copied this bit from BertForSequenceClassification# often the transformer is created by a parent classself.config = configself.bert = BertModel(config)self.classifier = nn.Linear( in_features=config.hidden_size, out_features=2, bias=False )def forward(self, input_ids: torch.Tensor, attention_mask: Optional[torch.Tensor] =None, labels: Optional[torch.Tensor] =None ) -> Tuple[torch.Tensor, ...]: output =self.bert( input_ids=input_ids, attention_mask=attention_mask ) logits =self.classifier(output[1])if labels isnotNone:# cross entropy is suitable for single class classifiers# it works as you expect when the input is [N, classes] floats and the target is [N] integers# so we reshape the input slightly# cross entropy applies softmax for you loss = torch.nn.functional.cross_entropy( logits.reshape(-1, 2), labels.flatten() )return (loss, logits)return (logits,) # must be a tuple!
We can create a pretrained version of our custom model by using the from_pretrained method:
Code
model = BertSimpleClassifier.from_pretrained(MODEL_NAME) # this works!
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertSimpleClassifier: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertSimpleClassifier from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertSimpleClassifier from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertSimpleClassifier were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
It’s worth stopping to cover how the from_pretrained method works, as it is quite simple and it works with our custom model. The method first loads the configuration for the model name. It then passes this configuration to the constructor of the class you call from_pretrained on (BertSimpleClassifier in this case). Finally it loads the weights from the files stored on S3 into the created model.
To get this to work with your custom model the __init__ method must take a config object as it’s only parameter and it must call the superclass constructor with it:
If you want to provide custom parameters to your constructor then do it through the configuration. The configuration is saved with the model weights so faithfully reproducing your model requires putting all configuration options in the config object.
When writing your own model it’s a good idea to quickly test that it works correctly with the training loop and the metrics. Let’s do that.
Code
from pathlib import Pathfrom transformers import Trainer, TrainingArguments, EvalPredictiontraining_args = TrainingArguments( per_device_train_batch_size=BATCH_SIZE, per_device_eval_batch_size=BATCH_SIZE, learning_rate=5e-5, warmup_ratio=0.06, report_to=[], # you'd use wandb for weights and biases# even shorter as this is testing the model and metrics evaluation_strategy="steps", max_steps=10, logging_steps=5, eval_steps=5, load_best_model_at_end=True, metric_for_best_model="accuracy", greater_is_better=True, no_cuda=True, # let you run it on any machine# output_dir is compulsory logging_dir=MODEL_RUN_FOLDER /"output", output_dir=MODEL_RUN_FOLDER /"output", overwrite_output_dir=True,)trainer = Trainer( model=model, args=training_args, train_dataset=split_ds["train"], eval_dataset=split_ds["test"], tokenizer=tokenizer, compute_metrics=custom_metrics,)trainer.train()
max_steps is given, it will override any value given in num_train_epochs
***** Running training *****
Num examples = 1800
Num Epochs = 1
Instantaneous batch size per device = 32
Total train batch size (w. parallel, distributed & accumulation) = 32
Gradient Accumulation steps = 1
Total optimization steps = 10
[10/10 00:29, Epoch 0/1]
Step
Training Loss
Validation Loss
Accuracy
Negative Precision
Negative Recall
Negative Fscore
Negative Support
Positive Precision
Positive Recall
Positive Fscore
Positive Support
5
0.683700
0.657098
0.675000
0.654867
0.740000
0.694836
100
0.701149
0.610000
0.652406
100
10
0.660300
0.644941
0.655000
0.613139
0.840000
0.708861
100
0.746032
0.470000
0.576687
100
***** Running Evaluation *****
Num examples = 200
Batch size = 32
Saving model checkpoint to /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-5
Configuration saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-5/config.json
Model weights saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-5/pytorch_model.bin
tokenizer config file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-5/tokenizer_config.json
Special tokens file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-5/special_tokens_map.json
***** Running Evaluation *****
Num examples = 200
Batch size = 32
Saving model checkpoint to /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-10
Configuration saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-10/config.json
Model weights saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-10/pytorch_model.bin
tokenizer config file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-10/tokenizer_config.json
Special tokens file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-10/special_tokens_map.json
Training completed. Do not forget to share your model on huggingface.co/models =)
Loading best model from /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-5 (score: 0.675).
This worked. It produced nice metrics. It’s a bit early to say if this is better than blind chance, I feel like this would train just fine though.
If you’re anything like me then you didn’t get to this point with your first try. You probably made a typo or something. The cell magic %pdb command at the start of this post is very useful. When you hit an error it produces an interactive debugger. In this you can execute python commands to inspect the current state of variables and find out why the code is broken. My error was that the logits were the classified outputs while the output came from the underlying bert model - and I was returning the wrong one.
Using a debugger is something that I do not do often enough. It can really help.
What Loss Function?
There are many loss functions available in pytorch. Choosing the correct loss function is the single largest difference you can make to the performance of your model. A well chosen loss function lets the model train quickly and become good at the task.
The way that you measure the quality of your model is an approximate loss function. Metrics are a poor loss function because they tend to turn predictions into discrete choices. This discretization means that the model cannot measure partial improvement that does not change the final output. As a first approximation taking a loss function that is very similar to your core metric is a good choice.
To help with this here are some appropriate loss functions to use for different tasks:
Classification
This is when you have to label some text with a single label. An example of this is sentiment classification.
The metrics for this might be overall accuracy, as well as the per class precision/recall/f1. The Cross Entropy loss function measures the degree to which your model confidently predicts the correct class.
You should not apply softmax to the output of the model before passing it to this loss function.
Multi Label Classification
This is when you want to label some text with zero or more labels. An example of this is assigining topics to text.
The metrics for this might be per label accuracy. The Binary Cross Entropy loss function treats each label as an independent prediction and measures how well your model does.
It’s best to use the with logits loss function without applying sigmoid to the model output as this is more numerically stable.
Regression
This is when you want to provide a continuous output or score for the text. An example of this is measuring the grammaticality of the text.
The metrics for this might be mean absolute error. The mean squared error loss function works well for this as it punishes extremely wrong answers and is more gentle with approximately correct answers.
If you are mapping your outputs (e.g. using sigmoid) then you should apply this before using the loss function.
Similarity or Clustering
This is when you want to measure the similarity or difference between text pairs. An example of this is matching queries to documents.
The metrics for this might be the number of matching documents in the top 10 most similar documents. The Cosine Embedding loss function uses cosine similarity to move the output embeddings closer together for matching queries and documents, and separate the embeddings for non-matching queries and documents.
Cosine similarity does have a problem that it only considers the angle of the vectors. It is scale invariant so points that it considers very similar may be quite distant from each other.
If this is a problem then you should create a custom loss function that can work with the euclidean distance. You can calculate the euclidean distance with torch.cdist, when used with \(p = 2\) this is the euclidean distance.
Huggingface Style
The simple model doesn’t match the style of the huggingface models. Wouldn’t it be better to copy them?
The reason that the huggingface models provide all of that output is to give greater flexibility to downstream tasks. For example, by providing the internal state of the model it is possible to preprocess some data and reuse the preprocessed data for subsequent tasks, saving time. If you don’t require these enhancements then you don’t have to use the huggingface style. You can use it if you wish though - predicting future use is hard after all.
Broken Model
Let’s make a mistake in our implementation of the huggingface style. It’s easy to copy the style of a huggingface model slightly wrong and it can lead to some very strange situations with the metrics.
Code
from typing import*from transformers import BertModel, BertPreTrainedModel, BertConfigfrom transformers.modeling_outputs import SequenceClassifierOutputimport torchfrom torch import nn# THIS CODE HAS A BUG DO NOT COPYclass BertBadHuggingfaceClassifier(BertPreTrainedModel):def__init__(self, config: BertConfig) ->None:super().__init__(config)# copied this bit from BertForSequenceClassification# often the transformer is created by a parent classself.config = configself.bert = BertModel(config)self.classifier = nn.Linear( in_features=config.hidden_size, out_features=2, bias=False )def forward(self, input_ids: torch.Tensor, attention_mask: Optional[torch.Tensor] =None, labels: Optional[torch.Tensor] =None, return_dict: Optional[bool] =None, ) -> Tuple[torch.Tensor, ...]: return_dict = return_dict if return_dict isnotNoneelseself.config.use_return_dict outputs =self.bert( input_ids=input_ids, attention_mask=attention_mask ) logits =self.classifier(outputs[1]) loss =Noneif labels isnotNone:# cross entropy is suitable for single class classifiers# it's most predictable when the input is [N, classes] floats and the target is [N] integers# cross entropy applies softmax for you loss = torch.nn.functional.cross_entropy(logits.reshape(-1, 2), labels.flatten()) output = (logits,) + outputs[2:]# THE ERROR IS HERE# output is already a tuple, we should be extending it# correct code is:# return ((loss,) + output) if loss is not None else outputreturn (loss, output) if loss isnotNoneelse output
Code
model = BertBadHuggingfaceClassifier.from_pretrained(MODEL_NAME)
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertBadHuggingfaceClassifier: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertBadHuggingfaceClassifier from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertBadHuggingfaceClassifier from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertBadHuggingfaceClassifier were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Code
from pathlib import Pathfrom transformers import Trainer, TrainingArguments, EvalPredictiontraining_args = TrainingArguments( per_device_train_batch_size=BATCH_SIZE, per_device_eval_batch_size=BATCH_SIZE, learning_rate=5e-5, warmup_ratio=0.06, report_to=[], # you'd use wandb for weights and biases# even shorter as this is testing the model and metrics evaluation_strategy="steps", max_steps=10, logging_steps=5, eval_steps=5, load_best_model_at_end=True, metric_for_best_model="accuracy", greater_is_better=True, no_cuda=True, # let you run it on any machine# output_dir is compulsory logging_dir=MODEL_RUN_FOLDER /"output", output_dir=MODEL_RUN_FOLDER /"output", overwrite_output_dir=True,)trainer = Trainer( model=model, args=training_args, train_dataset=split_ds["train"], eval_dataset=split_ds["test"], tokenizer=tokenizer, compute_metrics=custom_metrics,)trainer.train()
max_steps is given, it will override any value given in num_train_epochs
***** Running training *****
Num examples = 1800
Num Epochs = 1
Instantaneous batch size per device = 32
Total train batch size (w. parallel, distributed & accumulation) = 32
Gradient Accumulation steps = 1
Total optimization steps = 10
AttributeError: 'tuple' object has no attribute 'argmax'
I used the debugger here to investigate the problem. As you can see the predictions are no longer a numpy array, instead they have become a tuple. This is because of the way the model incorrectly forms the response.
The core problem is that the response from the model has to be well formed and that it is very easy to make a mistake. When you make a mistake you only find out when the metric code runs - which can be after some time if you are doing a full train.
Run your model and metric code early. You can run your metric code directly by evaluating your model:
Code
from pathlib import Pathfrom transformers import Trainer, TrainingArguments, EvalPrediction# Reduced this to what is required for an evaluationtraining_args = TrainingArguments( per_device_eval_batch_size=BATCH_SIZE, report_to=[], # you'd use wandb for weights and biases no_cuda=True, # let you run it on any machine# output_dir is compulsory logging_dir=MODEL_RUN_FOLDER /"output", output_dir=MODEL_RUN_FOLDER /"output", overwrite_output_dir=True,)trainer = Trainer( model=model, args=training_args, eval_dataset=split_ds["test"], tokenizer=tokenizer, compute_metrics=custom_metrics,)# THIS IS THE CHANGEtrainer.evaluate()
Working Model
This time lets make it correctly. It really comes down to always returning a tensor as the model output and loss. I have frequently made a mistake where the model output is a tuple instead and you only discover that when the metric calculation fails - the training loop does not inspect the output of the model.
Code
from typing import*from transformers import BertModel, BertPreTrainedModel, BertConfigfrom transformers.modeling_outputs import SequenceClassifierOutputimport torchfrom torch import nnclass BertHuggingfaceClassifier(BertPreTrainedModel):def__init__(self, config: BertConfig) ->None:super().__init__(config)# copied this bit from BertForSequenceClassification# often the transformer is created by a parent classself.config = configself.bert = BertModel(config)self.classifier = nn.Linear( in_features=config.hidden_size, out_features=2, bias=False )def forward(self, input_ids: torch.Tensor, attention_mask: Optional[torch.Tensor] =None, labels: Optional[torch.Tensor] =None, return_dict: Optional[bool] =None, ) -> Tuple[torch.Tensor, ...]: return_dict = return_dict if return_dict isnotNoneelseself.config.use_return_dict outputs =self.bert( input_ids=input_ids, attention_mask=attention_mask ) logits =self.classifier(outputs[1]) loss =Noneif labels isnotNone:# cross entropy is suitable for single class classifiers# it's most predictable when the input is [N, classes] floats and the target is [N] integers# cross entropy applies softmax for you loss = torch.nn.functional.cross_entropy(logits.reshape(-1, 2), labels.flatten())ifnot return_dict: output = (logits,) + outputs[2:]return ((loss,) + output) if loss isnotNoneelse outputreturn SequenceClassifierOutput( loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions, )
Code
model = BertHuggingfaceClassifier.from_pretrained(MODEL_NAME)
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertHuggingfaceClassifier: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertHuggingfaceClassifier from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertHuggingfaceClassifier from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertHuggingfaceClassifier were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Code
from pathlib import Pathfrom transformers import Trainer, TrainingArguments, EvalPredictiontraining_args = TrainingArguments( per_device_train_batch_size=BATCH_SIZE, per_device_eval_batch_size=BATCH_SIZE, learning_rate=5e-5, warmup_ratio=0.06, report_to=[], # you'd use wandb for weights and biases# even shorter as this is testing the model and metrics evaluation_strategy="steps", max_steps=100, logging_steps=10, eval_steps=10, load_best_model_at_end=True, metric_for_best_model="accuracy", greater_is_better=True, no_cuda=True, # let you run it on any machine# output_dir is compulsory logging_dir=MODEL_RUN_FOLDER /"output", output_dir=MODEL_RUN_FOLDER /"output", overwrite_output_dir=True,)trainer = Trainer( model=model, args=training_args, train_dataset=split_ds["train"], eval_dataset=split_ds["test"], tokenizer=tokenizer, compute_metrics=custom_metrics,)trainer.train()
max_steps is given, it will override any value given in num_train_epochs
***** Running training *****
Num examples = 1800
Num Epochs = 2
Instantaneous batch size per device = 32
Total train batch size (w. parallel, distributed & accumulation) = 32
Gradient Accumulation steps = 1
Total optimization steps = 100
[100/100 03:33, Epoch 1/2]
Step
Training Loss
Validation Loss
Accuracy
Negative Precision
Negative Recall
Negative Fscore
Negative Support
Positive Precision
Positive Recall
Positive Fscore
Positive Support
10
0.693200
0.644840
0.690000
0.784615
0.515152
0.621951
99
0.644444
0.861386
0.737288
101
20
0.635700
0.605355
0.655000
0.826087
0.383838
0.524138
99
0.603896
0.920792
0.729412
101
30
0.577800
0.497736
0.785000
0.833333
0.707071
0.765027
99
0.750000
0.861386
0.801843
101
40
0.531100
0.488882
0.770000
0.778947
0.747475
0.762887
99
0.761905
0.792079
0.776699
101
50
0.523500
0.489339
0.780000
0.823529
0.707071
0.760870
99
0.747826
0.851485
0.796296
101
60
0.428700
0.469223
0.765000
0.745283
0.797980
0.770732
99
0.787234
0.732673
0.758974
101
70
0.313300
0.490093
0.785000
0.774510
0.797980
0.786070
99
0.795918
0.772277
0.783920
101
80
0.371500
0.506241
0.775000
0.806818
0.717172
0.759358
99
0.750000
0.831683
0.788732
101
90
0.325700
0.475269
0.785000
0.804348
0.747475
0.774869
99
0.768519
0.821782
0.794258
101
100
0.388900
0.463786
0.795000
0.795918
0.787879
0.791878
99
0.794118
0.801980
0.798030
101
***** Running Evaluation *****
Num examples = 200
Batch size = 32
Saving model checkpoint to /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-10
Configuration saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-10/config.json
Model weights saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-10/pytorch_model.bin
tokenizer config file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-10/tokenizer_config.json
Special tokens file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-10/special_tokens_map.json
***** Running Evaluation *****
Num examples = 200
Batch size = 32
Saving model checkpoint to /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-20
Configuration saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-20/config.json
Model weights saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-20/pytorch_model.bin
tokenizer config file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-20/tokenizer_config.json
Special tokens file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-20/special_tokens_map.json
***** Running Evaluation *****
Num examples = 200
Batch size = 32
Saving model checkpoint to /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-30
Configuration saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-30/config.json
Model weights saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-30/pytorch_model.bin
tokenizer config file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-30/tokenizer_config.json
Special tokens file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-30/special_tokens_map.json
***** Running Evaluation *****
Num examples = 200
Batch size = 32
Saving model checkpoint to /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-40
Configuration saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-40/config.json
Model weights saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-40/pytorch_model.bin
tokenizer config file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-40/tokenizer_config.json
Special tokens file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-40/special_tokens_map.json
***** Running Evaluation *****
Num examples = 200
Batch size = 32
Saving model checkpoint to /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-50
Configuration saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-50/config.json
Model weights saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-50/pytorch_model.bin
tokenizer config file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-50/tokenizer_config.json
Special tokens file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-50/special_tokens_map.json
***** Running Evaluation *****
Num examples = 200
Batch size = 32
Saving model checkpoint to /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-60
Configuration saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-60/config.json
Model weights saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-60/pytorch_model.bin
tokenizer config file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-60/tokenizer_config.json
Special tokens file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-60/special_tokens_map.json
***** Running Evaluation *****
Num examples = 200
Batch size = 32
Saving model checkpoint to /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-70
Configuration saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-70/config.json
Model weights saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-70/pytorch_model.bin
tokenizer config file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-70/tokenizer_config.json
Special tokens file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-70/special_tokens_map.json
***** Running Evaluation *****
Num examples = 200
Batch size = 32
Saving model checkpoint to /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-80
Configuration saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-80/config.json
Model weights saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-80/pytorch_model.bin
tokenizer config file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-80/tokenizer_config.json
Special tokens file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-80/special_tokens_map.json
***** Running Evaluation *****
Num examples = 200
Batch size = 32
Saving model checkpoint to /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-90
Configuration saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-90/config.json
Model weights saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-90/pytorch_model.bin
tokenizer config file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-90/tokenizer_config.json
Special tokens file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-90/special_tokens_map.json
***** Running Evaluation *****
Num examples = 200
Batch size = 32
Saving model checkpoint to /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-100
Configuration saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-100/config.json
Model weights saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-100/pytorch_model.bin
tokenizer config file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-100/tokenizer_config.json
Special tokens file saved in /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-100/special_tokens_map.json
Training completed. Do not forget to share your model on huggingface.co/models =)
Loading best model from /data/blog/2021-09-12-train-using-huggingface/runs/output/checkpoint-100 (score: 0.795).
After training a model it’s a very good idea to save it.
If you have to restart your notebook or it crashes then you can lose your model. Even worse, when using a GPU if you hit a CUDA error then the graphics card can end up in an error state where you will not be able to perform operations - not even saving your model.
Code
from pathlib import Pathmodel.save_pretrained(Path("/data/blog/2021-09-12-train-using-huggingface/model"))
Configuration saved in /data/blog/2021-09-12-train-using-huggingface/model/config.json
Model weights saved in /data/blog/2021-09-12-train-using-huggingface/model/pytorch_model.bin
You can load this model by passing the folder to load_pretrained.
Code
from pathlib import Pathmodel = BertHuggingfaceClassifier.from_pretrained( Path("/data/blog/2021-09-12-train-using-huggingface/model"))
loading configuration file /data/blog/2021-09-12-train-using-huggingface/model/config.json
Model config BertConfig {
"_name_or_path": "bert-base-uncased",
"architectures": [
"BertHuggingfaceClassifier"
],
"attention_probs_dropout_prob": 0.1,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"position_embedding_type": "absolute",
"torch_dtype": "float32",
"transformers_version": "4.9.2",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 30522
}
loading weights file /data/blog/2021-09-12-train-using-huggingface/model/pytorch_model.bin
All model checkpoint weights were used when initializing BertHuggingfaceClassifier.
All the weights of BertHuggingfaceClassifier were initialized from the model checkpoint at /data/blog/2021-09-12-train-using-huggingface/model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertHuggingfaceClassifier for predictions without further training.
Manually Evaluating a Model
We want to be able to spot check the model to ensure that the evaluation code is legitimate. It’s easy to write a metric that doesn’t actually check what you want. Luckily evaluating the model is easy - we just have to invoke it and inspect the output.
Code
from typing import*from transformers import AutoTokenizerfrom pathlib import Pathimport torch# load the saved modelmodel = BertHuggingfaceClassifier.from_pretrained(Path("/data/blog/2021-09-12-train-using-huggingface/model"))tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)# put the model into evaluate mode to disable things like dropoutmodel.eval()@torch.no_grad() # disable gradient calculation - makes it much faster and use less memorydef evaluate(text: str) -> Dict[str, float]: tokens = tokenizer( text, return_tensors="pt", # return pytorch tensors return_attention_mask=False, # not needed for a single row return_token_type_ids=False# not needed for this classification task )# tokens is a dictionary that matches the kwargs of the model# so we can just use it directly output = model(**tokens)# output is always batched so this gets the only row predictions = output.logits[0] predictions = predictions.softmax(dim=-1) negative, positive = predictions.tolist()return {"negative": negative,"positive": positive, }
This model clearly has strong feelings about cheese. How very dare it.
Considering how little this was trained this is doing ok. An accuracy of high 70s is believable.
Next Steps
There are several ways that you can extend this approach to improve the model you are creating. I’m going to introduce a few of them here, however this is already quite long so I will not be thoroughly exploring them.
Customizing the Optimizer
The approach that I have outlined in this involves taking a well trained model and replacing the head of it. This means you have part of the model that has been trained extensively, and another part that is freshly initialized.
When training the model the changes to the weights affect both parts, so the well trained body is updated to try to improve the untrained head. This then degrades the body weights.
If the classification head was first trained without altering the body then the weights within it could become better than random. After that the whole model could be trained to allow the body to support the classifier more.
This technique is used in the FastAI course and was introduced in (Howard and Ruder 2018):
Howard, Jeremy, and Sebastian Ruder. 2018. “Universal Language Model Fine-Tuning for Text Classification.”https://arxiv.org/abs/1801.06146.
Gradual unfreezing
Rather than fine-tuning all layers at once, which risks catastrophic forgetting, we propose to gradually unfreeze the model starting from the last layer as this contains the least general knowledge (Yosinski et al., 2014): We first unfreeze the last layer and fine-tune all unfrozen layers for one epoch. We then unfreeze the next lower frozen layer and repeat, until we fine-tune all layers until convergence at the last iteration. This is similar to ‘chain-thaw’ (Felbo et al., 2017), except that we add a layer at a time to the set of ‘thawed’ layers, rather than only training a single layer at a time.
Another improvement to this approach is to have a gradually increased learning rate as you move from the start of the body to the end of the classification head. The principle behind the increased learning rate is that the start of the model deals with basic universal qualities of text, while the end of the model is specific to the task at hand. By reducing the rate at which the start of the model can change you preserve the universal knowledge that the model already encodes. This was introduced in the same paper as discriminative fine-tuning.
Discriminative fine-tuning
As different layers capture different types of information (Yosinski et al., 2014), they should be fine-tuned to different extents. In CV where fine-tuning the entire model is too costly as networks can have more than 100 layers (Huang et al., 2017), a common practice is to fine-tune initially just the last hidden layer, and then unfreeze additional layers for further fine-tuning.
The simplest approach to treating different layers appropriately is to fine-tune the model one layer at a time, analogous to greedy layer-wise training (Bengio et al., 2007) and ‘chain-thaw’ (Felbo et al., 2017). However, this introduces a sequential requirement, hindering parallelism, and requires multiple passes through the dataset, resulting in overfitting for small datasets. For this reason, we propose a more efficient fine-tuning method, discriminative fine-tuning.
Instead of using the same learning rate for all layers of the model, discriminative fine-tuning allows us to tune each layer with different learning rates
To apply these techniques to the trainer you need to create your own optimizer (and optionally scheduler). You can supply multiple separate optimizer/scheduler pairs to the trainer.
Sequence to Sequence Models
Training a sequence to sequence model involves using a Seq2SeqTrainer and associated arguments. The key difference is that a sequence to sequence model is expected to step through the input and labels one token at a time, at each point making a prediction. So for them a single input row is like a batch as it has to process it many times. This is different enough to the normal training process that it requires a custom trainer.
Simple Transformers
The simple transformers library builds on the transformers library to simplify the process of training your own models. I’ve used it briefly before and it worked well to train RoBERTa on a regression task. It is good and I do recommend it.
Even though I say all this I still do not use it often.
Get Familiar with the Existing Code
The huggingface code is well documented and relatively easy to read. If you review the task specific versions of each model you can see the way in which they customize the model to meet the requirements of the task. Going through the models from old to new will also show you how the techniques involved have progressed.
Remember you can just use ?? in a jupyter notebook to see the source code for anything.
Copying an existing task head and adjusting it to your needs is an easy way to get familiar with deep learning model creation.