Code
from pathlib import Path
= Path("/data/blog/2025/03/08/hyperparameter-search")
POST_FOLDER =True, exist_ok=True)
POST_FOLDER.mkdir(parents
= POST_FOLDER / "dataset"
DATASET_FOLDER = POST_FOLDER / "train" TRAIN_FOLDER
March 8, 2025
When training a model for a task there are many settings that can be provided to the trainer. These settings can affect the resulting model and it is difficult to know what the best value to use is.
Hyperparameter search is a way to determine the best value by repeatedly running the training with different values. It’s an interesting problem space as it involves using previous tests to predict the outcome of as yet unexplored values. These tests can be split into those that explore the space and those that maximize the result (exploration versus exploitation).
The huggingface trainer has support for hyperparameter optimization and this notebook is an exploration of that.
I want a simple problem to optimize, as this is really an exploration of the tools. As such a sentiment classifier should work. If I use a small model and a simple dataset then it should be easy to quickly train a model.
The code below defines the dataset, the metrics and the training loop. It’s all standard stuff so I won’t explain it further.
from datasets import load_dataset, load_from_disk
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
model = AutoModelForSequenceClassification.from_pretrained("answerdotai/ModernBERT-base", num_labels=2)
if DATASET_FOLDER.exists():
ds = load_from_disk(DATASET_FOLDER)
else:
ds = load_dataset("fancyzhx/amazon_polarity")
def create_documents(row: dict) -> dict[str, str]:
return {"document": f"{row['title']}\n\n{row['content']}"}
def encode(rows, batched=True) -> dict[str, list[int]]:
return tokenizer(
rows["document"],
return_attention_mask=False,
return_token_type_ids=False,
padding=False,
)
ds = ds.map(create_documents)
ds = ds.map(encode, batched=True)
ds.save_to_disk(DATASET_FOLDER)
Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at answerdotai/ModernBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
from sklearn.metrics import precision_recall_fscore_support
from transformers import EvalPrediction
import numpy as np
def accuracy_precision_and_recall(results: EvalPrediction) -> dict[str, float]:
y_pred = np.argmax(results.predictions, axis=1)
y_true = results.label_ids
accuracy = (y_pred == y_true).mean().item()
precision, recall, fscore, _support = precision_recall_fscore_support(
y_true=y_true,
y_pred=y_pred,
average="macro",
)
return {
"accuracy": accuracy,
"precision": precision,
"recall": recall,
"fscore": fscore,
}
from pathlib import Path
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
DataCollatorWithPadding,
Trainer,
TrainingArguments,
)
from datasets import Dataset
def train_model(
*,
model: AutoModelForSequenceClassification,
tokenizer: AutoTokenizer,
train_ds: Dataset,
valid_ds: Dataset,
batch_size: int,
learning_rate: float,
epochs: int,
folder: Path = TRAIN_FOLDER,
) -> dict[str, float]:
collator = DataCollatorWithPadding(tokenizer=tokenizer)
training_args = TrainingArguments(
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size * 3,
learning_rate=learning_rate,
num_train_epochs=epochs,
warmup_ratio=0.06,
eval_strategy="epoch",
logging_strategy="epoch",
save_strategy="epoch",
save_total_limit=2,
report_to=[],
load_best_model_at_end=True,
metric_for_best_model="fscore",
greater_is_better=True,
# output_dir is compulsory
logging_dir=folder / "train",
output_dir=folder / "train",
overwrite_output_dir=True,
)
trainer = Trainer(
model=model,
processing_class=tokenizer,
args=training_args,
data_collator=collator,
train_dataset=train_ds,
eval_dataset=valid_ds,
compute_metrics=accuracy_precision_and_recall,
)
trainer.train()
metrics = trainer.evaluate()
return metrics
We can now train the model against this amazon sentiment dataset. This only has positive and negative classes so it’s a very simple task.
train_model(
model=model,
tokenizer=tokenizer,
train_ds=ds["train"].take(1_000),
valid_ds=ds["test"].take(100),
batch_size=32,
epochs=1,
learning_rate=1e-4,
)
Epoch | Training Loss | Validation Loss | Accuracy | Precision | Recall | Fscore |
---|---|---|---|---|---|---|
1 | 0.527900 | 0.367354 | 0.870000 | 0.871753 | 0.867724 | 0.868938 |
{'eval_loss': 0.36735397577285767,
'eval_accuracy': 0.87,
'eval_precision': 0.8717532467532467,
'eval_recall': 0.8677238057005219,
'eval_fscore': 0.8689384010484928,
'eval_runtime': 0.621,
'eval_samples_per_second': 161.04,
'eval_steps_per_second': 3.221,
'epoch': 1.0}
Here is a problem with this kind of task - the model quality is so high that training over 1,000 examples produces a 87% accurate classifier. Still, this is an exploration of technique not of task difficulty.
There were three parameters in the original code that could be tuned. The batch_size, epochs and learning_rate were all suitable parameters to tweak. How would we define a search over these?
from pathlib import Path
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
DataCollatorWithPadding,
Trainer,
TrainingArguments,
)
from datasets import Dataset
from transformers import AutoModelForSequenceClassification
# this produces a warning, saving the model means the
# hyperparameter runs are the same and avoids the warning
MODEL_FOLDER = TRAIN_FOLDER / "blank"
if not (MODEL_FOLDER / "model.safetensors").exists():
model = AutoModelForSequenceClassification.from_pretrained(
"answerdotai/ModernBERT-base",
num_labels=2,
)
model.save_pretrained(MODEL_FOLDER)
def model_init(trial):
model = AutoModelForSequenceClassification.from_pretrained(MODEL_FOLDER)
return model
def optuna_hp_space(trial):
learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-3, log=True)
batch_size = trial.suggest_categorical("per_device_train_batch_size", [16, 32])
return {
"learning_rate": learning_rate,
"per_device_train_batch_size": batch_size,
"per_device_eval_batch_size": batch_size * 3,
}
def optimize_model(
*,
tokenizer: AutoTokenizer,
train_ds: Dataset,
valid_ds: Dataset,
batch_size: int,
learning_rate: float,
epochs: int,
trials: int,
folder: Path = TRAIN_FOLDER,
) -> dict[str, float]:
collator = DataCollatorWithPadding(tokenizer=tokenizer)
training_args = TrainingArguments(
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size * 3,
learning_rate=learning_rate,
num_train_epochs=epochs,
warmup_ratio=0.06,
eval_strategy="epoch",
logging_strategy="epoch",
save_strategy="epoch",
save_total_limit=2,
report_to=[],
load_best_model_at_end=True,
metric_for_best_model="fscore",
greater_is_better=True,
# the huggingface hyperparameter search has memory problems
# see https://github.com/huggingface/setfit/issues/311
# use_cpu=True, # getting cuda oom all the time
# output_dir is compulsory
logging_dir=folder / "train",
output_dir=folder / "train",
overwrite_output_dir=True,
)
trainer = Trainer(
model_init=model_init,
processing_class=tokenizer,
args=training_args,
data_collator=collator,
train_dataset=train_ds,
eval_dataset=valid_ds,
compute_metrics=accuracy_precision_and_recall,
)
best_trials = trainer.hyperparameter_search(
direction="maximize",
backend="optuna",
hp_space=optuna_hp_space,
n_trials=trials,
compute_objective=lambda metrics: metrics["eval_accuracy"],
)
return best_trials
optimize_model(
tokenizer=tokenizer,
train_ds=ds["train"].take(1_000),
valid_ds=ds["test"].take(100),
batch_size=32,
epochs=1,
trials=5,
learning_rate=1e-4,
)
[I 2025-03-30 14:42:32,815] A new study created in memory with name: no-name-5ba4c8e3-37ff-4b84-a0e4-d01d232ad306
Epoch | Training Loss | Validation Loss | Accuracy | Precision | Recall | Fscore |
---|---|---|---|---|---|---|
1 | 0.574500 | 0.466148 | 0.810000 | 0.818182 | 0.814733 | 0.809829 |
[I 2025-03-30 14:43:10,973] Trial 0 finished with value: 0.81 and parameters: {'learning_rate': 1.1932317963695397e-05, 'per_device_train_batch_size': 16}. Best is trial 0 with value: 0.81.
Epoch | Training Loss | Validation Loss | Accuracy | Precision | Recall | Fscore |
---|---|---|---|---|---|---|
1 | 0.840700 | 0.699033 | 0.490000 | 0.614583 | 0.517664 | 0.374310 |
[I 2025-03-30 14:43:46,193] Trial 1 finished with value: 0.49 and parameters: {'learning_rate': 0.0008425556143563035, 'per_device_train_batch_size': 16}. Best is trial 0 with value: 0.81.
Epoch | Training Loss | Validation Loss | Accuracy | Precision | Recall | Fscore |
---|---|---|---|---|---|---|
1 | 0.459800 | 0.231394 | 0.900000 | 0.913409 | 0.894821 | 0.898001 |
[I 2025-03-30 14:44:22,498] Trial 2 finished with value: 0.9 and parameters: {'learning_rate': 0.0001090460187424317, 'per_device_train_batch_size': 32}. Best is trial 2 with value: 0.9.
Epoch | Training Loss | Validation Loss | Accuracy | Precision | Recall | Fscore |
---|---|---|---|---|---|---|
1 | 0.465700 | 0.337975 | 0.880000 | 0.879552 | 0.880771 | 0.879808 |
[I 2025-03-30 14:44:58,170] Trial 3 finished with value: 0.88 and parameters: {'learning_rate': 3.107467202967815e-05, 'per_device_train_batch_size': 16}. Best is trial 2 with value: 0.9.
Epoch | Training Loss | Validation Loss | Accuracy | Precision | Recall | Fscore |
---|---|---|---|---|---|---|
1 | 0.637300 | 0.571121 | 0.750000 | 0.757305 | 0.754516 | 0.749775 |
[I 2025-03-30 14:45:33,994] Trial 4 finished with value: 0.75 and parameters: {'learning_rate': 1.0654243244293594e-05, 'per_device_train_batch_size': 32}. Best is trial 2 with value: 0.9.
BestRun(run_id='2', objective=0.9, hyperparameters={'learning_rate': 0.0001090460187424317, 'per_device_train_batch_size': 32}, run_summary=None)
The hyperparameter optimization was able to find a set of parameters that works better with the model, producing a 90% accurate classifier. Even though this is a toy example there were several problems. The gpu frequently ran out of memory and I have found the documentation to be very sparse.
Would this be better if I used optuna directly?
This is structured slightly differently. The two concepts are the study, which is an excersize in optimizing the hyperparameters; and a trial, which is a single run with a set of hyperparameters.
Each trial is done by running a function that takes a trial object. The trial object can then produce the values for that trial through suggest_
methods. The output of the trial function is a value to be minimized.
This means I can hook up the optuna code to my train_model
code from before.
import optuna
from pathlib import Path
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
DataCollatorWithPadding,
Trainer,
TrainingArguments,
)
from datasets import Dataset
from transformers import AutoModelForSequenceClassification
# this produces a warning, saving the model means the
# hyperparameter runs are the same and avoids the warning
MODEL_FOLDER = TRAIN_FOLDER / "blank"
if not (MODEL_FOLDER / "model.safetensors").exists():
model = AutoModelForSequenceClassification.from_pretrained(
"answerdotai/ModernBERT-base",
num_labels=2,
)
model.save_pretrained(MODEL_FOLDER)
# Define an objective function to be minimized.
def objective(trial: optuna.Trial) -> float:
model = AutoModelForSequenceClassification.from_pretrained(MODEL_FOLDER)
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
train_ds = ds["train"].take(1_000)
valid_ds = ds["test"].take(100)
epochs = 1
batch_size = trial.suggest_categorical('classifier', [16, 32, 64])
learning_rate = trial.suggest_float('learning_rate', 1e-5, 1e-3, log=True)
metrics = train_model(
model=model,
tokenizer=tokenizer,
train_ds=train_ds,
valid_ds=valid_ds,
batch_size=batch_size,
epochs=epochs,
learning_rate=learning_rate,
)
accuracy = metrics["eval_accuracy"]
error = 1 - accuracy
return error
study = optuna.create_study() # Create a new study.
# gc_after_trial prevents cuda oom
# see https://optuna.readthedocs.io/en/stable/faq.html#how-do-i-avoid-running-out-of-memory-oom-when-optimizing-studies
study.optimize(objective, n_trials=5, gc_after_trial=True) # Invoke optimization of the objective function.
[I 2025-03-30 15:03:28,018] A new study created in memory with name: no-name-0fea4ec6-bb73-4a48-ba0a-5bd65f25d91b
Epoch | Training Loss | Validation Loss | Accuracy | Precision | Recall | Fscore |
---|---|---|---|---|---|---|
1 | 0.471100 | 0.203674 | 0.930000 | 0.936371 | 0.926736 | 0.929143 |
[I 2025-03-30 15:04:14,658] Trial 0 finished with value: 0.06999999999999995 and parameters: {'classifier': 32, 'learning_rate': 9.073398965583905e-05}. Best is trial 0 with value: 0.06999999999999995.
Epoch | Training Loss | Validation Loss | Accuracy | Precision | Recall | Fscore |
---|---|---|---|---|---|---|
1 | 0.408900 | 0.229655 | 0.920000 | 0.924113 | 0.917302 | 0.919192 |
[I 2025-03-30 15:04:49,514] Trial 1 finished with value: 0.07999999999999996 and parameters: {'classifier': 16, 'learning_rate': 4.4432591516234544e-05}. Best is trial 0 with value: 0.06999999999999995.
Epoch | Training Loss | Validation Loss | Accuracy | Precision | Recall | Fscore |
---|---|---|---|---|---|---|
1 | 0.475900 | 0.368769 | 0.860000 | 0.860606 | 0.858290 | 0.859098 |
[I 2025-03-30 15:05:36,117] Trial 2 finished with value: 0.14 and parameters: {'classifier': 16, 'learning_rate': 2.001954927549395e-05}. Best is trial 0 with value: 0.06999999999999995.
Epoch | Training Loss | Validation Loss | Accuracy | Precision | Recall | Fscore |
---|---|---|---|---|---|---|
1 | 0.535600 | 0.289646 | 0.880000 | 0.887143 | 0.875953 | 0.878247 |
[I 2025-03-30 15:06:24,540] Trial 3 finished with value: 0.12 and parameters: {'classifier': 16, 'learning_rate': 0.00036024883894480335}. Best is trial 0 with value: 0.06999999999999995.
Epoch | Training Loss | Validation Loss | Accuracy | Precision | Recall | Fscore |
---|---|---|---|---|---|---|
1 | 0.771500 | 0.459973 | 0.750000 | 0.750000 | 0.747290 | 0.747958 |
[I 2025-03-30 15:07:12,629] Trial 4 finished with value: 0.25 and parameters: {'classifier': 16, 'learning_rate': 0.0008112443241609021}. Best is trial 0 with value: 0.06999999999999995.
This has concluded. I feel that using optuna directly was slightly easier than going through the huggingface approach. This is simply because of the garbage collection concerns, and most of the output that I see was actually coming from optuna all along.
Optuna does come with a visualization library that can show you how the model performance varied. Something to play with another time.