Hyperparameter Search with the Huggingface Trainer

Optimizing training hyperparameters with various libraries
Published

March 8, 2025

When training a model for a task there are many settings that can be provided to the trainer. These settings can affect the resulting model and it is difficult to know what the best value to use is.

Hyperparameter search is a way to determine the best value by repeatedly running the training with different values. It’s an interesting problem space as it involves using previous tests to predict the outcome of as yet unexplored values. These tests can be split into those that explore the space and those that maximize the result (exploration versus exploitation).

The huggingface trainer has support for hyperparameter optimization and this notebook is an exploration of that.

Optimizable Problem

I want a simple problem to optimize, as this is really an exploration of the tools. As such a sentiment classifier should work. If I use a small model and a simple dataset then it should be easy to quickly train a model.

The code below defines the dataset, the metrics and the training loop. It’s all standard stuff so I won’t explain it further.

Code
from pathlib import Path

POST_FOLDER = Path("/data/blog/2025/03/08/hyperparameter-search")
POST_FOLDER.mkdir(parents=True, exist_ok=True)

DATASET_FOLDER = POST_FOLDER / "dataset"
TRAIN_FOLDER = POST_FOLDER / "train"
Code
from datasets import load_dataset, load_from_disk
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
model = AutoModelForSequenceClassification.from_pretrained("answerdotai/ModernBERT-base", num_labels=2)

if DATASET_FOLDER.exists():
    ds = load_from_disk(DATASET_FOLDER)
else:
    ds = load_dataset("fancyzhx/amazon_polarity")
    
    def create_documents(row: dict) -> dict[str, str]:
        return {"document": f"{row['title']}\n\n{row['content']}"}
    
    def encode(rows, batched=True) -> dict[str, list[int]]:
        return tokenizer(
            rows["document"],
            return_attention_mask=False,
            return_token_type_ids=False,
            padding=False,
        )
    
    ds = ds.map(create_documents)
    ds = ds.map(encode, batched=True)
    ds.save_to_disk(DATASET_FOLDER)
Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at answerdotai/ModernBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Code
from sklearn.metrics import precision_recall_fscore_support
from transformers import EvalPrediction
import numpy as np

def accuracy_precision_and_recall(results: EvalPrediction) -> dict[str, float]:
    y_pred = np.argmax(results.predictions, axis=1)
    y_true = results.label_ids

    accuracy = (y_pred == y_true).mean().item()
    precision, recall, fscore, _support = precision_recall_fscore_support(
        y_true=y_true,
        y_pred=y_pred,
        average="macro",
    )
    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "fscore": fscore,
    }
Code
from pathlib import Path
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DataCollatorWithPadding,
    Trainer,
    TrainingArguments,
)
from datasets import Dataset

def train_model(
    *,
    model: AutoModelForSequenceClassification,
    tokenizer: AutoTokenizer,
    train_ds: Dataset,
    valid_ds: Dataset,
    batch_size: int,
    learning_rate: float,
    epochs: int,
    folder: Path = TRAIN_FOLDER,
) -> dict[str, float]:
    collator = DataCollatorWithPadding(tokenizer=tokenizer)

    training_args = TrainingArguments(
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size * 3,
        learning_rate=learning_rate,
        num_train_epochs=epochs,
        warmup_ratio=0.06,

        eval_strategy="epoch",
        logging_strategy="epoch",
        save_strategy="epoch",
        save_total_limit=2,
        report_to=[],
        load_best_model_at_end=True,
        metric_for_best_model="fscore",
        greater_is_better=True,

        # output_dir is compulsory
        logging_dir=folder / "train",
        output_dir=folder / "train",
        overwrite_output_dir=True,
    )

    trainer = Trainer(
        model=model,
        processing_class=tokenizer,
        args=training_args,
        data_collator=collator,
        train_dataset=train_ds,
        eval_dataset=valid_ds,
        compute_metrics=accuracy_precision_and_recall,
    )
    trainer.train()

    metrics = trainer.evaluate()
    return metrics

We can now train the model against this amazon sentiment dataset. This only has positive and negative classes so it’s a very simple task.

train_model(
    model=model,
    tokenizer=tokenizer,
    train_ds=ds["train"].take(1_000),
    valid_ds=ds["test"].take(100),
    batch_size=32,
    epochs=1,
    learning_rate=1e-4,
)
[32/32 00:46, Epoch 1/1]
Epoch Training Loss Validation Loss Accuracy Precision Recall Fscore
1 0.527900 0.367354 0.870000 0.871753 0.867724 0.868938

[2/2 00:00]
{'eval_loss': 0.36735397577285767,
 'eval_accuracy': 0.87,
 'eval_precision': 0.8717532467532467,
 'eval_recall': 0.8677238057005219,
 'eval_fscore': 0.8689384010484928,
 'eval_runtime': 0.621,
 'eval_samples_per_second': 161.04,
 'eval_steps_per_second': 3.221,
 'epoch': 1.0}

Here is a problem with this kind of task - the model quality is so high that training over 1,000 examples produces a 87% accurate classifier. Still, this is an exploration of technique not of task difficulty.

Hyperparameter Optimization

There were three parameters in the original code that could be tuned. The batch_size, epochs and learning_rate were all suitable parameters to tweak. How would we define a search over these?

Code
from pathlib import Path
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DataCollatorWithPadding,
    Trainer,
    TrainingArguments,
)
from datasets import Dataset
from transformers import AutoModelForSequenceClassification

# this produces a warning, saving the model means the
# hyperparameter runs are the same and avoids the warning
MODEL_FOLDER = TRAIN_FOLDER / "blank"
if not (MODEL_FOLDER / "model.safetensors").exists():
    model = AutoModelForSequenceClassification.from_pretrained(
        "answerdotai/ModernBERT-base",
        num_labels=2,
    )
    model.save_pretrained(MODEL_FOLDER)

def model_init(trial):
    model = AutoModelForSequenceClassification.from_pretrained(MODEL_FOLDER)
    return model

def optuna_hp_space(trial):
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-3, log=True)
    batch_size = trial.suggest_categorical("per_device_train_batch_size", [16, 32])
    return {
        "learning_rate": learning_rate,
        "per_device_train_batch_size": batch_size,
        "per_device_eval_batch_size": batch_size * 3,
    }

def optimize_model(
    *,
    tokenizer: AutoTokenizer,
    train_ds: Dataset,
    valid_ds: Dataset,
    batch_size: int,
    learning_rate: float,
    epochs: int,
    trials: int,
    folder: Path = TRAIN_FOLDER,
) -> dict[str, float]:
    collator = DataCollatorWithPadding(tokenizer=tokenizer)

    training_args = TrainingArguments(
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size * 3,
        learning_rate=learning_rate,
        num_train_epochs=epochs,
        warmup_ratio=0.06,

        eval_strategy="epoch",
        logging_strategy="epoch",
        save_strategy="epoch",
        save_total_limit=2,
        report_to=[],
        load_best_model_at_end=True,
        metric_for_best_model="fscore",
        greater_is_better=True,
        # the huggingface hyperparameter search has memory problems
        # see https://github.com/huggingface/setfit/issues/311
        # use_cpu=True, # getting cuda oom all the time

        # output_dir is compulsory
        logging_dir=folder / "train",
        output_dir=folder / "train",
        overwrite_output_dir=True,
    )

    trainer = Trainer(
        model_init=model_init,
        processing_class=tokenizer,
        args=training_args,
        data_collator=collator,
        train_dataset=train_ds,
        eval_dataset=valid_ds,
        compute_metrics=accuracy_precision_and_recall,
    )
    best_trials = trainer.hyperparameter_search(
        direction="maximize",
        backend="optuna",
        hp_space=optuna_hp_space,
        n_trials=trials,
        compute_objective=lambda metrics: metrics["eval_accuracy"],
    )
    return best_trials
optimize_model(
    tokenizer=tokenizer,
    train_ds=ds["train"].take(1_000),
    valid_ds=ds["test"].take(100),
    batch_size=32,
    epochs=1,
    trials=5,
    learning_rate=1e-4,
)
[I 2025-03-30 14:42:32,815] A new study created in memory with name: no-name-5ba4c8e3-37ff-4b84-a0e4-d01d232ad306
[63/63 00:34, Epoch 1/1]
Epoch Training Loss Validation Loss Accuracy Precision Recall Fscore
1 0.574500 0.466148 0.810000 0.818182 0.814733 0.809829

[I 2025-03-30 14:43:10,973] Trial 0 finished with value: 0.81 and parameters: {'learning_rate': 1.1932317963695397e-05, 'per_device_train_batch_size': 16}. Best is trial 0 with value: 0.81.
[63/63 00:34, Epoch 1/1]
Epoch Training Loss Validation Loss Accuracy Precision Recall Fscore
1 0.840700 0.699033 0.490000 0.614583 0.517664 0.374310

[I 2025-03-30 14:43:46,193] Trial 1 finished with value: 0.49 and parameters: {'learning_rate': 0.0008425556143563035, 'per_device_train_batch_size': 16}. Best is trial 0 with value: 0.81.
[32/32 00:34, Epoch 1/1]
Epoch Training Loss Validation Loss Accuracy Precision Recall Fscore
1 0.459800 0.231394 0.900000 0.913409 0.894821 0.898001

[I 2025-03-30 14:44:22,498] Trial 2 finished with value: 0.9 and parameters: {'learning_rate': 0.0001090460187424317, 'per_device_train_batch_size': 32}. Best is trial 2 with value: 0.9.
[63/63 00:34, Epoch 1/1]
Epoch Training Loss Validation Loss Accuracy Precision Recall Fscore
1 0.465700 0.337975 0.880000 0.879552 0.880771 0.879808

[I 2025-03-30 14:44:58,170] Trial 3 finished with value: 0.88 and parameters: {'learning_rate': 3.107467202967815e-05, 'per_device_train_batch_size': 16}. Best is trial 2 with value: 0.9.
[32/32 00:34, Epoch 1/1]
Epoch Training Loss Validation Loss Accuracy Precision Recall Fscore
1 0.637300 0.571121 0.750000 0.757305 0.754516 0.749775

[I 2025-03-30 14:45:33,994] Trial 4 finished with value: 0.75 and parameters: {'learning_rate': 1.0654243244293594e-05, 'per_device_train_batch_size': 32}. Best is trial 2 with value: 0.9.
BestRun(run_id='2', objective=0.9, hyperparameters={'learning_rate': 0.0001090460187424317, 'per_device_train_batch_size': 32}, run_summary=None)

The hyperparameter optimization was able to find a set of parameters that works better with the model, producing a 90% accurate classifier. Even though this is a toy example there were several problems. The gpu frequently ran out of memory and I have found the documentation to be very sparse.

Would this be better if I used optuna directly?