GPT-2 Recycle

Actually training GPT-2 for another language properly
Published

April 12, 2021

I’ve been trying to retrain GPT-2 to work as a language model on another language. Although the training looks like it works well, I have found that the downstream performance on a task is lacking. It seems that perplexity is not the only thing I need to work on.

To this end one of my co-workers linked a paper from Wietse de Vries and Malvina Nissim which covers how to effectively retrain GPT-2 for a different language. It does this in stages in quite a neat way.

The tokenizer is created. This then requires a new embedding layer. The embedding layer is then trained without training the rest of the model (to prevent catastrophic forgetting). Then the model can be fine tuned a little.

One of the neat approaches in the paper is that they train an embedding for GPT-2 small first, and have a way to scale that up to GPT-2 medium. They do this by finding the mapping between the embedding in GPT-2 small and medium and then applying this to their custom language embedding.

Let’s see if I can get this working.

To make it easy to use this library I have cloned it as a submodule at submodules/gpt2-recycle. I am going to be using a data folder at data/2021-04-12-gpt-2-recycle.

Code
! ls submodules/gpt2-recycle
environment.yml  example.py  HOWTO.md  LICENSE  README.md  src
Code
! ls data/2021-04-12-gpt-2-recycle
data  output  text  tokenizer.json  tokens

Data Preparation

The data preparation requires the following steps:

  • Generate a tokenizer
  • Tokenize the documents
  • Calculate document token lengths
  • Train / Valid split of documents

Create Tokenizer

The first thing to do is to create the custom tokenizer. Tokenization is done using byte pair encoding, which will take common sequences of characters and encode them as a single byte pair. I’ve got two datasets for this work, one is the spanish wikipedia and another is a comparable number of twitter sentences. Combining wikipedia and twitter should prevent overfitting on the more formal wikipedia language.

Code
from pathlib import Path

SUBMODULE_FOLDER = str(Path("submodules/gpt2-recycle/src").resolve())
DATA_FOLDER = str(Path("data/2021-04-12-gpt-2-recycle").resolve())
Code
PREPARATION_FOLDER = Path(DATA_FOLDER) / "data" / "es" / "preparation"
PREPARATION_FOLDER.mkdir(exist_ok=True, parents=True)
Code
FILES = sorted((PREPARATION_FOLDER / "plaintext").glob("**/*.txt"))
Code
CHARACTERS = {
    character
    for file in FILES
    for character in file.read_text()
    if character != " "
}

(PREPARATION_FOLDER / "charset.txt").write_text("".join(sorted(CHARACTERS)))
len(CHARACTERS)
6140
Code
! cd $DATA_FOLDER ; PYTHONPATH=$SUBMODULE_FOLDER python -m preparation.1_vocab_0_create es 40000
 > creating vocabulary with vocab size 40000k
Traceback (most recent call last):
  File "/home/matthew/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/matthew/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/matthew/Programming/Blog/blog/notebooks/submodules/gpt2-recycle/src/preparation/1_vocab_0_create.py", line 48, in <module>
    main()
  File "/home/matthew/Programming/Blog/blog/notebooks/submodules/gpt2-recycle/src/preparation/1_vocab_0_create.py", line 43, in main
    tokenizer = train_tokenizer(args.lang, args.size * 1000)
  File "/home/matthew/Programming/Blog/blog/notebooks/submodules/gpt2-recycle/src/preparation/1_vocab_0_create.py", line 25, in train_tokenizer
    tokenizer.train(trainer, (Path('data') / lang / 'plaintext').glob('*/*.txt'))
TypeError: Can't convert <tokenizers.trainers.BpeTrainer object at 0x7f191200a450> to Sequence

Here is a small problem with the existing code. The order of the arguments for the tokenizer train have changed. To fix this I’ll just recreate the code with the arguments as kwargs.

Code
# A discussion of the appropriate vocabulary size is in 2_prepare_2_eval.ipynb in gpt2-recycle
# A larger vocabulary results in rarer tokens.
# They chose 40,000 as a reasonable tradeoff.
VOCAB_SIZE = 40_000
BLOCK_SIZE = 128
BATCH_SIZE = 16
LEARNING_RATE = 1e-5
MAX_STEPS = 15_000_000 // BATCH_SIZE
MODEL_NAME = "gpt2-medium" # "gpt2-small"
Code
from typing import *
from tokenizers import Tokenizer, trainers, normalizers, pre_tokenizers, decoders, processors, models

def make_tokenizer(lang: str, size: int) -> None:
    base_dir = PREPARATION_FOLDER / 'vocabularies'
    dst_path = base_dir / f'{lang}-{str(size//1_000).zfill(3)}k.tokenizer.json'
    if dst_path.exists():
        print(f' > {dst_path} already exists. skipping')
        return

    print(f' > creating vocabulary with vocab size {size//1000}k')
    tokenizer = train_tokenizer(files=FILES, vocab_size=size, alphabet=sorted(CHARACTERS))
    tokenizer.save(str(dst_path), pretty=True)

def train_tokenizer(files: List[Path], vocab_size: int, alphabet: List[str]) -> Tokenizer:
    tokenizer = Tokenizer(models.BPE())
    tokenizer.normalizer = normalizers.NFC()
    tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
    tokenizer.decoder = decoders.ByteLevel()
    tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

    trainer = trainers.BpeTrainer(
        vocab_size=vocab_size,
        min_frequency=2,
        show_progress=True,
        special_tokens=['<unk>', '<s>', '</s>'],
        initial_alphabet=alphabet,
    )

    files = [str(file.resolve()) for file in files]
    tokenizer.train(files=files, trainer=trainer)
    return tokenizer
Code
make_tokenizer(lang="es", size=VOCAB_SIZE)
 > creating vocabulary with vocab size 40k
Code
tokenizer = train_tokenizer(files=FILES, vocab_size=VOCAB_SIZE, alphabet=sorted(CHARACTERS))
tokenizer.save(str(DATA_FOLDER / "tokenizer.json"), pretty=True)

Tokenize Documents

This involves tokenizing every sentence. Since I have organized the data into files with a sentence per line I had to double space the files. This is because the tokenizer expects a blank line between each document.

Code
! cd $DATA_FOLDER ; PYTHONPATH=$SUBMODULE_FOLDER python -m preparation.2_prepare_0_tokens --size 40 --model gpt2-medium es
 > preparing data/es/preparation/prepared/data-040k.pkl
🔥 data/es/preparation/plaintext/spanish-twitter.txt
15580452it [09:51, 26343.56it/s]
 ::: 7,790,226 examples loaded
🔥 data/es/preparation/plaintext/spanish-wikipedia.txt
6543442it [06:03, 17458.70it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1093 > 1024). Running this sequence through the model will result in indexing errors
14803750it [14:05, 17516.41it/s]
 ::: 15,192,101 examples loaded
15,192,101 examples
 > exporting data/es/preparation/prepared/data-040k.pkl
Code
! cd $DATA_FOLDER ; PYTHONPATH=$SUBMODULE_FOLDER python -m preparation.2_prepare_1_lengths --size 40 es
 > data/es/preparation/prepared/data-040k.pkl
 ::: loading examples
 ::: counting lengths
 ::: saved 15192101 lengths to data/es/preparation/prepared/data-040k.pkl.lengths
 ::: counting coverages
 ::: saved 42042 coverage scores to data/es/preparation/prepared/data-040k.pkl.coverage
Code
! cd $DATA_FOLDER ; PYTHONPATH=$SUBMODULE_FOLDER python -m preparation.3_finish_data --size 40 es
 > loading data/es/preparation/prepared/data-040k.pkl.lengths
Code
! cd $DATA_FOLDER ; PYTHONPATH=$SUBMODULE_FOLDER python -m preparation.3_split_train_val es 40 0.1
Traceback (most recent call last):
  File "/home/matthew/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/matthew/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/matthew/Programming/Blog/blog/notebooks/submodules/gpt2-recycle/src/preparation/3_split_train_val.py", line 12, in <module>
    n = args.size.zfill(3)
AttributeError: 'int' object has no attribute 'zfill'

This is an odd one, zfill is a string method. I think that the argparse parameter was updated?

Code
def split_data(vocab_size: int, val_ratio: float) -> None:
    from torch import randperm
    import numpy as np
    
    n = str(vocab_size // 1000).zfill(3)
    
    prep_dir = PREPARATION_FOLDER / 'final'

    src_path = prep_dir / f'index-{n}k.npy'
    tra_dst_path = prep_dir / f'index-train-{n}k.npy'
    val_dst_path = prep_dir / f'index-valid-{n}k.npy'

    dat = np.load(src_path)

    n_val = int(len(dat) * val_ratio)
    n_tra = len(dat) - n_val

    print(f'train={n_tra:,} valid={n_val:,}')

    ind = randperm(len(dat)).tolist()
    ind_tra = ind[:n_tra]
    ind_val = ind[n_tra:]

    np.save(tra_dst_path, ind_tra, allow_pickle=False)
    np.save(val_dst_path, ind_val, allow_pickle=False)
Code
split_data(VOCAB_SIZE, 0.1)
train=13,672,891 valid=1,519,210

So this has completed the data preparation. The next thing is to train the model embeddings.


Code
! cd $DATA_FOLDER ; PYTHONPATH=$SUBMODULE_FOLDER python -m training.main --help
Traceback (most recent call last):
  File "/home/matthew/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/matthew/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/matthew/Programming/Blog/blog/notebooks/submodules/gpt2-recycle/src/training/main.py", line 16, in <module>
    from pytorch_lightning.callbacks import LearningRateLogger, ModelCheckpoint
ImportError: cannot import name 'LearningRateLogger' from 'pytorch_lightning.callbacks' (/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/callbacks/__init__.py)

Changed LearningRateLogger to LearningRateMonitor - https://github.com/PyTorchLightning/pytorch-lightning/pull/3251 I’ve just changed the import in the file for now.

Code
! cd $DATA_FOLDER ; PYTHONPATH=$SUBMODULE_FOLDER python -m training.main --help
usage: main.py [-h] [--logger [LOGGER]]
               [--checkpoint_callback [CHECKPOINT_CALLBACK]]
               [--default_root_dir DEFAULT_ROOT_DIR]
               [--gradient_clip_val GRADIENT_CLIP_VAL]
               [--process_position PROCESS_POSITION] [--num_nodes NUM_NODES]
               [--num_processes NUM_PROCESSES] [--gpus GPUS]
               [--auto_select_gpus [AUTO_SELECT_GPUS]] [--tpu_cores TPU_CORES]
               [--log_gpu_memory LOG_GPU_MEMORY]
               [--progress_bar_refresh_rate PROGRESS_BAR_REFRESH_RATE]
               [--overfit_batches OVERFIT_BATCHES]
               [--track_grad_norm TRACK_GRAD_NORM]
               [--check_val_every_n_epoch CHECK_VAL_EVERY_N_EPOCH]
               [--fast_dev_run [FAST_DEV_RUN]]
               [--accumulate_grad_batches ACCUMULATE_GRAD_BATCHES]
               [--max_epochs MAX_EPOCHS] [--min_epochs MIN_EPOCHS]
               [--max_steps MAX_STEPS] [--min_steps MIN_STEPS]
               [--limit_train_batches LIMIT_TRAIN_BATCHES]
               [--limit_val_batches LIMIT_VAL_BATCHES]
               [--limit_test_batches LIMIT_TEST_BATCHES]
               [--limit_predict_batches LIMIT_PREDICT_BATCHES]
               [--val_check_interval VAL_CHECK_INTERVAL]
               [--flush_logs_every_n_steps FLUSH_LOGS_EVERY_N_STEPS]
               [--log_every_n_steps LOG_EVERY_N_STEPS]
               [--accelerator ACCELERATOR] [--sync_batchnorm [SYNC_BATCHNORM]]
               [--precision PRECISION] [--weights_summary WEIGHTS_SUMMARY]
               [--weights_save_path WEIGHTS_SAVE_PATH]
               [--num_sanity_val_steps NUM_SANITY_VAL_STEPS]
               [--truncated_bptt_steps TRUNCATED_BPTT_STEPS]
               [--resume_from_checkpoint RESUME_FROM_CHECKPOINT]
               [--profiler [PROFILER]] [--benchmark [BENCHMARK]]
               [--deterministic [DETERMINISTIC]]
               [--reload_dataloaders_every_epoch [RELOAD_DATALOADERS_EVERY_EPOCH]]
               [--auto_lr_find [AUTO_LR_FIND]]
               [--replace_sampler_ddp [REPLACE_SAMPLER_DDP]]
               [--terminate_on_nan [TERMINATE_ON_NAN]]
               [--auto_scale_batch_size [AUTO_SCALE_BATCH_SIZE]]
               [--prepare_data_per_node [PREPARE_DATA_PER_NODE]]
               [--plugins PLUGINS] [--amp_backend AMP_BACKEND]
               [--amp_level AMP_LEVEL]
               [--distributed_backend DISTRIBUTED_BACKEND]
               [--automatic_optimization [AUTOMATIC_OPTIMIZATION]]
               [--move_metrics_to_cpu [MOVE_METRICS_TO_CPU]]
               [--enable_pl_optimizer [ENABLE_PL_OPTIMIZER]]
               [--multiple_trainloader_mode MULTIPLE_TRAINLOADER_MODE]
               [--stochastic_weight_avg [STOCHASTIC_WEIGHT_AVG]]
               [--num_workers NUM_WORKERS] [--data_path DATA_PATH]
               [--data_index_path DATA_INDEX_PATH] [--mmap]
               [--max_seq_length MAX_SEQ_LENGTH]
               [--pretrained_path PRETRAINED_PATH] [--vocab_size VOCAB_SIZE]
               [--tokenizer_path TOKENIZER_PATH] [--wte_only] [--unfreeze]
               [--reset_state] [--subset_size SUBSET_SIZE] [--lr LR]
               [--batch_size BATCH_SIZE] [--verbose] [--search] [--seed SEED]
               [--version VERSION] [--name NAME] --lang LANG

optional arguments:
  -h, --help            show this help message and exit
  --logger [LOGGER]     Logger (or iterable collection of loggers) for
                        experiment tracking.
  --checkpoint_callback [CHECKPOINT_CALLBACK]
                        If ``True``, enable checkpointing. It will configure a
                        default ModelCheckpoint callback if there is no user-
                        defined ModelCheckpoint in :paramref:`~pytorch_lightni
                        ng.trainer.trainer.Trainer.callbacks`. Default:
                        ``True``. .. warning:: Passing a ModelCheckpoint
                        instance to this argument is deprecated since v1.1 and
                        will be unsupported from v1.3. Use `callbacks`
                        argument instead.
  --default_root_dir DEFAULT_ROOT_DIR
                        Default path for logs and weights when no
                        logger/ckpt_callback passed. Default: ``os.getcwd()``.
                        Can be remote file paths such as `s3://mybucket/path`
                        or 'hdfs://path/'
  --gradient_clip_val GRADIENT_CLIP_VAL
                        0 means don't clip.
  --process_position PROCESS_POSITION
                        orders the progress bar when running multiple models
                        on same machine.
  --num_nodes NUM_NODES
                        number of GPU nodes for distributed training.
  --num_processes NUM_PROCESSES
                        number of processes for distributed training with
                        distributed_backend="ddp_cpu"
  --gpus GPUS           number of gpus to train on (int) or which GPUs to
                        train on (list or str) applied per node
  --auto_select_gpus [AUTO_SELECT_GPUS]
                        If enabled and `gpus` is an integer, pick available
                        gpus automatically. This is especially useful when
                        GPUs are configured to be in "exclusive mode", such
                        that only one process at a time can access them.
  --tpu_cores TPU_CORES
                        How many TPU cores to train on (1 or 8) / Single TPU
                        to train on [1]
  --log_gpu_memory LOG_GPU_MEMORY
                        None, 'min_max', 'all'. Might slow performance
  --progress_bar_refresh_rate PROGRESS_BAR_REFRESH_RATE
                        How often to refresh progress bar (in steps). Value
                        ``0`` disables progress bar. Ignored when a custom
                        progress bar is passed to
                        :paramref:`~Trainer.callbacks`. Default: None, means a
                        suitable value will be chosen based on the environment
                        (terminal, Google COLAB, etc.).
  --overfit_batches OVERFIT_BATCHES
                        Overfit a percent of training data (float) or a set
                        number of batches (int). Default: 0.0
  --track_grad_norm TRACK_GRAD_NORM
                        -1 no tracking. Otherwise tracks that p-norm. May be
                        set to 'inf' infinity-norm.
  --check_val_every_n_epoch CHECK_VAL_EVERY_N_EPOCH
                        Check val every n train epochs.
  --fast_dev_run [FAST_DEV_RUN]
                        runs n if set to ``n`` (int) else 1 if set to ``True``
                        batch(es) of train, val and test to find any bugs (ie:
                        a sort of unit test).
  --accumulate_grad_batches ACCUMULATE_GRAD_BATCHES
                        Accumulates grads every k batches or as set up in the
                        dict.
  --max_epochs MAX_EPOCHS
                        Stop training once this number of epochs is reached.
                        Disabled by default (None). If both max_epochs and
                        max_steps are not specified, defaults to
                        ``max_epochs`` = 1000.
  --min_epochs MIN_EPOCHS
                        Force training for at least these many epochs.
                        Disabled by default (None). If both min_epochs and
                        min_steps are not specified, defaults to
                        ``min_epochs`` = 1.
  --max_steps MAX_STEPS
                        Stop training after this number of steps. Disabled by
                        default (None).
  --min_steps MIN_STEPS
                        Force training for at least these number of steps.
                        Disabled by default (None).
  --limit_train_batches LIMIT_TRAIN_BATCHES
                        How much of training dataset to check (floats =
                        percent, int = num_batches)
  --limit_val_batches LIMIT_VAL_BATCHES
                        How much of validation dataset to check (floats =
                        percent, int = num_batches)
  --limit_test_batches LIMIT_TEST_BATCHES
                        How much of test dataset to check (floats = percent,
                        int = num_batches)
  --limit_predict_batches LIMIT_PREDICT_BATCHES
  --val_check_interval VAL_CHECK_INTERVAL
                        How often to check the validation set. Use float to
                        check within a training epoch, use int to check every
                        n steps (batches).
  --flush_logs_every_n_steps FLUSH_LOGS_EVERY_N_STEPS
                        How often to flush logs to disk (defaults to every 100
                        steps).
  --log_every_n_steps LOG_EVERY_N_STEPS
                        How often to log within steps (defaults to every 50
                        steps).
  --accelerator ACCELERATOR
                        Previously known as distributed_backend (dp, ddp,
                        ddp2, etc...). Can also take in an accelerator object
                        for custom hardware.
  --sync_batchnorm [SYNC_BATCHNORM]
                        Synchronize batch norm layers between process
                        groups/whole world.
  --precision PRECISION
                        Full precision (32), half precision (16). Can be used
                        on CPU, GPU or TPUs.
  --weights_summary WEIGHTS_SUMMARY
                        Prints a summary of the weights when training begins.
  --weights_save_path WEIGHTS_SAVE_PATH
                        Where to save weights if specified. Will override
                        default_root_dir for checkpoints only. Use this if for
                        whatever reason you need the checkpoints stored in a
                        different place than the logs written in
                        `default_root_dir`. Can be remote file paths such as
                        `s3://mybucket/path` or 'hdfs://path/' Defaults to
                        `default_root_dir`.
  --num_sanity_val_steps NUM_SANITY_VAL_STEPS
                        Sanity check runs n validation batches before starting
                        the training routine. Set it to `-1` to run all
                        batches in all validation dataloaders. Default: 2
  --truncated_bptt_steps TRUNCATED_BPTT_STEPS
                        Truncated back prop breaks performs backprop every k
                        steps of much longer sequence.
  --resume_from_checkpoint RESUME_FROM_CHECKPOINT
                        Path/URL of the checkpoint from which training is
                        resumed. If there is no checkpoint file at the path,
                        start from scratch. If resuming from mid-epoch
                        checkpoint, training will start from the beginning of
                        the next epoch.
  --profiler [PROFILER]
                        To profile individual steps during training and assist
                        in identifying bottlenecks. Passing bool value is
                        deprecated in v1.1 and will be removed in v1.3.
  --benchmark [BENCHMARK]
                        If true enables cudnn.benchmark.
  --deterministic [DETERMINISTIC]
                        If true enables cudnn.deterministic.
  --reload_dataloaders_every_epoch [RELOAD_DATALOADERS_EVERY_EPOCH]
                        Set to True to reload dataloaders every epoch.
  --auto_lr_find [AUTO_LR_FIND]
                        If set to True, will make trainer.tune() run a
                        learning rate finder, trying to optimize initial
                        learning for faster convergence. trainer.tune() method
                        will set the suggested learning rate in self.lr or
                        self.learning_rate in the LightningModule. To use a
                        different key set a string instead of True with the
                        key name.
  --replace_sampler_ddp [REPLACE_SAMPLER_DDP]
                        Explicitly enables or disables sampler replacement. If
                        not specified this will toggled automatically when DDP
                        is used. By default it will add ``shuffle=True`` for
                        train sampler and ``shuffle=False`` for val/test
                        sampler. If you want to customize it, you can set
                        ``replace_sampler_ddp=False`` and add your own
                        distributed sampler.
  --terminate_on_nan [TERMINATE_ON_NAN]
                        If set to True, will terminate training (by raising a
                        `ValueError`) at the end of each training batch, if
                        any of the parameters or the loss are NaN or +/-inf.
  --auto_scale_batch_size [AUTO_SCALE_BATCH_SIZE]
                        If set to True, will `initially` run a batch size
                        finder trying to find the largest batch size that fits
                        into memory. The result will be stored in
                        self.batch_size in the LightningModule. Additionally,
                        can be set to either `power` that estimates the batch
                        size through a power search or `binsearch` that
                        estimates the batch size through a binary search.
  --prepare_data_per_node [PREPARE_DATA_PER_NODE]
                        If True, each LOCAL_RANK=0 will call prepare data.
                        Otherwise only NODE_RANK=0, LOCAL_RANK=0 will prepare
                        data
  --plugins PLUGINS     Plugins allow modification of core behavior like ddp
                        and amp, and enable custom lightning plugins.
  --amp_backend AMP_BACKEND
                        The mixed precision backend to use ("native" or
                        "apex")
  --amp_level AMP_LEVEL
                        The optimization level to use (O1, O2, etc...).
  --distributed_backend DISTRIBUTED_BACKEND
                        deprecated. Please use 'accelerator'
  --automatic_optimization [AUTOMATIC_OPTIMIZATION]
                        If False you are responsible for calling .backward,
                        .step, zero_grad in LightningModule. This argument has
                        been moved to LightningModule. It is deprecated here
                        in v1.1 and will be removed in v1.3.
  --move_metrics_to_cpu [MOVE_METRICS_TO_CPU]
                        Whether to force internal logged metrics to be moved
                        to cpu. This can save some gpu memory, but can make
                        training slower. Use with attention.
  --enable_pl_optimizer [ENABLE_PL_OPTIMIZER]
                        If True, each optimizer will be wrapped by
                        `pytorch_lightning.core.optimizer.LightningOptimizer`.
                        It allows Lightning to handle AMP, TPU,
                        accumulated_gradients, etc. .. warning:: Currently
                        deprecated and it will be removed in v1.3
  --multiple_trainloader_mode MULTIPLE_TRAINLOADER_MODE
                        How to loop over the datasets when there are multiple
                        train loaders. In 'max_size_cycle' mode, the trainer
                        ends one epoch when the largest dataset is traversed,
                        and smaller datasets reload when running out of their
                        data. In 'min_size' mode, all the datasets reload when
                        reaching the minimum length of datasets.
  --stochastic_weight_avg [STOCHASTIC_WEIGHT_AVG]
                        Whether to use `Stochastic Weight Averaging (SWA)
                        <https://pytorch.org/blog/pytorch-1.6-now-includes-
                        stochastic-weight-averaging/>_`
  --num_workers NUM_WORKERS
  --data_path DATA_PATH
  --data_index_path DATA_INDEX_PATH
  --mmap
  --max_seq_length MAX_SEQ_LENGTH
  --pretrained_path PRETRAINED_PATH
  --vocab_size VOCAB_SIZE
  --tokenizer_path TOKENIZER_PATH
  --wte_only
  --unfreeze
  --reset_state
  --subset_size SUBSET_SIZE
  --lr LR
  --batch_size BATCH_SIZE
  --verbose
  --search
  --seed SEED
  --version VERSION
  --name NAME
  --lang LANG

This is an extensive set of options. I want to keep this as it is a nice summary of what is available for training.

Code
command = f"""
cd {DATA_FOLDER} ; PYTHONPATH={SUBMODULE_FOLDER} python -m training.main \
    --accumulate_grad_batches 2000 \
    --max_steps {MAX_STEPS} \
    --limit_val_batches 100 \
    --val_check_interval 10000 \
    --auto_lr_find True \
    --auto_scale_batch_size True \
    --amp_backend amp \
    --amp_level O1 \
    --data_path {DATA_FOLDER} \
    --data_index_path {DATA_FOLDER} \
    --max_seq_length 128 \
    --vocab_size 40 \
    --tokenizer_path {DATA_FOLDER} \
    --lang es
"""

! $command
Traceback (most recent call last):
  File "/home/matthew/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/matthew/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/matthew/Programming/Blog/blog/notebooks/submodules/gpt2-recycle/src/training/main.py", line 582, in <module>
    main()
  File "/home/matthew/Programming/Blog/blog/notebooks/submodules/gpt2-recycle/src/training/main.py", line 540, in main
    trainer_kwargs = get_trainer_kwargs(args)
  File "/home/matthew/Programming/Blog/blog/notebooks/submodules/gpt2-recycle/src/training/main.py", line 481, in get_trainer_kwargs
    checkpoint_callback = ModelCheckpoint(filepath=None,
TypeError: __init__() got an unexpected keyword argument 'filepath'

I think this has changed to dirpath. Maybe pytorch lightning should put more effort into backward compatability?

Code
command = f"""
cd {DATA_FOLDER} ; PYTHONPATH={SUBMODULE_FOLDER} python -m training.main \
    --accumulate_grad_batches 2000 \
    --max_steps {MAX_STEPS} \
    --limit_val_batches 100 \
    --val_check_interval 10000 \
    --auto_lr_find True \
    --auto_scale_batch_size True \
    --amp_backend amp \
    --amp_level O1 \
    --data_path {DATA_FOLDER} \
    --data_index_path {DATA_FOLDER} \
    --max_seq_length 128 \
    --vocab_size 40 \
    --tokenizer_path {DATA_FOLDER} \
    --lang es
"""

! $command
starting: 0
Global seed set to 7649832
Traceback (most recent call last):
  File "/home/matthew/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/matthew/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/matthew/Programming/Blog/blog/notebooks/submodules/gpt2-recycle/src/training/main.py", line 582, in <module>
    main()
  File "/home/matthew/Programming/Blog/blog/notebooks/submodules/gpt2-recycle/src/training/main.py", line 554, in main
    model = EmbeddingTunerModel(args)
  File "/home/matthew/Programming/Blog/blog/notebooks/submodules/gpt2-recycle/src/training/main.py", line 241, in __init__
    with open(Path('data') / self.hparams.lang / 'config.json') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/es/config.json'

I’ve just created a json file with an empty dict in that place.

Code
command = f"""
cd {DATA_FOLDER} ; PYTHONPATH={SUBMODULE_FOLDER} python -m training.main \
    --accumulate_grad_batches 2000 \
    --max_steps {MAX_STEPS} \
    --gpus 1 \
    --limit_val_batches 100 \
    --val_check_interval 10000 \
    --auto_lr_find True \
    --auto_scale_batch_size True \
    --amp_backend amp \
    --amp_level O1 \
    --data_path {DATA_FOLDER}/data/es/preparation/final/data-040k.npy \
    --data_index_path {DATA_FOLDER}/data/es/preparation/final/index-040k.npy \
    --max_seq_length 128 \
    --vocab_size 40 \
    --tokenizer_path {DATA_FOLDER}/data/es/preparation/vocabularies/es-040k.tokenizer.json \
    --lang es
"""

! $command
starting: 0
Global seed set to 7649832
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Traceback (most recent call last):
  File "/home/matthew/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/matthew/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/matthew/Programming/Blog/blog/notebooks/submodules/gpt2-recycle/src/training/main.py", line 582, in <module>
    main()
  File "/home/matthew/Programming/Blog/blog/notebooks/submodules/gpt2-recycle/src/training/main.py", line 578, in main
    trainer.fit(model)
  File "/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit
    self.accelerator.setup(self, model)  # note: this sets up self.lightning_module
  File "/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu.py", line 30, in setup
    return super().setup(trainer, model)
  File "/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 69, in setup
    self.setup_optimizers(trainer)
  File "/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 315, in setup_optimizers
    optimizers, lr_schedulers, optimizer_frequencies = self.training_type_plugin.init_optimizers(
  File "/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 160, in init_optimizers
    return trainer.init_optimizers(model)
  File "/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/optimizers.py", line 83, in init_optimizers
    lr_schedulers = self.configure_schedulers(lr_schedulers, monitor=monitor)
  File "/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/optimizers.py", line 133, in configure_schedulers
    raise MisconfigurationException(
pytorch_lightning.utilities.exceptions.MisconfigurationException: `configure_optimizers` must include a monitor when a `ReduceLROnPlateau` scheduler is used. For example: {"optimizer": optimizer, "lr_scheduler": scheduler, "monitor": "metric_to_track"}

So I wonder how the monitor is supposed to be configured. I want to get this training quickly so I’m going to downgrade pytorch lightning to 1.0.

I tried downgrading and it still doesn’t work. It turns out that the configure_optimizers method of the trainer needs to change. This is what I changed it to:

def configure_optimizers(self):
    optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad,
                                        self.m.parameters()),
                                 lr=self.lr,
                                 amsgrad=True)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer,
                                                           patience=1)
    return {"optimizer": optimizer, "scheduler": scheduler, "monitor": "loss"}
Code
command = f"""
cd {DATA_FOLDER} ; PYTHONPATH={SUBMODULE_FOLDER} python -m training.main \
    --accumulate_grad_batches 2000 \
    --max_steps {MAX_STEPS} \
    --gpus 1 \
    --limit_val_batches 100 \
    --val_check_interval 10000 \
    --auto_lr_find True \
    --auto_scale_batch_size True \
    --amp_backend amp \
    --amp_level O1 \
    --data_path {DATA_FOLDER}/data/es/preparation/final/data-040k.npy \
    --data_index_path {DATA_FOLDER}/data/es/preparation/final/index-040k.npy \
    --max_seq_length 128 \
    --vocab_size 50257 \
    --tokenizer_path {DATA_FOLDER}/data/es/preparation/vocabularies/es-040k.tokenizer.json \
    --lang es
"""

! $command
wandb: Currently logged in as: matthewfranglen (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.10.25
wandb: Syncing run accelerator_None_accumulate_grad_batches_2000_amp_backend_amp_amp_level_O1_auto_lr_find_True_auto_scale_batch_size_True_auto_select_gpus_False_automatic_optimization_None_batch_size_3_benchmark_False_check_val_every_n_epoch_1_checkpoint_callback_True_data_index_path_/data/wikipedia/processed/spanish-sentences/data/es/preparation/final/index-040k.npy_data_path_/data/wikipedia/processed/spanish-sentences/data/es/preparation/final/data-040k.npy_default_root_dir_None_deterministic_False_distributed_backend_None_enable_pl_optimizer_None_fast_dev_run_False_flush_logs_every_n_steps_100_gpus_1_gradient_clip_val_0_lang_es_limit_predict_batches_1.0_limit_test_batches_1.0_limit_train_batches_1.0_limit_val_batches_100_log_every_n_steps_50_log_gpu_memory_None_logger_True_lr_0.001_max_epochs_None_max_seq_length_128_max_steps_937500_min_epochs_None_min_steps_None_mmap_False_move_metrics_to_cpu_False_multiple_trainloader_mode_max_size_cycle_name_None_num_nodes_1_num_processes_1_num_sanity_val_steps_2_num_workers_4_overfit_batches_0.0_plugins_None_precision_32_prepare_data_per_node_True_pretrained_path_gpt2_process_position_0_profiler_None_progress_bar_refresh_rate_None_reload_dataloaders_every_epoch_False_replace_sampler_ddp_True_reset_state_False_resume_from_checkpoint_None_search_False_seed_7649832_stochastic_weight_avg_False_subset_size_1.0_sync_batchnorm_False_terminate_on_nan_False_tokenizer_path_/data/wikipedia/processed/spanish-sentences/data/es/preparation/vocabularies/es-040k.tokenizer.json_tpu_cores_<function _gpus_arg_default at 0x7ff22ca611f0>_track_grad_norm_-1_truncated_bptt_steps_None_unfreeze_False_val_check_interval_10000_verbose_False_version_0_vocab_size_50257_weights_save_path_None_weights_summary_top_wte_only_False
wandb: ⭐️ View project at https://wandb.ai/matthewfranglen/mf-blog-recycle-gpt2-es
wandb: 🚀 View run at https://wandb.ai/matthewfranglen/mf-blog-recycle-gpt2-es/runs/3l26l38n
wandb: Run data is saved locally in /data/wikipedia/processed/spanish-sentences/wandb/run-20210413_120325-3l26l38n
wandb: Run `wandb offline` to turn off syncing.

starting: 0
Global seed set to 7649832
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name | Type            | Params
-----------------------------------------
0 | m    | GPT2LMHeadModel | 124 M 
-----------------------------------------
124 M     Trainable params
0         Non-trainable params
124 M     Total params
497.759   Total estimated model params size (MB)
validation examples: 1,523,919
Validation sanity check:  50%|██████████          | 1/2 [00:00<00:00,  5.76it/s]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The validation_epoch_end should not return anything as of 9.1. To log, use self.log(...) or self.write(...) directly in the LightningModule
  warnings.warn(*args, **kwargs)
/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:52: RuntimeWarning: You are using `LearningRateMonitor` callback with models that have no learning rate schedulers. Please see documentation for `configure_optimizers` method.
  warnings.warn(*args, **kwargs)
training examples: 13,716,233
Epoch 0:   0%|                                      | 0/4617778 [00:00<?, ?it/s]/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The {log:dict keyword} was deprecated in 0.9.1 and will be removed in 1.0.0
Please use self.log(...) inside the lightningModule instead.
# log on a step or aggregate epoch metric to the logger and/or progress bar (inside LightningModule)
self.log('train_loss', loss, on_step=True, on_epoch=True, prog_bar=True)
  warnings.warn(*args, **kwargs)
Epoch 0:   0%|   | 10000/4617778 [04:07<31:43:35, 40.34it/s, loss=8.24, v_num=0]
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:   0%|   | 10004/4617778 [04:08<31:44:44, 40.32it/s, loss=8.24, v_num=0]
Epoch 0:   0%|   | 10017/4617778 [04:08<31:43:05, 40.35it/s, loss=8.24, v_num=0]
Epoch 0:   0%|   | 10030/4617778 [04:08<31:41:27, 40.39it/s, loss=8.24, v_num=0]
Epoch 0:   0%|   | 10043/4617778 [04:08<31:39:48, 40.42it/s, loss=8.24, v_num=0]
Epoch 0:   0%|   | 10056/4617778 [04:08<31:38:08, 40.46it/s, loss=8.24, v_num=0]
Epoch 0:   0%|   | 10069/4617778 [04:08<31:36:29, 40.49it/s, loss=8.24, v_num=0]
Epoch 0:   0%|   | 10082/4617778 [04:08<31:34:50, 40.53it/s, loss=8.24, v_num=0]
Epoch 0:   0%|   | 10095/4617778 [04:08<31:33:12, 40.56it/s, loss=8.24, v_num=0]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:   0%|   | 10101/4617778 [04:09<31:38:21, 40.45it/s, loss=8.24, v_num=0]
Epoch 0:   0%|   | 10429/4617778 [04:17<31:38:29, 40.45it/s, loss=8.24, v_num=0]^CA
/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: Detected KeyboardInterrupt, attempting graceful shutdown...
  warnings.warn(*args, **kwargs)
Saving latest checkpoint...
Epoch 0:   0%|   | 10429/4617778 [04:18<31:39:52, 40.42it/s, loss=8.24, v_num=0]

So this is finally training. The one thing I want to do is to get it reporting to weights and bias so I can track it.

Code
command = f"""
cd {DATA_FOLDER} ; PYTHONPATH={SUBMODULE_FOLDER} python -m training.main \
    --accumulate_grad_batches 2000 \
    --max_epochs 1 \
    --gpus 1 \
    --limit_val_batches 100 \
    --val_check_interval 10000 \
    --auto_lr_find True \
    --batch_size 32 \
    --amp_backend amp \
    --amp_level O1 \
    --data_path {DATA_FOLDER}/data/es/preparation/final/data-040k.npy \
    --data_index_path {DATA_FOLDER}/data/es/preparation/final/index-040k.npy \
    --max_seq_length 128 \
    --vocab_size 50257 \
    --tokenizer_path {DATA_FOLDER}/data/es/preparation/vocabularies/es-040k.tokenizer.json \
    --lang es
"""

! $command
starting: 0
Global seed set to 7649832
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
wandb: Currently logged in as: matthewfranglen (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.10.25
wandb: Syncing run accumulate_grad_batches_2000_amp_backend_amp_amp_level_O1_auto_lr_find_True_auto_scale_batch_size_False_auto_select_gpus_False_batch_size_32_benchmark_False_check_val_every_n_epoch_1_checkpoint_callback_True_data_index_path_/data/wikipedia/processed/spanish-sentences/data/es/preparation/final/index-040k.npy_data_path_/data/wikipedia/processed/spanish-sentences/data/es/preparation/final/data-040k.npy_deterministic_False_fast_dev_run_False_flush_logs_every_n_steps_100_gpus_1_gradient_clip_val_0_lang_es_limit_predict_batches_1.0_limit_test_batches_1.0_limit_train_batches_1.0_limit_val_batches_100_log_every_n_steps_50_logger_True_lr_0.001_max_epochs_1_max_seq_length_128_mmap_False_move_metrics_to_cpu_False_multiple_trainloader_mode_max_size_cycle_num_nodes_1_num_processes_1_num_sanity_val_steps_2_num_workers_4_overfit_batches_0.0_precision_32_prepare_data_per_node_True_pretrained_path_gpt2_process_position_0_reload_dataloaders_every_epoch_False_replace_sampler_ddp_True_reset_state_False_search_False_seed_7649832_stochastic_weight_avg_False_subset_size_1.0_sync_batchnorm_False_terminate_on_nan_False_tokenizer_path_/data/wikipedia/processed/spanish-sentences/data/es/preparation/vocabularies/es-040k.tokenizer.json_tpu_cores_<function _gpus_arg_default at 0x7ff2b2d63310>_track_grad_norm_-1_unfreeze_False_val_check_interval_10000_verbose_False_version_0_vocab_size_50257_weights_summary_top_wte_only_False
wandb: ⭐️ View project at https://wandb.ai/matthewfranglen/mf-blog-recycle-gpt2-es
wandb: 🚀 View run at https://wandb.ai/matthewfranglen/mf-blog-recycle-gpt2-es/runs/16p22vt0
wandb: Run data is saved locally in /data/wikipedia/processed/spanish-sentences/wandb/run-20210413_133917-16p22vt0
wandb: Run `wandb offline` to turn off syncing.


  | Name | Type            | Params
-----------------------------------------
0 | m    | GPT2LMHeadModel | 124 M 
-----------------------------------------
124 M     Trainable params
0         Non-trainable params
124 M     Total params
497.759   Total estimated model params size (MB)
validation examples: 1,523,919
Validation sanity check:  50%|██████████          | 1/2 [00:00<00:00,  5.29it/s]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The validation_epoch_end should not return anything as of 9.1. To log, use self.log(...) or self.write(...) directly in the LightningModule
  warnings.warn(*args, **kwargs)
/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:52: RuntimeWarning: You are using `LearningRateMonitor` callback with models that have no learning rate schedulers. Please see documentation for `configure_optimizers` method.
  warnings.warn(*args, **kwargs)
training examples: 13,716,233
Epoch 0:   0%|                                       | 0/432833 [00:00<?, ?it/s]/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The {log:dict keyword} was deprecated in 0.9.1 and will be removed in 1.0.0
Please use self.log(...) inside the lightningModule instead.
# log on a step or aggregate epoch metric to the logger and/or progress bar (inside LightningModule)
self.log('train_loss', loss, on_step=True, on_epoch=True, prog_bar=True)
  warnings.warn(*args, **kwargs)
Epoch 0:   2%| | 10001/432833 [18:28<13:01:05,  9.02it/s, loss=8.19, v_num=2vt0]
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:   2%| | 10004/432833 [18:28<13:01:02,  9.02it/s, loss=8.19, v_num=2vt0]
Epoch 0:   2%| | 10007/432833 [18:28<13:00:52,  9.02it/s, loss=8.19, v_num=2vt0]
Epoch 0:   2%| | 10011/432833 [18:28<13:00:38,  9.03it/s, loss=8.19, v_num=2vt0]
Epoch 0:   2%| | 10015/432833 [18:29<13:00:23,  9.03it/s, loss=8.19, v_num=2vt0]
Epoch 0:   2%| | 10019/432833 [18:29<13:00:10,  9.03it/s, loss=8.19, v_num=2vt0]
Epoch 0:   2%| | 10023/432833 [18:29<12:59:57,  9.03it/s, loss=8.19, v_num=2vt0]
Epoch 0:   2%| | 10027/432833 [18:29<12:59:47,  9.04it/s, loss=8.19, v_num=2vt0]
Epoch 0:   2%| | 10031/432833 [18:29<12:59:34,  9.04it/s, loss=8.19, v_num=2vt0]
Validating:  31%|█████████▎                    | 31/100 [00:01<00:03, 19.14it/s]
Epoch 0:   2%| | 10035/432833 [18:29<12:59:23,  9.04it/s, loss=8.19, v_num=2vt0]
Epoch 0:   2%| | 10039/432833 [18:30<12:59:12,  9.04it/s, loss=8.19, v_num=2vt0]
Epoch 0:   2%| | 10043/432833 [18:30<12:58:58,  9.05it/s, loss=8.19, v_num=2vt0]
Epoch 0:   2%| | 10047/432833 [18:30<12:58:44,  9.05it/s, loss=8.19, v_num=2vt0]
Epoch 0:   2%| | 10051/432833 [18:30<12:58:30,  9.05it/s, loss=8.19, v_num=2vt0]
Epoch 0:   2%| | 10055/432833 [18:30<12:58:16,  9.05it/s, loss=8.19, v_num=2vt0]
Epoch 0:   2%| | 10059/432833 [18:30<12:58:04,  9.06it/s, loss=8.19, v_num=2vt0]
Epoch 0:   2%| | 10063/432833 [18:30<12:57:51,  9.06it/s, loss=8.19, v_num=2vt0]
Epoch 0:   2%| | 10067/432833 [18:31<12:57:37,  9.06it/s, loss=8.19, v_num=2vt0]
Epoch 0:   2%| | 10071/432833 [18:31<12:57:25,  9.06it/s, loss=8.19, v_num=2vt0]
Epoch 0:   2%| | 10075/432833 [18:31<12:57:13,  9.07it/s, loss=8.19, v_num=2vt0]
Epoch 0:   2%| | 10079/432833 [18:31<12:56:59,  9.07it/s, loss=8.19, v_num=2vt0]
Epoch 0:   2%| | 10083/432833 [18:31<12:56:46,  9.07it/s, loss=8.19, v_num=2vt0]
Validating:  83%|████████████████████████▉     | 83/100 [00:03<00:00, 27.52it/s]
Epoch 0:   2%| | 10087/432833 [18:31<12:56:37,  9.07it/s, loss=8.19, v_num=2vt0]
Epoch 0:   2%| | 10091/432833 [18:31<12:56:24,  9.07it/s, loss=8.19, v_num=2vt0]
Epoch 0:   2%| | 10095/432833 [18:32<12:56:10,  9.08it/s, loss=8.19, v_num=2vt0]
Epoch 0:   2%| | 10099/432833 [18:32<12:55:58,  9.08it/s, loss=8.19, v_num=2vt0]
Validating:  99%|█████████████████████████████▋| 99/100 [00:03<00:00, 24.96it/s]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:   2%| | 10101/432833 [18:33<12:56:25,  9.07it/s, loss=8.19, v_num=2vt0]
Epoch 0:   5%| | 20101/432833 [37:13<12:44:25,  9.00it/s, loss=7.32, v_num=2vt0]
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:   5%| | 20103/432833 [37:13<12:44:25,  9.00it/s, loss=7.32, v_num=2vt0]
Epoch 0:   5%| | 20108/432833 [37:14<12:44:15,  9.00it/s, loss=7.32, v_num=2vt0]
Epoch 0:   5%| | 20113/432833 [37:14<12:44:06,  9.00it/s, loss=7.32, v_num=2vt0]
Epoch 0:   5%| | 20119/432833 [37:14<12:43:55,  9.00it/s, loss=7.32, v_num=2vt0]
Validating:  18%|█████▍                        | 18/100 [00:00<00:06, 12.06it/s]
Epoch 0:   5%| | 20125/432833 [37:14<12:43:45,  9.01it/s, loss=7.32, v_num=2vt0]
Epoch 0:   5%| | 20131/432833 [37:14<12:43:35,  9.01it/s, loss=7.32, v_num=2vt0]
Validating:  30%|█████████                     | 30/100 [00:01<00:03, 18.58it/s]
Epoch 0:   5%| | 20137/432833 [37:15<12:43:26,  9.01it/s, loss=7.32, v_num=2vt0]
Validating:  37%|███████████                   | 37/100 [00:01<00:02, 21.77it/s]
Epoch 0:   5%| | 20143/432833 [37:15<12:43:16,  9.01it/s, loss=7.32, v_num=2vt0]
Validating:  44%|█████████████▏                | 44/100 [00:01<00:02, 25.71it/s]
Epoch 0:   5%| | 20149/432833 [37:15<12:43:08,  9.01it/s, loss=7.32, v_num=2vt0]
Epoch 0:   5%| | 20155/432833 [37:15<12:42:58,  9.01it/s, loss=7.32, v_num=2vt0]
Validating:  54%|████████████████▏             | 54/100 [00:02<00:01, 23.34it/s]
Epoch 0:   5%| | 20161/432833 [37:16<12:42:48,  9.02it/s, loss=7.32, v_num=2vt0]
Validating:  61%|██████████████████▎           | 61/100 [00:02<00:01, 26.01it/s]
Epoch 0:   5%| | 20167/432833 [37:16<12:42:39,  9.02it/s, loss=7.32, v_num=2vt0]
Validating:  67%|████████████████████          | 67/100 [00:02<00:01, 22.42it/s]
Epoch 0:   5%| | 20173/432833 [37:16<12:42:30,  9.02it/s, loss=7.32, v_num=2vt0]
Validating:  73%|█████████████████████▉        | 73/100 [00:02<00:01, 23.30it/s]
Epoch 0:   5%| | 20179/432833 [37:16<12:42:21,  9.02it/s, loss=7.32, v_num=2vt0]
Validating:  80%|████████████████████████      | 80/100 [00:03<00:00, 23.63it/s]
Epoch 0:   5%| | 20185/432833 [37:17<12:42:12,  9.02it/s, loss=7.32, v_num=2vt0]
Epoch 0:   5%| | 20191/432833 [37:17<12:42:02,  9.02it/s, loss=7.32, v_num=2vt0]
Validating:  90%|███████████████████████████   | 90/100 [00:03<00:00, 25.20it/s]
Epoch 0:   5%| | 20197/432833 [37:17<12:41:52,  9.03it/s, loss=7.32, v_num=2vt0]
Validating:  97%|█████████████████████████████ | 97/100 [00:03<00:00, 27.03it/s]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:   5%| | 20202/432833 [37:18<12:41:59,  9.03it/s, loss=7.32, v_num=2vt0]
Epoch 0:   7%| | 30202/432833 [56:01<12:26:59,  8.98it/s, loss=6.84, v_num=2vt0]
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:   7%| | 30205/432833 [56:02<12:26:58,  8.98it/s, loss=6.84, v_num=2vt0]
Epoch 0:   7%| | 30212/432833 [56:02<12:26:48,  8.99it/s, loss=6.84, v_num=2vt0]
Validating:  12%|███▌                          | 12/100 [00:00<00:12,  7.28it/s]
Epoch 0:   7%| | 30219/432833 [56:02<12:26:40,  8.99it/s, loss=6.84, v_num=2vt0]
Validating:  19%|█████▋                        | 19/100 [00:00<00:06, 11.86it/s]
Epoch 0:   7%| | 30226/432833 [56:02<12:26:33,  8.99it/s, loss=6.84, v_num=2vt0]
Validating:  25%|███████▌                      | 25/100 [00:00<00:04, 16.37it/s]
Epoch 0:   7%| | 30233/432833 [56:03<12:26:25,  8.99it/s, loss=6.84, v_num=2vt0]
Validating:  31%|█████████▎                    | 31/100 [00:01<00:03, 20.04it/s]
Epoch 0:   7%| | 30240/432833 [56:03<12:26:17,  8.99it/s, loss=6.84, v_num=2vt0]
Validating:  38%|███████████▍                  | 38/100 [00:01<00:02, 21.81it/s]
Validating:  41%|████████████▎                 | 41/100 [00:01<00:03, 19.44it/s]
Epoch 0:   7%| | 30247/432833 [56:03<12:26:11,  8.99it/s, loss=6.84, v_num=2vt0]
Validating:  47%|██████████████                | 47/100 [00:01<00:02, 22.55it/s]
Epoch 0:   7%| | 30254/432833 [56:04<12:26:04,  8.99it/s, loss=6.84, v_num=2vt0]
Validating:  53%|███████████████▉              | 53/100 [00:02<00:02, 21.91it/s]
Epoch 0:   7%| | 30261/432833 [56:04<12:25:56,  8.99it/s, loss=6.84, v_num=2vt0]
Validating:  61%|██████████████████▎           | 61/100 [00:02<00:01, 24.13it/s]
Epoch 0:   7%| | 30268/432833 [56:04<12:25:49,  9.00it/s, loss=6.84, v_num=2vt0]
Validating:  68%|████████████████████▍         | 68/100 [00:02<00:01, 25.60it/s]
Epoch 0:   7%| | 30275/432833 [56:04<12:25:40,  9.00it/s, loss=6.84, v_num=2vt0]
Validating:  75%|██████████████████████▌       | 75/100 [00:02<00:00, 27.99it/s]
Epoch 0:   7%| | 30282/432833 [56:05<12:25:32,  9.00it/s, loss=6.84, v_num=2vt0]
Epoch 0:   7%| | 30289/432833 [56:05<12:25:24,  9.00it/s, loss=6.84, v_num=2vt0]
Validating:  88%|██████████████████████████▍   | 88/100 [00:03<00:00, 30.53it/s]
Epoch 0:   7%| | 30296/432833 [56:05<12:25:15,  9.00it/s, loss=6.84, v_num=2vt0]
Validating:  96%|████████████████████████████▊ | 96/100 [00:03<00:00, 30.88it/s]
Epoch 0:   7%| | 30303/432833 [56:05<12:25:08,  9.00it/s, loss=6.84, v_num=2vt0]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:   7%| | 30303/432833 [56:06<12:25:19,  9.00it/s, loss=6.84, v_num=2vt0]
Epoch 0:   9%| | 40303/432833 [1:14:53<12:09:19,  8.97it/s, loss=6.49, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:   9%| | 40306/432833 [1:14:53<12:09:18,  8.97it/s, loss=6.49, v_num=2vt
Epoch 0:   9%| | 40313/432833 [1:14:53<12:09:12,  8.97it/s, loss=6.49, v_num=2vt
Validating:  10%|███                           | 10/100 [00:00<00:12,  7.04it/s]
Epoch 0:   9%| | 40320/432833 [1:14:53<12:09:05,  8.97it/s, loss=6.49, v_num=2vt
Validating:  18%|█████▍                        | 18/100 [00:00<00:07, 11.42it/s]
Epoch 0:   9%| | 40327/432833 [1:14:53<12:09:00,  8.97it/s, loss=6.49, v_num=2vt
Validating:  25%|███████▌                      | 25/100 [00:01<00:04, 15.84it/s]
Epoch 0:   9%| | 40334/432833 [1:14:54<12:08:54,  8.97it/s, loss=6.49, v_num=2vt
Validating:  32%|█████████▌                    | 32/100 [00:01<00:03, 20.32it/s]
Epoch 0:   9%| | 40341/432833 [1:14:54<12:08:48,  8.98it/s, loss=6.49, v_num=2vt
Validating:  38%|███████████▍                  | 38/100 [00:01<00:02, 23.84it/s]
Epoch 0:   9%| | 40348/432833 [1:14:54<12:08:42,  8.98it/s, loss=6.49, v_num=2vt
Validating:  46%|█████████████▊                | 46/100 [00:01<00:01, 28.47it/s]
Epoch 0:   9%| | 40355/432833 [1:14:54<12:08:36,  8.98it/s, loss=6.49, v_num=2vt
Validating:  53%|███████████████▉              | 53/100 [00:02<00:01, 25.23it/s]
Epoch 0:   9%| | 40362/432833 [1:14:55<12:08:30,  8.98it/s, loss=6.49, v_num=2vt
Validating:  60%|██████████████████            | 60/100 [00:02<00:01, 25.62it/s]
Epoch 0:   9%| | 40369/432833 [1:14:55<12:08:24,  8.98it/s, loss=6.49, v_num=2vt
Validating:  68%|████████████████████▍         | 68/100 [00:02<00:01, 28.49it/s]
Epoch 0:   9%| | 40376/432833 [1:14:55<12:08:18,  8.98it/s, loss=6.49, v_num=2vt
Validating:  76%|██████████████████████▊       | 76/100 [00:02<00:00, 27.48it/s]
Epoch 0:   9%| | 40383/432833 [1:14:55<12:08:12,  8.98it/s, loss=6.49, v_num=2vt
Validating:  82%|████████████████████████▌     | 82/100 [00:03<00:00, 26.79it/s]
Epoch 0:   9%| | 40390/432833 [1:14:56<12:08:08,  8.98it/s, loss=6.49, v_num=2vt
Validating:  89%|██████████████████████████▋   | 89/100 [00:03<00:00, 22.14it/s]
Epoch 0:   9%| | 40397/432833 [1:14:56<12:08:02,  8.98it/s, loss=6.49, v_num=2vt
Validating:  96%|████████████████████████████▊ | 96/100 [00:03<00:00, 26.48it/s]
Epoch 0:   9%| | 40404/432833 [1:14:56<12:07:56,  8.98it/s, loss=6.49, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:   9%| | 40404/432833 [1:14:57<12:08:04,  8.98it/s, loss=6.49, v_num=2vt
Epoch 0:  12%| | 50404/432833 [1:33:44<11:51:18,  8.96it/s, loss=5.71, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  12%| | 50407/432833 [1:33:45<11:51:17,  8.96it/s, loss=5.71, v_num=2vt
Epoch 0:  12%| | 50414/432833 [1:33:45<11:51:11,  8.96it/s, loss=5.71, v_num=2vt
Validating:  11%|███▎                          | 11/100 [00:00<00:12,  7.35it/s]
Epoch 0:  12%| | 50421/432833 [1:33:45<11:51:06,  8.96it/s, loss=5.71, v_num=2vt
Validating:  17%|█████                         | 17/100 [00:00<00:07, 11.64it/s]
Validating:  20%|██████                        | 20/100 [00:00<00:05, 14.25it/s]
Epoch 0:  12%| | 50428/432833 [1:33:45<11:51:01,  8.96it/s, loss=5.71, v_num=2vt
Validating:  27%|████████                      | 27/100 [00:01<00:03, 19.71it/s]
Epoch 0:  12%| | 50435/432833 [1:33:46<11:50:57,  8.96it/s, loss=5.71, v_num=2vt
Validating:  33%|█████████▉                    | 33/100 [00:01<00:03, 21.22it/s]
Epoch 0:  12%| | 50442/432833 [1:33:46<11:50:52,  8.97it/s, loss=5.71, v_num=2vt
Validating:  39%|███████████▋                  | 39/100 [00:01<00:02, 24.16it/s]
Epoch 0:  12%| | 50449/432833 [1:33:46<11:50:47,  8.97it/s, loss=5.71, v_num=2vt
Epoch 0:  12%| | 50456/432833 [1:33:46<11:50:42,  8.97it/s, loss=5.71, v_num=2vt
Validating:  52%|███████████████▌              | 52/100 [00:01<00:01, 28.44it/s]
Epoch 0:  12%| | 50463/432833 [1:33:47<11:50:37,  8.97it/s, loss=5.71, v_num=2vt
Validating:  59%|█████████████████▋            | 59/100 [00:02<00:01, 25.14it/s]
Epoch 0:  12%| | 50470/432833 [1:33:47<11:50:33,  8.97it/s, loss=5.71, v_num=2vt
Validating:  66%|███████████████████▊          | 66/100 [00:02<00:01, 23.55it/s]
Epoch 0:  12%| | 50477/432833 [1:33:47<11:50:29,  8.97it/s, loss=5.71, v_num=2vt
Validating:  74%|██████████████████████▏       | 74/100 [00:02<00:01, 25.31it/s]
Epoch 0:  12%| | 50484/432833 [1:33:47<11:50:24,  8.97it/s, loss=5.71, v_num=2vt
Validating:  80%|████████████████████████      | 80/100 [00:03<00:00, 25.50it/s]
Validating:  83%|████████████████████████▉     | 83/100 [00:03<00:00, 24.51it/s]
Epoch 0:  12%| | 50491/432833 [1:33:48<11:50:19,  8.97it/s, loss=5.71, v_num=2vt
Validating:  89%|██████████████████████████▋   | 89/100 [00:03<00:00, 25.69it/s]
Epoch 0:  12%| | 50498/432833 [1:33:48<11:50:15,  8.97it/s, loss=5.71, v_num=2vt
Validating:  96%|████████████████████████████▊ | 96/100 [00:03<00:00, 26.76it/s]
Epoch 0:  12%| | 50505/432833 [1:33:48<11:50:10,  8.97it/s, loss=5.71, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  12%| | 50505/432833 [1:33:49<11:50:16,  8.97it/s, loss=5.71, v_num=2vt
Epoch 0:  14%|▏| 60505/432833 [1:52:37<11:33:03,  8.95it/s, loss=5.31, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  14%|▏| 60508/432833 [1:52:37<11:33:02,  8.95it/s, loss=5.31, v_num=2vt
Validating:   4%|█▏                             | 4/100 [00:00<00:18,  5.26it/s]
Validating:   6%|█▊                             | 6/100 [00:00<00:13,  6.73it/s]
Epoch 0:  14%|▏| 60515/432833 [1:52:38<11:32:58,  8.95it/s, loss=5.31, v_num=2vt
Epoch 0:  14%|▏| 60522/432833 [1:52:38<11:32:54,  8.96it/s, loss=5.31, v_num=2vt
Validating:  17%|█████                         | 17/100 [00:00<00:06, 13.47it/s]
Epoch 0:  14%|▏| 60529/432833 [1:52:38<11:32:50,  8.96it/s, loss=5.31, v_num=2vt
Validating:  25%|███████▌                      | 25/100 [00:01<00:03, 19.45it/s]
Epoch 0:  14%|▏| 60536/432833 [1:52:38<11:32:46,  8.96it/s, loss=5.31, v_num=2vt
Validating:  33%|█████████▉                    | 33/100 [00:01<00:03, 22.18it/s]
Epoch 0:  14%|▏| 60543/432833 [1:52:39<11:32:42,  8.96it/s, loss=5.31, v_num=2vt
Validating:  39%|███████████▋                  | 39/100 [00:01<00:02, 23.10it/s]
Epoch 0:  14%|▏| 60550/432833 [1:52:39<11:32:38,  8.96it/s, loss=5.31, v_num=2vt
Validating:  47%|██████████████                | 47/100 [00:01<00:02, 26.49it/s]
Epoch 0:  14%|▏| 60557/432833 [1:52:39<11:32:34,  8.96it/s, loss=5.31, v_num=2vt
Validating:  55%|████████████████▌             | 55/100 [00:02<00:01, 27.33it/s]
Epoch 0:  14%|▏| 60564/432833 [1:52:39<11:32:30,  8.96it/s, loss=5.31, v_num=2vt
Validating:  61%|██████████████████▎           | 61/100 [00:02<00:01, 24.92it/s]
Epoch 0:  14%|▏| 60571/432833 [1:52:40<11:32:26,  8.96it/s, loss=5.31, v_num=2vt
Validating:  68%|████████████████████▍         | 68/100 [00:02<00:01, 25.78it/s]
Epoch 0:  14%|▏| 60578/432833 [1:52:40<11:32:22,  8.96it/s, loss=5.31, v_num=2vt
Validating:  75%|██████████████████████▌       | 75/100 [00:02<00:00, 27.66it/s]
Epoch 0:  14%|▏| 60585/432833 [1:52:40<11:32:18,  8.96it/s, loss=5.31, v_num=2vt
Validating:  82%|████████████████████████▌     | 82/100 [00:03<00:00, 29.34it/s]
Epoch 0:  14%|▏| 60592/432833 [1:52:40<11:32:14,  8.96it/s, loss=5.31, v_num=2vt
Epoch 0:  14%|▏| 60599/432833 [1:52:41<11:32:09,  8.96it/s, loss=5.31, v_num=2vt
Validating:  94%|████████████████████████████▏ | 94/100 [00:03<00:00, 28.76it/s]
Epoch 0:  14%|▏| 60606/432833 [1:52:41<11:32:05,  8.96it/s, loss=5.31, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  14%|▏| 60606/432833 [1:52:42<11:32:10,  8.96it/s, loss=5.31, v_num=2vt
Epoch 0:  16%|▎ | 70606/432833 [2:11:29<11:14:34,  8.95it/s, loss=5, v_num=2vt0]
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  16%|▎ | 70609/432833 [2:11:29<11:14:33,  8.95it/s, loss=5, v_num=2vt0]
Epoch 0:  16%|▎ | 70616/432833 [2:11:29<11:14:30,  8.95it/s, loss=5, v_num=2vt0]
Validating:  11%|███▎                          | 11/100 [00:00<00:12,  7.14it/s]
Epoch 0:  16%|▎ | 70623/432833 [2:11:30<11:14:27,  8.95it/s, loss=5, v_num=2vt0]
Validating:  17%|█████                         | 17/100 [00:00<00:07, 10.49it/s]
Epoch 0:  16%|▎ | 70630/432833 [2:11:30<11:14:23,  8.95it/s, loss=5, v_num=2vt0]
Validating:  25%|███████▌                      | 25/100 [00:01<00:05, 13.68it/s]
Epoch 0:  16%|▎ | 70637/432833 [2:11:30<11:14:20,  8.95it/s, loss=5, v_num=2vt0]
Validating:  31%|█████████▎                    | 31/100 [00:01<00:04, 17.23it/s]
Epoch 0:  16%|▎ | 70644/432833 [2:11:31<11:14:17,  8.95it/s, loss=5, v_num=2vt0]
Validating:  38%|███████████▍                  | 38/100 [00:01<00:02, 20.80it/s]
Epoch 0:  16%|▎ | 70651/432833 [2:11:31<11:14:13,  8.95it/s, loss=5, v_num=2vt0]
Validating:  45%|█████████████▌                | 45/100 [00:01<00:02, 25.08it/s]
Epoch 0:  16%|▎ | 70658/432833 [2:11:31<11:14:09,  8.95it/s, loss=5, v_num=2vt0]
Validating:  54%|████████████████▏             | 54/100 [00:02<00:01, 30.06it/s]
Epoch 0:  16%|▎ | 70665/432833 [2:11:31<11:14:06,  8.95it/s, loss=5, v_num=2vt0]
Epoch 0:  16%|▎ | 70672/432833 [2:11:31<11:14:02,  8.95it/s, loss=5, v_num=2vt0]
Validating:  66%|███████████████████▊          | 66/100 [00:02<00:01, 30.86it/s]
Epoch 0:  16%|▎ | 70679/432833 [2:11:32<11:13:59,  8.96it/s, loss=5, v_num=2vt0]
Validating:  74%|██████████████████████▏       | 74/100 [00:02<00:00, 27.96it/s]
Epoch 0:  16%|▎ | 70686/432833 [2:11:32<11:13:55,  8.96it/s, loss=5, v_num=2vt0]
Validating:  81%|████████████████████████▎     | 81/100 [00:03<00:00, 24.47it/s]
Epoch 0:  16%|▎ | 70693/432833 [2:11:32<11:13:52,  8.96it/s, loss=5, v_num=2vt0]
Validating:  87%|██████████████████████████    | 87/100 [00:03<00:00, 25.28it/s]
Epoch 0:  16%|▎ | 70700/432833 [2:11:33<11:13:48,  8.96it/s, loss=5, v_num=2vt0]
Validating:  94%|████████████████████████████▏ | 94/100 [00:03<00:00, 27.48it/s]
Epoch 0:  16%|▎ | 70707/432833 [2:11:33<11:13:45,  8.96it/s, loss=5, v_num=2vt0]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  16%|▎ | 70707/432833 [2:11:34<11:13:49,  8.96it/s, loss=5, v_num=2vt0]
Epoch 0:  19%|▏| 80707/432833 [2:30:21<10:56:00,  8.95it/s, loss=4.77, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  19%|▏| 80710/432833 [2:30:21<10:56:00,  8.95it/s, loss=4.77, v_num=2vt
Epoch 0:  19%|▏| 80717/432833 [2:30:21<10:55:56,  8.95it/s, loss=4.77, v_num=2vt
Validating:  10%|███                           | 10/100 [00:00<00:12,  7.16it/s]
Epoch 0:  19%|▏| 80724/432833 [2:30:22<10:55:53,  8.95it/s, loss=4.77, v_num=2vt
Validating:  17%|█████                         | 17/100 [00:00<00:07, 11.31it/s]
Epoch 0:  19%|▏| 80731/432833 [2:30:22<10:55:50,  8.95it/s, loss=4.77, v_num=2vt
Validating:  24%|███████▏                      | 24/100 [00:00<00:04, 16.97it/s]
Validating:  27%|████████                      | 27/100 [00:01<00:03, 19.28it/s]
Epoch 0:  19%|▏| 80738/432833 [2:30:22<10:55:47,  8.95it/s, loss=4.77, v_num=2vt
Validating:  34%|██████████▏                   | 34/100 [00:01<00:03, 21.38it/s]
Epoch 0:  19%|▏| 80745/432833 [2:30:23<10:55:44,  8.95it/s, loss=4.77, v_num=2vt
Validating:  40%|████████████                  | 40/100 [00:01<00:02, 21.03it/s]
Epoch 0:  19%|▏| 80752/432833 [2:30:23<10:55:41,  8.95it/s, loss=4.77, v_num=2vt
Validating:  47%|██████████████                | 47/100 [00:01<00:02, 21.75it/s]
Epoch 0:  19%|▏| 80759/432833 [2:30:23<10:55:38,  8.95it/s, loss=4.77, v_num=2vt
Validating:  54%|████████████████▏             | 54/100 [00:02<00:01, 24.31it/s]
Epoch 0:  19%|▏| 80766/432833 [2:30:23<10:55:35,  8.95it/s, loss=4.77, v_num=2vt
Validating:  61%|██████████████████▎           | 61/100 [00:02<00:01, 26.38it/s]
Epoch 0:  19%|▏| 80773/432833 [2:30:24<10:55:32,  8.95it/s, loss=4.77, v_num=2vt
Validating:  67%|████████████████████          | 67/100 [00:02<00:01, 24.51it/s]
Epoch 0:  19%|▏| 80780/432833 [2:30:24<10:55:29,  8.95it/s, loss=4.77, v_num=2vt
Validating:  73%|█████████████████████▉        | 73/100 [00:03<00:01, 21.41it/s]
Epoch 0:  19%|▏| 80787/432833 [2:30:24<10:55:27,  8.95it/s, loss=4.77, v_num=2vt
Validating:  80%|████████████████████████      | 80/100 [00:03<00:00, 24.67it/s]
Epoch 0:  19%|▏| 80794/432833 [2:30:24<10:55:23,  8.95it/s, loss=4.77, v_num=2vt
Validating:  88%|██████████████████████████▍   | 88/100 [00:03<00:00, 28.09it/s]
Epoch 0:  19%|▏| 80801/432833 [2:30:25<10:55:20,  8.95it/s, loss=4.77, v_num=2vt
Validating:  96%|████████████████████████████▊ | 96/100 [00:03<00:00, 30.87it/s]
Epoch 0:  19%|▏| 80808/432833 [2:30:25<10:55:17,  8.95it/s, loss=4.77, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  19%|▏| 80808/432833 [2:30:26<10:55:20,  8.95it/s, loss=4.77, v_num=2vt
Epoch 0:  21%|▏| 90808/432833 [2:49:16<10:37:32,  8.94it/s, loss=4.58, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  21%|▏| 90811/432833 [2:49:16<10:37:31,  8.94it/s, loss=4.58, v_num=2vt
Epoch 0:  21%|▏| 90818/432833 [2:49:16<10:37:29,  8.94it/s, loss=4.58, v_num=2vt
Validating:  10%|███                           | 10/100 [00:00<00:12,  7.13it/s]
Validating:  12%|███▌                          | 12/100 [00:00<00:10,  8.50it/s]
Epoch 0:  21%|▏| 90825/432833 [2:49:16<10:37:26,  8.94it/s, loss=4.58, v_num=2vt
Validating:  20%|██████                        | 20/100 [00:00<00:06, 13.22it/s]
Epoch 0:  21%|▏| 90832/432833 [2:49:17<10:37:23,  8.94it/s, loss=4.58, v_num=2vt
Validating:  26%|███████▊                      | 26/100 [00:01<00:04, 18.03it/s]
Epoch 0:  21%|▏| 90839/432833 [2:49:17<10:37:20,  8.94it/s, loss=4.58, v_num=2vt
Validating:  33%|█████████▉                    | 33/100 [00:01<00:03, 22.22it/s]
Epoch 0:  21%|▏| 90846/432833 [2:49:17<10:37:18,  8.94it/s, loss=4.58, v_num=2vt
Epoch 0:  21%|▏| 90853/432833 [2:49:17<10:37:15,  8.94it/s, loss=4.58, v_num=2vt
Validating:  45%|█████████████▌                | 45/100 [00:01<00:02, 26.98it/s]
Validating:  48%|██████████████▍               | 48/100 [00:01<00:01, 26.02it/s]
Epoch 0:  21%|▏| 90860/432833 [2:49:18<10:37:12,  8.94it/s, loss=4.58, v_num=2vt
Validating:  55%|████████████████▌             | 55/100 [00:02<00:01, 28.31it/s]
Epoch 0:  21%|▏| 90867/432833 [2:49:18<10:37:09,  8.95it/s, loss=4.58, v_num=2vt
Validating:  61%|██████████████████▎           | 61/100 [00:02<00:01, 26.01it/s]
Epoch 0:  21%|▏| 90874/432833 [2:49:18<10:37:06,  8.95it/s, loss=4.58, v_num=2vt
Epoch 0:  21%|▏| 90881/432833 [2:49:18<10:37:03,  8.95it/s, loss=4.58, v_num=2vt
Validating:  73%|█████████████████████▉        | 73/100 [00:02<00:00, 27.16it/s]
Validating:  76%|██████████████████████▊       | 76/100 [00:02<00:00, 27.53it/s]
Epoch 0:  21%|▏| 90888/432833 [2:49:19<10:37:01,  8.95it/s, loss=4.58, v_num=2vt
Validating:  82%|████████████████████████▌     | 82/100 [00:03<00:00, 23.55it/s]
Epoch 0:  21%|▏| 90895/432833 [2:49:19<10:36:58,  8.95it/s, loss=4.58, v_num=2vt
Validating:  90%|███████████████████████████   | 90/100 [00:03<00:00, 26.11it/s]
Epoch 0:  21%|▏| 90902/432833 [2:49:19<10:36:55,  8.95it/s, loss=4.58, v_num=2vt
Validating:  97%|█████████████████████████████ | 97/100 [00:03<00:00, 27.53it/s]
Epoch 0:  21%|▏| 90909/432833 [2:49:19<10:36:53,  8.95it/s, loss=4.58, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  21%|▏| 90909/432833 [2:49:20<10:36:56,  8.95it/s, loss=4.58, v_num=2vt
Epoch 0:  23%|▏| 100909/432833 [3:08:09<10:18:55,  8.94it/s, loss=4.42, v_num=2v
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  23%|▏| 100912/432833 [3:08:10<10:18:55,  8.94it/s, loss=4.42, v_num=2v
Epoch 0:  23%|▏| 100919/432833 [3:08:10<10:18:52,  8.94it/s, loss=4.42, v_num=2v
Validating:  10%|███                           | 10/100 [00:00<00:13,  6.70it/s]
Epoch 0:  23%|▏| 100926/432833 [3:08:10<10:18:49,  8.94it/s, loss=4.42, v_num=2v
Validating:  17%|█████                         | 17/100 [00:00<00:07, 10.99it/s]
Validating:  20%|██████                        | 20/100 [00:00<00:06, 12.89it/s]
Epoch 0:  23%|▏| 100933/432833 [3:08:10<10:18:47,  8.94it/s, loss=4.42, v_num=2v
Validating:  26%|███████▊                      | 26/100 [00:01<00:04, 16.28it/s]
Epoch 0:  23%|▏| 100940/432833 [3:08:11<10:18:45,  8.94it/s, loss=4.42, v_num=2v
Validating:  33%|█████████▉                    | 33/100 [00:01<00:03, 21.98it/s]
Epoch 0:  23%|▏| 100947/432833 [3:08:11<10:18:42,  8.94it/s, loss=4.42, v_num=2v
Validating:  39%|███████████▋                  | 39/100 [00:01<00:02, 22.96it/s]
Epoch 0:  23%|▏| 100954/432833 [3:08:11<10:18:40,  8.94it/s, loss=4.42, v_num=2v
Validating:  45%|█████████████▌                | 45/100 [00:02<00:02, 18.95it/s]
Epoch 0:  23%|▏| 100961/432833 [3:08:11<10:18:37,  8.94it/s, loss=4.42, v_num=2v
Validating:  54%|████████████████▏             | 54/100 [00:02<00:01, 25.54it/s]
Epoch 0:  23%|▏| 100968/432833 [3:08:12<10:18:35,  8.94it/s, loss=4.42, v_num=2v
Epoch 0:  23%|▏| 100975/432833 [3:08:12<10:18:32,  8.94it/s, loss=4.42, v_num=2v
Validating:  66%|███████████████████▊          | 66/100 [00:02<00:01, 26.04it/s]
Epoch 0:  23%|▏| 100982/432833 [3:08:12<10:18:30,  8.94it/s, loss=4.42, v_num=2v
Validating:  75%|██████████████████████▌       | 75/100 [00:02<00:00, 28.99it/s]
Epoch 0:  23%|▏| 100989/432833 [3:08:12<10:18:27,  8.94it/s, loss=4.42, v_num=2v
Epoch 0:  23%|▏| 100996/432833 [3:08:13<10:18:25,  8.94it/s, loss=4.42, v_num=2v
Validating:  87%|██████████████████████████    | 87/100 [00:03<00:00, 26.95it/s]
Epoch 0:  23%|▏| 101003/432833 [3:08:13<10:18:22,  8.94it/s, loss=4.42, v_num=2v
Validating:  94%|████████████████████████████▏ | 94/100 [00:03<00:00, 26.88it/s]
Validating:  97%|█████████████████████████████ | 97/100 [00:03<00:00, 27.55it/s]
Epoch 0:  23%|▏| 101010/432833 [3:08:13<10:18:20,  8.94it/s, loss=4.42, v_num=2vSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  23%|▏| 101010/432833 [3:08:14<10:18:22,  8.94it/s, loss=4.42, v_num=2v
Epoch 0:  26%|▎| 111010/432833 [3:27:04<10:00:18,  8.93it/s, loss=4.29, v_num=2v
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  26%|▎| 111013/432833 [3:27:04<10:00:17,  8.94it/s, loss=4.29, v_num=2v
Validating:   6%|█▊                             | 6/100 [00:00<00:18,  5.21it/s]
Epoch 0:  26%|▎| 111020/432833 [3:27:04<10:00:15,  8.94it/s, loss=4.29, v_num=2v
Validating:  12%|███▌                          | 12/100 [00:00<00:10,  8.59it/s]
Epoch 0:  26%|▎| 111027/432833 [3:27:05<10:00:13,  8.94it/s, loss=4.29, v_num=2v
Validating:  18%|█████▍                        | 18/100 [00:00<00:06, 13.13it/s]
Epoch 0:  26%|▎| 111034/432833 [3:27:05<10:00:10,  8.94it/s, loss=4.29, v_num=2v
Validating:  26%|███████▊                      | 26/100 [00:01<00:04, 18.44it/s]
Epoch 0:  26%|▎| 111041/432833 [3:27:05<10:00:08,  8.94it/s, loss=4.29, v_num=2v
Validating:  32%|█████████▌                    | 32/100 [00:01<00:03, 22.06it/s]
Epoch 0:  26%|▎| 111048/432833 [3:27:05<10:00:06,  8.94it/s, loss=4.29, v_num=2v
Validating:  38%|███████████▍                  | 38/100 [00:01<00:02, 23.06it/s]
Validating:  41%|████████████▎                 | 41/100 [00:01<00:02, 24.44it/s]
Epoch 0:  26%|▎| 111055/432833 [3:27:06<10:00:03,  8.94it/s, loss=4.29, v_num=2v
Validating:  48%|██████████████▍               | 48/100 [00:01<00:01, 27.52it/s]
Epoch 0:  26%|▎| 111062/432833 [3:27:06<10:00:01,  8.94it/s, loss=4.29, v_num=2v
Validating:  54%|████████████████▏             | 54/100 [00:02<00:01, 24.02it/s]
Epoch 0:  26%|▎| 111069/432833 [3:27:06<9:59:59,  8.94it/s, loss=4.29, v_num=2vt
Validating:  60%|██████████████████            | 60/100 [00:02<00:01, 21.47it/s]
Epoch 0:  26%|▎| 111076/432833 [3:27:06<9:59:57,  8.94it/s, loss=4.29, v_num=2vt
Validating:  68%|████████████████████▍         | 68/100 [00:02<00:01, 26.27it/s]
Epoch 0:  26%|▎| 111083/432833 [3:27:07<9:59:55,  8.94it/s, loss=4.29, v_num=2vt
Validating:  74%|██████████████████████▏       | 74/100 [00:03<00:01, 24.13it/s]
Epoch 0:  26%|▎| 111090/432833 [3:27:07<9:59:52,  8.94it/s, loss=4.29, v_num=2vt
Validating:  81%|████████████████████████▎     | 81/100 [00:03<00:00, 28.14it/s]
Epoch 0:  26%|▎| 111097/432833 [3:27:07<9:59:50,  8.94it/s, loss=4.29, v_num=2vt
Validating:  89%|██████████████████████████▋   | 89/100 [00:03<00:00, 30.38it/s]
Epoch 0:  26%|▎| 111104/432833 [3:27:07<9:59:47,  8.94it/s, loss=4.29, v_num=2vt
Epoch 0:  26%|▎| 111111/432833 [3:27:08<9:59:45,  8.94it/s, loss=4.29, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  26%|▎| 111111/432833 [3:27:08<9:59:47,  8.94it/s, loss=4.29, v_num=2vt
Epoch 0:  28%|▎| 121111/432833 [3:46:04<9:41:53,  8.93it/s, loss=4.18, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  28%|▎| 121114/432833 [3:46:04<9:41:52,  8.93it/s, loss=4.18, v_num=2vt
Validating:   6%|█▊                             | 6/100 [00:00<00:17,  5.37it/s]
Epoch 0:  28%|▎| 121121/432833 [3:46:05<9:41:50,  8.93it/s, loss=4.18, v_num=2vt
Validating:  13%|███▉                          | 13/100 [00:00<00:09,  9.21it/s]
Epoch 0:  28%|▎| 121128/432833 [3:46:05<9:41:48,  8.93it/s, loss=4.18, v_num=2vt
Validating:  19%|█████▋                        | 19/100 [00:00<00:05, 13.87it/s]
Epoch 0:  28%|▎| 121135/432833 [3:46:05<9:41:46,  8.93it/s, loss=4.18, v_num=2vt
Validating:  25%|███████▌                      | 25/100 [00:01<00:04, 17.36it/s]
Epoch 0:  28%|▎| 121142/432833 [3:46:05<9:41:44,  8.93it/s, loss=4.18, v_num=2vt
Validating:  32%|█████████▌                    | 32/100 [00:01<00:03, 20.77it/s]
Epoch 0:  28%|▎| 121149/432833 [3:46:06<9:41:42,  8.93it/s, loss=4.18, v_num=2vt
Validating:  39%|███████████▋                  | 39/100 [00:01<00:02, 21.85it/s]
Epoch 0:  28%|▎| 121156/432833 [3:46:06<9:41:39,  8.93it/s, loss=4.18, v_num=2vt
Validating:  46%|█████████████▊                | 46/100 [00:01<00:02, 24.57it/s]
Epoch 0:  28%|▎| 121163/432833 [3:46:06<9:41:37,  8.93it/s, loss=4.18, v_num=2vt
Validating:  53%|███████████████▉              | 53/100 [00:02<00:01, 25.37it/s]
Epoch 0:  28%|▎| 121170/432833 [3:46:06<9:41:35,  8.93it/s, loss=4.18, v_num=2vt
Validating:  60%|██████████████████            | 60/100 [00:02<00:01, 27.06it/s]
Epoch 0:  28%|▎| 121177/432833 [3:46:07<9:41:33,  8.93it/s, loss=4.18, v_num=2vt
Validating:  68%|████████████████████▍         | 68/100 [00:02<00:01, 28.56it/s]
Epoch 0:  28%|▎| 121184/432833 [3:46:07<9:41:31,  8.93it/s, loss=4.18, v_num=2vt
Epoch 0:  28%|▎| 121191/432833 [3:46:07<9:41:28,  8.93it/s, loss=4.18, v_num=2vt
Validating:  80%|████████████████████████      | 80/100 [00:03<00:00, 30.11it/s]
Epoch 0:  28%|▎| 121198/432833 [3:46:07<9:41:26,  8.93it/s, loss=4.18, v_num=2vt
Validating:  87%|██████████████████████████    | 87/100 [00:03<00:00, 26.62it/s]
Validating:  90%|███████████████████████████   | 90/100 [00:03<00:00, 25.64it/s]
Epoch 0:  28%|▎| 121205/432833 [3:46:08<9:41:24,  8.93it/s, loss=4.18, v_num=2vt
Validating:  97%|█████████████████████████████ | 97/100 [00:03<00:00, 26.49it/s]
Epoch 0:  28%|▎| 121212/432833 [3:46:08<9:41:22,  8.93it/s, loss=4.18, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  28%|▎| 121212/432833 [3:46:09<9:41:24,  8.93it/s, loss=4.18, v_num=2vt
Epoch 0:  30%|▎| 131212/432833 [4:05:06<9:23:25,  8.92it/s, loss=4.09, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  30%|▎| 131215/432833 [4:05:06<9:23:25,  8.92it/s, loss=4.09, v_num=2vt
Epoch 0:  30%|▎| 131222/432833 [4:05:06<9:23:23,  8.92it/s, loss=4.09, v_num=2vt
Validating:  10%|███                           | 10/100 [00:00<00:12,  6.98it/s]
Validating:  12%|███▌                          | 12/100 [00:00<00:10,  8.45it/s]
Epoch 0:  30%|▎| 131229/432833 [4:05:07<9:23:21,  8.92it/s, loss=4.09, v_num=2vt
Validating:  19%|█████▋                        | 19/100 [00:00<00:06, 12.92it/s]
Epoch 0:  30%|▎| 131236/432833 [4:05:07<9:23:19,  8.92it/s, loss=4.09, v_num=2vt
Validating:  26%|███████▊                      | 26/100 [00:01<00:04, 18.22it/s]
Epoch 0:  30%|▎| 131243/432833 [4:05:07<9:23:17,  8.92it/s, loss=4.09, v_num=2vt
Epoch 0:  30%|▎| 131250/432833 [4:05:07<9:23:15,  8.92it/s, loss=4.09, v_num=2vt
Validating:  38%|███████████▍                  | 38/100 [00:01<00:02, 26.28it/s]
Epoch 0:  30%|▎| 131257/432833 [4:05:08<9:23:13,  8.92it/s, loss=4.09, v_num=2vt
Validating:  46%|█████████████▊                | 46/100 [00:01<00:02, 26.49it/s]
Epoch 0:  30%|▎| 131264/432833 [4:05:08<9:23:11,  8.92it/s, loss=4.09, v_num=2vt
Validating:  52%|███████████████▌              | 52/100 [00:02<00:02, 22.31it/s]
Epoch 0:  30%|▎| 131271/432833 [4:05:08<9:23:09,  8.92it/s, loss=4.09, v_num=2vt
Validating:  59%|█████████████████▋            | 59/100 [00:02<00:01, 25.13it/s]
Epoch 0:  30%|▎| 131278/432833 [4:05:08<9:23:07,  8.93it/s, loss=4.09, v_num=2vt
Validating:  67%|████████████████████          | 67/100 [00:02<00:01, 27.90it/s]
Epoch 0:  30%|▎| 131285/432833 [4:05:09<9:23:05,  8.93it/s, loss=4.09, v_num=2vt
Validating:  74%|██████████████████████▏       | 74/100 [00:02<00:00, 26.73it/s]
Epoch 0:  30%|▎| 131292/432833 [4:05:09<9:23:03,  8.93it/s, loss=4.09, v_num=2vt
Validating:  80%|████████████████████████      | 80/100 [00:03<00:00, 27.48it/s]
Epoch 0:  30%|▎| 131299/432833 [4:05:09<9:23:01,  8.93it/s, loss=4.09, v_num=2vt
Validating:  89%|██████████████████████████▋   | 89/100 [00:03<00:00, 26.33it/s]
Epoch 0:  30%|▎| 131306/432833 [4:05:09<9:22:59,  8.93it/s, loss=4.09, v_num=2vt
Validating:  96%|████████████████████████████▊ | 96/100 [00:03<00:00, 23.56it/s]
Epoch 0:  30%|▎| 131313/432833 [4:05:10<9:22:57,  8.93it/s, loss=4.09, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  30%|▎| 131313/432833 [4:05:11<9:22:59,  8.93it/s, loss=4.09, v_num=2vt
Epoch 0:  33%|▎| 141313/432833 [4:24:05<9:04:48,  8.92it/s, loss=4.01, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  33%|▎| 141316/432833 [4:24:05<9:04:48,  8.92it/s, loss=4.01, v_num=2vt
Epoch 0:  33%|▎| 141323/432833 [4:24:06<9:04:46,  8.92it/s, loss=4.01, v_num=2vt
Validating:  10%|███                           | 10/100 [00:00<00:12,  7.16it/s]
Epoch 0:  33%|▎| 141330/432833 [4:24:06<9:04:44,  8.92it/s, loss=4.01, v_num=2vt
Validating:  17%|█████                         | 17/100 [00:00<00:07, 11.79it/s]
Validating:  20%|██████                        | 20/100 [00:00<00:05, 14.38it/s]
Epoch 0:  33%|▎| 141337/432833 [4:24:06<9:04:42,  8.92it/s, loss=4.01, v_num=2vt
Validating:  27%|████████                      | 27/100 [00:01<00:03, 19.44it/s]
Epoch 0:  33%|▎| 141344/432833 [4:24:06<9:04:40,  8.92it/s, loss=4.01, v_num=2vt
Epoch 0:  33%|▎| 141351/432833 [4:24:07<9:04:38,  8.92it/s, loss=4.01, v_num=2vt
Validating:  38%|███████████▍                  | 38/100 [00:01<00:02, 25.40it/s]
Epoch 0:  33%|▎| 141358/432833 [4:24:07<9:04:36,  8.92it/s, loss=4.01, v_num=2vt
Validating:  47%|██████████████                | 47/100 [00:01<00:01, 30.00it/s]
Epoch 0:  33%|▎| 141365/432833 [4:24:07<9:04:34,  8.92it/s, loss=4.01, v_num=2vt
Epoch 0:  33%|▎| 141372/432833 [4:24:07<9:04:32,  8.92it/s, loss=4.01, v_num=2vt
Validating:  59%|█████████████████▋            | 59/100 [00:02<00:01, 31.74it/s]
Epoch 0:  33%|▎| 141379/432833 [4:24:07<9:04:30,  8.92it/s, loss=4.01, v_num=2vt
Validating:  67%|████████████████████          | 67/100 [00:02<00:01, 31.47it/s]
Epoch 0:  33%|▎| 141386/432833 [4:24:08<9:04:28,  8.92it/s, loss=4.01, v_num=2vt
Validating:  75%|██████████████████████▌       | 75/100 [00:02<00:00, 27.55it/s]
Epoch 0:  33%|▎| 141393/432833 [4:24:08<9:04:26,  8.92it/s, loss=4.01, v_num=2vt
Validating:  81%|████████████████████████▎     | 81/100 [00:02<00:00, 25.83it/s]
Epoch 0:  33%|▎| 141400/432833 [4:24:08<9:04:24,  8.92it/s, loss=4.01, v_num=2vt
Validating:  87%|██████████████████████████    | 87/100 [00:03<00:00, 24.88it/s]
Validating:  90%|███████████████████████████   | 90/100 [00:03<00:00, 21.98it/s]
Epoch 0:  33%|▎| 141407/432833 [4:24:09<9:04:23,  8.92it/s, loss=4.01, v_num=2vt
Validating:  97%|█████████████████████████████ | 97/100 [00:03<00:00, 19.70it/s]
Epoch 0:  33%|▎| 141414/432833 [4:24:09<9:04:21,  8.92it/s, loss=4.01, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  33%|▎| 141414/432833 [4:24:10<9:04:23,  8.92it/s, loss=4.01, v_num=2vt
Epoch 0:  35%|▎| 151414/432833 [4:43:02<8:46:04,  8.92it/s, loss=3.94, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  35%|▎| 151417/432833 [4:43:03<8:46:04,  8.92it/s, loss=3.94, v_num=2vt
Epoch 0:  35%|▎| 151424/432833 [4:43:03<8:46:02,  8.92it/s, loss=3.94, v_num=2vt
Validating:  11%|███▎                          | 11/100 [00:00<00:12,  7.03it/s]
Epoch 0:  35%|▎| 151431/432833 [4:43:03<8:46:00,  8.92it/s, loss=3.94, v_num=2vt
Validating:  18%|█████▍                        | 18/100 [00:00<00:07, 11.42it/s]
Epoch 0:  35%|▎| 151438/432833 [4:43:03<8:45:58,  8.92it/s, loss=3.94, v_num=2vt
Validating:  25%|███████▌                      | 25/100 [00:01<00:04, 16.07it/s]
Epoch 0:  35%|▎| 151445/432833 [4:43:04<8:45:56,  8.92it/s, loss=3.94, v_num=2vt
Validating:  33%|█████████▉                    | 33/100 [00:01<00:03, 21.98it/s]
Epoch 0:  35%|▎| 151452/432833 [4:43:04<8:45:54,  8.92it/s, loss=3.94, v_num=2vt
Validating:  40%|████████████                  | 40/100 [00:01<00:02, 26.23it/s]
Epoch 0:  35%|▎| 151459/432833 [4:43:04<8:45:53,  8.92it/s, loss=3.94, v_num=2vt
Epoch 0:  35%|▎| 151466/432833 [4:43:04<8:45:51,  8.92it/s, loss=3.94, v_num=2vt
Validating:  52%|███████████████▌              | 52/100 [00:01<00:01, 26.44it/s]
Epoch 0:  35%|▎| 151473/432833 [4:43:05<8:45:49,  8.92it/s, loss=3.94, v_num=2vt
Validating:  59%|█████████████████▋            | 59/100 [00:02<00:01, 28.92it/s]
Epoch 0:  35%|▎| 151480/432833 [4:43:05<8:45:47,  8.92it/s, loss=3.94, v_num=2vt
Validating:  67%|████████████████████          | 67/100 [00:02<00:01, 29.52it/s]
Epoch 0:  35%|▎| 151487/432833 [4:43:05<8:45:46,  8.92it/s, loss=3.94, v_num=2vt
Validating:  74%|██████████████████████▏       | 74/100 [00:02<00:01, 24.92it/s]
Epoch 0:  35%|▎| 151494/432833 [4:43:05<8:45:44,  8.92it/s, loss=3.94, v_num=2vt
Validating:  80%|████████████████████████      | 80/100 [00:03<00:00, 24.59it/s]
Epoch 0:  35%|▎| 151501/432833 [4:43:06<8:45:42,  8.92it/s, loss=3.94, v_num=2vt
Validating:  87%|██████████████████████████    | 87/100 [00:03<00:00, 23.46it/s]
Validating:  90%|███████████████████████████   | 90/100 [00:03<00:00, 23.95it/s]
Epoch 0:  35%|▎| 151508/432833 [4:43:06<8:45:41,  8.92it/s, loss=3.94, v_num=2vt
Epoch 0:  35%|▎| 151515/432833 [4:43:06<8:45:39,  8.92it/s, loss=3.94, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  35%|▎| 151515/432833 [4:43:07<8:45:40,  8.92it/s, loss=3.94, v_num=2vt
Epoch 0:  37%|▎| 161515/432833 [5:02:00<8:27:18,  8.91it/s, loss=3.88, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  37%|▎| 161518/432833 [5:02:00<8:27:18,  8.91it/s, loss=3.88, v_num=2vt
Validating:   7%|██▏                            | 7/100 [00:00<00:17,  5.37it/s]
Epoch 0:  37%|▎| 161525/432833 [5:02:00<8:27:16,  8.91it/s, loss=3.88, v_num=2vt
Validating:  11%|███▎                          | 11/100 [00:00<00:10,  8.32it/s]
Epoch 0:  37%|▎| 161532/432833 [5:02:00<8:27:15,  8.91it/s, loss=3.88, v_num=2vt
Validating:  17%|█████                         | 17/100 [00:00<00:06, 12.31it/s]
Validating:  20%|██████                        | 20/100 [00:01<00:05, 14.41it/s]
Epoch 0:  37%|▎| 161539/432833 [5:02:01<8:27:13,  8.91it/s, loss=3.88, v_num=2vt
Validating:  26%|███████▊                      | 26/100 [00:01<00:04, 17.83it/s]
Epoch 0:  37%|▎| 161546/432833 [5:02:01<8:27:12,  8.91it/s, loss=3.88, v_num=2vt
Validating:  32%|█████████▌                    | 32/100 [00:01<00:03, 18.80it/s]
Epoch 0:  37%|▎| 161553/432833 [5:02:01<8:27:10,  8.91it/s, loss=3.88, v_num=2vt
Validating:  40%|████████████                  | 40/100 [00:01<00:02, 24.33it/s]
Epoch 0:  37%|▎| 161560/432833 [5:02:02<8:27:08,  8.92it/s, loss=3.88, v_num=2vt
Epoch 0:  37%|▎| 161567/432833 [5:02:02<8:27:06,  8.92it/s, loss=3.88, v_num=2vt
Validating:  52%|███████████████▌              | 52/100 [00:02<00:01, 29.65it/s]
Epoch 0:  37%|▎| 161574/432833 [5:02:02<8:27:05,  8.92it/s, loss=3.88, v_num=2vt
Validating:  60%|██████████████████            | 60/100 [00:02<00:01, 28.96it/s]
Epoch 0:  37%|▎| 161581/432833 [5:02:02<8:27:03,  8.92it/s, loss=3.88, v_num=2vt
Validating:  68%|████████████████████▍         | 68/100 [00:02<00:00, 32.44it/s]
Epoch 0:  37%|▎| 161588/432833 [5:02:02<8:27:01,  8.92it/s, loss=3.88, v_num=2vt
Epoch 0:  37%|▎| 161595/432833 [5:02:03<8:26:59,  8.92it/s, loss=3.88, v_num=2vt
Validating:  80%|████████████████████████      | 80/100 [00:03<00:00, 26.51it/s]
Epoch 0:  37%|▎| 161602/432833 [5:02:03<8:26:58,  8.92it/s, loss=3.88, v_num=2vt
Validating:  87%|██████████████████████████    | 87/100 [00:03<00:00, 27.91it/s]
Epoch 0:  37%|▎| 161609/432833 [5:02:03<8:26:56,  8.92it/s, loss=3.88, v_num=2vt
Validating:  94%|████████████████████████████▏ | 94/100 [00:03<00:00, 30.65it/s]
Epoch 0:  37%|▎| 161616/432833 [5:02:04<8:26:54,  8.92it/s, loss=3.88, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  37%|▎| 161616/432833 [5:02:04<8:26:56,  8.92it/s, loss=3.88, v_num=2vt
Epoch 0:  40%|▍| 171616/432833 [5:20:56<8:08:30,  8.91it/s, loss=3.82, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  40%|▍| 171619/432833 [5:20:56<8:08:29,  8.91it/s, loss=3.82, v_num=2vt
Epoch 0:  40%|▍| 171626/432833 [5:20:56<8:08:28,  8.91it/s, loss=3.82, v_num=2vt
Validating:  10%|███                           | 10/100 [00:00<00:12,  7.05it/s]
Validating:  13%|███▉                          | 13/100 [00:00<00:09,  8.85it/s]
Epoch 0:  40%|▍| 171633/432833 [5:20:57<8:08:26,  8.91it/s, loss=3.82, v_num=2vt
Epoch 0:  40%|▍| 171640/432833 [5:20:57<8:08:24,  8.91it/s, loss=3.82, v_num=2vt
Validating:  25%|███████▌                      | 25/100 [00:01<00:04, 16.67it/s]
Epoch 0:  40%|▍| 171647/432833 [5:20:57<8:08:23,  8.91it/s, loss=3.82, v_num=2vt
Validating:  31%|█████████▎                    | 31/100 [00:01<00:03, 19.13it/s]
Validating:  34%|██████████▏                   | 34/100 [00:01<00:03, 20.13it/s]
Epoch 0:  40%|▍| 171654/432833 [5:20:58<8:08:21,  8.91it/s, loss=3.82, v_num=2vt
Validating:  40%|████████████                  | 40/100 [00:01<00:02, 23.20it/s]
Epoch 0:  40%|▍| 171661/432833 [5:20:58<8:08:20,  8.91it/s, loss=3.82, v_num=2vt
Validating:  47%|██████████████                | 47/100 [00:01<00:02, 23.57it/s]
Epoch 0:  40%|▍| 171668/432833 [5:20:58<8:08:18,  8.91it/s, loss=3.82, v_num=2vt
Epoch 0:  40%|▍| 171675/432833 [5:20:58<8:08:17,  8.91it/s, loss=3.82, v_num=2vt
Validating:  59%|█████████████████▋            | 59/100 [00:02<00:01, 29.72it/s]
Epoch 0:  40%|▍| 171682/432833 [5:20:58<8:08:15,  8.91it/s, loss=3.82, v_num=2vt
Validating:  67%|████████████████████          | 67/100 [00:02<00:01, 29.09it/s]
Epoch 0:  40%|▍| 171689/432833 [5:20:59<8:08:13,  8.91it/s, loss=3.82, v_num=2vt
Validating:  75%|██████████████████████▌       | 75/100 [00:02<00:00, 30.04it/s]
Epoch 0:  40%|▍| 171696/432833 [5:20:59<8:08:12,  8.91it/s, loss=3.82, v_num=2vt
Epoch 0:  40%|▍| 171703/432833 [5:20:59<8:08:10,  8.92it/s, loss=3.82, v_num=2vt
Validating:  87%|██████████████████████████    | 87/100 [00:03<00:00, 28.83it/s]
Epoch 0:  40%|▍| 171710/432833 [5:20:59<8:08:08,  8.92it/s, loss=3.82, v_num=2vt
Validating:  96%|████████████████████████████▊ | 96/100 [00:03<00:00, 33.10it/s]
Epoch 0:  40%|▍| 171717/432833 [5:21:00<8:08:07,  8.92it/s, loss=3.82, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  40%|▍| 171717/432833 [5:21:01<8:08:08,  8.92it/s, loss=3.82, v_num=2vt
Epoch 0:  42%|▍| 181717/432833 [5:39:50<7:49:37,  8.91it/s, loss=3.77, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  42%|▍| 181720/432833 [5:39:50<7:49:37,  8.91it/s, loss=3.77, v_num=2vt
Validating:   6%|█▊                             | 6/100 [00:00<00:17,  5.36it/s]
Epoch 0:  42%|▍| 181727/432833 [5:39:51<7:49:35,  8.91it/s, loss=3.77, v_num=2vt
Epoch 0:  42%|▍| 181734/432833 [5:39:51<7:49:34,  8.91it/s, loss=3.77, v_num=2vt
Validating:  17%|█████                         | 17/100 [00:00<00:06, 11.94it/s]
Epoch 0:  42%|▍| 181741/432833 [5:39:51<7:49:32,  8.91it/s, loss=3.77, v_num=2vt
Validating:  24%|███████▏                      | 24/100 [00:01<00:04, 15.20it/s]
Epoch 0:  42%|▍| 181748/432833 [5:39:51<7:49:31,  8.91it/s, loss=3.77, v_num=2vt
Validating:  32%|█████████▌                    | 32/100 [00:01<00:03, 20.16it/s]
Epoch 0:  42%|▍| 181755/432833 [5:39:51<7:49:29,  8.91it/s, loss=3.77, v_num=2vt
Validating:  39%|███████████▋                  | 39/100 [00:01<00:02, 23.11it/s]
Epoch 0:  42%|▍| 181762/432833 [5:39:52<7:49:28,  8.91it/s, loss=3.77, v_num=2vt
Validating:  46%|█████████████▊                | 46/100 [00:01<00:02, 24.54it/s]
Epoch 0:  42%|▍| 181769/432833 [5:39:52<7:49:26,  8.91it/s, loss=3.77, v_num=2vt
Validating:  54%|████████████████▏             | 54/100 [00:02<00:01, 26.73it/s]
Epoch 0:  42%|▍| 181776/432833 [5:39:52<7:49:25,  8.91it/s, loss=3.77, v_num=2vt
Validating:  61%|██████████████████▎           | 61/100 [00:02<00:01, 23.69it/s]
Epoch 0:  42%|▍| 181783/432833 [5:39:53<7:49:23,  8.91it/s, loss=3.77, v_num=2vt
Validating:  69%|████████████████████▋         | 69/100 [00:02<00:01, 25.66it/s]
Epoch 0:  42%|▍| 181790/432833 [5:39:53<7:49:22,  8.91it/s, loss=3.77, v_num=2vt
Validating:  75%|██████████████████████▌       | 75/100 [00:02<00:01, 22.08it/s]
Epoch 0:  42%|▍| 181797/432833 [5:39:53<7:49:20,  8.91it/s, loss=3.77, v_num=2vt
Epoch 0:  42%|▍| 181804/432833 [5:39:53<7:49:19,  8.91it/s, loss=3.77, v_num=2vt
Validating:  87%|██████████████████████████    | 87/100 [00:03<00:00, 29.42it/s]
Epoch 0:  42%|▍| 181811/432833 [5:39:54<7:49:17,  8.91it/s, loss=3.77, v_num=2vt
Validating:  94%|████████████████████████████▏ | 94/100 [00:03<00:00, 26.36it/s]
Epoch 0:  42%|▍| 181818/432833 [5:39:54<7:49:16,  8.92it/s, loss=3.77, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  42%|▍| 181818/432833 [5:39:55<7:49:17,  8.91it/s, loss=3.77, v_num=2vt
Epoch 0:  43%|▍| 186597/432833 [5:48:54<7:40:25,  8.91it/s, loss=3.75, v_num=2vtwandb: Network error (ConnectTimeout), entering retry loop. See wandb/debug-internal.log for full traceback.
Epoch 0:  43%|▍| 187009/432833 [5:49:40<7:39:38,  8.91it/s, loss=3.75, v_num=2vtwandb: Network error resolved after 0:01:19.467101, resuming normal operation.
Epoch 0:  44%|▍| 191818/432833 [5:58:44<7:30:44,  8.91it/s, loss=3.72, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  44%|▍| 191821/432833 [5:58:44<7:30:44,  8.91it/s, loss=3.72, v_num=2vt
Epoch 0:  44%|▍| 191828/432833 [5:58:44<7:30:42,  8.91it/s, loss=3.72, v_num=2vt
Validating:  10%|███                           | 10/100 [00:00<00:12,  7.10it/s]
Epoch 0:  44%|▍| 191835/432833 [5:58:44<7:30:41,  8.91it/s, loss=3.72, v_num=2vt
Validating:  18%|█████▍                        | 18/100 [00:00<00:06, 11.93it/s]
Epoch 0:  44%|▍| 191842/432833 [5:58:45<7:30:39,  8.91it/s, loss=3.72, v_num=2vt
Validating:  26%|███████▊                      | 26/100 [00:00<00:04, 16.70it/s]
Epoch 0:  44%|▍| 191849/432833 [5:58:45<7:30:38,  8.91it/s, loss=3.72, v_num=2vt
Epoch 0:  44%|▍| 191856/432833 [5:58:45<7:30:36,  8.91it/s, loss=3.72, v_num=2vt
Validating:  39%|███████████▋                  | 39/100 [00:01<00:02, 21.69it/s]
Epoch 0:  44%|▍| 191863/432833 [5:58:45<7:30:35,  8.91it/s, loss=3.72, v_num=2vt
Validating:  47%|██████████████                | 47/100 [00:01<00:02, 25.53it/s]
Epoch 0:  44%|▍| 191870/432833 [5:58:45<7:30:33,  8.91it/s, loss=3.72, v_num=2vt
Validating:  54%|████████████████▏             | 54/100 [00:01<00:01, 26.53it/s]
Epoch 0:  44%|▍| 191877/432833 [5:58:46<7:30:32,  8.91it/s, loss=3.72, v_num=2vt
Validating:  61%|██████████████████▎           | 61/100 [00:02<00:01, 26.05it/s]
Epoch 0:  44%|▍| 191884/432833 [5:58:46<7:30:30,  8.91it/s, loss=3.72, v_num=2vt
Validating:  67%|████████████████████          | 67/100 [00:02<00:01, 26.02it/s]
Epoch 0:  44%|▍| 191891/432833 [5:58:46<7:30:29,  8.91it/s, loss=3.72, v_num=2vt
Validating:  73%|█████████████████████▉        | 73/100 [00:02<00:01, 24.74it/s]
Validating:  76%|██████████████████████▊       | 76/100 [00:02<00:00, 25.91it/s]
Epoch 0:  44%|▍| 191898/432833 [5:58:47<7:30:27,  8.91it/s, loss=3.72, v_num=2vt
Validating:  82%|████████████████████████▌     | 82/100 [00:03<00:00, 24.56it/s]
Epoch 0:  44%|▍| 191905/432833 [5:58:47<7:30:26,  8.91it/s, loss=3.72, v_num=2vt
Validating:  88%|██████████████████████████▍   | 88/100 [00:03<00:00, 23.10it/s]
Epoch 0:  44%|▍| 191912/432833 [5:58:47<7:30:25,  8.91it/s, loss=3.72, v_num=2vt
Validating:  95%|████████████████████████████▌ | 95/100 [00:03<00:00, 21.49it/s]
Epoch 0:  44%|▍| 191919/432833 [5:58:47<7:30:23,  8.91it/s, loss=3.72, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  44%|▍| 191919/432833 [5:58:48<7:30:24,  8.91it/s, loss=3.72, v_num=2vt
Epoch 0:  47%|▍| 201919/432833 [6:17:39<7:11:53,  8.91it/s, loss=3.68, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  47%|▍| 201922/432833 [6:17:39<7:11:52,  8.91it/s, loss=3.68, v_num=2vt
Validating:   7%|██▏                            | 7/100 [00:00<00:17,  5.38it/s]
Epoch 0:  47%|▍| 201929/432833 [6:17:39<7:11:51,  8.91it/s, loss=3.68, v_num=2vt
Validating:  12%|███▌                          | 12/100 [00:00<00:09,  8.91it/s]
Epoch 0:  47%|▍| 201936/432833 [6:17:40<7:11:50,  8.91it/s, loss=3.68, v_num=2vt
Validating:  17%|█████                         | 17/100 [00:00<00:06, 13.02it/s]
Epoch 0:  47%|▍| 201943/432833 [6:17:40<7:11:48,  8.91it/s, loss=3.68, v_num=2vt
Validating:  25%|███████▌                      | 25/100 [00:01<00:03, 19.04it/s]
Epoch 0:  47%|▍| 201950/432833 [6:17:40<7:11:47,  8.91it/s, loss=3.68, v_num=2vt
Validating:  31%|█████████▎                    | 31/100 [00:01<00:03, 21.69it/s]
Validating:  34%|██████████▏                   | 34/100 [00:01<00:03, 19.19it/s]
Epoch 0:  47%|▍| 201957/432833 [6:17:41<7:11:45,  8.91it/s, loss=3.68, v_num=2vt
Validating:  40%|████████████                  | 40/100 [00:01<00:02, 21.82it/s]
Epoch 0:  47%|▍| 201964/432833 [6:17:41<7:11:44,  8.91it/s, loss=3.68, v_num=2vt
Validating:  46%|█████████████▊                | 46/100 [00:02<00:02, 21.69it/s]
Epoch 0:  47%|▍| 201971/432833 [6:17:41<7:11:43,  8.91it/s, loss=3.68, v_num=2vt
Validating:  53%|███████████████▉              | 53/100 [00:02<00:01, 25.08it/s]
Epoch 0:  47%|▍| 201978/432833 [6:17:41<7:11:41,  8.91it/s, loss=3.68, v_num=2vt
Validating:  60%|██████████████████            | 60/100 [00:02<00:01, 24.00it/s]
Epoch 0:  47%|▍| 201985/432833 [6:17:42<7:11:40,  8.91it/s, loss=3.68, v_num=2vt
Validating:  67%|████████████████████          | 67/100 [00:02<00:01, 27.16it/s]
Epoch 0:  47%|▍| 201992/432833 [6:17:42<7:11:39,  8.91it/s, loss=3.68, v_num=2vt
Validating:  74%|██████████████████████▏       | 74/100 [00:03<00:00, 26.97it/s]
Epoch 0:  47%|▍| 201999/432833 [6:17:42<7:11:37,  8.91it/s, loss=3.68, v_num=2vt
Epoch 0:  47%|▍| 202006/432833 [6:17:42<7:11:36,  8.91it/s, loss=3.68, v_num=2vt
Validating:  87%|██████████████████████████    | 87/100 [00:03<00:00, 28.50it/s]
Epoch 0:  47%|▍| 202013/432833 [6:17:43<7:11:34,  8.91it/s, loss=3.68, v_num=2vt
Validating:  95%|████████████████████████████▌ | 95/100 [00:03<00:00, 30.13it/s]
Epoch 0:  47%|▍| 202020/432833 [6:17:43<7:11:33,  8.91it/s, loss=3.68, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  47%|▍| 202020/432833 [6:17:44<7:11:34,  8.91it/s, loss=3.68, v_num=2vt
Epoch 0:  49%|▍| 212020/432833 [6:36:32<6:52:58,  8.91it/s, loss=3.64, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  49%|▍| 212023/432833 [6:36:32<6:52:58,  8.91it/s, loss=3.64, v_num=2vt
Epoch 0:  49%|▍| 212030/432833 [6:36:32<6:52:57,  8.91it/s, loss=3.64, v_num=2vt
Validating:  10%|███                           | 10/100 [00:00<00:13,  6.86it/s]
Epoch 0:  49%|▍| 212037/432833 [6:36:32<6:52:55,  8.91it/s, loss=3.64, v_num=2vt
Validating:  17%|█████                         | 17/100 [00:00<00:07, 11.37it/s]
Validating:  20%|██████                        | 20/100 [00:00<00:06, 12.40it/s]
Epoch 0:  49%|▍| 212044/432833 [6:36:33<6:52:54,  8.91it/s, loss=3.64, v_num=2vt
Epoch 0:  49%|▍| 212051/432833 [6:36:33<6:52:53,  8.91it/s, loss=3.64, v_num=2vt
Validating:  31%|█████████▎                    | 31/100 [00:01<00:03, 20.64it/s]
Validating:  34%|██████████▏                   | 34/100 [00:01<00:02, 22.04it/s]
Epoch 0:  49%|▍| 212058/432833 [6:36:33<6:52:51,  8.91it/s, loss=3.64, v_num=2vt
Epoch 0:  49%|▍| 212065/432833 [6:36:33<6:52:50,  8.91it/s, loss=3.64, v_num=2vt
Validating:  45%|█████████████▌                | 45/100 [00:01<00:02, 27.47it/s]
Epoch 0:  49%|▍| 212072/432833 [6:36:34<6:52:49,  8.91it/s, loss=3.64, v_num=2vt
Validating:  52%|███████████████▌              | 52/100 [00:02<00:01, 25.21it/s]
Validating:  55%|████████████████▌             | 55/100 [00:02<00:01, 25.72it/s]
Epoch 0:  49%|▍| 212079/432833 [6:36:34<6:52:47,  8.91it/s, loss=3.64, v_num=2vt
Validating:  61%|██████████████████▎           | 61/100 [00:02<00:01, 24.67it/s]
Epoch 0:  49%|▍| 212086/432833 [6:36:34<6:52:46,  8.91it/s, loss=3.64, v_num=2vt
Validating:  67%|████████████████████          | 67/100 [00:02<00:01, 23.73it/s]
Epoch 0:  49%|▍| 212093/432833 [6:36:35<6:52:45,  8.91it/s, loss=3.64, v_num=2vt
Validating:  73%|█████████████████████▉        | 73/100 [00:02<00:01, 25.38it/s]
Epoch 0:  49%|▍| 212100/432833 [6:36:35<6:52:43,  8.91it/s, loss=3.64, v_num=2vt
Validating:  80%|████████████████████████      | 80/100 [00:03<00:00, 25.51it/s]
Epoch 0:  49%|▍| 212107/432833 [6:36:35<6:52:42,  8.91it/s, loss=3.64, v_num=2vt
Validating:  87%|██████████████████████████    | 87/100 [00:03<00:00, 28.40it/s]
Epoch 0:  49%|▍| 212114/432833 [6:36:35<6:52:41,  8.91it/s, loss=3.64, v_num=2vt
Epoch 0:  49%|▍| 212121/432833 [6:36:36<6:52:39,  8.91it/s, loss=3.64, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  49%|▍| 212121/432833 [6:36:36<6:52:40,  8.91it/s, loss=3.64, v_num=2vt
Epoch 0:  51%|▌| 222121/432833 [6:55:28<6:34:07,  8.91it/s, loss=3.61, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  51%|▌| 222124/432833 [6:55:28<6:34:07,  8.91it/s, loss=3.61, v_num=2vt
Epoch 0:  51%|▌| 222131/432833 [6:55:28<6:34:06,  8.91it/s, loss=3.61, v_num=2vt
Validating:  10%|███                           | 10/100 [00:00<00:12,  7.01it/s]
Epoch 0:  51%|▌| 222138/432833 [6:55:29<6:34:04,  8.91it/s, loss=3.61, v_num=2vt
Validating:  17%|█████                         | 17/100 [00:00<00:07, 11.42it/s]
Epoch 0:  51%|▌| 222145/432833 [6:55:29<6:34:03,  8.91it/s, loss=3.61, v_num=2vt
Validating:  24%|███████▏                      | 24/100 [00:01<00:04, 15.29it/s]
Epoch 0:  51%|▌| 222152/432833 [6:55:29<6:34:02,  8.91it/s, loss=3.61, v_num=2vt
Validating:  31%|█████████▎                    | 31/100 [00:01<00:03, 20.35it/s]
Epoch 0:  51%|▌| 222159/432833 [6:55:29<6:34:00,  8.91it/s, loss=3.61, v_num=2vt
Validating:  39%|███████████▋                  | 39/100 [00:01<00:02, 26.53it/s]
Epoch 0:  51%|▌| 222166/432833 [6:55:30<6:33:59,  8.91it/s, loss=3.61, v_num=2vt
Validating:  46%|█████████████▊                | 46/100 [00:01<00:02, 25.46it/s]
Epoch 0:  51%|▌| 222173/432833 [6:55:30<6:33:58,  8.91it/s, loss=3.61, v_num=2vt
Validating:  53%|███████████████▉              | 53/100 [00:02<00:01, 25.30it/s]
Epoch 0:  51%|▌| 222180/432833 [6:55:30<6:33:57,  8.91it/s, loss=3.61, v_num=2vt
Validating:  59%|█████████████████▋            | 59/100 [00:02<00:01, 25.67it/s]
Epoch 0:  51%|▌| 222187/432833 [6:55:30<6:33:55,  8.91it/s, loss=3.61, v_num=2vt
Validating:  66%|███████████████████▊          | 66/100 [00:02<00:01, 22.07it/s]
Epoch 0:  51%|▌| 222194/432833 [6:55:31<6:33:54,  8.91it/s, loss=3.61, v_num=2vt
Validating:  75%|██████████████████████▌       | 75/100 [00:02<00:00, 27.65it/s]
Epoch 0:  51%|▌| 222201/432833 [6:55:31<6:33:53,  8.91it/s, loss=3.61, v_num=2vt
Validating:  82%|████████████████████████▌     | 82/100 [00:03<00:00, 26.66it/s]
Epoch 0:  51%|▌| 222208/432833 [6:55:31<6:33:52,  8.91it/s, loss=3.61, v_num=2vt
Validating:  89%|██████████████████████████▋   | 89/100 [00:03<00:00, 27.87it/s]
Epoch 0:  51%|▌| 222215/432833 [6:55:31<6:33:50,  8.91it/s, loss=3.61, v_num=2vt
Validating:  97%|█████████████████████████████ | 97/100 [00:03<00:00, 27.82it/s]
Epoch 0:  51%|▌| 222222/432833 [6:55:32<6:33:49,  8.91it/s, loss=3.61, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  51%|▌| 222222/432833 [6:55:33<6:33:50,  8.91it/s, loss=3.61, v_num=2vt
Epoch 0:  54%|▌| 232222/432833 [7:14:25<6:15:17,  8.91it/s, loss=3.57, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  54%|▌| 232225/432833 [7:14:26<6:15:17,  8.91it/s, loss=3.57, v_num=2vt
Validating:   6%|█▊                             | 6/100 [00:00<00:18,  5.18it/s]
Epoch 0:  54%|▌| 232232/432833 [7:14:26<6:15:16,  8.91it/s, loss=3.57, v_num=2vt
Validating:  13%|███▉                          | 13/100 [00:00<00:09,  9.04it/s]
Epoch 0:  54%|▌| 232239/432833 [7:14:26<6:15:14,  8.91it/s, loss=3.57, v_num=2vt
Epoch 0:  54%|▌| 232246/432833 [7:14:26<6:15:13,  8.91it/s, loss=3.57, v_num=2vt
Validating:  24%|███████▏                      | 24/100 [00:00<00:04, 16.50it/s]
Epoch 0:  54%|▌| 232253/432833 [7:14:27<6:15:12,  8.91it/s, loss=3.57, v_num=2vt
Validating:  31%|█████████▎                    | 31/100 [00:01<00:03, 20.61it/s]
Validating:  34%|██████████▏                   | 34/100 [00:01<00:02, 22.34it/s]
Epoch 0:  54%|▌| 232260/432833 [7:14:27<6:15:11,  8.91it/s, loss=3.57, v_num=2vt
Validating:  40%|████████████                  | 40/100 [00:01<00:02, 24.46it/s]
Epoch 0:  54%|▌| 232267/432833 [7:14:27<6:15:09,  8.91it/s, loss=3.57, v_num=2vt
Validating:  46%|█████████████▊                | 46/100 [00:01<00:02, 24.45it/s]
Epoch 0:  54%|▌| 232274/432833 [7:14:27<6:15:08,  8.91it/s, loss=3.57, v_num=2vt
Validating:  54%|████████████████▏             | 54/100 [00:02<00:01, 28.51it/s]
Epoch 0:  54%|▌| 232281/432833 [7:14:28<6:15:07,  8.91it/s, loss=3.57, v_num=2vt
Validating:  61%|██████████████████▎           | 61/100 [00:02<00:01, 23.55it/s]
Epoch 0:  54%|▌| 232288/432833 [7:14:28<6:15:06,  8.91it/s, loss=3.57, v_num=2vt
Validating:  68%|████████████████████▍         | 68/100 [00:02<00:01, 25.66it/s]
Epoch 0:  54%|▌| 232295/432833 [7:14:28<6:15:04,  8.91it/s, loss=3.57, v_num=2vt
Validating:  74%|██████████████████████▏       | 74/100 [00:02<00:01, 23.15it/s]
Epoch 0:  54%|▌| 232302/432833 [7:14:29<6:15:03,  8.91it/s, loss=3.57, v_num=2vt
Validating:  81%|████████████████████████▎     | 81/100 [00:03<00:00, 22.83it/s]
Epoch 0:  54%|▌| 232309/432833 [7:14:29<6:15:02,  8.91it/s, loss=3.57, v_num=2vt
Validating:  89%|██████████████████████████▋   | 89/100 [00:03<00:00, 27.14it/s]
Epoch 0:  54%|▌| 232316/432833 [7:14:29<6:15:01,  8.91it/s, loss=3.57, v_num=2vt
Validating:  96%|████████████████████████████▊ | 96/100 [00:03<00:00, 28.06it/s]
Epoch 0:  54%|▌| 232323/432833 [7:14:29<6:14:59,  8.91it/s, loss=3.57, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  54%|▌| 232323/432833 [7:14:30<6:15:00,  8.91it/s, loss=3.57, v_num=2vt
Epoch 0:  56%|▌| 242323/432833 [7:33:22<5:56:26,  8.91it/s, loss=3.54, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  56%|▌| 242326/432833 [7:33:22<5:56:25,  8.91it/s, loss=3.54, v_num=2vt
Epoch 0:  56%|▌| 242333/432833 [7:33:22<5:56:24,  8.91it/s, loss=3.54, v_num=2vt
Validating:  11%|███▎                          | 11/100 [00:00<00:12,  7.12it/s]
Epoch 0:  56%|▌| 242340/432833 [7:33:23<5:56:23,  8.91it/s, loss=3.54, v_num=2vt
Validating:  17%|█████                         | 17/100 [00:00<00:07, 11.60it/s]
Epoch 0:  56%|▌| 242347/432833 [7:33:23<5:56:21,  8.91it/s, loss=3.54, v_num=2vt
Validating:  25%|███████▌                      | 25/100 [00:00<00:04, 17.71it/s]
Epoch 0:  56%|▌| 242354/432833 [7:33:23<5:56:20,  8.91it/s, loss=3.54, v_num=2vt
Validating:  33%|█████████▉                    | 33/100 [00:01<00:02, 22.87it/s]
Epoch 0:  56%|▌| 242361/432833 [7:33:23<5:56:19,  8.91it/s, loss=3.54, v_num=2vt
Epoch 0:  56%|▌| 242368/432833 [7:33:23<5:56:18,  8.91it/s, loss=3.54, v_num=2vt
Validating:  45%|█████████████▌                | 45/100 [00:01<00:01, 28.56it/s]
Epoch 0:  56%|▌| 242375/432833 [7:33:24<5:56:17,  8.91it/s, loss=3.54, v_num=2vt
Validating:  52%|███████████████▌              | 52/100 [00:01<00:01, 24.19it/s]
Epoch 0:  56%|▌| 242382/432833 [7:33:24<5:56:15,  8.91it/s, loss=3.54, v_num=2vt
Validating:  59%|█████████████████▋            | 59/100 [00:02<00:01, 22.29it/s]
Validating:  62%|██████████████████▌           | 62/100 [00:02<00:01, 23.67it/s]
Epoch 0:  56%|▌| 242389/432833 [7:33:24<5:56:14,  8.91it/s, loss=3.54, v_num=2vt
Validating:  69%|████████████████████▋         | 69/100 [00:02<00:01, 24.70it/s]
Epoch 0:  56%|▌| 242396/432833 [7:33:25<5:56:13,  8.91it/s, loss=3.54, v_num=2vt
Validating:  75%|██████████████████████▌       | 75/100 [00:02<00:01, 18.61it/s]
Epoch 0:  56%|▌| 242403/432833 [7:33:25<5:56:12,  8.91it/s, loss=3.54, v_num=2vt
Validating:  82%|████████████████████████▌     | 82/100 [00:03<00:00, 21.29it/s]
Epoch 0:  56%|▌| 242410/432833 [7:33:25<5:56:11,  8.91it/s, loss=3.54, v_num=2vt
Validating:  88%|██████████████████████████▍   | 88/100 [00:03<00:00, 22.59it/s]
Epoch 0:  56%|▌| 242417/432833 [7:33:26<5:56:10,  8.91it/s, loss=3.54, v_num=2vt
Validating:  95%|████████████████████████████▌ | 95/100 [00:03<00:00, 26.16it/s]
Epoch 0:  56%|▌| 242424/432833 [7:33:26<5:56:08,  8.91it/s, loss=3.54, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  56%|▌| 242424/432833 [7:33:27<5:56:09,  8.91it/s, loss=3.54, v_num=2vt
Epoch 0:  58%|▌| 252424/432833 [7:52:19<5:37:34,  8.91it/s, loss=3.51, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  58%|▌| 252427/432833 [7:52:20<5:37:34,  8.91it/s, loss=3.51, v_num=2vt
Epoch 0:  58%|▌| 252434/432833 [7:52:20<5:37:33,  8.91it/s, loss=3.51, v_num=2vt
Validating:  10%|███                           | 10/100 [00:00<00:13,  6.88it/s]
Epoch 0:  58%|▌| 252441/432833 [7:52:20<5:37:31,  8.91it/s, loss=3.51, v_num=2vt
Validating:  18%|█████▍                        | 18/100 [00:00<00:07, 11.51it/s]
Epoch 0:  58%|▌| 252448/432833 [7:52:20<5:37:30,  8.91it/s, loss=3.51, v_num=2vt
Validating:  26%|███████▊                      | 26/100 [00:00<00:04, 16.79it/s]
Epoch 0:  58%|▌| 252455/432833 [7:52:21<5:37:29,  8.91it/s, loss=3.51, v_num=2vt
Validating:  32%|█████████▌                    | 32/100 [00:01<00:03, 18.86it/s]
Epoch 0:  58%|▌| 252462/432833 [7:52:21<5:37:28,  8.91it/s, loss=3.51, v_num=2vt
Validating:  40%|████████████                  | 40/100 [00:01<00:02, 23.41it/s]
Epoch 0:  58%|▌| 252469/432833 [7:52:21<5:37:27,  8.91it/s, loss=3.51, v_num=2vt
Validating:  46%|█████████████▊                | 46/100 [00:01<00:02, 25.25it/s]
Epoch 0:  58%|▌| 252476/432833 [7:52:21<5:37:26,  8.91it/s, loss=3.51, v_num=2vt
Validating:  53%|███████████████▉              | 53/100 [00:02<00:01, 25.30it/s]
Epoch 0:  58%|▌| 252483/432833 [7:52:22<5:37:24,  8.91it/s, loss=3.51, v_num=2vt
Validating:  61%|██████████████████▎           | 61/100 [00:02<00:01, 29.75it/s]
Epoch 0:  58%|▌| 252490/432833 [7:52:22<5:37:23,  8.91it/s, loss=3.51, v_num=2vt
Epoch 0:  58%|▌| 252497/432833 [7:52:22<5:37:22,  8.91it/s, loss=3.51, v_num=2vt
Validating:  73%|█████████████████████▉        | 73/100 [00:02<00:01, 22.14it/s]
Validating:  76%|██████████████████████▊       | 76/100 [00:02<00:01, 22.72it/s]
Epoch 0:  58%|▌| 252504/432833 [7:52:22<5:37:21,  8.91it/s, loss=3.51, v_num=2vt
Validating:  82%|████████████████████████▌     | 82/100 [00:03<00:00, 21.68it/s]
Epoch 0:  58%|▌| 252511/432833 [7:52:23<5:37:20,  8.91it/s, loss=3.51, v_num=2vt
Validating:  89%|██████████████████████████▋   | 89/100 [00:03<00:00, 25.07it/s]
Epoch 0:  58%|▌| 252518/432833 [7:52:23<5:37:19,  8.91it/s, loss=3.51, v_num=2vt
Validating:  95%|████████████████████████████▌ | 95/100 [00:03<00:00, 24.65it/s]
Epoch 0:  58%|▌| 252525/432833 [7:52:23<5:37:17,  8.91it/s, loss=3.51, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  58%|▌| 252525/432833 [7:52:24<5:37:18,  8.91it/s, loss=3.51, v_num=2vt
Epoch 0:  61%|▌| 262525/432833 [8:11:16<5:18:42,  8.91it/s, loss=3.48, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  61%|▌| 262528/432833 [8:11:16<5:18:41,  8.91it/s, loss=3.48, v_num=2vt
Epoch 0:  61%|▌| 262535/432833 [8:11:16<5:18:40,  8.91it/s, loss=3.48, v_num=2vt
Validating:  10%|███                           | 10/100 [00:00<00:12,  6.96it/s]
Epoch 0:  61%|▌| 262542/432833 [8:11:16<5:18:39,  8.91it/s, loss=3.48, v_num=2vt
Epoch 0:  61%|▌| 262549/432833 [8:11:17<5:18:38,  8.91it/s, loss=3.48, v_num=2vt
Validating:  24%|███████▏                      | 24/100 [00:00<00:05, 15.15it/s]
Epoch 0:  61%|▌| 262556/432833 [8:11:17<5:18:37,  8.91it/s, loss=3.48, v_num=2vt
Validating:  32%|█████████▌                    | 32/100 [00:01<00:03, 17.60it/s]
Epoch 0:  61%|▌| 262563/432833 [8:11:17<5:18:36,  8.91it/s, loss=3.48, v_num=2vt
Validating:  39%|███████████▋                  | 39/100 [00:01<00:02, 20.63it/s]
Epoch 0:  61%|▌| 262570/432833 [8:11:17<5:18:34,  8.91it/s, loss=3.48, v_num=2vt
Validating:  47%|██████████████                | 47/100 [00:01<00:02, 24.69it/s]
Epoch 0:  61%|▌| 262577/432833 [8:11:18<5:18:33,  8.91it/s, loss=3.48, v_num=2vt
Epoch 0:  61%|▌| 262584/432833 [8:11:18<5:18:32,  8.91it/s, loss=3.48, v_num=2vt
Validating:  59%|█████████████████▋            | 59/100 [00:02<00:01, 27.97it/s]
Epoch 0:  61%|▌| 262591/432833 [8:11:18<5:18:31,  8.91it/s, loss=3.48, v_num=2vt
Validating:  66%|███████████████████▊          | 66/100 [00:02<00:01, 27.51it/s]
Epoch 0:  61%|▌| 262598/432833 [8:11:18<5:18:30,  8.91it/s, loss=3.48, v_num=2vt
Validating:  75%|██████████████████████▌       | 75/100 [00:02<00:00, 27.32it/s]
Epoch 0:  61%|▌| 262605/432833 [8:11:19<5:18:29,  8.91it/s, loss=3.48, v_num=2vt
Validating:  81%|████████████████████████▎     | 81/100 [00:02<00:00, 24.40it/s]
Epoch 0:  61%|▌| 262612/432833 [8:11:19<5:18:28,  8.91it/s, loss=3.48, v_num=2vt
Validating:  87%|██████████████████████████    | 87/100 [00:03<00:00, 25.02it/s]
Epoch 0:  61%|▌| 262619/432833 [8:11:19<5:18:26,  8.91it/s, loss=3.48, v_num=2vt
Validating:  94%|████████████████████████████▏ | 94/100 [00:03<00:00, 24.84it/s]
Validating:  97%|█████████████████████████████ | 97/100 [00:03<00:00, 22.38it/s]
Epoch 0:  61%|▌| 262626/432833 [8:11:19<5:18:25,  8.91it/s, loss=3.48, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  61%|▌| 262626/432833 [8:11:20<5:18:26,  8.91it/s, loss=3.48, v_num=2vt
Epoch 0:  63%|▋| 272626/432833 [8:30:14<4:59:50,  8.91it/s, loss=3.46, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  63%|▋| 272629/432833 [8:30:14<4:59:49,  8.91it/s, loss=3.46, v_num=2vt
Epoch 0:  63%|▋| 272636/432833 [8:30:14<4:59:48,  8.91it/s, loss=3.46, v_num=2vt
Validating:  12%|███▌                          | 12/100 [00:00<00:12,  7.20it/s]
Epoch 0:  63%|▋| 272643/432833 [8:30:14<4:59:47,  8.91it/s, loss=3.46, v_num=2vt
Validating:  19%|█████▋                        | 19/100 [00:00<00:06, 11.84it/s]
Epoch 0:  63%|▋| 272650/432833 [8:30:15<4:59:46,  8.91it/s, loss=3.46, v_num=2vt
Validating:  26%|███████▊                      | 26/100 [00:00<00:04, 17.20it/s]
Epoch 0:  63%|▋| 272657/432833 [8:30:15<4:59:45,  8.91it/s, loss=3.46, v_num=2vt
Validating:  33%|█████████▉                    | 33/100 [00:01<00:03, 18.54it/s]
Epoch 0:  63%|▋| 272664/432833 [8:30:15<4:59:44,  8.91it/s, loss=3.46, v_num=2vt
Validating:  39%|███████████▋                  | 39/100 [00:01<00:03, 19.66it/s]
Epoch 0:  63%|▋| 272671/432833 [8:30:16<4:59:43,  8.91it/s, loss=3.46, v_num=2vt
Validating:  46%|█████████████▊                | 46/100 [00:01<00:02, 23.72it/s]
Epoch 0:  63%|▋| 272678/432833 [8:30:16<4:59:42,  8.91it/s, loss=3.46, v_num=2vt
Validating:  52%|███████████████▌              | 52/100 [00:02<00:01, 24.23it/s]
Validating:  55%|████████████████▌             | 55/100 [00:02<00:01, 25.24it/s]
Epoch 0:  63%|▋| 272685/432833 [8:30:16<4:59:41,  8.91it/s, loss=3.46, v_num=2vt
Validating:  61%|██████████████████▎           | 61/100 [00:02<00:01, 26.22it/s]
Epoch 0:  63%|▋| 272692/432833 [8:30:16<4:59:40,  8.91it/s, loss=3.46, v_num=2vt
Validating:  67%|████████████████████          | 67/100 [00:02<00:01, 26.46it/s]
Epoch 0:  63%|▋| 272699/432833 [8:30:17<4:59:38,  8.91it/s, loss=3.46, v_num=2vt
Validating:  74%|██████████████████████▏       | 74/100 [00:02<00:01, 23.73it/s]
Epoch 0:  63%|▋| 272706/432833 [8:30:17<4:59:37,  8.91it/s, loss=3.46, v_num=2vt
Validating:  80%|████████████████████████      | 80/100 [00:03<00:00, 25.15it/s]
Epoch 0:  63%|▋| 272713/432833 [8:30:17<4:59:36,  8.91it/s, loss=3.46, v_num=2vt
Validating:  87%|██████████████████████████    | 87/100 [00:03<00:00, 24.12it/s]
Validating:  90%|███████████████████████████   | 90/100 [00:03<00:00, 23.86it/s]
Epoch 0:  63%|▋| 272720/432833 [8:30:17<4:59:35,  8.91it/s, loss=3.46, v_num=2vt
Epoch 0:  63%|▋| 272727/432833 [8:30:18<4:59:34,  8.91it/s, loss=3.46, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  63%|▋| 272727/432833 [8:30:19<4:59:35,  8.91it/s, loss=3.46, v_num=2vt
Epoch 0:  65%|▋| 282727/432833 [8:49:10<4:40:57,  8.90it/s, loss=3.43, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  65%|▋| 282730/432833 [8:49:10<4:40:56,  8.90it/s, loss=3.43, v_num=2vt
Validating:   6%|█▊                             | 6/100 [00:00<00:18,  5.02it/s]
Epoch 0:  65%|▋| 282737/432833 [8:49:11<4:40:55,  8.90it/s, loss=3.43, v_num=2vt
Validating:  12%|███▌                          | 12/100 [00:00<00:10,  8.46it/s]
Epoch 0:  65%|▋| 282744/432833 [8:49:11<4:40:54,  8.90it/s, loss=3.43, v_num=2vt
Validating:  19%|█████▋                        | 19/100 [00:00<00:06, 12.21it/s]
Epoch 0:  65%|▋| 282751/432833 [8:49:11<4:40:53,  8.91it/s, loss=3.43, v_num=2vt
Validating:  25%|███████▌                      | 25/100 [00:01<00:04, 16.22it/s]
Epoch 0:  65%|▋| 282758/432833 [8:49:12<4:40:52,  8.91it/s, loss=3.43, v_num=2vt
Validating:  31%|█████████▎                    | 31/100 [00:01<00:03, 19.96it/s]
Epoch 0:  65%|▋| 282765/432833 [8:49:12<4:40:51,  8.91it/s, loss=3.43, v_num=2vt
Validating:  38%|███████████▍                  | 38/100 [00:01<00:02, 22.41it/s]
Epoch 0:  65%|▋| 282772/432833 [8:49:12<4:40:50,  8.91it/s, loss=3.43, v_num=2vt
Validating:  46%|█████████████▊                | 46/100 [00:01<00:02, 25.13it/s]
Epoch 0:  65%|▋| 282779/432833 [8:49:12<4:40:49,  8.91it/s, loss=3.43, v_num=2vt
Epoch 0:  65%|▋| 282786/432833 [8:49:12<4:40:48,  8.91it/s, loss=3.43, v_num=2vt
Validating:  59%|█████████████████▋            | 59/100 [00:02<00:01, 31.98it/s]
Epoch 0:  65%|▋| 282793/432833 [8:49:13<4:40:47,  8.91it/s, loss=3.43, v_num=2vt
Validating:  67%|████████████████████          | 67/100 [00:02<00:01, 30.42it/s]
Epoch 0:  65%|▋| 282800/432833 [8:49:13<4:40:46,  8.91it/s, loss=3.43, v_num=2vt
Validating:  75%|██████████████████████▌       | 75/100 [00:02<00:00, 30.45it/s]
Epoch 0:  65%|▋| 282807/432833 [8:49:13<4:40:44,  8.91it/s, loss=3.43, v_num=2vt
Validating:  82%|████████████████████████▌     | 82/100 [00:03<00:00, 28.23it/s]
Epoch 0:  65%|▋| 282814/432833 [8:49:13<4:40:43,  8.91it/s, loss=3.43, v_num=2vt
Epoch 0:  65%|▋| 282821/432833 [8:49:14<4:40:42,  8.91it/s, loss=3.43, v_num=2vt
Validating:  94%|████████████████████████████▏ | 94/100 [00:03<00:00, 22.75it/s]
Validating:  97%|█████████████████████████████ | 97/100 [00:03<00:00, 23.54it/s]
Epoch 0:  65%|▋| 282828/432833 [8:49:14<4:40:41,  8.91it/s, loss=3.43, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  65%|▋| 282828/432833 [8:49:15<4:40:42,  8.91it/s, loss=3.43, v_num=2vt
Epoch 0:  68%|▋| 292828/432833 [9:08:07<4:22:03,  8.90it/s, loss=3.41, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  68%|▋| 292831/432833 [9:08:07<4:22:03,  8.90it/s, loss=3.41, v_num=2vt
Validating:   6%|█▊                             | 6/100 [00:00<00:17,  5.30it/s]
Epoch 0:  68%|▋| 292838/432833 [9:08:07<4:22:02,  8.90it/s, loss=3.41, v_num=2vt
Validating:  12%|███▌                          | 12/100 [00:00<00:09,  8.85it/s]
Epoch 0:  68%|▋| 292845/432833 [9:08:07<4:22:01,  8.90it/s, loss=3.41, v_num=2vt
Validating:  18%|█████▍                        | 18/100 [00:00<00:06, 13.44it/s]
Epoch 0:  68%|▋| 292852/432833 [9:08:08<4:22:00,  8.90it/s, loss=3.41, v_num=2vt
Epoch 0:  68%|▋| 292859/432833 [9:08:08<4:21:59,  8.90it/s, loss=3.41, v_num=2vt
Validating:  31%|█████████▎                    | 31/100 [00:01<00:03, 20.48it/s]
Validating:  34%|██████████▏                   | 34/100 [00:01<00:03, 21.49it/s]
Epoch 0:  68%|▋| 292866/432833 [9:08:08<4:21:58,  8.90it/s, loss=3.41, v_num=2vt
Validating:  40%|████████████                  | 40/100 [00:01<00:02, 21.92it/s]
Epoch 0:  68%|▋| 292873/432833 [9:08:08<4:21:57,  8.90it/s, loss=3.41, v_num=2vt
Validating:  46%|█████████████▊                | 46/100 [00:01<00:02, 22.88it/s]
Epoch 0:  68%|▋| 292880/432833 [9:08:09<4:21:56,  8.91it/s, loss=3.41, v_num=2vt
Validating:  54%|████████████████▏             | 54/100 [00:02<00:01, 27.14it/s]
Epoch 0:  68%|▋| 292887/432833 [9:08:09<4:21:55,  8.91it/s, loss=3.41, v_num=2vt
Validating:  60%|██████████████████            | 60/100 [00:02<00:01, 27.72it/s]
Epoch 0:  68%|▋| 292894/432833 [9:08:09<4:21:54,  8.91it/s, loss=3.41, v_num=2vt
Validating:  67%|████████████████████          | 67/100 [00:02<00:01, 25.00it/s]
Epoch 0:  68%|▋| 292901/432833 [9:08:09<4:21:53,  8.91it/s, loss=3.41, v_num=2vt
Validating:  74%|██████████████████████▏       | 74/100 [00:02<00:00, 27.87it/s]
Epoch 0:  68%|▋| 292908/432833 [9:08:10<4:21:51,  8.91it/s, loss=3.41, v_num=2vt
Validating:  82%|████████████████████████▌     | 82/100 [00:03<00:00, 28.75it/s]
Epoch 0:  68%|▋| 292915/432833 [9:08:10<4:21:50,  8.91it/s, loss=3.41, v_num=2vt
Validating:  88%|██████████████████████████▍   | 88/100 [00:03<00:00, 22.67it/s]
Epoch 0:  68%|▋| 292922/432833 [9:08:10<4:21:49,  8.91it/s, loss=3.41, v_num=2vt
Validating:  95%|████████████████████████████▌ | 95/100 [00:03<00:00, 26.25it/s]
Epoch 0:  68%|▋| 292929/432833 [9:08:11<4:21:48,  8.91it/s, loss=3.41, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  68%|▋| 292929/432833 [9:08:11<4:21:49,  8.91it/s, loss=3.41, v_num=2vt
Epoch 0:  70%|▋| 302929/432833 [9:27:03<4:03:10,  8.90it/s, loss=3.39, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  70%|▋| 302932/432833 [9:27:03<4:03:09,  8.90it/s, loss=3.39, v_num=2vt
Epoch 0:  70%|▋| 302939/432833 [9:27:04<4:03:08,  8.90it/s, loss=3.39, v_num=2vt
Validating:  10%|███                           | 10/100 [00:00<00:12,  6.95it/s]
Epoch 0:  70%|▋| 302946/432833 [9:27:04<4:03:07,  8.90it/s, loss=3.39, v_num=2vt
Validating:  17%|█████                         | 17/100 [00:00<00:07, 10.98it/s]
Validating:  20%|██████                        | 20/100 [00:00<00:06, 12.87it/s]
Epoch 0:  70%|▋| 302953/432833 [9:27:04<4:03:06,  8.90it/s, loss=3.39, v_num=2vt
Epoch 0:  70%|▋| 302960/432833 [9:27:04<4:03:05,  8.90it/s, loss=3.39, v_num=2vt
Validating:  31%|█████████▎                    | 31/100 [00:01<00:03, 20.57it/s]
Epoch 0:  70%|▋| 302967/432833 [9:27:05<4:03:04,  8.90it/s, loss=3.39, v_num=2vt
Validating:  38%|███████████▍                  | 38/100 [00:01<00:02, 23.70it/s]
Validating:  41%|████████████▎                 | 41/100 [00:01<00:02, 23.78it/s]
Epoch 0:  70%|▋| 302974/432833 [9:27:05<4:03:03,  8.90it/s, loss=3.39, v_num=2vt
Validating:  47%|██████████████                | 47/100 [00:01<00:02, 23.56it/s]
Epoch 0:  70%|▋| 302981/432833 [9:27:05<4:03:02,  8.90it/s, loss=3.39, v_num=2vt
Validating:  54%|████████████████▏             | 54/100 [00:02<00:01, 25.85it/s]
Epoch 0:  70%|▋| 302988/432833 [9:27:05<4:03:01,  8.90it/s, loss=3.39, v_num=2vt
Epoch 0:  70%|▋| 302995/432833 [9:27:06<4:03:00,  8.90it/s, loss=3.39, v_num=2vt
Validating:  67%|████████████████████          | 67/100 [00:02<00:01, 26.72it/s]
Epoch 0:  70%|▋| 303002/432833 [9:27:06<4:02:59,  8.90it/s, loss=3.39, v_num=2vt
Validating:  73%|█████████████████████▉        | 73/100 [00:02<00:01, 22.66it/s]
Epoch 0:  70%|▋| 303009/432833 [9:27:06<4:02:58,  8.91it/s, loss=3.39, v_num=2vt
Validating:  80%|████████████████████████      | 80/100 [00:03<00:00, 24.42it/s]
Epoch 0:  70%|▋| 303016/432833 [9:27:06<4:02:57,  8.91it/s, loss=3.39, v_num=2vt
Validating:  87%|██████████████████████████    | 87/100 [00:03<00:00, 26.69it/s]
Epoch 0:  70%|▋| 303023/432833 [9:27:07<4:02:56,  8.91it/s, loss=3.39, v_num=2vt
Validating:  95%|████████████████████████████▌ | 95/100 [00:03<00:00, 26.90it/s]
Epoch 0:  70%|▋| 303030/432833 [9:27:07<4:02:55,  8.91it/s, loss=3.39, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  70%|▋| 303030/432833 [9:27:08<4:02:56,  8.91it/s, loss=3.39, v_num=2vt
Epoch 0:  72%|▋| 313030/432833 [9:46:01<3:44:17,  8.90it/s, loss=3.37, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  72%|▋| 313033/432833 [9:46:01<3:44:16,  8.90it/s, loss=3.37, v_num=2vt
Epoch 0:  72%|▋| 313040/432833 [9:46:01<3:44:15,  8.90it/s, loss=3.37, v_num=2vt
Validating:  10%|███                           | 10/100 [00:00<00:13,  6.85it/s]
Epoch 0:  72%|▋| 313047/432833 [9:46:02<3:44:14,  8.90it/s, loss=3.37, v_num=2vt
Validating:  18%|█████▍                        | 18/100 [00:00<00:07, 11.70it/s]
Epoch 0:  72%|▋| 313054/432833 [9:46:02<3:44:13,  8.90it/s, loss=3.37, v_num=2vt
Epoch 0:  72%|▋| 313061/432833 [9:46:02<3:44:12,  8.90it/s, loss=3.37, v_num=2vt
Validating:  31%|█████████▎                    | 31/100 [00:01<00:03, 19.32it/s]
Epoch 0:  72%|▋| 313068/432833 [9:46:02<3:44:11,  8.90it/s, loss=3.37, v_num=2vt
Validating:  39%|███████████▋                  | 39/100 [00:01<00:02, 23.19it/s]
Epoch 0:  72%|▋| 313075/432833 [9:46:03<3:44:10,  8.90it/s, loss=3.37, v_num=2vt
Validating:  45%|█████████████▌                | 45/100 [00:01<00:02, 23.12it/s]
Validating:  48%|██████████████▍               | 48/100 [00:01<00:02, 21.22it/s]
Epoch 0:  72%|▋| 313082/432833 [9:46:03<3:44:09,  8.90it/s, loss=3.37, v_num=2vt
Validating:  55%|████████████████▌             | 55/100 [00:02<00:01, 23.72it/s]
Epoch 0:  72%|▋| 313089/432833 [9:46:03<3:44:08,  8.90it/s, loss=3.37, v_num=2vt
Validating:  61%|██████████████████▎           | 61/100 [00:02<00:01, 25.72it/s]
Epoch 0:  72%|▋| 313096/432833 [9:46:03<3:44:07,  8.90it/s, loss=3.37, v_num=2vt
Validating:  69%|████████████████████▋         | 69/100 [00:02<00:01, 27.52it/s]
Epoch 0:  72%|▋| 313103/432833 [9:46:04<3:44:06,  8.90it/s, loss=3.37, v_num=2vt
Validating:  75%|██████████████████████▌       | 75/100 [00:02<00:01, 23.70it/s]
Epoch 0:  72%|▋| 313110/432833 [9:46:04<3:44:05,  8.90it/s, loss=3.37, v_num=2vt
Validating:  82%|████████████████████████▌     | 82/100 [00:03<00:00, 25.29it/s]
Epoch 0:  72%|▋| 313117/432833 [9:46:04<3:44:04,  8.90it/s, loss=3.37, v_num=2vt
Validating:  88%|██████████████████████████▍   | 88/100 [00:03<00:00, 21.70it/s]
Epoch 0:  72%|▋| 313124/432833 [9:46:05<3:44:03,  8.90it/s, loss=3.37, v_num=2vt
Epoch 0:  72%|▋| 313131/432833 [9:46:05<3:44:02,  8.90it/s, loss=3.37, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  72%|▋| 313131/432833 [9:46:06<3:44:03,  8.90it/s, loss=3.37, v_num=2vt
Epoch 0:  75%|▋| 323131/432833 [10:04:57<3:25:22,  8.90it/s, loss=3.35, v_num=2v
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  75%|▋| 323134/432833 [10:04:57<3:25:22,  8.90it/s, loss=3.35, v_num=2v
Epoch 0:  75%|▋| 323141/432833 [10:04:57<3:25:21,  8.90it/s, loss=3.35, v_num=2v
Validating:  11%|███▎                          | 11/100 [00:00<00:12,  6.97it/s]
Epoch 0:  75%|▋| 323148/432833 [10:04:57<3:25:20,  8.90it/s, loss=3.35, v_num=2v
Validating:  18%|█████▍                        | 18/100 [00:00<00:07, 10.78it/s]
Epoch 0:  75%|▋| 323155/432833 [10:04:58<3:25:19,  8.90it/s, loss=3.35, v_num=2v
Validating:  24%|███████▏                      | 24/100 [00:01<00:04, 15.23it/s]
Epoch 0:  75%|▋| 323162/432833 [10:04:58<3:25:18,  8.90it/s, loss=3.35, v_num=2v
Validating:  31%|█████████▎                    | 31/100 [00:01<00:03, 19.90it/s]
Epoch 0:  75%|▋| 323169/432833 [10:04:58<3:25:17,  8.90it/s, loss=3.35, v_num=2v
Validating:  38%|███████████▍                  | 38/100 [00:01<00:02, 23.81it/s]
Epoch 0:  75%|▋| 323176/432833 [10:04:58<3:25:16,  8.90it/s, loss=3.35, v_num=2v
Validating:  45%|█████████████▌                | 45/100 [00:01<00:02, 26.93it/s]
Validating:  48%|██████████████▍               | 48/100 [00:01<00:02, 25.29it/s]
Epoch 0:  75%|▋| 323183/432833 [10:04:59<3:25:15,  8.90it/s, loss=3.35, v_num=2v
Validating:  54%|████████████████▏             | 54/100 [00:02<00:01, 25.78it/s]
Epoch 0:  75%|▋| 323190/432833 [10:04:59<3:25:14,  8.90it/s, loss=3.35, v_num=2v
Validating:  61%|██████████████████▎           | 61/100 [00:02<00:01, 24.37it/s]
Epoch 0:  75%|▋| 323197/432833 [10:04:59<3:25:13,  8.90it/s, loss=3.35, v_num=2v
Validating:  68%|████████████████████▍         | 68/100 [00:02<00:01, 27.03it/s]
Epoch 0:  75%|▋| 323204/432833 [10:04:59<3:25:12,  8.90it/s, loss=3.35, v_num=2v
Epoch 0:  75%|▋| 323211/432833 [10:04:59<3:25:11,  8.90it/s, loss=3.35, v_num=2v
Validating:  81%|████████████████████████▎     | 81/100 [00:03<00:00, 29.63it/s]
Epoch 0:  75%|▋| 323218/432833 [10:05:00<3:25:10,  8.90it/s, loss=3.35, v_num=2v
Validating:  89%|██████████████████████████▋   | 89/100 [00:03<00:00, 26.44it/s]
Epoch 0:  75%|▋| 323225/432833 [10:05:00<3:25:09,  8.90it/s, loss=3.35, v_num=2v
Validating:  96%|████████████████████████████▊ | 96/100 [00:03<00:00, 27.63it/s]
Epoch 0:  75%|▋| 323232/432833 [10:05:00<3:25:08,  8.90it/s, loss=3.35, v_num=2vSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  75%|▋| 323232/432833 [10:05:01<3:25:09,  8.90it/s, loss=3.35, v_num=2v
Epoch 0:  77%|▊| 333232/432833 [10:23:52<3:06:28,  8.90it/s, loss=3.33, v_num=2v
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  77%|▊| 333235/432833 [10:23:52<3:06:28,  8.90it/s, loss=3.33, v_num=2v
Validating:   7%|██▏                            | 7/100 [00:00<00:17,  5.35it/s]
Epoch 0:  77%|▊| 333242/432833 [10:23:53<3:06:27,  8.90it/s, loss=3.33, v_num=2v
Validating:  12%|███▌                          | 12/100 [00:00<00:10,  8.45it/s]
Epoch 0:  77%|▊| 333249/432833 [10:23:53<3:06:26,  8.90it/s, loss=3.33, v_num=2v
Validating:  17%|█████                         | 17/100 [00:00<00:07, 11.78it/s]
Epoch 0:  77%|▊| 333256/432833 [10:23:53<3:06:25,  8.90it/s, loss=3.33, v_num=2v
Validating:  24%|███████▏                      | 24/100 [00:01<00:04, 16.92it/s]
Validating:  27%|████████                      | 27/100 [00:01<00:03, 19.35it/s]
Epoch 0:  77%|▊| 333263/432833 [10:23:53<3:06:24,  8.90it/s, loss=3.33, v_num=2v
Validating:  33%|█████████▉                    | 33/100 [00:01<00:03, 21.47it/s]
Epoch 0:  77%|▊| 333270/432833 [10:23:54<3:06:23,  8.90it/s, loss=3.33, v_num=2v
Validating:  39%|███████████▋                  | 39/100 [00:01<00:02, 22.93it/s]
Epoch 0:  77%|▊| 333277/432833 [10:23:54<3:06:22,  8.90it/s, loss=3.33, v_num=2v
Validating:  47%|██████████████                | 47/100 [00:01<00:01, 28.00it/s]
Epoch 0:  77%|▊| 333284/432833 [10:23:54<3:06:21,  8.90it/s, loss=3.33, v_num=2v
Validating:  55%|████████████████▌             | 55/100 [00:02<00:01, 23.36it/s]
Epoch 0:  77%|▊| 333291/432833 [10:23:55<3:06:20,  8.90it/s, loss=3.33, v_num=2v
Validating:  61%|██████████████████▎           | 61/100 [00:02<00:01, 25.36it/s]
Epoch 0:  77%|▊| 333298/432833 [10:23:55<3:06:19,  8.90it/s, loss=3.33, v_num=2v
Validating:  69%|████████████████████▋         | 69/100 [00:02<00:01, 28.64it/s]
Epoch 0:  77%|▊| 333305/432833 [10:23:55<3:06:18,  8.90it/s, loss=3.33, v_num=2v
Validating:  75%|██████████████████████▌       | 75/100 [00:03<00:00, 26.22it/s]
Epoch 0:  77%|▊| 333312/432833 [10:23:55<3:06:17,  8.90it/s, loss=3.33, v_num=2v
Epoch 0:  77%|▊| 333319/432833 [10:23:55<3:06:16,  8.90it/s, loss=3.33, v_num=2v
Validating:  87%|██████████████████████████    | 87/100 [00:03<00:00, 29.05it/s]
Epoch 0:  77%|▊| 333326/432833 [10:23:56<3:06:15,  8.90it/s, loss=3.33, v_num=2v
Validating:  95%|████████████████████████████▌ | 95/100 [00:03<00:00, 29.33it/s]
Epoch 0:  77%|▊| 333333/432833 [10:23:56<3:06:14,  8.90it/s, loss=3.33, v_num=2vSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  77%|▊| 333333/432833 [10:23:57<3:06:15,  8.90it/s, loss=3.33, v_num=2v
Epoch 0:  79%|▊| 343333/432833 [10:42:48<2:47:33,  8.90it/s, loss=3.31, v_num=2v
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  79%|▊| 343336/432833 [10:42:48<2:47:33,  8.90it/s, loss=3.31, v_num=2v
Epoch 0:  79%|▊| 343343/432833 [10:42:48<2:47:32,  8.90it/s, loss=3.31, v_num=2v
Validating:  10%|███                           | 10/100 [00:00<00:12,  7.20it/s]
Validating:  13%|███▉                          | 13/100 [00:00<00:09,  9.33it/s]
Epoch 0:  79%|▊| 343350/432833 [10:42:48<2:47:31,  8.90it/s, loss=3.31, v_num=2v
Validating:  19%|█████▋                        | 19/100 [00:00<00:05, 13.63it/s]
Epoch 0:  79%|▊| 343357/432833 [10:42:49<2:47:30,  8.90it/s, loss=3.31, v_num=2v
Validating:  26%|███████▊                      | 26/100 [00:01<00:04, 18.49it/s]
Epoch 0:  79%|▊| 343364/432833 [10:42:49<2:47:29,  8.90it/s, loss=3.31, v_num=2v
Validating:  32%|█████████▌                    | 32/100 [00:01<00:03, 21.21it/s]
Epoch 0:  79%|▊| 343371/432833 [10:42:49<2:47:28,  8.90it/s, loss=3.31, v_num=2v
Validating:  38%|███████████▍                  | 38/100 [00:01<00:02, 21.85it/s]
Validating:  41%|████████████▎                 | 41/100 [00:01<00:02, 21.93it/s]
Epoch 0:  79%|▊| 343378/432833 [10:42:50<2:47:28,  8.90it/s, loss=3.31, v_num=2v
Validating:  47%|██████████████                | 47/100 [00:01<00:02, 24.26it/s]
Epoch 0:  79%|▊| 343385/432833 [10:42:50<2:47:27,  8.90it/s, loss=3.31, v_num=2v
Validating:  54%|████████████████▏             | 54/100 [00:02<00:01, 26.76it/s]
Epoch 0:  79%|▊| 343392/432833 [10:42:50<2:47:26,  8.90it/s, loss=3.31, v_num=2v
Epoch 0:  79%|▊| 343399/432833 [10:42:50<2:47:25,  8.90it/s, loss=3.31, v_num=2v
Validating:  66%|███████████████████▊          | 66/100 [00:02<00:01, 30.04it/s]
Epoch 0:  79%|▊| 343406/432833 [10:42:50<2:47:24,  8.90it/s, loss=3.31, v_num=2v
Validating:  75%|██████████████████████▌       | 75/100 [00:02<00:00, 31.17it/s]
Epoch 0:  79%|▊| 343413/432833 [10:42:51<2:47:23,  8.90it/s, loss=3.31, v_num=2v
Epoch 0:  79%|▊| 343420/432833 [10:42:51<2:47:22,  8.90it/s, loss=3.31, v_num=2v
Validating:  87%|██████████████████████████    | 87/100 [00:03<00:00, 30.73it/s]
Epoch 0:  79%|▊| 343427/432833 [10:42:51<2:47:21,  8.90it/s, loss=3.31, v_num=2v
Validating:  95%|████████████████████████████▌ | 95/100 [00:03<00:00, 23.99it/s]
Epoch 0:  79%|▊| 343434/432833 [10:42:52<2:47:20,  8.90it/s, loss=3.31, v_num=2vSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  79%|▊| 343434/432833 [10:42:52<2:47:20,  8.90it/s, loss=3.31, v_num=2v
Epoch 0:  82%|▊| 353434/432833 [11:01:42<2:28:39,  8.90it/s, loss=3.29, v_num=2v
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  82%|▊| 353437/432833 [11:01:43<2:28:38,  8.90it/s, loss=3.29, v_num=2v
Epoch 0:  82%|▊| 353444/432833 [11:01:43<2:28:37,  8.90it/s, loss=3.29, v_num=2v
Validating:  10%|███                           | 10/100 [00:00<00:12,  7.05it/s]
Epoch 0:  82%|▊| 353451/432833 [11:01:43<2:28:37,  8.90it/s, loss=3.29, v_num=2v
Validating:  17%|█████                         | 17/100 [00:00<00:07, 11.31it/s]
Epoch 0:  82%|▊| 353458/432833 [11:01:43<2:28:36,  8.90it/s, loss=3.29, v_num=2v
Validating:  24%|███████▏                      | 24/100 [00:00<00:04, 16.32it/s]
Epoch 0:  82%|▊| 353465/432833 [11:01:44<2:28:35,  8.90it/s, loss=3.29, v_num=2v
Validating:  31%|█████████▎                    | 31/100 [00:01<00:03, 18.87it/s]
Epoch 0:  82%|▊| 353472/432833 [11:01:44<2:28:34,  8.90it/s, loss=3.29, v_num=2v
Validating:  38%|███████████▍                  | 38/100 [00:01<00:02, 21.35it/s]
Epoch 0:  82%|▊| 353479/432833 [11:01:44<2:28:33,  8.90it/s, loss=3.29, v_num=2v
Validating:  45%|█████████████▌                | 45/100 [00:01<00:02, 22.62it/s]
Epoch 0:  82%|▊| 353486/432833 [11:01:44<2:28:32,  8.90it/s, loss=3.29, v_num=2v
Validating:  53%|███████████████▉              | 53/100 [00:02<00:01, 27.90it/s]
Epoch 0:  82%|▊| 353493/432833 [11:01:45<2:28:31,  8.90it/s, loss=3.29, v_num=2v
Validating:  61%|██████████████████▎           | 61/100 [00:02<00:01, 29.83it/s]
Epoch 0:  82%|▊| 353500/432833 [11:01:45<2:28:30,  8.90it/s, loss=3.29, v_num=2v
Validating:  68%|████████████████████▍         | 68/100 [00:02<00:01, 25.15it/s]
Epoch 0:  82%|▊| 353507/432833 [11:01:45<2:28:29,  8.90it/s, loss=3.29, v_num=2v
Validating:  75%|██████████████████████▌       | 75/100 [00:02<00:01, 22.88it/s]
Epoch 0:  82%|▊| 353514/432833 [11:01:45<2:28:28,  8.90it/s, loss=3.29, v_num=2v
Validating:  81%|████████████████████████▎     | 81/100 [00:03<00:00, 23.35it/s]
Epoch 0:  82%|▊| 353521/432833 [11:01:46<2:28:28,  8.90it/s, loss=3.29, v_num=2v
Validating:  88%|██████████████████████████▍   | 88/100 [00:03<00:00, 26.10it/s]
Epoch 0:  82%|▊| 353528/432833 [11:01:46<2:28:27,  8.90it/s, loss=3.29, v_num=2v
Validating:  95%|████████████████████████████▌ | 95/100 [00:03<00:00, 28.97it/s]
Epoch 0:  82%|▊| 353535/432833 [11:01:46<2:28:26,  8.90it/s, loss=3.29, v_num=2vSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  82%|▊| 353535/432833 [11:01:47<2:28:26,  8.90it/s, loss=3.29, v_num=2v
Epoch 0:  84%|▊| 363535/432833 [11:20:39<2:09:45,  8.90it/s, loss=3.28, v_num=2v
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  84%|▊| 363538/432833 [11:20:40<2:09:44,  8.90it/s, loss=3.28, v_num=2v
Epoch 0:  84%|▊| 363545/432833 [11:20:40<2:09:43,  8.90it/s, loss=3.28, v_num=2v
Validating:  10%|███                           | 10/100 [00:00<00:12,  6.96it/s]
Epoch 0:  84%|▊| 363552/432833 [11:20:40<2:09:42,  8.90it/s, loss=3.28, v_num=2v
Validating:  17%|█████                         | 17/100 [00:00<00:07, 11.40it/s]
Validating:  20%|██████                        | 20/100 [00:00<00:05, 13.69it/s]
Epoch 0:  84%|▊| 363559/432833 [11:20:41<2:09:42,  8.90it/s, loss=3.28, v_num=2v
Validating:  26%|███████▊                      | 26/100 [00:01<00:04, 15.48it/s]
Epoch 0:  84%|▊| 363566/432833 [11:20:41<2:09:41,  8.90it/s, loss=3.28, v_num=2v
Validating:  32%|█████████▌                    | 32/100 [00:01<00:04, 16.82it/s]
Epoch 0:  84%|▊| 363573/432833 [11:20:41<2:09:40,  8.90it/s, loss=3.28, v_num=2v
Validating:  39%|███████████▋                  | 39/100 [00:01<00:02, 21.39it/s]
Epoch 0:  84%|▊| 363580/432833 [11:20:41<2:09:39,  8.90it/s, loss=3.28, v_num=2v
Validating:  45%|█████████████▌                | 45/100 [00:02<00:02, 24.06it/s]
Epoch 0:  84%|▊| 363587/432833 [11:20:42<2:09:38,  8.90it/s, loss=3.28, v_num=2v
Validating:  52%|███████████████▌              | 52/100 [00:02<00:01, 27.26it/s]
Epoch 0:  84%|▊| 363594/432833 [11:20:42<2:09:37,  8.90it/s, loss=3.28, v_num=2v
Validating:  60%|██████████████████            | 60/100 [00:02<00:01, 30.15it/s]
Epoch 0:  84%|▊| 363601/432833 [11:20:42<2:09:36,  8.90it/s, loss=3.28, v_num=2v
Validating:  68%|████████████████████▍         | 68/100 [00:02<00:01, 30.13it/s]
Epoch 0:  84%|▊| 363608/432833 [11:20:42<2:09:35,  8.90it/s, loss=3.28, v_num=2v
Epoch 0:  84%|▊| 363615/432833 [11:20:42<2:09:34,  8.90it/s, loss=3.28, v_num=2v
Epoch 0:  84%|▊| 363622/432833 [11:20:43<2:09:33,  8.90it/s, loss=3.28, v_num=2v
Validating:  87%|██████████████████████████    | 87/100 [00:03<00:00, 37.38it/s]
Epoch 0:  84%|▊| 363629/432833 [11:20:43<2:09:33,  8.90it/s, loss=3.28, v_num=2v
Validating:  95%|████████████████████████████▌ | 95/100 [00:03<00:00, 27.07it/s]
Epoch 0:  84%|▊| 363636/432833 [11:20:43<2:09:32,  8.90it/s, loss=3.28, v_num=2vSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  84%|▊| 363636/432833 [11:20:44<2:09:32,  8.90it/s, loss=3.28, v_num=2v
Epoch 0:  86%|▊| 373636/432833 [11:39:33<1:50:50,  8.90it/s, loss=3.26, v_num=2v
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  86%|▊| 373639/432833 [11:39:33<1:50:49,  8.90it/s, loss=3.26, v_num=2v
Epoch 0:  86%|▊| 373646/432833 [11:39:33<1:50:48,  8.90it/s, loss=3.26, v_num=2v
Validating:  10%|███                           | 10/100 [00:00<00:12,  6.94it/s]
Validating:  12%|███▌                          | 12/100 [00:00<00:10,  8.61it/s]
Epoch 0:  86%|▊| 373653/432833 [11:39:34<1:50:47,  8.90it/s, loss=3.26, v_num=2v
Validating:  18%|█████▍                        | 18/100 [00:00<00:06, 13.35it/s]
Epoch 0:  86%|▊| 373660/432833 [11:39:34<1:50:47,  8.90it/s, loss=3.26, v_num=2v
Validating:  25%|███████▌                      | 25/100 [00:01<00:04, 18.10it/s]
Epoch 0:  86%|▊| 373667/432833 [11:39:34<1:50:46,  8.90it/s, loss=3.26, v_num=2v
Validating:  31%|█████████▎                    | 31/100 [00:01<00:03, 22.56it/s]
Epoch 0:  86%|▊| 373674/432833 [11:39:34<1:50:45,  8.90it/s, loss=3.26, v_num=2v
Validating:  38%|███████████▍                  | 38/100 [00:01<00:02, 25.33it/s]
Epoch 0:  86%|▊| 373681/432833 [11:39:35<1:50:44,  8.90it/s, loss=3.26, v_num=2v
Validating:  45%|█████████████▌                | 45/100 [00:01<00:02, 26.90it/s]
Epoch 0:  86%|▊| 373688/432833 [11:39:35<1:50:43,  8.90it/s, loss=3.26, v_num=2v
Validating:  54%|████████████████▏             | 54/100 [00:02<00:01, 30.72it/s]
Epoch 0:  86%|▊| 373695/432833 [11:39:35<1:50:42,  8.90it/s, loss=3.26, v_num=2v
Validating:  61%|██████████████████▎           | 61/100 [00:02<00:01, 27.17it/s]
Epoch 0:  86%|▊| 373702/432833 [11:39:35<1:50:41,  8.90it/s, loss=3.26, v_num=2v
Validating:  67%|████████████████████          | 67/100 [00:02<00:01, 23.00it/s]
Epoch 0:  86%|▊| 373709/432833 [11:39:36<1:50:40,  8.90it/s, loss=3.26, v_num=2v
Validating:  74%|██████████████████████▏       | 74/100 [00:02<00:01, 24.13it/s]
Epoch 0:  86%|▊| 373716/432833 [11:39:36<1:50:40,  8.90it/s, loss=3.26, v_num=2v
Validating:  81%|████████████████████████▎     | 81/100 [00:03<00:00, 24.94it/s]
Epoch 0:  86%|▊| 373723/432833 [11:39:36<1:50:39,  8.90it/s, loss=3.26, v_num=2v
Validating:  87%|██████████████████████████    | 87/100 [00:03<00:00, 22.58it/s]
Epoch 0:  86%|▊| 373730/432833 [11:39:36<1:50:38,  8.90it/s, loss=3.26, v_num=2v
Validating:  94%|████████████████████████████▏ | 94/100 [00:03<00:00, 27.24it/s]
Epoch 0:  86%|▊| 373737/432833 [11:39:37<1:50:37,  8.90it/s, loss=3.26, v_num=2vSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  86%|▊| 373737/432833 [11:39:37<1:50:37,  8.90it/s, loss=3.26, v_num=2v
Epoch 0:  89%|▉| 383737/432833 [11:58:26<1:31:55,  8.90it/s, loss=3.25, v_num=2v
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  89%|▉| 383740/432833 [11:58:27<1:31:54,  8.90it/s, loss=3.25, v_num=2v
Epoch 0:  89%|▉| 383747/432833 [11:58:27<1:31:53,  8.90it/s, loss=3.25, v_num=2v
Validating:  10%|███                           | 10/100 [00:00<00:12,  7.12it/s]
Validating:  13%|███▉                          | 13/100 [00:00<00:09,  9.20it/s]
Epoch 0:  89%|▉| 383754/432833 [11:58:27<1:31:53,  8.90it/s, loss=3.25, v_num=2v
Validating:  20%|██████                        | 20/100 [00:00<00:05, 13.87it/s]
Epoch 0:  89%|▉| 383761/432833 [11:58:27<1:31:52,  8.90it/s, loss=3.25, v_num=2v
Validating:  27%|████████                      | 27/100 [00:01<00:03, 18.53it/s]
Epoch 0:  89%|▉| 383768/432833 [11:58:28<1:31:51,  8.90it/s, loss=3.25, v_num=2v
Validating:  33%|█████████▉                    | 33/100 [00:01<00:03, 21.63it/s]
Epoch 0:  89%|▉| 383775/432833 [11:58:28<1:31:50,  8.90it/s, loss=3.25, v_num=2v
Validating:  40%|████████████                  | 40/100 [00:01<00:02, 26.15it/s]
Epoch 0:  89%|▉| 383782/432833 [11:58:28<1:31:49,  8.90it/s, loss=3.25, v_num=2v
Validating:  48%|██████████████▍               | 48/100 [00:01<00:01, 27.24it/s]
Epoch 0:  89%|▉| 383789/432833 [11:58:28<1:31:48,  8.90it/s, loss=3.25, v_num=2v
Validating:  54%|████████████████▏             | 54/100 [00:02<00:01, 27.34it/s]
Epoch 0:  89%|▉| 383796/432833 [11:58:29<1:31:47,  8.90it/s, loss=3.25, v_num=2v
Validating:  60%|██████████████████            | 60/100 [00:02<00:01, 26.83it/s]
Epoch 0:  89%|▉| 383803/432833 [11:58:29<1:31:47,  8.90it/s, loss=3.25, v_num=2v
Validating:  66%|███████████████████▊          | 66/100 [00:02<00:01, 24.91it/s]
Epoch 0:  89%|▉| 383810/432833 [11:58:29<1:31:46,  8.90it/s, loss=3.25, v_num=2v
Validating:  74%|██████████████████████▏       | 74/100 [00:02<00:00, 29.04it/s]
Epoch 0:  89%|▉| 383817/432833 [11:58:29<1:31:45,  8.90it/s, loss=3.25, v_num=2v
Validating:  82%|████████████████████████▌     | 82/100 [00:03<00:00, 30.07it/s]
Epoch 0:  89%|▉| 383824/432833 [11:58:30<1:31:44,  8.90it/s, loss=3.25, v_num=2v
Validating:  90%|███████████████████████████   | 90/100 [00:03<00:00, 21.41it/s]
Epoch 0:  89%|▉| 383831/432833 [11:58:30<1:31:43,  8.90it/s, loss=3.25, v_num=2v
Validating:  96%|████████████████████████████▊ | 96/100 [00:03<00:00, 22.81it/s]
Epoch 0:  89%|▉| 383838/432833 [11:58:30<1:31:42,  8.90it/s, loss=3.25, v_num=2vSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  89%|▉| 383838/432833 [11:58:31<1:31:42,  8.90it/s, loss=3.25, v_num=2v
Epoch 0:  91%|▉| 393838/432833 [12:17:21<1:13:00,  8.90it/s, loss=3.23, v_num=2v
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  91%|▉| 393841/432833 [12:17:21<1:13:00,  8.90it/s, loss=3.23, v_num=2v
Epoch 0:  91%|▉| 393848/432833 [12:17:21<1:12:59,  8.90it/s, loss=3.23, v_num=2v
Validating:  11%|███▎                          | 11/100 [00:00<00:13,  6.82it/s]
Epoch 0:  91%|▉| 393855/432833 [12:17:22<1:12:58,  8.90it/s, loss=3.23, v_num=2v
Validating:  19%|█████▋                        | 19/100 [00:00<00:06, 11.64it/s]
Epoch 0:  91%|▉| 393862/432833 [12:17:22<1:12:57,  8.90it/s, loss=3.23, v_num=2v
Validating:  27%|████████                      | 27/100 [00:01<00:04, 16.19it/s]
Epoch 0:  91%|▉| 393869/432833 [12:17:22<1:12:56,  8.90it/s, loss=3.23, v_num=2v
Validating:  34%|██████████▏                   | 34/100 [00:01<00:03, 21.03it/s]
Epoch 0:  91%|▉| 393876/432833 [12:17:22<1:12:55,  8.90it/s, loss=3.23, v_num=2v
Validating:  41%|████████████▎                 | 41/100 [00:01<00:02, 23.85it/s]
Epoch 0:  91%|▉| 393883/432833 [12:17:23<1:12:55,  8.90it/s, loss=3.23, v_num=2v
Validating:  47%|██████████████                | 47/100 [00:01<00:02, 24.57it/s]
Epoch 0:  91%|▉| 393890/432833 [12:17:23<1:12:54,  8.90it/s, loss=3.23, v_num=2v
Validating:  55%|████████████████▌             | 55/100 [00:02<00:01, 22.92it/s]
Epoch 0:  91%|▉| 393897/432833 [12:17:23<1:12:53,  8.90it/s, loss=3.23, v_num=2v
Validating:  61%|██████████████████▎           | 61/100 [00:02<00:01, 23.39it/s]
Epoch 0:  91%|▉| 393904/432833 [12:17:24<1:12:52,  8.90it/s, loss=3.23, v_num=2v
Validating:  67%|████████████████████          | 67/100 [00:02<00:01, 21.12it/s]
Epoch 0:  91%|▉| 393911/432833 [12:17:24<1:12:51,  8.90it/s, loss=3.23, v_num=2v
Validating:  73%|█████████████████████▉        | 73/100 [00:02<00:01, 23.19it/s]
Epoch 0:  91%|▉| 393918/432833 [12:17:24<1:12:50,  8.90it/s, loss=3.23, v_num=2v
Validating:  81%|████████████████████████▎     | 81/100 [00:03<00:00, 26.25it/s]
Epoch 0:  91%|▉| 393925/432833 [12:17:24<1:12:50,  8.90it/s, loss=3.23, v_num=2v
Validating:  87%|██████████████████████████    | 87/100 [00:03<00:00, 26.24it/s]
Epoch 0:  91%|▉| 393932/432833 [12:17:25<1:12:49,  8.90it/s, loss=3.23, v_num=2v
Validating:  95%|████████████████████████████▌ | 95/100 [00:03<00:00, 28.72it/s]
Epoch 0:  91%|▉| 393939/432833 [12:17:25<1:12:48,  8.90it/s, loss=3.23, v_num=2vSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  91%|▉| 393939/432833 [12:17:26<1:12:48,  8.90it/s, loss=3.23, v_num=2v
Epoch 0:  93%|▉| 403939/432833 [12:36:14<54:05,  8.90it/s, loss=3.22, v_num=2vt0
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  93%|▉| 403942/432833 [12:36:14<54:05,  8.90it/s, loss=3.22, v_num=2vt0
Validating:   5%|█▌                             | 5/100 [00:00<00:18,  5.18it/s]
Validating:   7%|██▏                            | 7/100 [00:00<00:14,  6.46it/s]
Epoch 0:  93%|▉| 403949/432833 [12:36:15<54:04,  8.90it/s, loss=3.22, v_num=2vt0
Validating:  12%|███▌                          | 12/100 [00:00<00:08,  9.98it/s]
Epoch 0:  93%|▉| 403956/432833 [12:36:15<54:03,  8.90it/s, loss=3.22, v_num=2vt0
Validating:  19%|█████▋                        | 19/100 [00:01<00:05, 14.39it/s]
Epoch 0:  93%|▉| 403963/432833 [12:36:15<54:02,  8.90it/s, loss=3.22, v_num=2vt0
Validating:  25%|███████▌                      | 25/100 [00:01<00:03, 19.10it/s]
Epoch 0:  93%|▉| 403970/432833 [12:36:15<54:02,  8.90it/s, loss=3.22, v_num=2vt0
Epoch 0:  93%|▉| 403977/432833 [12:36:16<54:01,  8.90it/s, loss=3.22, v_num=2vt0
Validating:  38%|███████████▍                  | 38/100 [00:01<00:02, 25.60it/s]
Epoch 0:  93%|▉| 403984/432833 [12:36:16<54:00,  8.90it/s, loss=3.22, v_num=2vt0
Validating:  45%|█████████████▌                | 45/100 [00:01<00:02, 25.65it/s]
Validating:  48%|██████████████▍               | 48/100 [00:02<00:02, 24.78it/s]
Epoch 0:  93%|▉| 403991/432833 [12:36:16<53:59,  8.90it/s, loss=3.22, v_num=2vt0
Validating:  54%|████████████████▏             | 54/100 [00:02<00:02, 22.41it/s]
Epoch 0:  93%|▉| 403998/432833 [12:36:16<53:58,  8.90it/s, loss=3.22, v_num=2vt0
Validating:  61%|██████████████████▎           | 61/100 [00:02<00:01, 24.59it/s]
Epoch 0:  93%|▉| 404005/432833 [12:36:17<53:57,  8.90it/s, loss=3.22, v_num=2vt0
Epoch 0:  93%|▉| 404012/432833 [12:36:17<53:57,  8.90it/s, loss=3.22, v_num=2vt0
Validating:  73%|█████████████████████▉        | 73/100 [00:02<00:00, 30.82it/s]
Epoch 0:  93%|▉| 404019/432833 [12:36:17<53:56,  8.90it/s, loss=3.22, v_num=2vt0
Validating:  81%|████████████████████████▎     | 81/100 [00:03<00:00, 26.76it/s]
Epoch 0:  93%|▉| 404026/432833 [12:36:17<53:55,  8.90it/s, loss=3.22, v_num=2vt0
Validating:  87%|██████████████████████████    | 87/100 [00:03<00:00, 27.11it/s]
Epoch 0:  93%|▉| 404033/432833 [12:36:18<53:54,  8.90it/s, loss=3.22, v_num=2vt0
Validating:  95%|████████████████████████████▌ | 95/100 [00:03<00:00, 29.14it/s]
Epoch 0:  93%|▉| 404040/432833 [12:36:18<53:53,  8.90it/s, loss=3.22, v_num=2vt0Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  93%|▉| 404040/432833 [12:36:19<53:53,  8.90it/s, loss=3.22, v_num=2vt0
Epoch 0:  96%|▉| 414040/432833 [12:55:07<35:10,  8.90it/s, loss=3.2, v_num=2vt0]
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  96%|▉| 414043/432833 [12:55:08<35:10,  8.90it/s, loss=3.2, v_num=2vt0]
Validating:   5%|█▌                             | 5/100 [00:00<00:18,  5.22it/s]
Epoch 0:  96%|▉| 414050/432833 [12:55:08<35:09,  8.90it/s, loss=3.2, v_num=2vt0]
Validating:  12%|███▌                          | 12/100 [00:00<00:09,  8.98it/s]
Epoch 0:  96%|▉| 414057/432833 [12:55:08<35:09,  8.90it/s, loss=3.2, v_num=2vt0]
Epoch 0:  96%|▉| 414064/432833 [12:55:08<35:08,  8.90it/s, loss=3.2, v_num=2vt0]
Validating:  24%|███████▏                      | 24/100 [00:00<00:04, 17.28it/s]
Epoch 0:  96%|▉| 414071/432833 [12:55:09<35:07,  8.90it/s, loss=3.2, v_num=2vt0]
Validating:  32%|█████████▌                    | 32/100 [00:01<00:03, 21.46it/s]
Epoch 0:  96%|▉| 414078/432833 [12:55:09<35:06,  8.90it/s, loss=3.2, v_num=2vt0]
Validating:  40%|████████████                  | 40/100 [00:01<00:02, 23.19it/s]
Epoch 0:  96%|▉| 414085/432833 [12:55:09<35:05,  8.90it/s, loss=3.2, v_num=2vt0]
Epoch 0:  96%|▉| 414092/432833 [12:55:09<35:04,  8.90it/s, loss=3.2, v_num=2vt0]
Validating:  52%|███████████████▌              | 52/100 [00:01<00:01, 26.58it/s]
Epoch 0:  96%|▉| 414099/432833 [12:55:10<35:04,  8.90it/s, loss=3.2, v_num=2vt0]
Validating:  59%|█████████████████▋            | 59/100 [00:02<00:01, 27.73it/s]
Validating:  62%|██████████████████▌           | 62/100 [00:02<00:01, 24.66it/s]
Epoch 0:  96%|▉| 414106/432833 [12:55:10<35:03,  8.90it/s, loss=3.2, v_num=2vt0]
Epoch 0:  96%|▉| 414113/432833 [12:55:10<35:02,  8.90it/s, loss=3.2, v_num=2vt0]
Validating:  74%|██████████████████████▏       | 74/100 [00:02<00:01, 25.74it/s]
Epoch 0:  96%|▉| 414120/432833 [12:55:11<35:01,  8.90it/s, loss=3.2, v_num=2vt0]
Validating:  80%|████████████████████████      | 80/100 [00:03<00:00, 26.44it/s]
Validating:  83%|████████████████████████▉     | 83/100 [00:03<00:00, 25.04it/s]
Epoch 0:  96%|▉| 414127/432833 [12:55:11<35:00,  8.90it/s, loss=3.2, v_num=2vt0]
Validating:  89%|██████████████████████████▋   | 89/100 [00:03<00:00, 21.49it/s]
Epoch 0:  96%|▉| 414134/432833 [12:55:11<35:00,  8.90it/s, loss=3.2, v_num=2vt0]
Epoch 0:  96%|▉| 414141/432833 [12:55:11<34:59,  8.90it/s, loss=3.2, v_num=2vt0]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  96%|▉| 414141/432833 [12:55:12<34:59,  8.90it/s, loss=3.2, v_num=2vt0]
Epoch 0:  98%|▉| 424141/432833 [13:13:58<16:16,  8.90it/s, loss=3.19, v_num=2vt0
Validating: 0it [00:00, ?it/s]
Validating:   0%|                                       | 0/100 [00:00<?, ?it/s]
Epoch 0:  98%|▉| 424144/432833 [13:13:58<16:15,  8.90it/s, loss=3.19, v_num=2vt0
Validating:   6%|█▊                             | 6/100 [00:00<00:17,  5.27it/s]
Epoch 0:  98%|▉| 424151/432833 [13:13:58<16:15,  8.90it/s, loss=3.19, v_num=2vt0
Validating:  12%|███▌                          | 12/100 [00:00<00:09,  8.84it/s]
Epoch 0:  98%|▉| 424158/432833 [13:13:59<16:14,  8.90it/s, loss=3.19, v_num=2vt0
Validating:  19%|█████▋                        | 19/100 [00:00<00:05, 13.71it/s]
Epoch 0:  98%|▉| 424165/432833 [13:13:59<16:13,  8.90it/s, loss=3.19, v_num=2vt0
Validating:  25%|███████▌                      | 25/100 [00:01<00:04, 16.52it/s]
Epoch 0:  98%|▉| 424172/432833 [13:13:59<16:12,  8.90it/s, loss=3.19, v_num=2vt0
Validating:  33%|█████████▉                    | 33/100 [00:01<00:03, 21.21it/s]
Epoch 0:  98%|▉| 424179/432833 [13:13:59<16:11,  8.90it/s, loss=3.19, v_num=2vt0
Validating:  39%|███████████▋                  | 39/100 [00:01<00:02, 23.00it/s]
Epoch 0:  98%|▉| 424186/432833 [13:14:00<16:11,  8.90it/s, loss=3.19, v_num=2vt0
Validating:  46%|█████████████▊                | 46/100 [00:01<00:02, 26.17it/s]
Epoch 0:  98%|▉| 424193/432833 [13:14:00<16:10,  8.90it/s, loss=3.19, v_num=2vt0
Validating:  54%|████████████████▏             | 54/100 [00:02<00:01, 27.74it/s]
Epoch 0:  98%|▉| 424200/432833 [13:14:00<16:09,  8.90it/s, loss=3.19, v_num=2vt0
Validating:  60%|██████████████████            | 60/100 [00:02<00:01, 24.91it/s]
Epoch 0:  98%|▉| 424207/432833 [13:14:00<16:08,  8.90it/s, loss=3.19, v_num=2vt0
Validating:  66%|███████████████████▊          | 66/100 [00:02<00:01, 23.49it/s]
Epoch 0:  98%|▉| 424214/432833 [13:14:01<16:07,  8.90it/s, loss=3.19, v_num=2vt0
Validating:  75%|██████████████████████▌       | 75/100 [00:02<00:00, 27.86it/s]
Epoch 0:  98%|▉| 424221/432833 [13:14:01<16:07,  8.90it/s, loss=3.19, v_num=2vt0
Epoch 0:  98%|▉| 424228/432833 [13:14:01<16:06,  8.90it/s, loss=3.19, v_num=2vt0
Validating:  87%|██████████████████████████    | 87/100 [00:03<00:00, 30.92it/s]
Epoch 0:  98%|▉| 424235/432833 [13:14:01<16:05,  8.90it/s, loss=3.19, v_num=2vt0
Validating:  94%|████████████████████████████▏ | 94/100 [00:03<00:00, 25.95it/s]
Epoch 0:  98%|▉| 424242/432833 [13:14:02<16:04,  8.90it/s, loss=3.19, v_num=2vt0Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0:  98%|▉| 424242/432833 [13:14:03<16:04,  8.90it/s, loss=3.19, v_num=2vt0
Epoch 0: 100%|█| 432833/432833 [13:30:14<00:00,  8.90it/s, loss=3.18, v_num=2vt0Saving latest checkpoint...
Epoch 0: 100%|█| 432833/432833 [13:30:14<00:00,  8.90it/s, loss=3.18, v_num=2vt0

wandb: Waiting for W&B process to finish, PID 100838
wandb: Program ended successfully.
wandb:                                                                                
wandb: Find user logs for this run at: /data/wikipedia/processed/spanish-sentences/wandb/run-20210413_133917-16p22vt0/logs/debug.log
wandb: Find internal logs for this run at: /data/wikipedia/processed/spanish-sentences/wandb/run-20210413_133917-16p22vt0/logs/debug-internal.log
wandb: Run summary:
wandb:          avg_val_loss 3.04847
wandb:                 epoch 0
wandb:   trainer/global_step 209
wandb:              _runtime 47676
wandb:            _timestamp 1618365233
wandb:                 _step 45
wandb:            train_loss 2.96082
wandb: Run history:
wandb:          avg_val_loss █▇▆▅▅▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:                 epoch ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:   trainer/global_step ▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇████
wandb:              _runtime ▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇████
wandb:            _timestamp ▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇████
wandb:                 _step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇████
wandb:            train_loss █▄▃▁
wandb: 
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: 
wandb: Synced accumulate_grad_batches_2000_amp_backend_amp_amp_level_O1_auto_lr_find_True_auto_scale_batch_size_False_auto_select_gpus_False_batch_size_32_benchmark_False_check_val_every_n_epoch_1_checkpoint_callback_True_data_index_path_/data/wikipedia/processed/spanish-sentences/data/es/preparation/final/index-040k.npy_data_path_/data/wikipedia/processed/spanish-sentences/data/es/preparation/final/data-040k.npy_deterministic_False_fast_dev_run_False_flush_logs_every_n_steps_100_gpus_1_gradient_clip_val_0_lang_es_limit_predict_batches_1.0_limit_test_batches_1.0_limit_train_batches_1.0_limit_val_batches_100_log_every_n_steps_50_logger_True_lr_0.001_max_epochs_1_max_seq_length_128_mmap_False_move_metrics_to_cpu_False_multiple_trainloader_mode_max_size_cycle_num_nodes_1_num_processes_1_num_sanity_val_steps_2_num_workers_4_overfit_batches_0.0_precision_32_prepare_data_per_node_True_pretrained_path_gpt2_process_position_0_reload_dataloaders_every_epoch_False_replace_sampler_ddp_True_reset_state_False_search_False_seed_7649832_stochastic_weight_avg_False_subset_size_1.0_sync_batchnorm_False_terminate_on_nan_False_tokenizer_path_/data/wikipedia/processed/spanish-sentences/data/es/preparation/vocabularies/es-040k.tokenizer.json_tpu_cores_<function _gpus_arg_default at 0x7ff2b2d63310>_track_grad_norm_-1_unfreeze_False_val_check_interval_10000_verbose_False_version_0_vocab_size_50257_weights_summary_top_wte_only_False: https://wandb.ai/matthewfranglen/mf-blog-recycle-gpt2-es/runs/16p22vt0

So it took a bit of work but I got it logging to wandb. I had to alter the get_trainer_kwargs in main.py and then disable the self.logger.experiment.add_text('example', txt) as the wandb logger doesn’t support that.

Anyway it’s running now and training the embedding for an epoch should take about 14 hours. At the moment it’s reporting a validation loss of 6.562 for the very first round of validation (which would be a perplexity >700).