Code
! ls submodules/gpt2-recycle
environment.yml example.py HOWTO.md LICENSE README.md src
April 12, 2021
I’ve been trying to retrain GPT-2 to work as a language model on another language. Although the training looks like it works well, I have found that the downstream performance on a task is lacking. It seems that perplexity is not the only thing I need to work on.
To this end one of my co-workers linked a paper from Wietse de Vries and Malvina Nissim which covers how to effectively retrain GPT-2 for a different language. It does this in stages in quite a neat way.
The tokenizer is created. This then requires a new embedding layer. The embedding layer is then trained without training the rest of the model (to prevent catastrophic forgetting). Then the model can be fine tuned a little.
One of the neat approaches in the paper is that they train an embedding for GPT-2 small first, and have a way to scale that up to GPT-2 medium. They do this by finding the mapping between the embedding in GPT-2 small and medium and then applying this to their custom language embedding.
Let’s see if I can get this working.
To make it easy to use this library I have cloned it as a submodule at submodules/gpt2-recycle
. I am going to be using a data folder at data/2021-04-12-gpt-2-recycle
.
The data preparation requires the following steps:
The first thing to do is to create the custom tokenizer. Tokenization is done using byte pair encoding, which will take common sequences of characters and encode them as a single byte pair. I’ve got two datasets for this work, one is the spanish wikipedia and another is a comparable number of twitter sentences. Combining wikipedia and twitter should prevent overfitting on the more formal wikipedia language.
6140
> creating vocabulary with vocab size 40000k
Traceback (most recent call last):
File "/home/matthew/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/matthew/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/matthew/Programming/Blog/blog/notebooks/submodules/gpt2-recycle/src/preparation/1_vocab_0_create.py", line 48, in <module>
main()
File "/home/matthew/Programming/Blog/blog/notebooks/submodules/gpt2-recycle/src/preparation/1_vocab_0_create.py", line 43, in main
tokenizer = train_tokenizer(args.lang, args.size * 1000)
File "/home/matthew/Programming/Blog/blog/notebooks/submodules/gpt2-recycle/src/preparation/1_vocab_0_create.py", line 25, in train_tokenizer
tokenizer.train(trainer, (Path('data') / lang / 'plaintext').glob('*/*.txt'))
TypeError: Can't convert <tokenizers.trainers.BpeTrainer object at 0x7f191200a450> to Sequence
Here is a small problem with the existing code. The order of the arguments for the tokenizer train have changed. To fix this I’ll just recreate the code with the arguments as kwargs.
# A discussion of the appropriate vocabulary size is in 2_prepare_2_eval.ipynb in gpt2-recycle
# A larger vocabulary results in rarer tokens.
# They chose 40,000 as a reasonable tradeoff.
VOCAB_SIZE = 40_000
BLOCK_SIZE = 128
BATCH_SIZE = 16
LEARNING_RATE = 1e-5
MAX_STEPS = 15_000_000 // BATCH_SIZE
MODEL_NAME = "gpt2-medium" # "gpt2-small"
from typing import *
from tokenizers import Tokenizer, trainers, normalizers, pre_tokenizers, decoders, processors, models
def make_tokenizer(lang: str, size: int) -> None:
base_dir = PREPARATION_FOLDER / 'vocabularies'
dst_path = base_dir / f'{lang}-{str(size//1_000).zfill(3)}k.tokenizer.json'
if dst_path.exists():
print(f' > {dst_path} already exists. skipping')
return
print(f' > creating vocabulary with vocab size {size//1000}k')
tokenizer = train_tokenizer(files=FILES, vocab_size=size, alphabet=sorted(CHARACTERS))
tokenizer.save(str(dst_path), pretty=True)
def train_tokenizer(files: List[Path], vocab_size: int, alphabet: List[str]) -> Tokenizer:
tokenizer = Tokenizer(models.BPE())
tokenizer.normalizer = normalizers.NFC()
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)
trainer = trainers.BpeTrainer(
vocab_size=vocab_size,
min_frequency=2,
show_progress=True,
special_tokens=['<unk>', '<s>', '</s>'],
initial_alphabet=alphabet,
)
files = [str(file.resolve()) for file in files]
tokenizer.train(files=files, trainer=trainer)
return tokenizer
This involves tokenizing every sentence. Since I have organized the data into files with a sentence per line I had to double space the files. This is because the tokenizer expects a blank line between each document.
> preparing data/es/preparation/prepared/data-040k.pkl
🔥 data/es/preparation/plaintext/spanish-twitter.txt
15580452it [09:51, 26343.56it/s]
::: 7,790,226 examples loaded
🔥 data/es/preparation/plaintext/spanish-wikipedia.txt
6543442it [06:03, 17458.70it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1093 > 1024). Running this sequence through the model will result in indexing errors
14803750it [14:05, 17516.41it/s]
::: 15,192,101 examples loaded
15,192,101 examples
> exporting data/es/preparation/prepared/data-040k.pkl
> data/es/preparation/prepared/data-040k.pkl
::: loading examples
::: counting lengths
::: saved 15192101 lengths to data/es/preparation/prepared/data-040k.pkl.lengths
::: counting coverages
::: saved 42042 coverage scores to data/es/preparation/prepared/data-040k.pkl.coverage
> loading data/es/preparation/prepared/data-040k.pkl.lengths
Traceback (most recent call last):
File "/home/matthew/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/matthew/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/matthew/Programming/Blog/blog/notebooks/submodules/gpt2-recycle/src/preparation/3_split_train_val.py", line 12, in <module>
n = args.size.zfill(3)
AttributeError: 'int' object has no attribute 'zfill'
This is an odd one, zfill is a string method. I think that the argparse parameter was updated?
def split_data(vocab_size: int, val_ratio: float) -> None:
from torch import randperm
import numpy as np
n = str(vocab_size // 1000).zfill(3)
prep_dir = PREPARATION_FOLDER / 'final'
src_path = prep_dir / f'index-{n}k.npy'
tra_dst_path = prep_dir / f'index-train-{n}k.npy'
val_dst_path = prep_dir / f'index-valid-{n}k.npy'
dat = np.load(src_path)
n_val = int(len(dat) * val_ratio)
n_tra = len(dat) - n_val
print(f'train={n_tra:,} valid={n_val:,}')
ind = randperm(len(dat)).tolist()
ind_tra = ind[:n_tra]
ind_val = ind[n_tra:]
np.save(tra_dst_path, ind_tra, allow_pickle=False)
np.save(val_dst_path, ind_val, allow_pickle=False)
So this has completed the data preparation. The next thing is to train the model embeddings.
Traceback (most recent call last):
File "/home/matthew/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/matthew/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/matthew/Programming/Blog/blog/notebooks/submodules/gpt2-recycle/src/training/main.py", line 16, in <module>
from pytorch_lightning.callbacks import LearningRateLogger, ModelCheckpoint
ImportError: cannot import name 'LearningRateLogger' from 'pytorch_lightning.callbacks' (/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/callbacks/__init__.py)
Changed LearningRateLogger to LearningRateMonitor - https://github.com/PyTorchLightning/pytorch-lightning/pull/3251 I’ve just changed the import in the file for now.
usage: main.py [-h] [--logger [LOGGER]]
[--checkpoint_callback [CHECKPOINT_CALLBACK]]
[--default_root_dir DEFAULT_ROOT_DIR]
[--gradient_clip_val GRADIENT_CLIP_VAL]
[--process_position PROCESS_POSITION] [--num_nodes NUM_NODES]
[--num_processes NUM_PROCESSES] [--gpus GPUS]
[--auto_select_gpus [AUTO_SELECT_GPUS]] [--tpu_cores TPU_CORES]
[--log_gpu_memory LOG_GPU_MEMORY]
[--progress_bar_refresh_rate PROGRESS_BAR_REFRESH_RATE]
[--overfit_batches OVERFIT_BATCHES]
[--track_grad_norm TRACK_GRAD_NORM]
[--check_val_every_n_epoch CHECK_VAL_EVERY_N_EPOCH]
[--fast_dev_run [FAST_DEV_RUN]]
[--accumulate_grad_batches ACCUMULATE_GRAD_BATCHES]
[--max_epochs MAX_EPOCHS] [--min_epochs MIN_EPOCHS]
[--max_steps MAX_STEPS] [--min_steps MIN_STEPS]
[--limit_train_batches LIMIT_TRAIN_BATCHES]
[--limit_val_batches LIMIT_VAL_BATCHES]
[--limit_test_batches LIMIT_TEST_BATCHES]
[--limit_predict_batches LIMIT_PREDICT_BATCHES]
[--val_check_interval VAL_CHECK_INTERVAL]
[--flush_logs_every_n_steps FLUSH_LOGS_EVERY_N_STEPS]
[--log_every_n_steps LOG_EVERY_N_STEPS]
[--accelerator ACCELERATOR] [--sync_batchnorm [SYNC_BATCHNORM]]
[--precision PRECISION] [--weights_summary WEIGHTS_SUMMARY]
[--weights_save_path WEIGHTS_SAVE_PATH]
[--num_sanity_val_steps NUM_SANITY_VAL_STEPS]
[--truncated_bptt_steps TRUNCATED_BPTT_STEPS]
[--resume_from_checkpoint RESUME_FROM_CHECKPOINT]
[--profiler [PROFILER]] [--benchmark [BENCHMARK]]
[--deterministic [DETERMINISTIC]]
[--reload_dataloaders_every_epoch [RELOAD_DATALOADERS_EVERY_EPOCH]]
[--auto_lr_find [AUTO_LR_FIND]]
[--replace_sampler_ddp [REPLACE_SAMPLER_DDP]]
[--terminate_on_nan [TERMINATE_ON_NAN]]
[--auto_scale_batch_size [AUTO_SCALE_BATCH_SIZE]]
[--prepare_data_per_node [PREPARE_DATA_PER_NODE]]
[--plugins PLUGINS] [--amp_backend AMP_BACKEND]
[--amp_level AMP_LEVEL]
[--distributed_backend DISTRIBUTED_BACKEND]
[--automatic_optimization [AUTOMATIC_OPTIMIZATION]]
[--move_metrics_to_cpu [MOVE_METRICS_TO_CPU]]
[--enable_pl_optimizer [ENABLE_PL_OPTIMIZER]]
[--multiple_trainloader_mode MULTIPLE_TRAINLOADER_MODE]
[--stochastic_weight_avg [STOCHASTIC_WEIGHT_AVG]]
[--num_workers NUM_WORKERS] [--data_path DATA_PATH]
[--data_index_path DATA_INDEX_PATH] [--mmap]
[--max_seq_length MAX_SEQ_LENGTH]
[--pretrained_path PRETRAINED_PATH] [--vocab_size VOCAB_SIZE]
[--tokenizer_path TOKENIZER_PATH] [--wte_only] [--unfreeze]
[--reset_state] [--subset_size SUBSET_SIZE] [--lr LR]
[--batch_size BATCH_SIZE] [--verbose] [--search] [--seed SEED]
[--version VERSION] [--name NAME] --lang LANG
optional arguments:
-h, --help show this help message and exit
--logger [LOGGER] Logger (or iterable collection of loggers) for
experiment tracking.
--checkpoint_callback [CHECKPOINT_CALLBACK]
If ``True``, enable checkpointing. It will configure a
default ModelCheckpoint callback if there is no user-
defined ModelCheckpoint in :paramref:`~pytorch_lightni
ng.trainer.trainer.Trainer.callbacks`. Default:
``True``. .. warning:: Passing a ModelCheckpoint
instance to this argument is deprecated since v1.1 and
will be unsupported from v1.3. Use `callbacks`
argument instead.
--default_root_dir DEFAULT_ROOT_DIR
Default path for logs and weights when no
logger/ckpt_callback passed. Default: ``os.getcwd()``.
Can be remote file paths such as `s3://mybucket/path`
or 'hdfs://path/'
--gradient_clip_val GRADIENT_CLIP_VAL
0 means don't clip.
--process_position PROCESS_POSITION
orders the progress bar when running multiple models
on same machine.
--num_nodes NUM_NODES
number of GPU nodes for distributed training.
--num_processes NUM_PROCESSES
number of processes for distributed training with
distributed_backend="ddp_cpu"
--gpus GPUS number of gpus to train on (int) or which GPUs to
train on (list or str) applied per node
--auto_select_gpus [AUTO_SELECT_GPUS]
If enabled and `gpus` is an integer, pick available
gpus automatically. This is especially useful when
GPUs are configured to be in "exclusive mode", such
that only one process at a time can access them.
--tpu_cores TPU_CORES
How many TPU cores to train on (1 or 8) / Single TPU
to train on [1]
--log_gpu_memory LOG_GPU_MEMORY
None, 'min_max', 'all'. Might slow performance
--progress_bar_refresh_rate PROGRESS_BAR_REFRESH_RATE
How often to refresh progress bar (in steps). Value
``0`` disables progress bar. Ignored when a custom
progress bar is passed to
:paramref:`~Trainer.callbacks`. Default: None, means a
suitable value will be chosen based on the environment
(terminal, Google COLAB, etc.).
--overfit_batches OVERFIT_BATCHES
Overfit a percent of training data (float) or a set
number of batches (int). Default: 0.0
--track_grad_norm TRACK_GRAD_NORM
-1 no tracking. Otherwise tracks that p-norm. May be
set to 'inf' infinity-norm.
--check_val_every_n_epoch CHECK_VAL_EVERY_N_EPOCH
Check val every n train epochs.
--fast_dev_run [FAST_DEV_RUN]
runs n if set to ``n`` (int) else 1 if set to ``True``
batch(es) of train, val and test to find any bugs (ie:
a sort of unit test).
--accumulate_grad_batches ACCUMULATE_GRAD_BATCHES
Accumulates grads every k batches or as set up in the
dict.
--max_epochs MAX_EPOCHS
Stop training once this number of epochs is reached.
Disabled by default (None). If both max_epochs and
max_steps are not specified, defaults to
``max_epochs`` = 1000.
--min_epochs MIN_EPOCHS
Force training for at least these many epochs.
Disabled by default (None). If both min_epochs and
min_steps are not specified, defaults to
``min_epochs`` = 1.
--max_steps MAX_STEPS
Stop training after this number of steps. Disabled by
default (None).
--min_steps MIN_STEPS
Force training for at least these number of steps.
Disabled by default (None).
--limit_train_batches LIMIT_TRAIN_BATCHES
How much of training dataset to check (floats =
percent, int = num_batches)
--limit_val_batches LIMIT_VAL_BATCHES
How much of validation dataset to check (floats =
percent, int = num_batches)
--limit_test_batches LIMIT_TEST_BATCHES
How much of test dataset to check (floats = percent,
int = num_batches)
--limit_predict_batches LIMIT_PREDICT_BATCHES
--val_check_interval VAL_CHECK_INTERVAL
How often to check the validation set. Use float to
check within a training epoch, use int to check every
n steps (batches).
--flush_logs_every_n_steps FLUSH_LOGS_EVERY_N_STEPS
How often to flush logs to disk (defaults to every 100
steps).
--log_every_n_steps LOG_EVERY_N_STEPS
How often to log within steps (defaults to every 50
steps).
--accelerator ACCELERATOR
Previously known as distributed_backend (dp, ddp,
ddp2, etc...). Can also take in an accelerator object
for custom hardware.
--sync_batchnorm [SYNC_BATCHNORM]
Synchronize batch norm layers between process
groups/whole world.
--precision PRECISION
Full precision (32), half precision (16). Can be used
on CPU, GPU or TPUs.
--weights_summary WEIGHTS_SUMMARY
Prints a summary of the weights when training begins.
--weights_save_path WEIGHTS_SAVE_PATH
Where to save weights if specified. Will override
default_root_dir for checkpoints only. Use this if for
whatever reason you need the checkpoints stored in a
different place than the logs written in
`default_root_dir`. Can be remote file paths such as
`s3://mybucket/path` or 'hdfs://path/' Defaults to
`default_root_dir`.
--num_sanity_val_steps NUM_SANITY_VAL_STEPS
Sanity check runs n validation batches before starting
the training routine. Set it to `-1` to run all
batches in all validation dataloaders. Default: 2
--truncated_bptt_steps TRUNCATED_BPTT_STEPS
Truncated back prop breaks performs backprop every k
steps of much longer sequence.
--resume_from_checkpoint RESUME_FROM_CHECKPOINT
Path/URL of the checkpoint from which training is
resumed. If there is no checkpoint file at the path,
start from scratch. If resuming from mid-epoch
checkpoint, training will start from the beginning of
the next epoch.
--profiler [PROFILER]
To profile individual steps during training and assist
in identifying bottlenecks. Passing bool value is
deprecated in v1.1 and will be removed in v1.3.
--benchmark [BENCHMARK]
If true enables cudnn.benchmark.
--deterministic [DETERMINISTIC]
If true enables cudnn.deterministic.
--reload_dataloaders_every_epoch [RELOAD_DATALOADERS_EVERY_EPOCH]
Set to True to reload dataloaders every epoch.
--auto_lr_find [AUTO_LR_FIND]
If set to True, will make trainer.tune() run a
learning rate finder, trying to optimize initial
learning for faster convergence. trainer.tune() method
will set the suggested learning rate in self.lr or
self.learning_rate in the LightningModule. To use a
different key set a string instead of True with the
key name.
--replace_sampler_ddp [REPLACE_SAMPLER_DDP]
Explicitly enables or disables sampler replacement. If
not specified this will toggled automatically when DDP
is used. By default it will add ``shuffle=True`` for
train sampler and ``shuffle=False`` for val/test
sampler. If you want to customize it, you can set
``replace_sampler_ddp=False`` and add your own
distributed sampler.
--terminate_on_nan [TERMINATE_ON_NAN]
If set to True, will terminate training (by raising a
`ValueError`) at the end of each training batch, if
any of the parameters or the loss are NaN or +/-inf.
--auto_scale_batch_size [AUTO_SCALE_BATCH_SIZE]
If set to True, will `initially` run a batch size
finder trying to find the largest batch size that fits
into memory. The result will be stored in
self.batch_size in the LightningModule. Additionally,
can be set to either `power` that estimates the batch
size through a power search or `binsearch` that
estimates the batch size through a binary search.
--prepare_data_per_node [PREPARE_DATA_PER_NODE]
If True, each LOCAL_RANK=0 will call prepare data.
Otherwise only NODE_RANK=0, LOCAL_RANK=0 will prepare
data
--plugins PLUGINS Plugins allow modification of core behavior like ddp
and amp, and enable custom lightning plugins.
--amp_backend AMP_BACKEND
The mixed precision backend to use ("native" or
"apex")
--amp_level AMP_LEVEL
The optimization level to use (O1, O2, etc...).
--distributed_backend DISTRIBUTED_BACKEND
deprecated. Please use 'accelerator'
--automatic_optimization [AUTOMATIC_OPTIMIZATION]
If False you are responsible for calling .backward,
.step, zero_grad in LightningModule. This argument has
been moved to LightningModule. It is deprecated here
in v1.1 and will be removed in v1.3.
--move_metrics_to_cpu [MOVE_METRICS_TO_CPU]
Whether to force internal logged metrics to be moved
to cpu. This can save some gpu memory, but can make
training slower. Use with attention.
--enable_pl_optimizer [ENABLE_PL_OPTIMIZER]
If True, each optimizer will be wrapped by
`pytorch_lightning.core.optimizer.LightningOptimizer`.
It allows Lightning to handle AMP, TPU,
accumulated_gradients, etc. .. warning:: Currently
deprecated and it will be removed in v1.3
--multiple_trainloader_mode MULTIPLE_TRAINLOADER_MODE
How to loop over the datasets when there are multiple
train loaders. In 'max_size_cycle' mode, the trainer
ends one epoch when the largest dataset is traversed,
and smaller datasets reload when running out of their
data. In 'min_size' mode, all the datasets reload when
reaching the minimum length of datasets.
--stochastic_weight_avg [STOCHASTIC_WEIGHT_AVG]
Whether to use `Stochastic Weight Averaging (SWA)
<https://pytorch.org/blog/pytorch-1.6-now-includes-
stochastic-weight-averaging/>_`
--num_workers NUM_WORKERS
--data_path DATA_PATH
--data_index_path DATA_INDEX_PATH
--mmap
--max_seq_length MAX_SEQ_LENGTH
--pretrained_path PRETRAINED_PATH
--vocab_size VOCAB_SIZE
--tokenizer_path TOKENIZER_PATH
--wte_only
--unfreeze
--reset_state
--subset_size SUBSET_SIZE
--lr LR
--batch_size BATCH_SIZE
--verbose
--search
--seed SEED
--version VERSION
--name NAME
--lang LANG
This is an extensive set of options. I want to keep this as it is a nice summary of what is available for training.
command = f"""
cd {DATA_FOLDER} ; PYTHONPATH={SUBMODULE_FOLDER} python -m training.main \
--accumulate_grad_batches 2000 \
--max_steps {MAX_STEPS} \
--limit_val_batches 100 \
--val_check_interval 10000 \
--auto_lr_find True \
--auto_scale_batch_size True \
--amp_backend amp \
--amp_level O1 \
--data_path {DATA_FOLDER} \
--data_index_path {DATA_FOLDER} \
--max_seq_length 128 \
--vocab_size 40 \
--tokenizer_path {DATA_FOLDER} \
--lang es
"""
! $command
Traceback (most recent call last):
File "/home/matthew/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/matthew/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/matthew/Programming/Blog/blog/notebooks/submodules/gpt2-recycle/src/training/main.py", line 582, in <module>
main()
File "/home/matthew/Programming/Blog/blog/notebooks/submodules/gpt2-recycle/src/training/main.py", line 540, in main
trainer_kwargs = get_trainer_kwargs(args)
File "/home/matthew/Programming/Blog/blog/notebooks/submodules/gpt2-recycle/src/training/main.py", line 481, in get_trainer_kwargs
checkpoint_callback = ModelCheckpoint(filepath=None,
TypeError: __init__() got an unexpected keyword argument 'filepath'
I think this has changed to dirpath. Maybe pytorch lightning should put more effort into backward compatability?
command = f"""
cd {DATA_FOLDER} ; PYTHONPATH={SUBMODULE_FOLDER} python -m training.main \
--accumulate_grad_batches 2000 \
--max_steps {MAX_STEPS} \
--limit_val_batches 100 \
--val_check_interval 10000 \
--auto_lr_find True \
--auto_scale_batch_size True \
--amp_backend amp \
--amp_level O1 \
--data_path {DATA_FOLDER} \
--data_index_path {DATA_FOLDER} \
--max_seq_length 128 \
--vocab_size 40 \
--tokenizer_path {DATA_FOLDER} \
--lang es
"""
! $command
starting: 0
Global seed set to 7649832
Traceback (most recent call last):
File "/home/matthew/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/matthew/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/matthew/Programming/Blog/blog/notebooks/submodules/gpt2-recycle/src/training/main.py", line 582, in <module>
main()
File "/home/matthew/Programming/Blog/blog/notebooks/submodules/gpt2-recycle/src/training/main.py", line 554, in main
model = EmbeddingTunerModel(args)
File "/home/matthew/Programming/Blog/blog/notebooks/submodules/gpt2-recycle/src/training/main.py", line 241, in __init__
with open(Path('data') / self.hparams.lang / 'config.json') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/es/config.json'
I’ve just created a json file with an empty dict in that place.
command = f"""
cd {DATA_FOLDER} ; PYTHONPATH={SUBMODULE_FOLDER} python -m training.main \
--accumulate_grad_batches 2000 \
--max_steps {MAX_STEPS} \
--gpus 1 \
--limit_val_batches 100 \
--val_check_interval 10000 \
--auto_lr_find True \
--auto_scale_batch_size True \
--amp_backend amp \
--amp_level O1 \
--data_path {DATA_FOLDER}/data/es/preparation/final/data-040k.npy \
--data_index_path {DATA_FOLDER}/data/es/preparation/final/index-040k.npy \
--max_seq_length 128 \
--vocab_size 40 \
--tokenizer_path {DATA_FOLDER}/data/es/preparation/vocabularies/es-040k.tokenizer.json \
--lang es
"""
! $command
starting: 0
Global seed set to 7649832
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Traceback (most recent call last):
File "/home/matthew/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/matthew/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/matthew/Programming/Blog/blog/notebooks/submodules/gpt2-recycle/src/training/main.py", line 582, in <module>
main()
File "/home/matthew/Programming/Blog/blog/notebooks/submodules/gpt2-recycle/src/training/main.py", line 578, in main
trainer.fit(model)
File "/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit
self.accelerator.setup(self, model) # note: this sets up self.lightning_module
File "/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu.py", line 30, in setup
return super().setup(trainer, model)
File "/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 69, in setup
self.setup_optimizers(trainer)
File "/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 315, in setup_optimizers
optimizers, lr_schedulers, optimizer_frequencies = self.training_type_plugin.init_optimizers(
File "/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 160, in init_optimizers
return trainer.init_optimizers(model)
File "/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/optimizers.py", line 83, in init_optimizers
lr_schedulers = self.configure_schedulers(lr_schedulers, monitor=monitor)
File "/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/optimizers.py", line 133, in configure_schedulers
raise MisconfigurationException(
pytorch_lightning.utilities.exceptions.MisconfigurationException: `configure_optimizers` must include a monitor when a `ReduceLROnPlateau` scheduler is used. For example: {"optimizer": optimizer, "lr_scheduler": scheduler, "monitor": "metric_to_track"}
So I wonder how the monitor is supposed to be configured. I want to get this training quickly so I’m going to downgrade pytorch lightning to 1.0.
I tried downgrading and it still doesn’t work. It turns out that the configure_optimizers
method of the trainer needs to change. This is what I changed it to:
def configure_optimizers(self):
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad,
self.m.parameters()),
lr=self.lr,
amsgrad=True)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer,
patience=1)
return {"optimizer": optimizer, "scheduler": scheduler, "monitor": "loss"}
command = f"""
cd {DATA_FOLDER} ; PYTHONPATH={SUBMODULE_FOLDER} python -m training.main \
--accumulate_grad_batches 2000 \
--max_steps {MAX_STEPS} \
--gpus 1 \
--limit_val_batches 100 \
--val_check_interval 10000 \
--auto_lr_find True \
--auto_scale_batch_size True \
--amp_backend amp \
--amp_level O1 \
--data_path {DATA_FOLDER}/data/es/preparation/final/data-040k.npy \
--data_index_path {DATA_FOLDER}/data/es/preparation/final/index-040k.npy \
--max_seq_length 128 \
--vocab_size 50257 \
--tokenizer_path {DATA_FOLDER}/data/es/preparation/vocabularies/es-040k.tokenizer.json \
--lang es
"""
! $command
wandb: Currently logged in as: matthewfranglen (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.10.25
wandb: Syncing run accelerator_None_accumulate_grad_batches_2000_amp_backend_amp_amp_level_O1_auto_lr_find_True_auto_scale_batch_size_True_auto_select_gpus_False_automatic_optimization_None_batch_size_3_benchmark_False_check_val_every_n_epoch_1_checkpoint_callback_True_data_index_path_/data/wikipedia/processed/spanish-sentences/data/es/preparation/final/index-040k.npy_data_path_/data/wikipedia/processed/spanish-sentences/data/es/preparation/final/data-040k.npy_default_root_dir_None_deterministic_False_distributed_backend_None_enable_pl_optimizer_None_fast_dev_run_False_flush_logs_every_n_steps_100_gpus_1_gradient_clip_val_0_lang_es_limit_predict_batches_1.0_limit_test_batches_1.0_limit_train_batches_1.0_limit_val_batches_100_log_every_n_steps_50_log_gpu_memory_None_logger_True_lr_0.001_max_epochs_None_max_seq_length_128_max_steps_937500_min_epochs_None_min_steps_None_mmap_False_move_metrics_to_cpu_False_multiple_trainloader_mode_max_size_cycle_name_None_num_nodes_1_num_processes_1_num_sanity_val_steps_2_num_workers_4_overfit_batches_0.0_plugins_None_precision_32_prepare_data_per_node_True_pretrained_path_gpt2_process_position_0_profiler_None_progress_bar_refresh_rate_None_reload_dataloaders_every_epoch_False_replace_sampler_ddp_True_reset_state_False_resume_from_checkpoint_None_search_False_seed_7649832_stochastic_weight_avg_False_subset_size_1.0_sync_batchnorm_False_terminate_on_nan_False_tokenizer_path_/data/wikipedia/processed/spanish-sentences/data/es/preparation/vocabularies/es-040k.tokenizer.json_tpu_cores_<function _gpus_arg_default at 0x7ff22ca611f0>_track_grad_norm_-1_truncated_bptt_steps_None_unfreeze_False_val_check_interval_10000_verbose_False_version_0_vocab_size_50257_weights_save_path_None_weights_summary_top_wte_only_False
wandb: ⭐️ View project at https://wandb.ai/matthewfranglen/mf-blog-recycle-gpt2-es
wandb: 🚀 View run at https://wandb.ai/matthewfranglen/mf-blog-recycle-gpt2-es/runs/3l26l38n
wandb: Run data is saved locally in /data/wikipedia/processed/spanish-sentences/wandb/run-20210413_120325-3l26l38n
wandb: Run `wandb offline` to turn off syncing.
starting: 0
Global seed set to 7649832
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
-----------------------------------------
0 | m | GPT2LMHeadModel | 124 M
-----------------------------------------
124 M Trainable params
0 Non-trainable params
124 M Total params
497.759 Total estimated model params size (MB)
validation examples: 1,523,919
Validation sanity check: 50%|██████████ | 1/2 [00:00<00:00, 5.76it/s]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The validation_epoch_end should not return anything as of 9.1. To log, use self.log(...) or self.write(...) directly in the LightningModule
warnings.warn(*args, **kwargs)
/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:52: RuntimeWarning: You are using `LearningRateMonitor` callback with models that have no learning rate schedulers. Please see documentation for `configure_optimizers` method.
warnings.warn(*args, **kwargs)
training examples: 13,716,233
Epoch 0: 0%| | 0/4617778 [00:00<?, ?it/s]/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The {log:dict keyword} was deprecated in 0.9.1 and will be removed in 1.0.0
Please use self.log(...) inside the lightningModule instead.
# log on a step or aggregate epoch metric to the logger and/or progress bar (inside LightningModule)
self.log('train_loss', loss, on_step=True, on_epoch=True, prog_bar=True)
warnings.warn(*args, **kwargs)
Epoch 0: 0%| | 10000/4617778 [04:07<31:43:35, 40.34it/s, loss=8.24, v_num=0]
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 0%| | 10004/4617778 [04:08<31:44:44, 40.32it/s, loss=8.24, v_num=0]
Epoch 0: 0%| | 10017/4617778 [04:08<31:43:05, 40.35it/s, loss=8.24, v_num=0]
Epoch 0: 0%| | 10030/4617778 [04:08<31:41:27, 40.39it/s, loss=8.24, v_num=0]
Epoch 0: 0%| | 10043/4617778 [04:08<31:39:48, 40.42it/s, loss=8.24, v_num=0]
Epoch 0: 0%| | 10056/4617778 [04:08<31:38:08, 40.46it/s, loss=8.24, v_num=0]
Epoch 0: 0%| | 10069/4617778 [04:08<31:36:29, 40.49it/s, loss=8.24, v_num=0]
Epoch 0: 0%| | 10082/4617778 [04:08<31:34:50, 40.53it/s, loss=8.24, v_num=0]
Epoch 0: 0%| | 10095/4617778 [04:08<31:33:12, 40.56it/s, loss=8.24, v_num=0]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 0%| | 10101/4617778 [04:09<31:38:21, 40.45it/s, loss=8.24, v_num=0]
Epoch 0: 0%| | 10429/4617778 [04:17<31:38:29, 40.45it/s, loss=8.24, v_num=0]^CA
/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: Detected KeyboardInterrupt, attempting graceful shutdown...
warnings.warn(*args, **kwargs)
Saving latest checkpoint...
Epoch 0: 0%| | 10429/4617778 [04:18<31:39:52, 40.42it/s, loss=8.24, v_num=0]
So this is finally training. The one thing I want to do is to get it reporting to weights and bias so I can track it.
command = f"""
cd {DATA_FOLDER} ; PYTHONPATH={SUBMODULE_FOLDER} python -m training.main \
--accumulate_grad_batches 2000 \
--max_epochs 1 \
--gpus 1 \
--limit_val_batches 100 \
--val_check_interval 10000 \
--auto_lr_find True \
--batch_size 32 \
--amp_backend amp \
--amp_level O1 \
--data_path {DATA_FOLDER}/data/es/preparation/final/data-040k.npy \
--data_index_path {DATA_FOLDER}/data/es/preparation/final/index-040k.npy \
--max_seq_length 128 \
--vocab_size 50257 \
--tokenizer_path {DATA_FOLDER}/data/es/preparation/vocabularies/es-040k.tokenizer.json \
--lang es
"""
! $command
starting: 0
Global seed set to 7649832
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
wandb: Currently logged in as: matthewfranglen (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.10.25
wandb: Syncing run accumulate_grad_batches_2000_amp_backend_amp_amp_level_O1_auto_lr_find_True_auto_scale_batch_size_False_auto_select_gpus_False_batch_size_32_benchmark_False_check_val_every_n_epoch_1_checkpoint_callback_True_data_index_path_/data/wikipedia/processed/spanish-sentences/data/es/preparation/final/index-040k.npy_data_path_/data/wikipedia/processed/spanish-sentences/data/es/preparation/final/data-040k.npy_deterministic_False_fast_dev_run_False_flush_logs_every_n_steps_100_gpus_1_gradient_clip_val_0_lang_es_limit_predict_batches_1.0_limit_test_batches_1.0_limit_train_batches_1.0_limit_val_batches_100_log_every_n_steps_50_logger_True_lr_0.001_max_epochs_1_max_seq_length_128_mmap_False_move_metrics_to_cpu_False_multiple_trainloader_mode_max_size_cycle_num_nodes_1_num_processes_1_num_sanity_val_steps_2_num_workers_4_overfit_batches_0.0_precision_32_prepare_data_per_node_True_pretrained_path_gpt2_process_position_0_reload_dataloaders_every_epoch_False_replace_sampler_ddp_True_reset_state_False_search_False_seed_7649832_stochastic_weight_avg_False_subset_size_1.0_sync_batchnorm_False_terminate_on_nan_False_tokenizer_path_/data/wikipedia/processed/spanish-sentences/data/es/preparation/vocabularies/es-040k.tokenizer.json_tpu_cores_<function _gpus_arg_default at 0x7ff2b2d63310>_track_grad_norm_-1_unfreeze_False_val_check_interval_10000_verbose_False_version_0_vocab_size_50257_weights_summary_top_wte_only_False
wandb: ⭐️ View project at https://wandb.ai/matthewfranglen/mf-blog-recycle-gpt2-es
wandb: 🚀 View run at https://wandb.ai/matthewfranglen/mf-blog-recycle-gpt2-es/runs/16p22vt0
wandb: Run data is saved locally in /data/wikipedia/processed/spanish-sentences/wandb/run-20210413_133917-16p22vt0
wandb: Run `wandb offline` to turn off syncing.
| Name | Type | Params
-----------------------------------------
0 | m | GPT2LMHeadModel | 124 M
-----------------------------------------
124 M Trainable params
0 Non-trainable params
124 M Total params
497.759 Total estimated model params size (MB)
validation examples: 1,523,919
Validation sanity check: 50%|██████████ | 1/2 [00:00<00:00, 5.29it/s]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The validation_epoch_end should not return anything as of 9.1. To log, use self.log(...) or self.write(...) directly in the LightningModule
warnings.warn(*args, **kwargs)
/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:52: RuntimeWarning: You are using `LearningRateMonitor` callback with models that have no learning rate schedulers. Please see documentation for `configure_optimizers` method.
warnings.warn(*args, **kwargs)
training examples: 13,716,233
Epoch 0: 0%| | 0/432833 [00:00<?, ?it/s]/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.8/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The {log:dict keyword} was deprecated in 0.9.1 and will be removed in 1.0.0
Please use self.log(...) inside the lightningModule instead.
# log on a step or aggregate epoch metric to the logger and/or progress bar (inside LightningModule)
self.log('train_loss', loss, on_step=True, on_epoch=True, prog_bar=True)
warnings.warn(*args, **kwargs)
Epoch 0: 2%| | 10001/432833 [18:28<13:01:05, 9.02it/s, loss=8.19, v_num=2vt0]
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 2%| | 10004/432833 [18:28<13:01:02, 9.02it/s, loss=8.19, v_num=2vt0]
Epoch 0: 2%| | 10007/432833 [18:28<13:00:52, 9.02it/s, loss=8.19, v_num=2vt0]
Epoch 0: 2%| | 10011/432833 [18:28<13:00:38, 9.03it/s, loss=8.19, v_num=2vt0]
Epoch 0: 2%| | 10015/432833 [18:29<13:00:23, 9.03it/s, loss=8.19, v_num=2vt0]
Epoch 0: 2%| | 10019/432833 [18:29<13:00:10, 9.03it/s, loss=8.19, v_num=2vt0]
Epoch 0: 2%| | 10023/432833 [18:29<12:59:57, 9.03it/s, loss=8.19, v_num=2vt0]
Epoch 0: 2%| | 10027/432833 [18:29<12:59:47, 9.04it/s, loss=8.19, v_num=2vt0]
Epoch 0: 2%| | 10031/432833 [18:29<12:59:34, 9.04it/s, loss=8.19, v_num=2vt0]
Validating: 31%|█████████▎ | 31/100 [00:01<00:03, 19.14it/s]
Epoch 0: 2%| | 10035/432833 [18:29<12:59:23, 9.04it/s, loss=8.19, v_num=2vt0]
Epoch 0: 2%| | 10039/432833 [18:30<12:59:12, 9.04it/s, loss=8.19, v_num=2vt0]
Epoch 0: 2%| | 10043/432833 [18:30<12:58:58, 9.05it/s, loss=8.19, v_num=2vt0]
Epoch 0: 2%| | 10047/432833 [18:30<12:58:44, 9.05it/s, loss=8.19, v_num=2vt0]
Epoch 0: 2%| | 10051/432833 [18:30<12:58:30, 9.05it/s, loss=8.19, v_num=2vt0]
Epoch 0: 2%| | 10055/432833 [18:30<12:58:16, 9.05it/s, loss=8.19, v_num=2vt0]
Epoch 0: 2%| | 10059/432833 [18:30<12:58:04, 9.06it/s, loss=8.19, v_num=2vt0]
Epoch 0: 2%| | 10063/432833 [18:30<12:57:51, 9.06it/s, loss=8.19, v_num=2vt0]
Epoch 0: 2%| | 10067/432833 [18:31<12:57:37, 9.06it/s, loss=8.19, v_num=2vt0]
Epoch 0: 2%| | 10071/432833 [18:31<12:57:25, 9.06it/s, loss=8.19, v_num=2vt0]
Epoch 0: 2%| | 10075/432833 [18:31<12:57:13, 9.07it/s, loss=8.19, v_num=2vt0]
Epoch 0: 2%| | 10079/432833 [18:31<12:56:59, 9.07it/s, loss=8.19, v_num=2vt0]
Epoch 0: 2%| | 10083/432833 [18:31<12:56:46, 9.07it/s, loss=8.19, v_num=2vt0]
Validating: 83%|████████████████████████▉ | 83/100 [00:03<00:00, 27.52it/s]
Epoch 0: 2%| | 10087/432833 [18:31<12:56:37, 9.07it/s, loss=8.19, v_num=2vt0]
Epoch 0: 2%| | 10091/432833 [18:31<12:56:24, 9.07it/s, loss=8.19, v_num=2vt0]
Epoch 0: 2%| | 10095/432833 [18:32<12:56:10, 9.08it/s, loss=8.19, v_num=2vt0]
Epoch 0: 2%| | 10099/432833 [18:32<12:55:58, 9.08it/s, loss=8.19, v_num=2vt0]
Validating: 99%|█████████████████████████████▋| 99/100 [00:03<00:00, 24.96it/s]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 2%| | 10101/432833 [18:33<12:56:25, 9.07it/s, loss=8.19, v_num=2vt0]
Epoch 0: 5%| | 20101/432833 [37:13<12:44:25, 9.00it/s, loss=7.32, v_num=2vt0]
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 5%| | 20103/432833 [37:13<12:44:25, 9.00it/s, loss=7.32, v_num=2vt0]
Epoch 0: 5%| | 20108/432833 [37:14<12:44:15, 9.00it/s, loss=7.32, v_num=2vt0]
Epoch 0: 5%| | 20113/432833 [37:14<12:44:06, 9.00it/s, loss=7.32, v_num=2vt0]
Epoch 0: 5%| | 20119/432833 [37:14<12:43:55, 9.00it/s, loss=7.32, v_num=2vt0]
Validating: 18%|█████▍ | 18/100 [00:00<00:06, 12.06it/s]
Epoch 0: 5%| | 20125/432833 [37:14<12:43:45, 9.01it/s, loss=7.32, v_num=2vt0]
Epoch 0: 5%| | 20131/432833 [37:14<12:43:35, 9.01it/s, loss=7.32, v_num=2vt0]
Validating: 30%|█████████ | 30/100 [00:01<00:03, 18.58it/s]
Epoch 0: 5%| | 20137/432833 [37:15<12:43:26, 9.01it/s, loss=7.32, v_num=2vt0]
Validating: 37%|███████████ | 37/100 [00:01<00:02, 21.77it/s]
Epoch 0: 5%| | 20143/432833 [37:15<12:43:16, 9.01it/s, loss=7.32, v_num=2vt0]
Validating: 44%|█████████████▏ | 44/100 [00:01<00:02, 25.71it/s]
Epoch 0: 5%| | 20149/432833 [37:15<12:43:08, 9.01it/s, loss=7.32, v_num=2vt0]
Epoch 0: 5%| | 20155/432833 [37:15<12:42:58, 9.01it/s, loss=7.32, v_num=2vt0]
Validating: 54%|████████████████▏ | 54/100 [00:02<00:01, 23.34it/s]
Epoch 0: 5%| | 20161/432833 [37:16<12:42:48, 9.02it/s, loss=7.32, v_num=2vt0]
Validating: 61%|██████████████████▎ | 61/100 [00:02<00:01, 26.01it/s]
Epoch 0: 5%| | 20167/432833 [37:16<12:42:39, 9.02it/s, loss=7.32, v_num=2vt0]
Validating: 67%|████████████████████ | 67/100 [00:02<00:01, 22.42it/s]
Epoch 0: 5%| | 20173/432833 [37:16<12:42:30, 9.02it/s, loss=7.32, v_num=2vt0]
Validating: 73%|█████████████████████▉ | 73/100 [00:02<00:01, 23.30it/s]
Epoch 0: 5%| | 20179/432833 [37:16<12:42:21, 9.02it/s, loss=7.32, v_num=2vt0]
Validating: 80%|████████████████████████ | 80/100 [00:03<00:00, 23.63it/s]
Epoch 0: 5%| | 20185/432833 [37:17<12:42:12, 9.02it/s, loss=7.32, v_num=2vt0]
Epoch 0: 5%| | 20191/432833 [37:17<12:42:02, 9.02it/s, loss=7.32, v_num=2vt0]
Validating: 90%|███████████████████████████ | 90/100 [00:03<00:00, 25.20it/s]
Epoch 0: 5%| | 20197/432833 [37:17<12:41:52, 9.03it/s, loss=7.32, v_num=2vt0]
Validating: 97%|█████████████████████████████ | 97/100 [00:03<00:00, 27.03it/s]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 5%| | 20202/432833 [37:18<12:41:59, 9.03it/s, loss=7.32, v_num=2vt0]
Epoch 0: 7%| | 30202/432833 [56:01<12:26:59, 8.98it/s, loss=6.84, v_num=2vt0]
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 7%| | 30205/432833 [56:02<12:26:58, 8.98it/s, loss=6.84, v_num=2vt0]
Epoch 0: 7%| | 30212/432833 [56:02<12:26:48, 8.99it/s, loss=6.84, v_num=2vt0]
Validating: 12%|███▌ | 12/100 [00:00<00:12, 7.28it/s]
Epoch 0: 7%| | 30219/432833 [56:02<12:26:40, 8.99it/s, loss=6.84, v_num=2vt0]
Validating: 19%|█████▋ | 19/100 [00:00<00:06, 11.86it/s]
Epoch 0: 7%| | 30226/432833 [56:02<12:26:33, 8.99it/s, loss=6.84, v_num=2vt0]
Validating: 25%|███████▌ | 25/100 [00:00<00:04, 16.37it/s]
Epoch 0: 7%| | 30233/432833 [56:03<12:26:25, 8.99it/s, loss=6.84, v_num=2vt0]
Validating: 31%|█████████▎ | 31/100 [00:01<00:03, 20.04it/s]
Epoch 0: 7%| | 30240/432833 [56:03<12:26:17, 8.99it/s, loss=6.84, v_num=2vt0]
Validating: 38%|███████████▍ | 38/100 [00:01<00:02, 21.81it/s]
Validating: 41%|████████████▎ | 41/100 [00:01<00:03, 19.44it/s]
Epoch 0: 7%| | 30247/432833 [56:03<12:26:11, 8.99it/s, loss=6.84, v_num=2vt0]
Validating: 47%|██████████████ | 47/100 [00:01<00:02, 22.55it/s]
Epoch 0: 7%| | 30254/432833 [56:04<12:26:04, 8.99it/s, loss=6.84, v_num=2vt0]
Validating: 53%|███████████████▉ | 53/100 [00:02<00:02, 21.91it/s]
Epoch 0: 7%| | 30261/432833 [56:04<12:25:56, 8.99it/s, loss=6.84, v_num=2vt0]
Validating: 61%|██████████████████▎ | 61/100 [00:02<00:01, 24.13it/s]
Epoch 0: 7%| | 30268/432833 [56:04<12:25:49, 9.00it/s, loss=6.84, v_num=2vt0]
Validating: 68%|████████████████████▍ | 68/100 [00:02<00:01, 25.60it/s]
Epoch 0: 7%| | 30275/432833 [56:04<12:25:40, 9.00it/s, loss=6.84, v_num=2vt0]
Validating: 75%|██████████████████████▌ | 75/100 [00:02<00:00, 27.99it/s]
Epoch 0: 7%| | 30282/432833 [56:05<12:25:32, 9.00it/s, loss=6.84, v_num=2vt0]
Epoch 0: 7%| | 30289/432833 [56:05<12:25:24, 9.00it/s, loss=6.84, v_num=2vt0]
Validating: 88%|██████████████████████████▍ | 88/100 [00:03<00:00, 30.53it/s]
Epoch 0: 7%| | 30296/432833 [56:05<12:25:15, 9.00it/s, loss=6.84, v_num=2vt0]
Validating: 96%|████████████████████████████▊ | 96/100 [00:03<00:00, 30.88it/s]
Epoch 0: 7%| | 30303/432833 [56:05<12:25:08, 9.00it/s, loss=6.84, v_num=2vt0]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 7%| | 30303/432833 [56:06<12:25:19, 9.00it/s, loss=6.84, v_num=2vt0]
Epoch 0: 9%| | 40303/432833 [1:14:53<12:09:19, 8.97it/s, loss=6.49, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 9%| | 40306/432833 [1:14:53<12:09:18, 8.97it/s, loss=6.49, v_num=2vt
Epoch 0: 9%| | 40313/432833 [1:14:53<12:09:12, 8.97it/s, loss=6.49, v_num=2vt
Validating: 10%|███ | 10/100 [00:00<00:12, 7.04it/s]
Epoch 0: 9%| | 40320/432833 [1:14:53<12:09:05, 8.97it/s, loss=6.49, v_num=2vt
Validating: 18%|█████▍ | 18/100 [00:00<00:07, 11.42it/s]
Epoch 0: 9%| | 40327/432833 [1:14:53<12:09:00, 8.97it/s, loss=6.49, v_num=2vt
Validating: 25%|███████▌ | 25/100 [00:01<00:04, 15.84it/s]
Epoch 0: 9%| | 40334/432833 [1:14:54<12:08:54, 8.97it/s, loss=6.49, v_num=2vt
Validating: 32%|█████████▌ | 32/100 [00:01<00:03, 20.32it/s]
Epoch 0: 9%| | 40341/432833 [1:14:54<12:08:48, 8.98it/s, loss=6.49, v_num=2vt
Validating: 38%|███████████▍ | 38/100 [00:01<00:02, 23.84it/s]
Epoch 0: 9%| | 40348/432833 [1:14:54<12:08:42, 8.98it/s, loss=6.49, v_num=2vt
Validating: 46%|█████████████▊ | 46/100 [00:01<00:01, 28.47it/s]
Epoch 0: 9%| | 40355/432833 [1:14:54<12:08:36, 8.98it/s, loss=6.49, v_num=2vt
Validating: 53%|███████████████▉ | 53/100 [00:02<00:01, 25.23it/s]
Epoch 0: 9%| | 40362/432833 [1:14:55<12:08:30, 8.98it/s, loss=6.49, v_num=2vt
Validating: 60%|██████████████████ | 60/100 [00:02<00:01, 25.62it/s]
Epoch 0: 9%| | 40369/432833 [1:14:55<12:08:24, 8.98it/s, loss=6.49, v_num=2vt
Validating: 68%|████████████████████▍ | 68/100 [00:02<00:01, 28.49it/s]
Epoch 0: 9%| | 40376/432833 [1:14:55<12:08:18, 8.98it/s, loss=6.49, v_num=2vt
Validating: 76%|██████████████████████▊ | 76/100 [00:02<00:00, 27.48it/s]
Epoch 0: 9%| | 40383/432833 [1:14:55<12:08:12, 8.98it/s, loss=6.49, v_num=2vt
Validating: 82%|████████████████████████▌ | 82/100 [00:03<00:00, 26.79it/s]
Epoch 0: 9%| | 40390/432833 [1:14:56<12:08:08, 8.98it/s, loss=6.49, v_num=2vt
Validating: 89%|██████████████████████████▋ | 89/100 [00:03<00:00, 22.14it/s]
Epoch 0: 9%| | 40397/432833 [1:14:56<12:08:02, 8.98it/s, loss=6.49, v_num=2vt
Validating: 96%|████████████████████████████▊ | 96/100 [00:03<00:00, 26.48it/s]
Epoch 0: 9%| | 40404/432833 [1:14:56<12:07:56, 8.98it/s, loss=6.49, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 9%| | 40404/432833 [1:14:57<12:08:04, 8.98it/s, loss=6.49, v_num=2vt
Epoch 0: 12%| | 50404/432833 [1:33:44<11:51:18, 8.96it/s, loss=5.71, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 12%| | 50407/432833 [1:33:45<11:51:17, 8.96it/s, loss=5.71, v_num=2vt
Epoch 0: 12%| | 50414/432833 [1:33:45<11:51:11, 8.96it/s, loss=5.71, v_num=2vt
Validating: 11%|███▎ | 11/100 [00:00<00:12, 7.35it/s]
Epoch 0: 12%| | 50421/432833 [1:33:45<11:51:06, 8.96it/s, loss=5.71, v_num=2vt
Validating: 17%|█████ | 17/100 [00:00<00:07, 11.64it/s]
Validating: 20%|██████ | 20/100 [00:00<00:05, 14.25it/s]
Epoch 0: 12%| | 50428/432833 [1:33:45<11:51:01, 8.96it/s, loss=5.71, v_num=2vt
Validating: 27%|████████ | 27/100 [00:01<00:03, 19.71it/s]
Epoch 0: 12%| | 50435/432833 [1:33:46<11:50:57, 8.96it/s, loss=5.71, v_num=2vt
Validating: 33%|█████████▉ | 33/100 [00:01<00:03, 21.22it/s]
Epoch 0: 12%| | 50442/432833 [1:33:46<11:50:52, 8.97it/s, loss=5.71, v_num=2vt
Validating: 39%|███████████▋ | 39/100 [00:01<00:02, 24.16it/s]
Epoch 0: 12%| | 50449/432833 [1:33:46<11:50:47, 8.97it/s, loss=5.71, v_num=2vt
Epoch 0: 12%| | 50456/432833 [1:33:46<11:50:42, 8.97it/s, loss=5.71, v_num=2vt
Validating: 52%|███████████████▌ | 52/100 [00:01<00:01, 28.44it/s]
Epoch 0: 12%| | 50463/432833 [1:33:47<11:50:37, 8.97it/s, loss=5.71, v_num=2vt
Validating: 59%|█████████████████▋ | 59/100 [00:02<00:01, 25.14it/s]
Epoch 0: 12%| | 50470/432833 [1:33:47<11:50:33, 8.97it/s, loss=5.71, v_num=2vt
Validating: 66%|███████████████████▊ | 66/100 [00:02<00:01, 23.55it/s]
Epoch 0: 12%| | 50477/432833 [1:33:47<11:50:29, 8.97it/s, loss=5.71, v_num=2vt
Validating: 74%|██████████████████████▏ | 74/100 [00:02<00:01, 25.31it/s]
Epoch 0: 12%| | 50484/432833 [1:33:47<11:50:24, 8.97it/s, loss=5.71, v_num=2vt
Validating: 80%|████████████████████████ | 80/100 [00:03<00:00, 25.50it/s]
Validating: 83%|████████████████████████▉ | 83/100 [00:03<00:00, 24.51it/s]
Epoch 0: 12%| | 50491/432833 [1:33:48<11:50:19, 8.97it/s, loss=5.71, v_num=2vt
Validating: 89%|██████████████████████████▋ | 89/100 [00:03<00:00, 25.69it/s]
Epoch 0: 12%| | 50498/432833 [1:33:48<11:50:15, 8.97it/s, loss=5.71, v_num=2vt
Validating: 96%|████████████████████████████▊ | 96/100 [00:03<00:00, 26.76it/s]
Epoch 0: 12%| | 50505/432833 [1:33:48<11:50:10, 8.97it/s, loss=5.71, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 12%| | 50505/432833 [1:33:49<11:50:16, 8.97it/s, loss=5.71, v_num=2vt
Epoch 0: 14%|▏| 60505/432833 [1:52:37<11:33:03, 8.95it/s, loss=5.31, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 14%|▏| 60508/432833 [1:52:37<11:33:02, 8.95it/s, loss=5.31, v_num=2vt
Validating: 4%|█▏ | 4/100 [00:00<00:18, 5.26it/s]
Validating: 6%|█▊ | 6/100 [00:00<00:13, 6.73it/s]
Epoch 0: 14%|▏| 60515/432833 [1:52:38<11:32:58, 8.95it/s, loss=5.31, v_num=2vt
Epoch 0: 14%|▏| 60522/432833 [1:52:38<11:32:54, 8.96it/s, loss=5.31, v_num=2vt
Validating: 17%|█████ | 17/100 [00:00<00:06, 13.47it/s]
Epoch 0: 14%|▏| 60529/432833 [1:52:38<11:32:50, 8.96it/s, loss=5.31, v_num=2vt
Validating: 25%|███████▌ | 25/100 [00:01<00:03, 19.45it/s]
Epoch 0: 14%|▏| 60536/432833 [1:52:38<11:32:46, 8.96it/s, loss=5.31, v_num=2vt
Validating: 33%|█████████▉ | 33/100 [00:01<00:03, 22.18it/s]
Epoch 0: 14%|▏| 60543/432833 [1:52:39<11:32:42, 8.96it/s, loss=5.31, v_num=2vt
Validating: 39%|███████████▋ | 39/100 [00:01<00:02, 23.10it/s]
Epoch 0: 14%|▏| 60550/432833 [1:52:39<11:32:38, 8.96it/s, loss=5.31, v_num=2vt
Validating: 47%|██████████████ | 47/100 [00:01<00:02, 26.49it/s]
Epoch 0: 14%|▏| 60557/432833 [1:52:39<11:32:34, 8.96it/s, loss=5.31, v_num=2vt
Validating: 55%|████████████████▌ | 55/100 [00:02<00:01, 27.33it/s]
Epoch 0: 14%|▏| 60564/432833 [1:52:39<11:32:30, 8.96it/s, loss=5.31, v_num=2vt
Validating: 61%|██████████████████▎ | 61/100 [00:02<00:01, 24.92it/s]
Epoch 0: 14%|▏| 60571/432833 [1:52:40<11:32:26, 8.96it/s, loss=5.31, v_num=2vt
Validating: 68%|████████████████████▍ | 68/100 [00:02<00:01, 25.78it/s]
Epoch 0: 14%|▏| 60578/432833 [1:52:40<11:32:22, 8.96it/s, loss=5.31, v_num=2vt
Validating: 75%|██████████████████████▌ | 75/100 [00:02<00:00, 27.66it/s]
Epoch 0: 14%|▏| 60585/432833 [1:52:40<11:32:18, 8.96it/s, loss=5.31, v_num=2vt
Validating: 82%|████████████████████████▌ | 82/100 [00:03<00:00, 29.34it/s]
Epoch 0: 14%|▏| 60592/432833 [1:52:40<11:32:14, 8.96it/s, loss=5.31, v_num=2vt
Epoch 0: 14%|▏| 60599/432833 [1:52:41<11:32:09, 8.96it/s, loss=5.31, v_num=2vt
Validating: 94%|████████████████████████████▏ | 94/100 [00:03<00:00, 28.76it/s]
Epoch 0: 14%|▏| 60606/432833 [1:52:41<11:32:05, 8.96it/s, loss=5.31, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 14%|▏| 60606/432833 [1:52:42<11:32:10, 8.96it/s, loss=5.31, v_num=2vt
Epoch 0: 16%|▎ | 70606/432833 [2:11:29<11:14:34, 8.95it/s, loss=5, v_num=2vt0]
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 16%|▎ | 70609/432833 [2:11:29<11:14:33, 8.95it/s, loss=5, v_num=2vt0]
Epoch 0: 16%|▎ | 70616/432833 [2:11:29<11:14:30, 8.95it/s, loss=5, v_num=2vt0]
Validating: 11%|███▎ | 11/100 [00:00<00:12, 7.14it/s]
Epoch 0: 16%|▎ | 70623/432833 [2:11:30<11:14:27, 8.95it/s, loss=5, v_num=2vt0]
Validating: 17%|█████ | 17/100 [00:00<00:07, 10.49it/s]
Epoch 0: 16%|▎ | 70630/432833 [2:11:30<11:14:23, 8.95it/s, loss=5, v_num=2vt0]
Validating: 25%|███████▌ | 25/100 [00:01<00:05, 13.68it/s]
Epoch 0: 16%|▎ | 70637/432833 [2:11:30<11:14:20, 8.95it/s, loss=5, v_num=2vt0]
Validating: 31%|█████████▎ | 31/100 [00:01<00:04, 17.23it/s]
Epoch 0: 16%|▎ | 70644/432833 [2:11:31<11:14:17, 8.95it/s, loss=5, v_num=2vt0]
Validating: 38%|███████████▍ | 38/100 [00:01<00:02, 20.80it/s]
Epoch 0: 16%|▎ | 70651/432833 [2:11:31<11:14:13, 8.95it/s, loss=5, v_num=2vt0]
Validating: 45%|█████████████▌ | 45/100 [00:01<00:02, 25.08it/s]
Epoch 0: 16%|▎ | 70658/432833 [2:11:31<11:14:09, 8.95it/s, loss=5, v_num=2vt0]
Validating: 54%|████████████████▏ | 54/100 [00:02<00:01, 30.06it/s]
Epoch 0: 16%|▎ | 70665/432833 [2:11:31<11:14:06, 8.95it/s, loss=5, v_num=2vt0]
Epoch 0: 16%|▎ | 70672/432833 [2:11:31<11:14:02, 8.95it/s, loss=5, v_num=2vt0]
Validating: 66%|███████████████████▊ | 66/100 [00:02<00:01, 30.86it/s]
Epoch 0: 16%|▎ | 70679/432833 [2:11:32<11:13:59, 8.96it/s, loss=5, v_num=2vt0]
Validating: 74%|██████████████████████▏ | 74/100 [00:02<00:00, 27.96it/s]
Epoch 0: 16%|▎ | 70686/432833 [2:11:32<11:13:55, 8.96it/s, loss=5, v_num=2vt0]
Validating: 81%|████████████████████████▎ | 81/100 [00:03<00:00, 24.47it/s]
Epoch 0: 16%|▎ | 70693/432833 [2:11:32<11:13:52, 8.96it/s, loss=5, v_num=2vt0]
Validating: 87%|██████████████████████████ | 87/100 [00:03<00:00, 25.28it/s]
Epoch 0: 16%|▎ | 70700/432833 [2:11:33<11:13:48, 8.96it/s, loss=5, v_num=2vt0]
Validating: 94%|████████████████████████████▏ | 94/100 [00:03<00:00, 27.48it/s]
Epoch 0: 16%|▎ | 70707/432833 [2:11:33<11:13:45, 8.96it/s, loss=5, v_num=2vt0]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 16%|▎ | 70707/432833 [2:11:34<11:13:49, 8.96it/s, loss=5, v_num=2vt0]
Epoch 0: 19%|▏| 80707/432833 [2:30:21<10:56:00, 8.95it/s, loss=4.77, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 19%|▏| 80710/432833 [2:30:21<10:56:00, 8.95it/s, loss=4.77, v_num=2vt
Epoch 0: 19%|▏| 80717/432833 [2:30:21<10:55:56, 8.95it/s, loss=4.77, v_num=2vt
Validating: 10%|███ | 10/100 [00:00<00:12, 7.16it/s]
Epoch 0: 19%|▏| 80724/432833 [2:30:22<10:55:53, 8.95it/s, loss=4.77, v_num=2vt
Validating: 17%|█████ | 17/100 [00:00<00:07, 11.31it/s]
Epoch 0: 19%|▏| 80731/432833 [2:30:22<10:55:50, 8.95it/s, loss=4.77, v_num=2vt
Validating: 24%|███████▏ | 24/100 [00:00<00:04, 16.97it/s]
Validating: 27%|████████ | 27/100 [00:01<00:03, 19.28it/s]
Epoch 0: 19%|▏| 80738/432833 [2:30:22<10:55:47, 8.95it/s, loss=4.77, v_num=2vt
Validating: 34%|██████████▏ | 34/100 [00:01<00:03, 21.38it/s]
Epoch 0: 19%|▏| 80745/432833 [2:30:23<10:55:44, 8.95it/s, loss=4.77, v_num=2vt
Validating: 40%|████████████ | 40/100 [00:01<00:02, 21.03it/s]
Epoch 0: 19%|▏| 80752/432833 [2:30:23<10:55:41, 8.95it/s, loss=4.77, v_num=2vt
Validating: 47%|██████████████ | 47/100 [00:01<00:02, 21.75it/s]
Epoch 0: 19%|▏| 80759/432833 [2:30:23<10:55:38, 8.95it/s, loss=4.77, v_num=2vt
Validating: 54%|████████████████▏ | 54/100 [00:02<00:01, 24.31it/s]
Epoch 0: 19%|▏| 80766/432833 [2:30:23<10:55:35, 8.95it/s, loss=4.77, v_num=2vt
Validating: 61%|██████████████████▎ | 61/100 [00:02<00:01, 26.38it/s]
Epoch 0: 19%|▏| 80773/432833 [2:30:24<10:55:32, 8.95it/s, loss=4.77, v_num=2vt
Validating: 67%|████████████████████ | 67/100 [00:02<00:01, 24.51it/s]
Epoch 0: 19%|▏| 80780/432833 [2:30:24<10:55:29, 8.95it/s, loss=4.77, v_num=2vt
Validating: 73%|█████████████████████▉ | 73/100 [00:03<00:01, 21.41it/s]
Epoch 0: 19%|▏| 80787/432833 [2:30:24<10:55:27, 8.95it/s, loss=4.77, v_num=2vt
Validating: 80%|████████████████████████ | 80/100 [00:03<00:00, 24.67it/s]
Epoch 0: 19%|▏| 80794/432833 [2:30:24<10:55:23, 8.95it/s, loss=4.77, v_num=2vt
Validating: 88%|██████████████████████████▍ | 88/100 [00:03<00:00, 28.09it/s]
Epoch 0: 19%|▏| 80801/432833 [2:30:25<10:55:20, 8.95it/s, loss=4.77, v_num=2vt
Validating: 96%|████████████████████████████▊ | 96/100 [00:03<00:00, 30.87it/s]
Epoch 0: 19%|▏| 80808/432833 [2:30:25<10:55:17, 8.95it/s, loss=4.77, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 19%|▏| 80808/432833 [2:30:26<10:55:20, 8.95it/s, loss=4.77, v_num=2vt
Epoch 0: 21%|▏| 90808/432833 [2:49:16<10:37:32, 8.94it/s, loss=4.58, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 21%|▏| 90811/432833 [2:49:16<10:37:31, 8.94it/s, loss=4.58, v_num=2vt
Epoch 0: 21%|▏| 90818/432833 [2:49:16<10:37:29, 8.94it/s, loss=4.58, v_num=2vt
Validating: 10%|███ | 10/100 [00:00<00:12, 7.13it/s]
Validating: 12%|███▌ | 12/100 [00:00<00:10, 8.50it/s]
Epoch 0: 21%|▏| 90825/432833 [2:49:16<10:37:26, 8.94it/s, loss=4.58, v_num=2vt
Validating: 20%|██████ | 20/100 [00:00<00:06, 13.22it/s]
Epoch 0: 21%|▏| 90832/432833 [2:49:17<10:37:23, 8.94it/s, loss=4.58, v_num=2vt
Validating: 26%|███████▊ | 26/100 [00:01<00:04, 18.03it/s]
Epoch 0: 21%|▏| 90839/432833 [2:49:17<10:37:20, 8.94it/s, loss=4.58, v_num=2vt
Validating: 33%|█████████▉ | 33/100 [00:01<00:03, 22.22it/s]
Epoch 0: 21%|▏| 90846/432833 [2:49:17<10:37:18, 8.94it/s, loss=4.58, v_num=2vt
Epoch 0: 21%|▏| 90853/432833 [2:49:17<10:37:15, 8.94it/s, loss=4.58, v_num=2vt
Validating: 45%|█████████████▌ | 45/100 [00:01<00:02, 26.98it/s]
Validating: 48%|██████████████▍ | 48/100 [00:01<00:01, 26.02it/s]
Epoch 0: 21%|▏| 90860/432833 [2:49:18<10:37:12, 8.94it/s, loss=4.58, v_num=2vt
Validating: 55%|████████████████▌ | 55/100 [00:02<00:01, 28.31it/s]
Epoch 0: 21%|▏| 90867/432833 [2:49:18<10:37:09, 8.95it/s, loss=4.58, v_num=2vt
Validating: 61%|██████████████████▎ | 61/100 [00:02<00:01, 26.01it/s]
Epoch 0: 21%|▏| 90874/432833 [2:49:18<10:37:06, 8.95it/s, loss=4.58, v_num=2vt
Epoch 0: 21%|▏| 90881/432833 [2:49:18<10:37:03, 8.95it/s, loss=4.58, v_num=2vt
Validating: 73%|█████████████████████▉ | 73/100 [00:02<00:00, 27.16it/s]
Validating: 76%|██████████████████████▊ | 76/100 [00:02<00:00, 27.53it/s]
Epoch 0: 21%|▏| 90888/432833 [2:49:19<10:37:01, 8.95it/s, loss=4.58, v_num=2vt
Validating: 82%|████████████████████████▌ | 82/100 [00:03<00:00, 23.55it/s]
Epoch 0: 21%|▏| 90895/432833 [2:49:19<10:36:58, 8.95it/s, loss=4.58, v_num=2vt
Validating: 90%|███████████████████████████ | 90/100 [00:03<00:00, 26.11it/s]
Epoch 0: 21%|▏| 90902/432833 [2:49:19<10:36:55, 8.95it/s, loss=4.58, v_num=2vt
Validating: 97%|█████████████████████████████ | 97/100 [00:03<00:00, 27.53it/s]
Epoch 0: 21%|▏| 90909/432833 [2:49:19<10:36:53, 8.95it/s, loss=4.58, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 21%|▏| 90909/432833 [2:49:20<10:36:56, 8.95it/s, loss=4.58, v_num=2vt
Epoch 0: 23%|▏| 100909/432833 [3:08:09<10:18:55, 8.94it/s, loss=4.42, v_num=2v
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 23%|▏| 100912/432833 [3:08:10<10:18:55, 8.94it/s, loss=4.42, v_num=2v
Epoch 0: 23%|▏| 100919/432833 [3:08:10<10:18:52, 8.94it/s, loss=4.42, v_num=2v
Validating: 10%|███ | 10/100 [00:00<00:13, 6.70it/s]
Epoch 0: 23%|▏| 100926/432833 [3:08:10<10:18:49, 8.94it/s, loss=4.42, v_num=2v
Validating: 17%|█████ | 17/100 [00:00<00:07, 10.99it/s]
Validating: 20%|██████ | 20/100 [00:00<00:06, 12.89it/s]
Epoch 0: 23%|▏| 100933/432833 [3:08:10<10:18:47, 8.94it/s, loss=4.42, v_num=2v
Validating: 26%|███████▊ | 26/100 [00:01<00:04, 16.28it/s]
Epoch 0: 23%|▏| 100940/432833 [3:08:11<10:18:45, 8.94it/s, loss=4.42, v_num=2v
Validating: 33%|█████████▉ | 33/100 [00:01<00:03, 21.98it/s]
Epoch 0: 23%|▏| 100947/432833 [3:08:11<10:18:42, 8.94it/s, loss=4.42, v_num=2v
Validating: 39%|███████████▋ | 39/100 [00:01<00:02, 22.96it/s]
Epoch 0: 23%|▏| 100954/432833 [3:08:11<10:18:40, 8.94it/s, loss=4.42, v_num=2v
Validating: 45%|█████████████▌ | 45/100 [00:02<00:02, 18.95it/s]
Epoch 0: 23%|▏| 100961/432833 [3:08:11<10:18:37, 8.94it/s, loss=4.42, v_num=2v
Validating: 54%|████████████████▏ | 54/100 [00:02<00:01, 25.54it/s]
Epoch 0: 23%|▏| 100968/432833 [3:08:12<10:18:35, 8.94it/s, loss=4.42, v_num=2v
Epoch 0: 23%|▏| 100975/432833 [3:08:12<10:18:32, 8.94it/s, loss=4.42, v_num=2v
Validating: 66%|███████████████████▊ | 66/100 [00:02<00:01, 26.04it/s]
Epoch 0: 23%|▏| 100982/432833 [3:08:12<10:18:30, 8.94it/s, loss=4.42, v_num=2v
Validating: 75%|██████████████████████▌ | 75/100 [00:02<00:00, 28.99it/s]
Epoch 0: 23%|▏| 100989/432833 [3:08:12<10:18:27, 8.94it/s, loss=4.42, v_num=2v
Epoch 0: 23%|▏| 100996/432833 [3:08:13<10:18:25, 8.94it/s, loss=4.42, v_num=2v
Validating: 87%|██████████████████████████ | 87/100 [00:03<00:00, 26.95it/s]
Epoch 0: 23%|▏| 101003/432833 [3:08:13<10:18:22, 8.94it/s, loss=4.42, v_num=2v
Validating: 94%|████████████████████████████▏ | 94/100 [00:03<00:00, 26.88it/s]
Validating: 97%|█████████████████████████████ | 97/100 [00:03<00:00, 27.55it/s]
Epoch 0: 23%|▏| 101010/432833 [3:08:13<10:18:20, 8.94it/s, loss=4.42, v_num=2vSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 23%|▏| 101010/432833 [3:08:14<10:18:22, 8.94it/s, loss=4.42, v_num=2v
Epoch 0: 26%|▎| 111010/432833 [3:27:04<10:00:18, 8.93it/s, loss=4.29, v_num=2v
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 26%|▎| 111013/432833 [3:27:04<10:00:17, 8.94it/s, loss=4.29, v_num=2v
Validating: 6%|█▊ | 6/100 [00:00<00:18, 5.21it/s]
Epoch 0: 26%|▎| 111020/432833 [3:27:04<10:00:15, 8.94it/s, loss=4.29, v_num=2v
Validating: 12%|███▌ | 12/100 [00:00<00:10, 8.59it/s]
Epoch 0: 26%|▎| 111027/432833 [3:27:05<10:00:13, 8.94it/s, loss=4.29, v_num=2v
Validating: 18%|█████▍ | 18/100 [00:00<00:06, 13.13it/s]
Epoch 0: 26%|▎| 111034/432833 [3:27:05<10:00:10, 8.94it/s, loss=4.29, v_num=2v
Validating: 26%|███████▊ | 26/100 [00:01<00:04, 18.44it/s]
Epoch 0: 26%|▎| 111041/432833 [3:27:05<10:00:08, 8.94it/s, loss=4.29, v_num=2v
Validating: 32%|█████████▌ | 32/100 [00:01<00:03, 22.06it/s]
Epoch 0: 26%|▎| 111048/432833 [3:27:05<10:00:06, 8.94it/s, loss=4.29, v_num=2v
Validating: 38%|███████████▍ | 38/100 [00:01<00:02, 23.06it/s]
Validating: 41%|████████████▎ | 41/100 [00:01<00:02, 24.44it/s]
Epoch 0: 26%|▎| 111055/432833 [3:27:06<10:00:03, 8.94it/s, loss=4.29, v_num=2v
Validating: 48%|██████████████▍ | 48/100 [00:01<00:01, 27.52it/s]
Epoch 0: 26%|▎| 111062/432833 [3:27:06<10:00:01, 8.94it/s, loss=4.29, v_num=2v
Validating: 54%|████████████████▏ | 54/100 [00:02<00:01, 24.02it/s]
Epoch 0: 26%|▎| 111069/432833 [3:27:06<9:59:59, 8.94it/s, loss=4.29, v_num=2vt
Validating: 60%|██████████████████ | 60/100 [00:02<00:01, 21.47it/s]
Epoch 0: 26%|▎| 111076/432833 [3:27:06<9:59:57, 8.94it/s, loss=4.29, v_num=2vt
Validating: 68%|████████████████████▍ | 68/100 [00:02<00:01, 26.27it/s]
Epoch 0: 26%|▎| 111083/432833 [3:27:07<9:59:55, 8.94it/s, loss=4.29, v_num=2vt
Validating: 74%|██████████████████████▏ | 74/100 [00:03<00:01, 24.13it/s]
Epoch 0: 26%|▎| 111090/432833 [3:27:07<9:59:52, 8.94it/s, loss=4.29, v_num=2vt
Validating: 81%|████████████████████████▎ | 81/100 [00:03<00:00, 28.14it/s]
Epoch 0: 26%|▎| 111097/432833 [3:27:07<9:59:50, 8.94it/s, loss=4.29, v_num=2vt
Validating: 89%|██████████████████████████▋ | 89/100 [00:03<00:00, 30.38it/s]
Epoch 0: 26%|▎| 111104/432833 [3:27:07<9:59:47, 8.94it/s, loss=4.29, v_num=2vt
Epoch 0: 26%|▎| 111111/432833 [3:27:08<9:59:45, 8.94it/s, loss=4.29, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 26%|▎| 111111/432833 [3:27:08<9:59:47, 8.94it/s, loss=4.29, v_num=2vt
Epoch 0: 28%|▎| 121111/432833 [3:46:04<9:41:53, 8.93it/s, loss=4.18, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 28%|▎| 121114/432833 [3:46:04<9:41:52, 8.93it/s, loss=4.18, v_num=2vt
Validating: 6%|█▊ | 6/100 [00:00<00:17, 5.37it/s]
Epoch 0: 28%|▎| 121121/432833 [3:46:05<9:41:50, 8.93it/s, loss=4.18, v_num=2vt
Validating: 13%|███▉ | 13/100 [00:00<00:09, 9.21it/s]
Epoch 0: 28%|▎| 121128/432833 [3:46:05<9:41:48, 8.93it/s, loss=4.18, v_num=2vt
Validating: 19%|█████▋ | 19/100 [00:00<00:05, 13.87it/s]
Epoch 0: 28%|▎| 121135/432833 [3:46:05<9:41:46, 8.93it/s, loss=4.18, v_num=2vt
Validating: 25%|███████▌ | 25/100 [00:01<00:04, 17.36it/s]
Epoch 0: 28%|▎| 121142/432833 [3:46:05<9:41:44, 8.93it/s, loss=4.18, v_num=2vt
Validating: 32%|█████████▌ | 32/100 [00:01<00:03, 20.77it/s]
Epoch 0: 28%|▎| 121149/432833 [3:46:06<9:41:42, 8.93it/s, loss=4.18, v_num=2vt
Validating: 39%|███████████▋ | 39/100 [00:01<00:02, 21.85it/s]
Epoch 0: 28%|▎| 121156/432833 [3:46:06<9:41:39, 8.93it/s, loss=4.18, v_num=2vt
Validating: 46%|█████████████▊ | 46/100 [00:01<00:02, 24.57it/s]
Epoch 0: 28%|▎| 121163/432833 [3:46:06<9:41:37, 8.93it/s, loss=4.18, v_num=2vt
Validating: 53%|███████████████▉ | 53/100 [00:02<00:01, 25.37it/s]
Epoch 0: 28%|▎| 121170/432833 [3:46:06<9:41:35, 8.93it/s, loss=4.18, v_num=2vt
Validating: 60%|██████████████████ | 60/100 [00:02<00:01, 27.06it/s]
Epoch 0: 28%|▎| 121177/432833 [3:46:07<9:41:33, 8.93it/s, loss=4.18, v_num=2vt
Validating: 68%|████████████████████▍ | 68/100 [00:02<00:01, 28.56it/s]
Epoch 0: 28%|▎| 121184/432833 [3:46:07<9:41:31, 8.93it/s, loss=4.18, v_num=2vt
Epoch 0: 28%|▎| 121191/432833 [3:46:07<9:41:28, 8.93it/s, loss=4.18, v_num=2vt
Validating: 80%|████████████████████████ | 80/100 [00:03<00:00, 30.11it/s]
Epoch 0: 28%|▎| 121198/432833 [3:46:07<9:41:26, 8.93it/s, loss=4.18, v_num=2vt
Validating: 87%|██████████████████████████ | 87/100 [00:03<00:00, 26.62it/s]
Validating: 90%|███████████████████████████ | 90/100 [00:03<00:00, 25.64it/s]
Epoch 0: 28%|▎| 121205/432833 [3:46:08<9:41:24, 8.93it/s, loss=4.18, v_num=2vt
Validating: 97%|█████████████████████████████ | 97/100 [00:03<00:00, 26.49it/s]
Epoch 0: 28%|▎| 121212/432833 [3:46:08<9:41:22, 8.93it/s, loss=4.18, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 28%|▎| 121212/432833 [3:46:09<9:41:24, 8.93it/s, loss=4.18, v_num=2vt
Epoch 0: 30%|▎| 131212/432833 [4:05:06<9:23:25, 8.92it/s, loss=4.09, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 30%|▎| 131215/432833 [4:05:06<9:23:25, 8.92it/s, loss=4.09, v_num=2vt
Epoch 0: 30%|▎| 131222/432833 [4:05:06<9:23:23, 8.92it/s, loss=4.09, v_num=2vt
Validating: 10%|███ | 10/100 [00:00<00:12, 6.98it/s]
Validating: 12%|███▌ | 12/100 [00:00<00:10, 8.45it/s]
Epoch 0: 30%|▎| 131229/432833 [4:05:07<9:23:21, 8.92it/s, loss=4.09, v_num=2vt
Validating: 19%|█████▋ | 19/100 [00:00<00:06, 12.92it/s]
Epoch 0: 30%|▎| 131236/432833 [4:05:07<9:23:19, 8.92it/s, loss=4.09, v_num=2vt
Validating: 26%|███████▊ | 26/100 [00:01<00:04, 18.22it/s]
Epoch 0: 30%|▎| 131243/432833 [4:05:07<9:23:17, 8.92it/s, loss=4.09, v_num=2vt
Epoch 0: 30%|▎| 131250/432833 [4:05:07<9:23:15, 8.92it/s, loss=4.09, v_num=2vt
Validating: 38%|███████████▍ | 38/100 [00:01<00:02, 26.28it/s]
Epoch 0: 30%|▎| 131257/432833 [4:05:08<9:23:13, 8.92it/s, loss=4.09, v_num=2vt
Validating: 46%|█████████████▊ | 46/100 [00:01<00:02, 26.49it/s]
Epoch 0: 30%|▎| 131264/432833 [4:05:08<9:23:11, 8.92it/s, loss=4.09, v_num=2vt
Validating: 52%|███████████████▌ | 52/100 [00:02<00:02, 22.31it/s]
Epoch 0: 30%|▎| 131271/432833 [4:05:08<9:23:09, 8.92it/s, loss=4.09, v_num=2vt
Validating: 59%|█████████████████▋ | 59/100 [00:02<00:01, 25.13it/s]
Epoch 0: 30%|▎| 131278/432833 [4:05:08<9:23:07, 8.93it/s, loss=4.09, v_num=2vt
Validating: 67%|████████████████████ | 67/100 [00:02<00:01, 27.90it/s]
Epoch 0: 30%|▎| 131285/432833 [4:05:09<9:23:05, 8.93it/s, loss=4.09, v_num=2vt
Validating: 74%|██████████████████████▏ | 74/100 [00:02<00:00, 26.73it/s]
Epoch 0: 30%|▎| 131292/432833 [4:05:09<9:23:03, 8.93it/s, loss=4.09, v_num=2vt
Validating: 80%|████████████████████████ | 80/100 [00:03<00:00, 27.48it/s]
Epoch 0: 30%|▎| 131299/432833 [4:05:09<9:23:01, 8.93it/s, loss=4.09, v_num=2vt
Validating: 89%|██████████████████████████▋ | 89/100 [00:03<00:00, 26.33it/s]
Epoch 0: 30%|▎| 131306/432833 [4:05:09<9:22:59, 8.93it/s, loss=4.09, v_num=2vt
Validating: 96%|████████████████████████████▊ | 96/100 [00:03<00:00, 23.56it/s]
Epoch 0: 30%|▎| 131313/432833 [4:05:10<9:22:57, 8.93it/s, loss=4.09, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 30%|▎| 131313/432833 [4:05:11<9:22:59, 8.93it/s, loss=4.09, v_num=2vt
Epoch 0: 33%|▎| 141313/432833 [4:24:05<9:04:48, 8.92it/s, loss=4.01, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 33%|▎| 141316/432833 [4:24:05<9:04:48, 8.92it/s, loss=4.01, v_num=2vt
Epoch 0: 33%|▎| 141323/432833 [4:24:06<9:04:46, 8.92it/s, loss=4.01, v_num=2vt
Validating: 10%|███ | 10/100 [00:00<00:12, 7.16it/s]
Epoch 0: 33%|▎| 141330/432833 [4:24:06<9:04:44, 8.92it/s, loss=4.01, v_num=2vt
Validating: 17%|█████ | 17/100 [00:00<00:07, 11.79it/s]
Validating: 20%|██████ | 20/100 [00:00<00:05, 14.38it/s]
Epoch 0: 33%|▎| 141337/432833 [4:24:06<9:04:42, 8.92it/s, loss=4.01, v_num=2vt
Validating: 27%|████████ | 27/100 [00:01<00:03, 19.44it/s]
Epoch 0: 33%|▎| 141344/432833 [4:24:06<9:04:40, 8.92it/s, loss=4.01, v_num=2vt
Epoch 0: 33%|▎| 141351/432833 [4:24:07<9:04:38, 8.92it/s, loss=4.01, v_num=2vt
Validating: 38%|███████████▍ | 38/100 [00:01<00:02, 25.40it/s]
Epoch 0: 33%|▎| 141358/432833 [4:24:07<9:04:36, 8.92it/s, loss=4.01, v_num=2vt
Validating: 47%|██████████████ | 47/100 [00:01<00:01, 30.00it/s]
Epoch 0: 33%|▎| 141365/432833 [4:24:07<9:04:34, 8.92it/s, loss=4.01, v_num=2vt
Epoch 0: 33%|▎| 141372/432833 [4:24:07<9:04:32, 8.92it/s, loss=4.01, v_num=2vt
Validating: 59%|█████████████████▋ | 59/100 [00:02<00:01, 31.74it/s]
Epoch 0: 33%|▎| 141379/432833 [4:24:07<9:04:30, 8.92it/s, loss=4.01, v_num=2vt
Validating: 67%|████████████████████ | 67/100 [00:02<00:01, 31.47it/s]
Epoch 0: 33%|▎| 141386/432833 [4:24:08<9:04:28, 8.92it/s, loss=4.01, v_num=2vt
Validating: 75%|██████████████████████▌ | 75/100 [00:02<00:00, 27.55it/s]
Epoch 0: 33%|▎| 141393/432833 [4:24:08<9:04:26, 8.92it/s, loss=4.01, v_num=2vt
Validating: 81%|████████████████████████▎ | 81/100 [00:02<00:00, 25.83it/s]
Epoch 0: 33%|▎| 141400/432833 [4:24:08<9:04:24, 8.92it/s, loss=4.01, v_num=2vt
Validating: 87%|██████████████████████████ | 87/100 [00:03<00:00, 24.88it/s]
Validating: 90%|███████████████████████████ | 90/100 [00:03<00:00, 21.98it/s]
Epoch 0: 33%|▎| 141407/432833 [4:24:09<9:04:23, 8.92it/s, loss=4.01, v_num=2vt
Validating: 97%|█████████████████████████████ | 97/100 [00:03<00:00, 19.70it/s]
Epoch 0: 33%|▎| 141414/432833 [4:24:09<9:04:21, 8.92it/s, loss=4.01, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 33%|▎| 141414/432833 [4:24:10<9:04:23, 8.92it/s, loss=4.01, v_num=2vt
Epoch 0: 35%|▎| 151414/432833 [4:43:02<8:46:04, 8.92it/s, loss=3.94, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 35%|▎| 151417/432833 [4:43:03<8:46:04, 8.92it/s, loss=3.94, v_num=2vt
Epoch 0: 35%|▎| 151424/432833 [4:43:03<8:46:02, 8.92it/s, loss=3.94, v_num=2vt
Validating: 11%|███▎ | 11/100 [00:00<00:12, 7.03it/s]
Epoch 0: 35%|▎| 151431/432833 [4:43:03<8:46:00, 8.92it/s, loss=3.94, v_num=2vt
Validating: 18%|█████▍ | 18/100 [00:00<00:07, 11.42it/s]
Epoch 0: 35%|▎| 151438/432833 [4:43:03<8:45:58, 8.92it/s, loss=3.94, v_num=2vt
Validating: 25%|███████▌ | 25/100 [00:01<00:04, 16.07it/s]
Epoch 0: 35%|▎| 151445/432833 [4:43:04<8:45:56, 8.92it/s, loss=3.94, v_num=2vt
Validating: 33%|█████████▉ | 33/100 [00:01<00:03, 21.98it/s]
Epoch 0: 35%|▎| 151452/432833 [4:43:04<8:45:54, 8.92it/s, loss=3.94, v_num=2vt
Validating: 40%|████████████ | 40/100 [00:01<00:02, 26.23it/s]
Epoch 0: 35%|▎| 151459/432833 [4:43:04<8:45:53, 8.92it/s, loss=3.94, v_num=2vt
Epoch 0: 35%|▎| 151466/432833 [4:43:04<8:45:51, 8.92it/s, loss=3.94, v_num=2vt
Validating: 52%|███████████████▌ | 52/100 [00:01<00:01, 26.44it/s]
Epoch 0: 35%|▎| 151473/432833 [4:43:05<8:45:49, 8.92it/s, loss=3.94, v_num=2vt
Validating: 59%|█████████████████▋ | 59/100 [00:02<00:01, 28.92it/s]
Epoch 0: 35%|▎| 151480/432833 [4:43:05<8:45:47, 8.92it/s, loss=3.94, v_num=2vt
Validating: 67%|████████████████████ | 67/100 [00:02<00:01, 29.52it/s]
Epoch 0: 35%|▎| 151487/432833 [4:43:05<8:45:46, 8.92it/s, loss=3.94, v_num=2vt
Validating: 74%|██████████████████████▏ | 74/100 [00:02<00:01, 24.92it/s]
Epoch 0: 35%|▎| 151494/432833 [4:43:05<8:45:44, 8.92it/s, loss=3.94, v_num=2vt
Validating: 80%|████████████████████████ | 80/100 [00:03<00:00, 24.59it/s]
Epoch 0: 35%|▎| 151501/432833 [4:43:06<8:45:42, 8.92it/s, loss=3.94, v_num=2vt
Validating: 87%|██████████████████████████ | 87/100 [00:03<00:00, 23.46it/s]
Validating: 90%|███████████████████████████ | 90/100 [00:03<00:00, 23.95it/s]
Epoch 0: 35%|▎| 151508/432833 [4:43:06<8:45:41, 8.92it/s, loss=3.94, v_num=2vt
Epoch 0: 35%|▎| 151515/432833 [4:43:06<8:45:39, 8.92it/s, loss=3.94, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 35%|▎| 151515/432833 [4:43:07<8:45:40, 8.92it/s, loss=3.94, v_num=2vt
Epoch 0: 37%|▎| 161515/432833 [5:02:00<8:27:18, 8.91it/s, loss=3.88, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 37%|▎| 161518/432833 [5:02:00<8:27:18, 8.91it/s, loss=3.88, v_num=2vt
Validating: 7%|██▏ | 7/100 [00:00<00:17, 5.37it/s]
Epoch 0: 37%|▎| 161525/432833 [5:02:00<8:27:16, 8.91it/s, loss=3.88, v_num=2vt
Validating: 11%|███▎ | 11/100 [00:00<00:10, 8.32it/s]
Epoch 0: 37%|▎| 161532/432833 [5:02:00<8:27:15, 8.91it/s, loss=3.88, v_num=2vt
Validating: 17%|█████ | 17/100 [00:00<00:06, 12.31it/s]
Validating: 20%|██████ | 20/100 [00:01<00:05, 14.41it/s]
Epoch 0: 37%|▎| 161539/432833 [5:02:01<8:27:13, 8.91it/s, loss=3.88, v_num=2vt
Validating: 26%|███████▊ | 26/100 [00:01<00:04, 17.83it/s]
Epoch 0: 37%|▎| 161546/432833 [5:02:01<8:27:12, 8.91it/s, loss=3.88, v_num=2vt
Validating: 32%|█████████▌ | 32/100 [00:01<00:03, 18.80it/s]
Epoch 0: 37%|▎| 161553/432833 [5:02:01<8:27:10, 8.91it/s, loss=3.88, v_num=2vt
Validating: 40%|████████████ | 40/100 [00:01<00:02, 24.33it/s]
Epoch 0: 37%|▎| 161560/432833 [5:02:02<8:27:08, 8.92it/s, loss=3.88, v_num=2vt
Epoch 0: 37%|▎| 161567/432833 [5:02:02<8:27:06, 8.92it/s, loss=3.88, v_num=2vt
Validating: 52%|███████████████▌ | 52/100 [00:02<00:01, 29.65it/s]
Epoch 0: 37%|▎| 161574/432833 [5:02:02<8:27:05, 8.92it/s, loss=3.88, v_num=2vt
Validating: 60%|██████████████████ | 60/100 [00:02<00:01, 28.96it/s]
Epoch 0: 37%|▎| 161581/432833 [5:02:02<8:27:03, 8.92it/s, loss=3.88, v_num=2vt
Validating: 68%|████████████████████▍ | 68/100 [00:02<00:00, 32.44it/s]
Epoch 0: 37%|▎| 161588/432833 [5:02:02<8:27:01, 8.92it/s, loss=3.88, v_num=2vt
Epoch 0: 37%|▎| 161595/432833 [5:02:03<8:26:59, 8.92it/s, loss=3.88, v_num=2vt
Validating: 80%|████████████████████████ | 80/100 [00:03<00:00, 26.51it/s]
Epoch 0: 37%|▎| 161602/432833 [5:02:03<8:26:58, 8.92it/s, loss=3.88, v_num=2vt
Validating: 87%|██████████████████████████ | 87/100 [00:03<00:00, 27.91it/s]
Epoch 0: 37%|▎| 161609/432833 [5:02:03<8:26:56, 8.92it/s, loss=3.88, v_num=2vt
Validating: 94%|████████████████████████████▏ | 94/100 [00:03<00:00, 30.65it/s]
Epoch 0: 37%|▎| 161616/432833 [5:02:04<8:26:54, 8.92it/s, loss=3.88, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 37%|▎| 161616/432833 [5:02:04<8:26:56, 8.92it/s, loss=3.88, v_num=2vt
Epoch 0: 40%|▍| 171616/432833 [5:20:56<8:08:30, 8.91it/s, loss=3.82, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 40%|▍| 171619/432833 [5:20:56<8:08:29, 8.91it/s, loss=3.82, v_num=2vt
Epoch 0: 40%|▍| 171626/432833 [5:20:56<8:08:28, 8.91it/s, loss=3.82, v_num=2vt
Validating: 10%|███ | 10/100 [00:00<00:12, 7.05it/s]
Validating: 13%|███▉ | 13/100 [00:00<00:09, 8.85it/s]
Epoch 0: 40%|▍| 171633/432833 [5:20:57<8:08:26, 8.91it/s, loss=3.82, v_num=2vt
Epoch 0: 40%|▍| 171640/432833 [5:20:57<8:08:24, 8.91it/s, loss=3.82, v_num=2vt
Validating: 25%|███████▌ | 25/100 [00:01<00:04, 16.67it/s]
Epoch 0: 40%|▍| 171647/432833 [5:20:57<8:08:23, 8.91it/s, loss=3.82, v_num=2vt
Validating: 31%|█████████▎ | 31/100 [00:01<00:03, 19.13it/s]
Validating: 34%|██████████▏ | 34/100 [00:01<00:03, 20.13it/s]
Epoch 0: 40%|▍| 171654/432833 [5:20:58<8:08:21, 8.91it/s, loss=3.82, v_num=2vt
Validating: 40%|████████████ | 40/100 [00:01<00:02, 23.20it/s]
Epoch 0: 40%|▍| 171661/432833 [5:20:58<8:08:20, 8.91it/s, loss=3.82, v_num=2vt
Validating: 47%|██████████████ | 47/100 [00:01<00:02, 23.57it/s]
Epoch 0: 40%|▍| 171668/432833 [5:20:58<8:08:18, 8.91it/s, loss=3.82, v_num=2vt
Epoch 0: 40%|▍| 171675/432833 [5:20:58<8:08:17, 8.91it/s, loss=3.82, v_num=2vt
Validating: 59%|█████████████████▋ | 59/100 [00:02<00:01, 29.72it/s]
Epoch 0: 40%|▍| 171682/432833 [5:20:58<8:08:15, 8.91it/s, loss=3.82, v_num=2vt
Validating: 67%|████████████████████ | 67/100 [00:02<00:01, 29.09it/s]
Epoch 0: 40%|▍| 171689/432833 [5:20:59<8:08:13, 8.91it/s, loss=3.82, v_num=2vt
Validating: 75%|██████████████████████▌ | 75/100 [00:02<00:00, 30.04it/s]
Epoch 0: 40%|▍| 171696/432833 [5:20:59<8:08:12, 8.91it/s, loss=3.82, v_num=2vt
Epoch 0: 40%|▍| 171703/432833 [5:20:59<8:08:10, 8.92it/s, loss=3.82, v_num=2vt
Validating: 87%|██████████████████████████ | 87/100 [00:03<00:00, 28.83it/s]
Epoch 0: 40%|▍| 171710/432833 [5:20:59<8:08:08, 8.92it/s, loss=3.82, v_num=2vt
Validating: 96%|████████████████████████████▊ | 96/100 [00:03<00:00, 33.10it/s]
Epoch 0: 40%|▍| 171717/432833 [5:21:00<8:08:07, 8.92it/s, loss=3.82, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 40%|▍| 171717/432833 [5:21:01<8:08:08, 8.92it/s, loss=3.82, v_num=2vt
Epoch 0: 42%|▍| 181717/432833 [5:39:50<7:49:37, 8.91it/s, loss=3.77, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 42%|▍| 181720/432833 [5:39:50<7:49:37, 8.91it/s, loss=3.77, v_num=2vt
Validating: 6%|█▊ | 6/100 [00:00<00:17, 5.36it/s]
Epoch 0: 42%|▍| 181727/432833 [5:39:51<7:49:35, 8.91it/s, loss=3.77, v_num=2vt
Epoch 0: 42%|▍| 181734/432833 [5:39:51<7:49:34, 8.91it/s, loss=3.77, v_num=2vt
Validating: 17%|█████ | 17/100 [00:00<00:06, 11.94it/s]
Epoch 0: 42%|▍| 181741/432833 [5:39:51<7:49:32, 8.91it/s, loss=3.77, v_num=2vt
Validating: 24%|███████▏ | 24/100 [00:01<00:04, 15.20it/s]
Epoch 0: 42%|▍| 181748/432833 [5:39:51<7:49:31, 8.91it/s, loss=3.77, v_num=2vt
Validating: 32%|█████████▌ | 32/100 [00:01<00:03, 20.16it/s]
Epoch 0: 42%|▍| 181755/432833 [5:39:51<7:49:29, 8.91it/s, loss=3.77, v_num=2vt
Validating: 39%|███████████▋ | 39/100 [00:01<00:02, 23.11it/s]
Epoch 0: 42%|▍| 181762/432833 [5:39:52<7:49:28, 8.91it/s, loss=3.77, v_num=2vt
Validating: 46%|█████████████▊ | 46/100 [00:01<00:02, 24.54it/s]
Epoch 0: 42%|▍| 181769/432833 [5:39:52<7:49:26, 8.91it/s, loss=3.77, v_num=2vt
Validating: 54%|████████████████▏ | 54/100 [00:02<00:01, 26.73it/s]
Epoch 0: 42%|▍| 181776/432833 [5:39:52<7:49:25, 8.91it/s, loss=3.77, v_num=2vt
Validating: 61%|██████████████████▎ | 61/100 [00:02<00:01, 23.69it/s]
Epoch 0: 42%|▍| 181783/432833 [5:39:53<7:49:23, 8.91it/s, loss=3.77, v_num=2vt
Validating: 69%|████████████████████▋ | 69/100 [00:02<00:01, 25.66it/s]
Epoch 0: 42%|▍| 181790/432833 [5:39:53<7:49:22, 8.91it/s, loss=3.77, v_num=2vt
Validating: 75%|██████████████████████▌ | 75/100 [00:02<00:01, 22.08it/s]
Epoch 0: 42%|▍| 181797/432833 [5:39:53<7:49:20, 8.91it/s, loss=3.77, v_num=2vt
Epoch 0: 42%|▍| 181804/432833 [5:39:53<7:49:19, 8.91it/s, loss=3.77, v_num=2vt
Validating: 87%|██████████████████████████ | 87/100 [00:03<00:00, 29.42it/s]
Epoch 0: 42%|▍| 181811/432833 [5:39:54<7:49:17, 8.91it/s, loss=3.77, v_num=2vt
Validating: 94%|████████████████████████████▏ | 94/100 [00:03<00:00, 26.36it/s]
Epoch 0: 42%|▍| 181818/432833 [5:39:54<7:49:16, 8.92it/s, loss=3.77, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 42%|▍| 181818/432833 [5:39:55<7:49:17, 8.91it/s, loss=3.77, v_num=2vt
Epoch 0: 43%|▍| 186597/432833 [5:48:54<7:40:25, 8.91it/s, loss=3.75, v_num=2vtwandb: Network error (ConnectTimeout), entering retry loop. See wandb/debug-internal.log for full traceback.
Epoch 0: 43%|▍| 187009/432833 [5:49:40<7:39:38, 8.91it/s, loss=3.75, v_num=2vtwandb: Network error resolved after 0:01:19.467101, resuming normal operation.
Epoch 0: 44%|▍| 191818/432833 [5:58:44<7:30:44, 8.91it/s, loss=3.72, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 44%|▍| 191821/432833 [5:58:44<7:30:44, 8.91it/s, loss=3.72, v_num=2vt
Epoch 0: 44%|▍| 191828/432833 [5:58:44<7:30:42, 8.91it/s, loss=3.72, v_num=2vt
Validating: 10%|███ | 10/100 [00:00<00:12, 7.10it/s]
Epoch 0: 44%|▍| 191835/432833 [5:58:44<7:30:41, 8.91it/s, loss=3.72, v_num=2vt
Validating: 18%|█████▍ | 18/100 [00:00<00:06, 11.93it/s]
Epoch 0: 44%|▍| 191842/432833 [5:58:45<7:30:39, 8.91it/s, loss=3.72, v_num=2vt
Validating: 26%|███████▊ | 26/100 [00:00<00:04, 16.70it/s]
Epoch 0: 44%|▍| 191849/432833 [5:58:45<7:30:38, 8.91it/s, loss=3.72, v_num=2vt
Epoch 0: 44%|▍| 191856/432833 [5:58:45<7:30:36, 8.91it/s, loss=3.72, v_num=2vt
Validating: 39%|███████████▋ | 39/100 [00:01<00:02, 21.69it/s]
Epoch 0: 44%|▍| 191863/432833 [5:58:45<7:30:35, 8.91it/s, loss=3.72, v_num=2vt
Validating: 47%|██████████████ | 47/100 [00:01<00:02, 25.53it/s]
Epoch 0: 44%|▍| 191870/432833 [5:58:45<7:30:33, 8.91it/s, loss=3.72, v_num=2vt
Validating: 54%|████████████████▏ | 54/100 [00:01<00:01, 26.53it/s]
Epoch 0: 44%|▍| 191877/432833 [5:58:46<7:30:32, 8.91it/s, loss=3.72, v_num=2vt
Validating: 61%|██████████████████▎ | 61/100 [00:02<00:01, 26.05it/s]
Epoch 0: 44%|▍| 191884/432833 [5:58:46<7:30:30, 8.91it/s, loss=3.72, v_num=2vt
Validating: 67%|████████████████████ | 67/100 [00:02<00:01, 26.02it/s]
Epoch 0: 44%|▍| 191891/432833 [5:58:46<7:30:29, 8.91it/s, loss=3.72, v_num=2vt
Validating: 73%|█████████████████████▉ | 73/100 [00:02<00:01, 24.74it/s]
Validating: 76%|██████████████████████▊ | 76/100 [00:02<00:00, 25.91it/s]
Epoch 0: 44%|▍| 191898/432833 [5:58:47<7:30:27, 8.91it/s, loss=3.72, v_num=2vt
Validating: 82%|████████████████████████▌ | 82/100 [00:03<00:00, 24.56it/s]
Epoch 0: 44%|▍| 191905/432833 [5:58:47<7:30:26, 8.91it/s, loss=3.72, v_num=2vt
Validating: 88%|██████████████████████████▍ | 88/100 [00:03<00:00, 23.10it/s]
Epoch 0: 44%|▍| 191912/432833 [5:58:47<7:30:25, 8.91it/s, loss=3.72, v_num=2vt
Validating: 95%|████████████████████████████▌ | 95/100 [00:03<00:00, 21.49it/s]
Epoch 0: 44%|▍| 191919/432833 [5:58:47<7:30:23, 8.91it/s, loss=3.72, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 44%|▍| 191919/432833 [5:58:48<7:30:24, 8.91it/s, loss=3.72, v_num=2vt
Epoch 0: 47%|▍| 201919/432833 [6:17:39<7:11:53, 8.91it/s, loss=3.68, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 47%|▍| 201922/432833 [6:17:39<7:11:52, 8.91it/s, loss=3.68, v_num=2vt
Validating: 7%|██▏ | 7/100 [00:00<00:17, 5.38it/s]
Epoch 0: 47%|▍| 201929/432833 [6:17:39<7:11:51, 8.91it/s, loss=3.68, v_num=2vt
Validating: 12%|███▌ | 12/100 [00:00<00:09, 8.91it/s]
Epoch 0: 47%|▍| 201936/432833 [6:17:40<7:11:50, 8.91it/s, loss=3.68, v_num=2vt
Validating: 17%|█████ | 17/100 [00:00<00:06, 13.02it/s]
Epoch 0: 47%|▍| 201943/432833 [6:17:40<7:11:48, 8.91it/s, loss=3.68, v_num=2vt
Validating: 25%|███████▌ | 25/100 [00:01<00:03, 19.04it/s]
Epoch 0: 47%|▍| 201950/432833 [6:17:40<7:11:47, 8.91it/s, loss=3.68, v_num=2vt
Validating: 31%|█████████▎ | 31/100 [00:01<00:03, 21.69it/s]
Validating: 34%|██████████▏ | 34/100 [00:01<00:03, 19.19it/s]
Epoch 0: 47%|▍| 201957/432833 [6:17:41<7:11:45, 8.91it/s, loss=3.68, v_num=2vt
Validating: 40%|████████████ | 40/100 [00:01<00:02, 21.82it/s]
Epoch 0: 47%|▍| 201964/432833 [6:17:41<7:11:44, 8.91it/s, loss=3.68, v_num=2vt
Validating: 46%|█████████████▊ | 46/100 [00:02<00:02, 21.69it/s]
Epoch 0: 47%|▍| 201971/432833 [6:17:41<7:11:43, 8.91it/s, loss=3.68, v_num=2vt
Validating: 53%|███████████████▉ | 53/100 [00:02<00:01, 25.08it/s]
Epoch 0: 47%|▍| 201978/432833 [6:17:41<7:11:41, 8.91it/s, loss=3.68, v_num=2vt
Validating: 60%|██████████████████ | 60/100 [00:02<00:01, 24.00it/s]
Epoch 0: 47%|▍| 201985/432833 [6:17:42<7:11:40, 8.91it/s, loss=3.68, v_num=2vt
Validating: 67%|████████████████████ | 67/100 [00:02<00:01, 27.16it/s]
Epoch 0: 47%|▍| 201992/432833 [6:17:42<7:11:39, 8.91it/s, loss=3.68, v_num=2vt
Validating: 74%|██████████████████████▏ | 74/100 [00:03<00:00, 26.97it/s]
Epoch 0: 47%|▍| 201999/432833 [6:17:42<7:11:37, 8.91it/s, loss=3.68, v_num=2vt
Epoch 0: 47%|▍| 202006/432833 [6:17:42<7:11:36, 8.91it/s, loss=3.68, v_num=2vt
Validating: 87%|██████████████████████████ | 87/100 [00:03<00:00, 28.50it/s]
Epoch 0: 47%|▍| 202013/432833 [6:17:43<7:11:34, 8.91it/s, loss=3.68, v_num=2vt
Validating: 95%|████████████████████████████▌ | 95/100 [00:03<00:00, 30.13it/s]
Epoch 0: 47%|▍| 202020/432833 [6:17:43<7:11:33, 8.91it/s, loss=3.68, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 47%|▍| 202020/432833 [6:17:44<7:11:34, 8.91it/s, loss=3.68, v_num=2vt
Epoch 0: 49%|▍| 212020/432833 [6:36:32<6:52:58, 8.91it/s, loss=3.64, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 49%|▍| 212023/432833 [6:36:32<6:52:58, 8.91it/s, loss=3.64, v_num=2vt
Epoch 0: 49%|▍| 212030/432833 [6:36:32<6:52:57, 8.91it/s, loss=3.64, v_num=2vt
Validating: 10%|███ | 10/100 [00:00<00:13, 6.86it/s]
Epoch 0: 49%|▍| 212037/432833 [6:36:32<6:52:55, 8.91it/s, loss=3.64, v_num=2vt
Validating: 17%|█████ | 17/100 [00:00<00:07, 11.37it/s]
Validating: 20%|██████ | 20/100 [00:00<00:06, 12.40it/s]
Epoch 0: 49%|▍| 212044/432833 [6:36:33<6:52:54, 8.91it/s, loss=3.64, v_num=2vt
Epoch 0: 49%|▍| 212051/432833 [6:36:33<6:52:53, 8.91it/s, loss=3.64, v_num=2vt
Validating: 31%|█████████▎ | 31/100 [00:01<00:03, 20.64it/s]
Validating: 34%|██████████▏ | 34/100 [00:01<00:02, 22.04it/s]
Epoch 0: 49%|▍| 212058/432833 [6:36:33<6:52:51, 8.91it/s, loss=3.64, v_num=2vt
Epoch 0: 49%|▍| 212065/432833 [6:36:33<6:52:50, 8.91it/s, loss=3.64, v_num=2vt
Validating: 45%|█████████████▌ | 45/100 [00:01<00:02, 27.47it/s]
Epoch 0: 49%|▍| 212072/432833 [6:36:34<6:52:49, 8.91it/s, loss=3.64, v_num=2vt
Validating: 52%|███████████████▌ | 52/100 [00:02<00:01, 25.21it/s]
Validating: 55%|████████████████▌ | 55/100 [00:02<00:01, 25.72it/s]
Epoch 0: 49%|▍| 212079/432833 [6:36:34<6:52:47, 8.91it/s, loss=3.64, v_num=2vt
Validating: 61%|██████████████████▎ | 61/100 [00:02<00:01, 24.67it/s]
Epoch 0: 49%|▍| 212086/432833 [6:36:34<6:52:46, 8.91it/s, loss=3.64, v_num=2vt
Validating: 67%|████████████████████ | 67/100 [00:02<00:01, 23.73it/s]
Epoch 0: 49%|▍| 212093/432833 [6:36:35<6:52:45, 8.91it/s, loss=3.64, v_num=2vt
Validating: 73%|█████████████████████▉ | 73/100 [00:02<00:01, 25.38it/s]
Epoch 0: 49%|▍| 212100/432833 [6:36:35<6:52:43, 8.91it/s, loss=3.64, v_num=2vt
Validating: 80%|████████████████████████ | 80/100 [00:03<00:00, 25.51it/s]
Epoch 0: 49%|▍| 212107/432833 [6:36:35<6:52:42, 8.91it/s, loss=3.64, v_num=2vt
Validating: 87%|██████████████████████████ | 87/100 [00:03<00:00, 28.40it/s]
Epoch 0: 49%|▍| 212114/432833 [6:36:35<6:52:41, 8.91it/s, loss=3.64, v_num=2vt
Epoch 0: 49%|▍| 212121/432833 [6:36:36<6:52:39, 8.91it/s, loss=3.64, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 49%|▍| 212121/432833 [6:36:36<6:52:40, 8.91it/s, loss=3.64, v_num=2vt
Epoch 0: 51%|▌| 222121/432833 [6:55:28<6:34:07, 8.91it/s, loss=3.61, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 51%|▌| 222124/432833 [6:55:28<6:34:07, 8.91it/s, loss=3.61, v_num=2vt
Epoch 0: 51%|▌| 222131/432833 [6:55:28<6:34:06, 8.91it/s, loss=3.61, v_num=2vt
Validating: 10%|███ | 10/100 [00:00<00:12, 7.01it/s]
Epoch 0: 51%|▌| 222138/432833 [6:55:29<6:34:04, 8.91it/s, loss=3.61, v_num=2vt
Validating: 17%|█████ | 17/100 [00:00<00:07, 11.42it/s]
Epoch 0: 51%|▌| 222145/432833 [6:55:29<6:34:03, 8.91it/s, loss=3.61, v_num=2vt
Validating: 24%|███████▏ | 24/100 [00:01<00:04, 15.29it/s]
Epoch 0: 51%|▌| 222152/432833 [6:55:29<6:34:02, 8.91it/s, loss=3.61, v_num=2vt
Validating: 31%|█████████▎ | 31/100 [00:01<00:03, 20.35it/s]
Epoch 0: 51%|▌| 222159/432833 [6:55:29<6:34:00, 8.91it/s, loss=3.61, v_num=2vt
Validating: 39%|███████████▋ | 39/100 [00:01<00:02, 26.53it/s]
Epoch 0: 51%|▌| 222166/432833 [6:55:30<6:33:59, 8.91it/s, loss=3.61, v_num=2vt
Validating: 46%|█████████████▊ | 46/100 [00:01<00:02, 25.46it/s]
Epoch 0: 51%|▌| 222173/432833 [6:55:30<6:33:58, 8.91it/s, loss=3.61, v_num=2vt
Validating: 53%|███████████████▉ | 53/100 [00:02<00:01, 25.30it/s]
Epoch 0: 51%|▌| 222180/432833 [6:55:30<6:33:57, 8.91it/s, loss=3.61, v_num=2vt
Validating: 59%|█████████████████▋ | 59/100 [00:02<00:01, 25.67it/s]
Epoch 0: 51%|▌| 222187/432833 [6:55:30<6:33:55, 8.91it/s, loss=3.61, v_num=2vt
Validating: 66%|███████████████████▊ | 66/100 [00:02<00:01, 22.07it/s]
Epoch 0: 51%|▌| 222194/432833 [6:55:31<6:33:54, 8.91it/s, loss=3.61, v_num=2vt
Validating: 75%|██████████████████████▌ | 75/100 [00:02<00:00, 27.65it/s]
Epoch 0: 51%|▌| 222201/432833 [6:55:31<6:33:53, 8.91it/s, loss=3.61, v_num=2vt
Validating: 82%|████████████████████████▌ | 82/100 [00:03<00:00, 26.66it/s]
Epoch 0: 51%|▌| 222208/432833 [6:55:31<6:33:52, 8.91it/s, loss=3.61, v_num=2vt
Validating: 89%|██████████████████████████▋ | 89/100 [00:03<00:00, 27.87it/s]
Epoch 0: 51%|▌| 222215/432833 [6:55:31<6:33:50, 8.91it/s, loss=3.61, v_num=2vt
Validating: 97%|█████████████████████████████ | 97/100 [00:03<00:00, 27.82it/s]
Epoch 0: 51%|▌| 222222/432833 [6:55:32<6:33:49, 8.91it/s, loss=3.61, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 51%|▌| 222222/432833 [6:55:33<6:33:50, 8.91it/s, loss=3.61, v_num=2vt
Epoch 0: 54%|▌| 232222/432833 [7:14:25<6:15:17, 8.91it/s, loss=3.57, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 54%|▌| 232225/432833 [7:14:26<6:15:17, 8.91it/s, loss=3.57, v_num=2vt
Validating: 6%|█▊ | 6/100 [00:00<00:18, 5.18it/s]
Epoch 0: 54%|▌| 232232/432833 [7:14:26<6:15:16, 8.91it/s, loss=3.57, v_num=2vt
Validating: 13%|███▉ | 13/100 [00:00<00:09, 9.04it/s]
Epoch 0: 54%|▌| 232239/432833 [7:14:26<6:15:14, 8.91it/s, loss=3.57, v_num=2vt
Epoch 0: 54%|▌| 232246/432833 [7:14:26<6:15:13, 8.91it/s, loss=3.57, v_num=2vt
Validating: 24%|███████▏ | 24/100 [00:00<00:04, 16.50it/s]
Epoch 0: 54%|▌| 232253/432833 [7:14:27<6:15:12, 8.91it/s, loss=3.57, v_num=2vt
Validating: 31%|█████████▎ | 31/100 [00:01<00:03, 20.61it/s]
Validating: 34%|██████████▏ | 34/100 [00:01<00:02, 22.34it/s]
Epoch 0: 54%|▌| 232260/432833 [7:14:27<6:15:11, 8.91it/s, loss=3.57, v_num=2vt
Validating: 40%|████████████ | 40/100 [00:01<00:02, 24.46it/s]
Epoch 0: 54%|▌| 232267/432833 [7:14:27<6:15:09, 8.91it/s, loss=3.57, v_num=2vt
Validating: 46%|█████████████▊ | 46/100 [00:01<00:02, 24.45it/s]
Epoch 0: 54%|▌| 232274/432833 [7:14:27<6:15:08, 8.91it/s, loss=3.57, v_num=2vt
Validating: 54%|████████████████▏ | 54/100 [00:02<00:01, 28.51it/s]
Epoch 0: 54%|▌| 232281/432833 [7:14:28<6:15:07, 8.91it/s, loss=3.57, v_num=2vt
Validating: 61%|██████████████████▎ | 61/100 [00:02<00:01, 23.55it/s]
Epoch 0: 54%|▌| 232288/432833 [7:14:28<6:15:06, 8.91it/s, loss=3.57, v_num=2vt
Validating: 68%|████████████████████▍ | 68/100 [00:02<00:01, 25.66it/s]
Epoch 0: 54%|▌| 232295/432833 [7:14:28<6:15:04, 8.91it/s, loss=3.57, v_num=2vt
Validating: 74%|██████████████████████▏ | 74/100 [00:02<00:01, 23.15it/s]
Epoch 0: 54%|▌| 232302/432833 [7:14:29<6:15:03, 8.91it/s, loss=3.57, v_num=2vt
Validating: 81%|████████████████████████▎ | 81/100 [00:03<00:00, 22.83it/s]
Epoch 0: 54%|▌| 232309/432833 [7:14:29<6:15:02, 8.91it/s, loss=3.57, v_num=2vt
Validating: 89%|██████████████████████████▋ | 89/100 [00:03<00:00, 27.14it/s]
Epoch 0: 54%|▌| 232316/432833 [7:14:29<6:15:01, 8.91it/s, loss=3.57, v_num=2vt
Validating: 96%|████████████████████████████▊ | 96/100 [00:03<00:00, 28.06it/s]
Epoch 0: 54%|▌| 232323/432833 [7:14:29<6:14:59, 8.91it/s, loss=3.57, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 54%|▌| 232323/432833 [7:14:30<6:15:00, 8.91it/s, loss=3.57, v_num=2vt
Epoch 0: 56%|▌| 242323/432833 [7:33:22<5:56:26, 8.91it/s, loss=3.54, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 56%|▌| 242326/432833 [7:33:22<5:56:25, 8.91it/s, loss=3.54, v_num=2vt
Epoch 0: 56%|▌| 242333/432833 [7:33:22<5:56:24, 8.91it/s, loss=3.54, v_num=2vt
Validating: 11%|███▎ | 11/100 [00:00<00:12, 7.12it/s]
Epoch 0: 56%|▌| 242340/432833 [7:33:23<5:56:23, 8.91it/s, loss=3.54, v_num=2vt
Validating: 17%|█████ | 17/100 [00:00<00:07, 11.60it/s]
Epoch 0: 56%|▌| 242347/432833 [7:33:23<5:56:21, 8.91it/s, loss=3.54, v_num=2vt
Validating: 25%|███████▌ | 25/100 [00:00<00:04, 17.71it/s]
Epoch 0: 56%|▌| 242354/432833 [7:33:23<5:56:20, 8.91it/s, loss=3.54, v_num=2vt
Validating: 33%|█████████▉ | 33/100 [00:01<00:02, 22.87it/s]
Epoch 0: 56%|▌| 242361/432833 [7:33:23<5:56:19, 8.91it/s, loss=3.54, v_num=2vt
Epoch 0: 56%|▌| 242368/432833 [7:33:23<5:56:18, 8.91it/s, loss=3.54, v_num=2vt
Validating: 45%|█████████████▌ | 45/100 [00:01<00:01, 28.56it/s]
Epoch 0: 56%|▌| 242375/432833 [7:33:24<5:56:17, 8.91it/s, loss=3.54, v_num=2vt
Validating: 52%|███████████████▌ | 52/100 [00:01<00:01, 24.19it/s]
Epoch 0: 56%|▌| 242382/432833 [7:33:24<5:56:15, 8.91it/s, loss=3.54, v_num=2vt
Validating: 59%|█████████████████▋ | 59/100 [00:02<00:01, 22.29it/s]
Validating: 62%|██████████████████▌ | 62/100 [00:02<00:01, 23.67it/s]
Epoch 0: 56%|▌| 242389/432833 [7:33:24<5:56:14, 8.91it/s, loss=3.54, v_num=2vt
Validating: 69%|████████████████████▋ | 69/100 [00:02<00:01, 24.70it/s]
Epoch 0: 56%|▌| 242396/432833 [7:33:25<5:56:13, 8.91it/s, loss=3.54, v_num=2vt
Validating: 75%|██████████████████████▌ | 75/100 [00:02<00:01, 18.61it/s]
Epoch 0: 56%|▌| 242403/432833 [7:33:25<5:56:12, 8.91it/s, loss=3.54, v_num=2vt
Validating: 82%|████████████████████████▌ | 82/100 [00:03<00:00, 21.29it/s]
Epoch 0: 56%|▌| 242410/432833 [7:33:25<5:56:11, 8.91it/s, loss=3.54, v_num=2vt
Validating: 88%|██████████████████████████▍ | 88/100 [00:03<00:00, 22.59it/s]
Epoch 0: 56%|▌| 242417/432833 [7:33:26<5:56:10, 8.91it/s, loss=3.54, v_num=2vt
Validating: 95%|████████████████████████████▌ | 95/100 [00:03<00:00, 26.16it/s]
Epoch 0: 56%|▌| 242424/432833 [7:33:26<5:56:08, 8.91it/s, loss=3.54, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 56%|▌| 242424/432833 [7:33:27<5:56:09, 8.91it/s, loss=3.54, v_num=2vt
Epoch 0: 58%|▌| 252424/432833 [7:52:19<5:37:34, 8.91it/s, loss=3.51, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 58%|▌| 252427/432833 [7:52:20<5:37:34, 8.91it/s, loss=3.51, v_num=2vt
Epoch 0: 58%|▌| 252434/432833 [7:52:20<5:37:33, 8.91it/s, loss=3.51, v_num=2vt
Validating: 10%|███ | 10/100 [00:00<00:13, 6.88it/s]
Epoch 0: 58%|▌| 252441/432833 [7:52:20<5:37:31, 8.91it/s, loss=3.51, v_num=2vt
Validating: 18%|█████▍ | 18/100 [00:00<00:07, 11.51it/s]
Epoch 0: 58%|▌| 252448/432833 [7:52:20<5:37:30, 8.91it/s, loss=3.51, v_num=2vt
Validating: 26%|███████▊ | 26/100 [00:00<00:04, 16.79it/s]
Epoch 0: 58%|▌| 252455/432833 [7:52:21<5:37:29, 8.91it/s, loss=3.51, v_num=2vt
Validating: 32%|█████████▌ | 32/100 [00:01<00:03, 18.86it/s]
Epoch 0: 58%|▌| 252462/432833 [7:52:21<5:37:28, 8.91it/s, loss=3.51, v_num=2vt
Validating: 40%|████████████ | 40/100 [00:01<00:02, 23.41it/s]
Epoch 0: 58%|▌| 252469/432833 [7:52:21<5:37:27, 8.91it/s, loss=3.51, v_num=2vt
Validating: 46%|█████████████▊ | 46/100 [00:01<00:02, 25.25it/s]
Epoch 0: 58%|▌| 252476/432833 [7:52:21<5:37:26, 8.91it/s, loss=3.51, v_num=2vt
Validating: 53%|███████████████▉ | 53/100 [00:02<00:01, 25.30it/s]
Epoch 0: 58%|▌| 252483/432833 [7:52:22<5:37:24, 8.91it/s, loss=3.51, v_num=2vt
Validating: 61%|██████████████████▎ | 61/100 [00:02<00:01, 29.75it/s]
Epoch 0: 58%|▌| 252490/432833 [7:52:22<5:37:23, 8.91it/s, loss=3.51, v_num=2vt
Epoch 0: 58%|▌| 252497/432833 [7:52:22<5:37:22, 8.91it/s, loss=3.51, v_num=2vt
Validating: 73%|█████████████████████▉ | 73/100 [00:02<00:01, 22.14it/s]
Validating: 76%|██████████████████████▊ | 76/100 [00:02<00:01, 22.72it/s]
Epoch 0: 58%|▌| 252504/432833 [7:52:22<5:37:21, 8.91it/s, loss=3.51, v_num=2vt
Validating: 82%|████████████████████████▌ | 82/100 [00:03<00:00, 21.68it/s]
Epoch 0: 58%|▌| 252511/432833 [7:52:23<5:37:20, 8.91it/s, loss=3.51, v_num=2vt
Validating: 89%|██████████████████████████▋ | 89/100 [00:03<00:00, 25.07it/s]
Epoch 0: 58%|▌| 252518/432833 [7:52:23<5:37:19, 8.91it/s, loss=3.51, v_num=2vt
Validating: 95%|████████████████████████████▌ | 95/100 [00:03<00:00, 24.65it/s]
Epoch 0: 58%|▌| 252525/432833 [7:52:23<5:37:17, 8.91it/s, loss=3.51, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 58%|▌| 252525/432833 [7:52:24<5:37:18, 8.91it/s, loss=3.51, v_num=2vt
Epoch 0: 61%|▌| 262525/432833 [8:11:16<5:18:42, 8.91it/s, loss=3.48, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 61%|▌| 262528/432833 [8:11:16<5:18:41, 8.91it/s, loss=3.48, v_num=2vt
Epoch 0: 61%|▌| 262535/432833 [8:11:16<5:18:40, 8.91it/s, loss=3.48, v_num=2vt
Validating: 10%|███ | 10/100 [00:00<00:12, 6.96it/s]
Epoch 0: 61%|▌| 262542/432833 [8:11:16<5:18:39, 8.91it/s, loss=3.48, v_num=2vt
Epoch 0: 61%|▌| 262549/432833 [8:11:17<5:18:38, 8.91it/s, loss=3.48, v_num=2vt
Validating: 24%|███████▏ | 24/100 [00:00<00:05, 15.15it/s]
Epoch 0: 61%|▌| 262556/432833 [8:11:17<5:18:37, 8.91it/s, loss=3.48, v_num=2vt
Validating: 32%|█████████▌ | 32/100 [00:01<00:03, 17.60it/s]
Epoch 0: 61%|▌| 262563/432833 [8:11:17<5:18:36, 8.91it/s, loss=3.48, v_num=2vt
Validating: 39%|███████████▋ | 39/100 [00:01<00:02, 20.63it/s]
Epoch 0: 61%|▌| 262570/432833 [8:11:17<5:18:34, 8.91it/s, loss=3.48, v_num=2vt
Validating: 47%|██████████████ | 47/100 [00:01<00:02, 24.69it/s]
Epoch 0: 61%|▌| 262577/432833 [8:11:18<5:18:33, 8.91it/s, loss=3.48, v_num=2vt
Epoch 0: 61%|▌| 262584/432833 [8:11:18<5:18:32, 8.91it/s, loss=3.48, v_num=2vt
Validating: 59%|█████████████████▋ | 59/100 [00:02<00:01, 27.97it/s]
Epoch 0: 61%|▌| 262591/432833 [8:11:18<5:18:31, 8.91it/s, loss=3.48, v_num=2vt
Validating: 66%|███████████████████▊ | 66/100 [00:02<00:01, 27.51it/s]
Epoch 0: 61%|▌| 262598/432833 [8:11:18<5:18:30, 8.91it/s, loss=3.48, v_num=2vt
Validating: 75%|██████████████████████▌ | 75/100 [00:02<00:00, 27.32it/s]
Epoch 0: 61%|▌| 262605/432833 [8:11:19<5:18:29, 8.91it/s, loss=3.48, v_num=2vt
Validating: 81%|████████████████████████▎ | 81/100 [00:02<00:00, 24.40it/s]
Epoch 0: 61%|▌| 262612/432833 [8:11:19<5:18:28, 8.91it/s, loss=3.48, v_num=2vt
Validating: 87%|██████████████████████████ | 87/100 [00:03<00:00, 25.02it/s]
Epoch 0: 61%|▌| 262619/432833 [8:11:19<5:18:26, 8.91it/s, loss=3.48, v_num=2vt
Validating: 94%|████████████████████████████▏ | 94/100 [00:03<00:00, 24.84it/s]
Validating: 97%|█████████████████████████████ | 97/100 [00:03<00:00, 22.38it/s]
Epoch 0: 61%|▌| 262626/432833 [8:11:19<5:18:25, 8.91it/s, loss=3.48, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 61%|▌| 262626/432833 [8:11:20<5:18:26, 8.91it/s, loss=3.48, v_num=2vt
Epoch 0: 63%|▋| 272626/432833 [8:30:14<4:59:50, 8.91it/s, loss=3.46, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 63%|▋| 272629/432833 [8:30:14<4:59:49, 8.91it/s, loss=3.46, v_num=2vt
Epoch 0: 63%|▋| 272636/432833 [8:30:14<4:59:48, 8.91it/s, loss=3.46, v_num=2vt
Validating: 12%|███▌ | 12/100 [00:00<00:12, 7.20it/s]
Epoch 0: 63%|▋| 272643/432833 [8:30:14<4:59:47, 8.91it/s, loss=3.46, v_num=2vt
Validating: 19%|█████▋ | 19/100 [00:00<00:06, 11.84it/s]
Epoch 0: 63%|▋| 272650/432833 [8:30:15<4:59:46, 8.91it/s, loss=3.46, v_num=2vt
Validating: 26%|███████▊ | 26/100 [00:00<00:04, 17.20it/s]
Epoch 0: 63%|▋| 272657/432833 [8:30:15<4:59:45, 8.91it/s, loss=3.46, v_num=2vt
Validating: 33%|█████████▉ | 33/100 [00:01<00:03, 18.54it/s]
Epoch 0: 63%|▋| 272664/432833 [8:30:15<4:59:44, 8.91it/s, loss=3.46, v_num=2vt
Validating: 39%|███████████▋ | 39/100 [00:01<00:03, 19.66it/s]
Epoch 0: 63%|▋| 272671/432833 [8:30:16<4:59:43, 8.91it/s, loss=3.46, v_num=2vt
Validating: 46%|█████████████▊ | 46/100 [00:01<00:02, 23.72it/s]
Epoch 0: 63%|▋| 272678/432833 [8:30:16<4:59:42, 8.91it/s, loss=3.46, v_num=2vt
Validating: 52%|███████████████▌ | 52/100 [00:02<00:01, 24.23it/s]
Validating: 55%|████████████████▌ | 55/100 [00:02<00:01, 25.24it/s]
Epoch 0: 63%|▋| 272685/432833 [8:30:16<4:59:41, 8.91it/s, loss=3.46, v_num=2vt
Validating: 61%|██████████████████▎ | 61/100 [00:02<00:01, 26.22it/s]
Epoch 0: 63%|▋| 272692/432833 [8:30:16<4:59:40, 8.91it/s, loss=3.46, v_num=2vt
Validating: 67%|████████████████████ | 67/100 [00:02<00:01, 26.46it/s]
Epoch 0: 63%|▋| 272699/432833 [8:30:17<4:59:38, 8.91it/s, loss=3.46, v_num=2vt
Validating: 74%|██████████████████████▏ | 74/100 [00:02<00:01, 23.73it/s]
Epoch 0: 63%|▋| 272706/432833 [8:30:17<4:59:37, 8.91it/s, loss=3.46, v_num=2vt
Validating: 80%|████████████████████████ | 80/100 [00:03<00:00, 25.15it/s]
Epoch 0: 63%|▋| 272713/432833 [8:30:17<4:59:36, 8.91it/s, loss=3.46, v_num=2vt
Validating: 87%|██████████████████████████ | 87/100 [00:03<00:00, 24.12it/s]
Validating: 90%|███████████████████████████ | 90/100 [00:03<00:00, 23.86it/s]
Epoch 0: 63%|▋| 272720/432833 [8:30:17<4:59:35, 8.91it/s, loss=3.46, v_num=2vt
Epoch 0: 63%|▋| 272727/432833 [8:30:18<4:59:34, 8.91it/s, loss=3.46, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 63%|▋| 272727/432833 [8:30:19<4:59:35, 8.91it/s, loss=3.46, v_num=2vt
Epoch 0: 65%|▋| 282727/432833 [8:49:10<4:40:57, 8.90it/s, loss=3.43, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 65%|▋| 282730/432833 [8:49:10<4:40:56, 8.90it/s, loss=3.43, v_num=2vt
Validating: 6%|█▊ | 6/100 [00:00<00:18, 5.02it/s]
Epoch 0: 65%|▋| 282737/432833 [8:49:11<4:40:55, 8.90it/s, loss=3.43, v_num=2vt
Validating: 12%|███▌ | 12/100 [00:00<00:10, 8.46it/s]
Epoch 0: 65%|▋| 282744/432833 [8:49:11<4:40:54, 8.90it/s, loss=3.43, v_num=2vt
Validating: 19%|█████▋ | 19/100 [00:00<00:06, 12.21it/s]
Epoch 0: 65%|▋| 282751/432833 [8:49:11<4:40:53, 8.91it/s, loss=3.43, v_num=2vt
Validating: 25%|███████▌ | 25/100 [00:01<00:04, 16.22it/s]
Epoch 0: 65%|▋| 282758/432833 [8:49:12<4:40:52, 8.91it/s, loss=3.43, v_num=2vt
Validating: 31%|█████████▎ | 31/100 [00:01<00:03, 19.96it/s]
Epoch 0: 65%|▋| 282765/432833 [8:49:12<4:40:51, 8.91it/s, loss=3.43, v_num=2vt
Validating: 38%|███████████▍ | 38/100 [00:01<00:02, 22.41it/s]
Epoch 0: 65%|▋| 282772/432833 [8:49:12<4:40:50, 8.91it/s, loss=3.43, v_num=2vt
Validating: 46%|█████████████▊ | 46/100 [00:01<00:02, 25.13it/s]
Epoch 0: 65%|▋| 282779/432833 [8:49:12<4:40:49, 8.91it/s, loss=3.43, v_num=2vt
Epoch 0: 65%|▋| 282786/432833 [8:49:12<4:40:48, 8.91it/s, loss=3.43, v_num=2vt
Validating: 59%|█████████████████▋ | 59/100 [00:02<00:01, 31.98it/s]
Epoch 0: 65%|▋| 282793/432833 [8:49:13<4:40:47, 8.91it/s, loss=3.43, v_num=2vt
Validating: 67%|████████████████████ | 67/100 [00:02<00:01, 30.42it/s]
Epoch 0: 65%|▋| 282800/432833 [8:49:13<4:40:46, 8.91it/s, loss=3.43, v_num=2vt
Validating: 75%|██████████████████████▌ | 75/100 [00:02<00:00, 30.45it/s]
Epoch 0: 65%|▋| 282807/432833 [8:49:13<4:40:44, 8.91it/s, loss=3.43, v_num=2vt
Validating: 82%|████████████████████████▌ | 82/100 [00:03<00:00, 28.23it/s]
Epoch 0: 65%|▋| 282814/432833 [8:49:13<4:40:43, 8.91it/s, loss=3.43, v_num=2vt
Epoch 0: 65%|▋| 282821/432833 [8:49:14<4:40:42, 8.91it/s, loss=3.43, v_num=2vt
Validating: 94%|████████████████████████████▏ | 94/100 [00:03<00:00, 22.75it/s]
Validating: 97%|█████████████████████████████ | 97/100 [00:03<00:00, 23.54it/s]
Epoch 0: 65%|▋| 282828/432833 [8:49:14<4:40:41, 8.91it/s, loss=3.43, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 65%|▋| 282828/432833 [8:49:15<4:40:42, 8.91it/s, loss=3.43, v_num=2vt
Epoch 0: 68%|▋| 292828/432833 [9:08:07<4:22:03, 8.90it/s, loss=3.41, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 68%|▋| 292831/432833 [9:08:07<4:22:03, 8.90it/s, loss=3.41, v_num=2vt
Validating: 6%|█▊ | 6/100 [00:00<00:17, 5.30it/s]
Epoch 0: 68%|▋| 292838/432833 [9:08:07<4:22:02, 8.90it/s, loss=3.41, v_num=2vt
Validating: 12%|███▌ | 12/100 [00:00<00:09, 8.85it/s]
Epoch 0: 68%|▋| 292845/432833 [9:08:07<4:22:01, 8.90it/s, loss=3.41, v_num=2vt
Validating: 18%|█████▍ | 18/100 [00:00<00:06, 13.44it/s]
Epoch 0: 68%|▋| 292852/432833 [9:08:08<4:22:00, 8.90it/s, loss=3.41, v_num=2vt
Epoch 0: 68%|▋| 292859/432833 [9:08:08<4:21:59, 8.90it/s, loss=3.41, v_num=2vt
Validating: 31%|█████████▎ | 31/100 [00:01<00:03, 20.48it/s]
Validating: 34%|██████████▏ | 34/100 [00:01<00:03, 21.49it/s]
Epoch 0: 68%|▋| 292866/432833 [9:08:08<4:21:58, 8.90it/s, loss=3.41, v_num=2vt
Validating: 40%|████████████ | 40/100 [00:01<00:02, 21.92it/s]
Epoch 0: 68%|▋| 292873/432833 [9:08:08<4:21:57, 8.90it/s, loss=3.41, v_num=2vt
Validating: 46%|█████████████▊ | 46/100 [00:01<00:02, 22.88it/s]
Epoch 0: 68%|▋| 292880/432833 [9:08:09<4:21:56, 8.91it/s, loss=3.41, v_num=2vt
Validating: 54%|████████████████▏ | 54/100 [00:02<00:01, 27.14it/s]
Epoch 0: 68%|▋| 292887/432833 [9:08:09<4:21:55, 8.91it/s, loss=3.41, v_num=2vt
Validating: 60%|██████████████████ | 60/100 [00:02<00:01, 27.72it/s]
Epoch 0: 68%|▋| 292894/432833 [9:08:09<4:21:54, 8.91it/s, loss=3.41, v_num=2vt
Validating: 67%|████████████████████ | 67/100 [00:02<00:01, 25.00it/s]
Epoch 0: 68%|▋| 292901/432833 [9:08:09<4:21:53, 8.91it/s, loss=3.41, v_num=2vt
Validating: 74%|██████████████████████▏ | 74/100 [00:02<00:00, 27.87it/s]
Epoch 0: 68%|▋| 292908/432833 [9:08:10<4:21:51, 8.91it/s, loss=3.41, v_num=2vt
Validating: 82%|████████████████████████▌ | 82/100 [00:03<00:00, 28.75it/s]
Epoch 0: 68%|▋| 292915/432833 [9:08:10<4:21:50, 8.91it/s, loss=3.41, v_num=2vt
Validating: 88%|██████████████████████████▍ | 88/100 [00:03<00:00, 22.67it/s]
Epoch 0: 68%|▋| 292922/432833 [9:08:10<4:21:49, 8.91it/s, loss=3.41, v_num=2vt
Validating: 95%|████████████████████████████▌ | 95/100 [00:03<00:00, 26.25it/s]
Epoch 0: 68%|▋| 292929/432833 [9:08:11<4:21:48, 8.91it/s, loss=3.41, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 68%|▋| 292929/432833 [9:08:11<4:21:49, 8.91it/s, loss=3.41, v_num=2vt
Epoch 0: 70%|▋| 302929/432833 [9:27:03<4:03:10, 8.90it/s, loss=3.39, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 70%|▋| 302932/432833 [9:27:03<4:03:09, 8.90it/s, loss=3.39, v_num=2vt
Epoch 0: 70%|▋| 302939/432833 [9:27:04<4:03:08, 8.90it/s, loss=3.39, v_num=2vt
Validating: 10%|███ | 10/100 [00:00<00:12, 6.95it/s]
Epoch 0: 70%|▋| 302946/432833 [9:27:04<4:03:07, 8.90it/s, loss=3.39, v_num=2vt
Validating: 17%|█████ | 17/100 [00:00<00:07, 10.98it/s]
Validating: 20%|██████ | 20/100 [00:00<00:06, 12.87it/s]
Epoch 0: 70%|▋| 302953/432833 [9:27:04<4:03:06, 8.90it/s, loss=3.39, v_num=2vt
Epoch 0: 70%|▋| 302960/432833 [9:27:04<4:03:05, 8.90it/s, loss=3.39, v_num=2vt
Validating: 31%|█████████▎ | 31/100 [00:01<00:03, 20.57it/s]
Epoch 0: 70%|▋| 302967/432833 [9:27:05<4:03:04, 8.90it/s, loss=3.39, v_num=2vt
Validating: 38%|███████████▍ | 38/100 [00:01<00:02, 23.70it/s]
Validating: 41%|████████████▎ | 41/100 [00:01<00:02, 23.78it/s]
Epoch 0: 70%|▋| 302974/432833 [9:27:05<4:03:03, 8.90it/s, loss=3.39, v_num=2vt
Validating: 47%|██████████████ | 47/100 [00:01<00:02, 23.56it/s]
Epoch 0: 70%|▋| 302981/432833 [9:27:05<4:03:02, 8.90it/s, loss=3.39, v_num=2vt
Validating: 54%|████████████████▏ | 54/100 [00:02<00:01, 25.85it/s]
Epoch 0: 70%|▋| 302988/432833 [9:27:05<4:03:01, 8.90it/s, loss=3.39, v_num=2vt
Epoch 0: 70%|▋| 302995/432833 [9:27:06<4:03:00, 8.90it/s, loss=3.39, v_num=2vt
Validating: 67%|████████████████████ | 67/100 [00:02<00:01, 26.72it/s]
Epoch 0: 70%|▋| 303002/432833 [9:27:06<4:02:59, 8.90it/s, loss=3.39, v_num=2vt
Validating: 73%|█████████████████████▉ | 73/100 [00:02<00:01, 22.66it/s]
Epoch 0: 70%|▋| 303009/432833 [9:27:06<4:02:58, 8.91it/s, loss=3.39, v_num=2vt
Validating: 80%|████████████████████████ | 80/100 [00:03<00:00, 24.42it/s]
Epoch 0: 70%|▋| 303016/432833 [9:27:06<4:02:57, 8.91it/s, loss=3.39, v_num=2vt
Validating: 87%|██████████████████████████ | 87/100 [00:03<00:00, 26.69it/s]
Epoch 0: 70%|▋| 303023/432833 [9:27:07<4:02:56, 8.91it/s, loss=3.39, v_num=2vt
Validating: 95%|████████████████████████████▌ | 95/100 [00:03<00:00, 26.90it/s]
Epoch 0: 70%|▋| 303030/432833 [9:27:07<4:02:55, 8.91it/s, loss=3.39, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 70%|▋| 303030/432833 [9:27:08<4:02:56, 8.91it/s, loss=3.39, v_num=2vt
Epoch 0: 72%|▋| 313030/432833 [9:46:01<3:44:17, 8.90it/s, loss=3.37, v_num=2vt
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 72%|▋| 313033/432833 [9:46:01<3:44:16, 8.90it/s, loss=3.37, v_num=2vt
Epoch 0: 72%|▋| 313040/432833 [9:46:01<3:44:15, 8.90it/s, loss=3.37, v_num=2vt
Validating: 10%|███ | 10/100 [00:00<00:13, 6.85it/s]
Epoch 0: 72%|▋| 313047/432833 [9:46:02<3:44:14, 8.90it/s, loss=3.37, v_num=2vt
Validating: 18%|█████▍ | 18/100 [00:00<00:07, 11.70it/s]
Epoch 0: 72%|▋| 313054/432833 [9:46:02<3:44:13, 8.90it/s, loss=3.37, v_num=2vt
Epoch 0: 72%|▋| 313061/432833 [9:46:02<3:44:12, 8.90it/s, loss=3.37, v_num=2vt
Validating: 31%|█████████▎ | 31/100 [00:01<00:03, 19.32it/s]
Epoch 0: 72%|▋| 313068/432833 [9:46:02<3:44:11, 8.90it/s, loss=3.37, v_num=2vt
Validating: 39%|███████████▋ | 39/100 [00:01<00:02, 23.19it/s]
Epoch 0: 72%|▋| 313075/432833 [9:46:03<3:44:10, 8.90it/s, loss=3.37, v_num=2vt
Validating: 45%|█████████████▌ | 45/100 [00:01<00:02, 23.12it/s]
Validating: 48%|██████████████▍ | 48/100 [00:01<00:02, 21.22it/s]
Epoch 0: 72%|▋| 313082/432833 [9:46:03<3:44:09, 8.90it/s, loss=3.37, v_num=2vt
Validating: 55%|████████████████▌ | 55/100 [00:02<00:01, 23.72it/s]
Epoch 0: 72%|▋| 313089/432833 [9:46:03<3:44:08, 8.90it/s, loss=3.37, v_num=2vt
Validating: 61%|██████████████████▎ | 61/100 [00:02<00:01, 25.72it/s]
Epoch 0: 72%|▋| 313096/432833 [9:46:03<3:44:07, 8.90it/s, loss=3.37, v_num=2vt
Validating: 69%|████████████████████▋ | 69/100 [00:02<00:01, 27.52it/s]
Epoch 0: 72%|▋| 313103/432833 [9:46:04<3:44:06, 8.90it/s, loss=3.37, v_num=2vt
Validating: 75%|██████████████████████▌ | 75/100 [00:02<00:01, 23.70it/s]
Epoch 0: 72%|▋| 313110/432833 [9:46:04<3:44:05, 8.90it/s, loss=3.37, v_num=2vt
Validating: 82%|████████████████████████▌ | 82/100 [00:03<00:00, 25.29it/s]
Epoch 0: 72%|▋| 313117/432833 [9:46:04<3:44:04, 8.90it/s, loss=3.37, v_num=2vt
Validating: 88%|██████████████████████████▍ | 88/100 [00:03<00:00, 21.70it/s]
Epoch 0: 72%|▋| 313124/432833 [9:46:05<3:44:03, 8.90it/s, loss=3.37, v_num=2vt
Epoch 0: 72%|▋| 313131/432833 [9:46:05<3:44:02, 8.90it/s, loss=3.37, v_num=2vtSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 72%|▋| 313131/432833 [9:46:06<3:44:03, 8.90it/s, loss=3.37, v_num=2vt
Epoch 0: 75%|▋| 323131/432833 [10:04:57<3:25:22, 8.90it/s, loss=3.35, v_num=2v
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 75%|▋| 323134/432833 [10:04:57<3:25:22, 8.90it/s, loss=3.35, v_num=2v
Epoch 0: 75%|▋| 323141/432833 [10:04:57<3:25:21, 8.90it/s, loss=3.35, v_num=2v
Validating: 11%|███▎ | 11/100 [00:00<00:12, 6.97it/s]
Epoch 0: 75%|▋| 323148/432833 [10:04:57<3:25:20, 8.90it/s, loss=3.35, v_num=2v
Validating: 18%|█████▍ | 18/100 [00:00<00:07, 10.78it/s]
Epoch 0: 75%|▋| 323155/432833 [10:04:58<3:25:19, 8.90it/s, loss=3.35, v_num=2v
Validating: 24%|███████▏ | 24/100 [00:01<00:04, 15.23it/s]
Epoch 0: 75%|▋| 323162/432833 [10:04:58<3:25:18, 8.90it/s, loss=3.35, v_num=2v
Validating: 31%|█████████▎ | 31/100 [00:01<00:03, 19.90it/s]
Epoch 0: 75%|▋| 323169/432833 [10:04:58<3:25:17, 8.90it/s, loss=3.35, v_num=2v
Validating: 38%|███████████▍ | 38/100 [00:01<00:02, 23.81it/s]
Epoch 0: 75%|▋| 323176/432833 [10:04:58<3:25:16, 8.90it/s, loss=3.35, v_num=2v
Validating: 45%|█████████████▌ | 45/100 [00:01<00:02, 26.93it/s]
Validating: 48%|██████████████▍ | 48/100 [00:01<00:02, 25.29it/s]
Epoch 0: 75%|▋| 323183/432833 [10:04:59<3:25:15, 8.90it/s, loss=3.35, v_num=2v
Validating: 54%|████████████████▏ | 54/100 [00:02<00:01, 25.78it/s]
Epoch 0: 75%|▋| 323190/432833 [10:04:59<3:25:14, 8.90it/s, loss=3.35, v_num=2v
Validating: 61%|██████████████████▎ | 61/100 [00:02<00:01, 24.37it/s]
Epoch 0: 75%|▋| 323197/432833 [10:04:59<3:25:13, 8.90it/s, loss=3.35, v_num=2v
Validating: 68%|████████████████████▍ | 68/100 [00:02<00:01, 27.03it/s]
Epoch 0: 75%|▋| 323204/432833 [10:04:59<3:25:12, 8.90it/s, loss=3.35, v_num=2v
Epoch 0: 75%|▋| 323211/432833 [10:04:59<3:25:11, 8.90it/s, loss=3.35, v_num=2v
Validating: 81%|████████████████████████▎ | 81/100 [00:03<00:00, 29.63it/s]
Epoch 0: 75%|▋| 323218/432833 [10:05:00<3:25:10, 8.90it/s, loss=3.35, v_num=2v
Validating: 89%|██████████████████████████▋ | 89/100 [00:03<00:00, 26.44it/s]
Epoch 0: 75%|▋| 323225/432833 [10:05:00<3:25:09, 8.90it/s, loss=3.35, v_num=2v
Validating: 96%|████████████████████████████▊ | 96/100 [00:03<00:00, 27.63it/s]
Epoch 0: 75%|▋| 323232/432833 [10:05:00<3:25:08, 8.90it/s, loss=3.35, v_num=2vSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 75%|▋| 323232/432833 [10:05:01<3:25:09, 8.90it/s, loss=3.35, v_num=2v
Epoch 0: 77%|▊| 333232/432833 [10:23:52<3:06:28, 8.90it/s, loss=3.33, v_num=2v
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 77%|▊| 333235/432833 [10:23:52<3:06:28, 8.90it/s, loss=3.33, v_num=2v
Validating: 7%|██▏ | 7/100 [00:00<00:17, 5.35it/s]
Epoch 0: 77%|▊| 333242/432833 [10:23:53<3:06:27, 8.90it/s, loss=3.33, v_num=2v
Validating: 12%|███▌ | 12/100 [00:00<00:10, 8.45it/s]
Epoch 0: 77%|▊| 333249/432833 [10:23:53<3:06:26, 8.90it/s, loss=3.33, v_num=2v
Validating: 17%|█████ | 17/100 [00:00<00:07, 11.78it/s]
Epoch 0: 77%|▊| 333256/432833 [10:23:53<3:06:25, 8.90it/s, loss=3.33, v_num=2v
Validating: 24%|███████▏ | 24/100 [00:01<00:04, 16.92it/s]
Validating: 27%|████████ | 27/100 [00:01<00:03, 19.35it/s]
Epoch 0: 77%|▊| 333263/432833 [10:23:53<3:06:24, 8.90it/s, loss=3.33, v_num=2v
Validating: 33%|█████████▉ | 33/100 [00:01<00:03, 21.47it/s]
Epoch 0: 77%|▊| 333270/432833 [10:23:54<3:06:23, 8.90it/s, loss=3.33, v_num=2v
Validating: 39%|███████████▋ | 39/100 [00:01<00:02, 22.93it/s]
Epoch 0: 77%|▊| 333277/432833 [10:23:54<3:06:22, 8.90it/s, loss=3.33, v_num=2v
Validating: 47%|██████████████ | 47/100 [00:01<00:01, 28.00it/s]
Epoch 0: 77%|▊| 333284/432833 [10:23:54<3:06:21, 8.90it/s, loss=3.33, v_num=2v
Validating: 55%|████████████████▌ | 55/100 [00:02<00:01, 23.36it/s]
Epoch 0: 77%|▊| 333291/432833 [10:23:55<3:06:20, 8.90it/s, loss=3.33, v_num=2v
Validating: 61%|██████████████████▎ | 61/100 [00:02<00:01, 25.36it/s]
Epoch 0: 77%|▊| 333298/432833 [10:23:55<3:06:19, 8.90it/s, loss=3.33, v_num=2v
Validating: 69%|████████████████████▋ | 69/100 [00:02<00:01, 28.64it/s]
Epoch 0: 77%|▊| 333305/432833 [10:23:55<3:06:18, 8.90it/s, loss=3.33, v_num=2v
Validating: 75%|██████████████████████▌ | 75/100 [00:03<00:00, 26.22it/s]
Epoch 0: 77%|▊| 333312/432833 [10:23:55<3:06:17, 8.90it/s, loss=3.33, v_num=2v
Epoch 0: 77%|▊| 333319/432833 [10:23:55<3:06:16, 8.90it/s, loss=3.33, v_num=2v
Validating: 87%|██████████████████████████ | 87/100 [00:03<00:00, 29.05it/s]
Epoch 0: 77%|▊| 333326/432833 [10:23:56<3:06:15, 8.90it/s, loss=3.33, v_num=2v
Validating: 95%|████████████████████████████▌ | 95/100 [00:03<00:00, 29.33it/s]
Epoch 0: 77%|▊| 333333/432833 [10:23:56<3:06:14, 8.90it/s, loss=3.33, v_num=2vSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 77%|▊| 333333/432833 [10:23:57<3:06:15, 8.90it/s, loss=3.33, v_num=2v
Epoch 0: 79%|▊| 343333/432833 [10:42:48<2:47:33, 8.90it/s, loss=3.31, v_num=2v
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 79%|▊| 343336/432833 [10:42:48<2:47:33, 8.90it/s, loss=3.31, v_num=2v
Epoch 0: 79%|▊| 343343/432833 [10:42:48<2:47:32, 8.90it/s, loss=3.31, v_num=2v
Validating: 10%|███ | 10/100 [00:00<00:12, 7.20it/s]
Validating: 13%|███▉ | 13/100 [00:00<00:09, 9.33it/s]
Epoch 0: 79%|▊| 343350/432833 [10:42:48<2:47:31, 8.90it/s, loss=3.31, v_num=2v
Validating: 19%|█████▋ | 19/100 [00:00<00:05, 13.63it/s]
Epoch 0: 79%|▊| 343357/432833 [10:42:49<2:47:30, 8.90it/s, loss=3.31, v_num=2v
Validating: 26%|███████▊ | 26/100 [00:01<00:04, 18.49it/s]
Epoch 0: 79%|▊| 343364/432833 [10:42:49<2:47:29, 8.90it/s, loss=3.31, v_num=2v
Validating: 32%|█████████▌ | 32/100 [00:01<00:03, 21.21it/s]
Epoch 0: 79%|▊| 343371/432833 [10:42:49<2:47:28, 8.90it/s, loss=3.31, v_num=2v
Validating: 38%|███████████▍ | 38/100 [00:01<00:02, 21.85it/s]
Validating: 41%|████████████▎ | 41/100 [00:01<00:02, 21.93it/s]
Epoch 0: 79%|▊| 343378/432833 [10:42:50<2:47:28, 8.90it/s, loss=3.31, v_num=2v
Validating: 47%|██████████████ | 47/100 [00:01<00:02, 24.26it/s]
Epoch 0: 79%|▊| 343385/432833 [10:42:50<2:47:27, 8.90it/s, loss=3.31, v_num=2v
Validating: 54%|████████████████▏ | 54/100 [00:02<00:01, 26.76it/s]
Epoch 0: 79%|▊| 343392/432833 [10:42:50<2:47:26, 8.90it/s, loss=3.31, v_num=2v
Epoch 0: 79%|▊| 343399/432833 [10:42:50<2:47:25, 8.90it/s, loss=3.31, v_num=2v
Validating: 66%|███████████████████▊ | 66/100 [00:02<00:01, 30.04it/s]
Epoch 0: 79%|▊| 343406/432833 [10:42:50<2:47:24, 8.90it/s, loss=3.31, v_num=2v
Validating: 75%|██████████████████████▌ | 75/100 [00:02<00:00, 31.17it/s]
Epoch 0: 79%|▊| 343413/432833 [10:42:51<2:47:23, 8.90it/s, loss=3.31, v_num=2v
Epoch 0: 79%|▊| 343420/432833 [10:42:51<2:47:22, 8.90it/s, loss=3.31, v_num=2v
Validating: 87%|██████████████████████████ | 87/100 [00:03<00:00, 30.73it/s]
Epoch 0: 79%|▊| 343427/432833 [10:42:51<2:47:21, 8.90it/s, loss=3.31, v_num=2v
Validating: 95%|████████████████████████████▌ | 95/100 [00:03<00:00, 23.99it/s]
Epoch 0: 79%|▊| 343434/432833 [10:42:52<2:47:20, 8.90it/s, loss=3.31, v_num=2vSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 79%|▊| 343434/432833 [10:42:52<2:47:20, 8.90it/s, loss=3.31, v_num=2v
Epoch 0: 82%|▊| 353434/432833 [11:01:42<2:28:39, 8.90it/s, loss=3.29, v_num=2v
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 82%|▊| 353437/432833 [11:01:43<2:28:38, 8.90it/s, loss=3.29, v_num=2v
Epoch 0: 82%|▊| 353444/432833 [11:01:43<2:28:37, 8.90it/s, loss=3.29, v_num=2v
Validating: 10%|███ | 10/100 [00:00<00:12, 7.05it/s]
Epoch 0: 82%|▊| 353451/432833 [11:01:43<2:28:37, 8.90it/s, loss=3.29, v_num=2v
Validating: 17%|█████ | 17/100 [00:00<00:07, 11.31it/s]
Epoch 0: 82%|▊| 353458/432833 [11:01:43<2:28:36, 8.90it/s, loss=3.29, v_num=2v
Validating: 24%|███████▏ | 24/100 [00:00<00:04, 16.32it/s]
Epoch 0: 82%|▊| 353465/432833 [11:01:44<2:28:35, 8.90it/s, loss=3.29, v_num=2v
Validating: 31%|█████████▎ | 31/100 [00:01<00:03, 18.87it/s]
Epoch 0: 82%|▊| 353472/432833 [11:01:44<2:28:34, 8.90it/s, loss=3.29, v_num=2v
Validating: 38%|███████████▍ | 38/100 [00:01<00:02, 21.35it/s]
Epoch 0: 82%|▊| 353479/432833 [11:01:44<2:28:33, 8.90it/s, loss=3.29, v_num=2v
Validating: 45%|█████████████▌ | 45/100 [00:01<00:02, 22.62it/s]
Epoch 0: 82%|▊| 353486/432833 [11:01:44<2:28:32, 8.90it/s, loss=3.29, v_num=2v
Validating: 53%|███████████████▉ | 53/100 [00:02<00:01, 27.90it/s]
Epoch 0: 82%|▊| 353493/432833 [11:01:45<2:28:31, 8.90it/s, loss=3.29, v_num=2v
Validating: 61%|██████████████████▎ | 61/100 [00:02<00:01, 29.83it/s]
Epoch 0: 82%|▊| 353500/432833 [11:01:45<2:28:30, 8.90it/s, loss=3.29, v_num=2v
Validating: 68%|████████████████████▍ | 68/100 [00:02<00:01, 25.15it/s]
Epoch 0: 82%|▊| 353507/432833 [11:01:45<2:28:29, 8.90it/s, loss=3.29, v_num=2v
Validating: 75%|██████████████████████▌ | 75/100 [00:02<00:01, 22.88it/s]
Epoch 0: 82%|▊| 353514/432833 [11:01:45<2:28:28, 8.90it/s, loss=3.29, v_num=2v
Validating: 81%|████████████████████████▎ | 81/100 [00:03<00:00, 23.35it/s]
Epoch 0: 82%|▊| 353521/432833 [11:01:46<2:28:28, 8.90it/s, loss=3.29, v_num=2v
Validating: 88%|██████████████████████████▍ | 88/100 [00:03<00:00, 26.10it/s]
Epoch 0: 82%|▊| 353528/432833 [11:01:46<2:28:27, 8.90it/s, loss=3.29, v_num=2v
Validating: 95%|████████████████████████████▌ | 95/100 [00:03<00:00, 28.97it/s]
Epoch 0: 82%|▊| 353535/432833 [11:01:46<2:28:26, 8.90it/s, loss=3.29, v_num=2vSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 82%|▊| 353535/432833 [11:01:47<2:28:26, 8.90it/s, loss=3.29, v_num=2v
Epoch 0: 84%|▊| 363535/432833 [11:20:39<2:09:45, 8.90it/s, loss=3.28, v_num=2v
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 84%|▊| 363538/432833 [11:20:40<2:09:44, 8.90it/s, loss=3.28, v_num=2v
Epoch 0: 84%|▊| 363545/432833 [11:20:40<2:09:43, 8.90it/s, loss=3.28, v_num=2v
Validating: 10%|███ | 10/100 [00:00<00:12, 6.96it/s]
Epoch 0: 84%|▊| 363552/432833 [11:20:40<2:09:42, 8.90it/s, loss=3.28, v_num=2v
Validating: 17%|█████ | 17/100 [00:00<00:07, 11.40it/s]
Validating: 20%|██████ | 20/100 [00:00<00:05, 13.69it/s]
Epoch 0: 84%|▊| 363559/432833 [11:20:41<2:09:42, 8.90it/s, loss=3.28, v_num=2v
Validating: 26%|███████▊ | 26/100 [00:01<00:04, 15.48it/s]
Epoch 0: 84%|▊| 363566/432833 [11:20:41<2:09:41, 8.90it/s, loss=3.28, v_num=2v
Validating: 32%|█████████▌ | 32/100 [00:01<00:04, 16.82it/s]
Epoch 0: 84%|▊| 363573/432833 [11:20:41<2:09:40, 8.90it/s, loss=3.28, v_num=2v
Validating: 39%|███████████▋ | 39/100 [00:01<00:02, 21.39it/s]
Epoch 0: 84%|▊| 363580/432833 [11:20:41<2:09:39, 8.90it/s, loss=3.28, v_num=2v
Validating: 45%|█████████████▌ | 45/100 [00:02<00:02, 24.06it/s]
Epoch 0: 84%|▊| 363587/432833 [11:20:42<2:09:38, 8.90it/s, loss=3.28, v_num=2v
Validating: 52%|███████████████▌ | 52/100 [00:02<00:01, 27.26it/s]
Epoch 0: 84%|▊| 363594/432833 [11:20:42<2:09:37, 8.90it/s, loss=3.28, v_num=2v
Validating: 60%|██████████████████ | 60/100 [00:02<00:01, 30.15it/s]
Epoch 0: 84%|▊| 363601/432833 [11:20:42<2:09:36, 8.90it/s, loss=3.28, v_num=2v
Validating: 68%|████████████████████▍ | 68/100 [00:02<00:01, 30.13it/s]
Epoch 0: 84%|▊| 363608/432833 [11:20:42<2:09:35, 8.90it/s, loss=3.28, v_num=2v
Epoch 0: 84%|▊| 363615/432833 [11:20:42<2:09:34, 8.90it/s, loss=3.28, v_num=2v
Epoch 0: 84%|▊| 363622/432833 [11:20:43<2:09:33, 8.90it/s, loss=3.28, v_num=2v
Validating: 87%|██████████████████████████ | 87/100 [00:03<00:00, 37.38it/s]
Epoch 0: 84%|▊| 363629/432833 [11:20:43<2:09:33, 8.90it/s, loss=3.28, v_num=2v
Validating: 95%|████████████████████████████▌ | 95/100 [00:03<00:00, 27.07it/s]
Epoch 0: 84%|▊| 363636/432833 [11:20:43<2:09:32, 8.90it/s, loss=3.28, v_num=2vSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 84%|▊| 363636/432833 [11:20:44<2:09:32, 8.90it/s, loss=3.28, v_num=2v
Epoch 0: 86%|▊| 373636/432833 [11:39:33<1:50:50, 8.90it/s, loss=3.26, v_num=2v
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 86%|▊| 373639/432833 [11:39:33<1:50:49, 8.90it/s, loss=3.26, v_num=2v
Epoch 0: 86%|▊| 373646/432833 [11:39:33<1:50:48, 8.90it/s, loss=3.26, v_num=2v
Validating: 10%|███ | 10/100 [00:00<00:12, 6.94it/s]
Validating: 12%|███▌ | 12/100 [00:00<00:10, 8.61it/s]
Epoch 0: 86%|▊| 373653/432833 [11:39:34<1:50:47, 8.90it/s, loss=3.26, v_num=2v
Validating: 18%|█████▍ | 18/100 [00:00<00:06, 13.35it/s]
Epoch 0: 86%|▊| 373660/432833 [11:39:34<1:50:47, 8.90it/s, loss=3.26, v_num=2v
Validating: 25%|███████▌ | 25/100 [00:01<00:04, 18.10it/s]
Epoch 0: 86%|▊| 373667/432833 [11:39:34<1:50:46, 8.90it/s, loss=3.26, v_num=2v
Validating: 31%|█████████▎ | 31/100 [00:01<00:03, 22.56it/s]
Epoch 0: 86%|▊| 373674/432833 [11:39:34<1:50:45, 8.90it/s, loss=3.26, v_num=2v
Validating: 38%|███████████▍ | 38/100 [00:01<00:02, 25.33it/s]
Epoch 0: 86%|▊| 373681/432833 [11:39:35<1:50:44, 8.90it/s, loss=3.26, v_num=2v
Validating: 45%|█████████████▌ | 45/100 [00:01<00:02, 26.90it/s]
Epoch 0: 86%|▊| 373688/432833 [11:39:35<1:50:43, 8.90it/s, loss=3.26, v_num=2v
Validating: 54%|████████████████▏ | 54/100 [00:02<00:01, 30.72it/s]
Epoch 0: 86%|▊| 373695/432833 [11:39:35<1:50:42, 8.90it/s, loss=3.26, v_num=2v
Validating: 61%|██████████████████▎ | 61/100 [00:02<00:01, 27.17it/s]
Epoch 0: 86%|▊| 373702/432833 [11:39:35<1:50:41, 8.90it/s, loss=3.26, v_num=2v
Validating: 67%|████████████████████ | 67/100 [00:02<00:01, 23.00it/s]
Epoch 0: 86%|▊| 373709/432833 [11:39:36<1:50:40, 8.90it/s, loss=3.26, v_num=2v
Validating: 74%|██████████████████████▏ | 74/100 [00:02<00:01, 24.13it/s]
Epoch 0: 86%|▊| 373716/432833 [11:39:36<1:50:40, 8.90it/s, loss=3.26, v_num=2v
Validating: 81%|████████████████████████▎ | 81/100 [00:03<00:00, 24.94it/s]
Epoch 0: 86%|▊| 373723/432833 [11:39:36<1:50:39, 8.90it/s, loss=3.26, v_num=2v
Validating: 87%|██████████████████████████ | 87/100 [00:03<00:00, 22.58it/s]
Epoch 0: 86%|▊| 373730/432833 [11:39:36<1:50:38, 8.90it/s, loss=3.26, v_num=2v
Validating: 94%|████████████████████████████▏ | 94/100 [00:03<00:00, 27.24it/s]
Epoch 0: 86%|▊| 373737/432833 [11:39:37<1:50:37, 8.90it/s, loss=3.26, v_num=2vSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 86%|▊| 373737/432833 [11:39:37<1:50:37, 8.90it/s, loss=3.26, v_num=2v
Epoch 0: 89%|▉| 383737/432833 [11:58:26<1:31:55, 8.90it/s, loss=3.25, v_num=2v
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 89%|▉| 383740/432833 [11:58:27<1:31:54, 8.90it/s, loss=3.25, v_num=2v
Epoch 0: 89%|▉| 383747/432833 [11:58:27<1:31:53, 8.90it/s, loss=3.25, v_num=2v
Validating: 10%|███ | 10/100 [00:00<00:12, 7.12it/s]
Validating: 13%|███▉ | 13/100 [00:00<00:09, 9.20it/s]
Epoch 0: 89%|▉| 383754/432833 [11:58:27<1:31:53, 8.90it/s, loss=3.25, v_num=2v
Validating: 20%|██████ | 20/100 [00:00<00:05, 13.87it/s]
Epoch 0: 89%|▉| 383761/432833 [11:58:27<1:31:52, 8.90it/s, loss=3.25, v_num=2v
Validating: 27%|████████ | 27/100 [00:01<00:03, 18.53it/s]
Epoch 0: 89%|▉| 383768/432833 [11:58:28<1:31:51, 8.90it/s, loss=3.25, v_num=2v
Validating: 33%|█████████▉ | 33/100 [00:01<00:03, 21.63it/s]
Epoch 0: 89%|▉| 383775/432833 [11:58:28<1:31:50, 8.90it/s, loss=3.25, v_num=2v
Validating: 40%|████████████ | 40/100 [00:01<00:02, 26.15it/s]
Epoch 0: 89%|▉| 383782/432833 [11:58:28<1:31:49, 8.90it/s, loss=3.25, v_num=2v
Validating: 48%|██████████████▍ | 48/100 [00:01<00:01, 27.24it/s]
Epoch 0: 89%|▉| 383789/432833 [11:58:28<1:31:48, 8.90it/s, loss=3.25, v_num=2v
Validating: 54%|████████████████▏ | 54/100 [00:02<00:01, 27.34it/s]
Epoch 0: 89%|▉| 383796/432833 [11:58:29<1:31:47, 8.90it/s, loss=3.25, v_num=2v
Validating: 60%|██████████████████ | 60/100 [00:02<00:01, 26.83it/s]
Epoch 0: 89%|▉| 383803/432833 [11:58:29<1:31:47, 8.90it/s, loss=3.25, v_num=2v
Validating: 66%|███████████████████▊ | 66/100 [00:02<00:01, 24.91it/s]
Epoch 0: 89%|▉| 383810/432833 [11:58:29<1:31:46, 8.90it/s, loss=3.25, v_num=2v
Validating: 74%|██████████████████████▏ | 74/100 [00:02<00:00, 29.04it/s]
Epoch 0: 89%|▉| 383817/432833 [11:58:29<1:31:45, 8.90it/s, loss=3.25, v_num=2v
Validating: 82%|████████████████████████▌ | 82/100 [00:03<00:00, 30.07it/s]
Epoch 0: 89%|▉| 383824/432833 [11:58:30<1:31:44, 8.90it/s, loss=3.25, v_num=2v
Validating: 90%|███████████████████████████ | 90/100 [00:03<00:00, 21.41it/s]
Epoch 0: 89%|▉| 383831/432833 [11:58:30<1:31:43, 8.90it/s, loss=3.25, v_num=2v
Validating: 96%|████████████████████████████▊ | 96/100 [00:03<00:00, 22.81it/s]
Epoch 0: 89%|▉| 383838/432833 [11:58:30<1:31:42, 8.90it/s, loss=3.25, v_num=2vSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 89%|▉| 383838/432833 [11:58:31<1:31:42, 8.90it/s, loss=3.25, v_num=2v
Epoch 0: 91%|▉| 393838/432833 [12:17:21<1:13:00, 8.90it/s, loss=3.23, v_num=2v
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 91%|▉| 393841/432833 [12:17:21<1:13:00, 8.90it/s, loss=3.23, v_num=2v
Epoch 0: 91%|▉| 393848/432833 [12:17:21<1:12:59, 8.90it/s, loss=3.23, v_num=2v
Validating: 11%|███▎ | 11/100 [00:00<00:13, 6.82it/s]
Epoch 0: 91%|▉| 393855/432833 [12:17:22<1:12:58, 8.90it/s, loss=3.23, v_num=2v
Validating: 19%|█████▋ | 19/100 [00:00<00:06, 11.64it/s]
Epoch 0: 91%|▉| 393862/432833 [12:17:22<1:12:57, 8.90it/s, loss=3.23, v_num=2v
Validating: 27%|████████ | 27/100 [00:01<00:04, 16.19it/s]
Epoch 0: 91%|▉| 393869/432833 [12:17:22<1:12:56, 8.90it/s, loss=3.23, v_num=2v
Validating: 34%|██████████▏ | 34/100 [00:01<00:03, 21.03it/s]
Epoch 0: 91%|▉| 393876/432833 [12:17:22<1:12:55, 8.90it/s, loss=3.23, v_num=2v
Validating: 41%|████████████▎ | 41/100 [00:01<00:02, 23.85it/s]
Epoch 0: 91%|▉| 393883/432833 [12:17:23<1:12:55, 8.90it/s, loss=3.23, v_num=2v
Validating: 47%|██████████████ | 47/100 [00:01<00:02, 24.57it/s]
Epoch 0: 91%|▉| 393890/432833 [12:17:23<1:12:54, 8.90it/s, loss=3.23, v_num=2v
Validating: 55%|████████████████▌ | 55/100 [00:02<00:01, 22.92it/s]
Epoch 0: 91%|▉| 393897/432833 [12:17:23<1:12:53, 8.90it/s, loss=3.23, v_num=2v
Validating: 61%|██████████████████▎ | 61/100 [00:02<00:01, 23.39it/s]
Epoch 0: 91%|▉| 393904/432833 [12:17:24<1:12:52, 8.90it/s, loss=3.23, v_num=2v
Validating: 67%|████████████████████ | 67/100 [00:02<00:01, 21.12it/s]
Epoch 0: 91%|▉| 393911/432833 [12:17:24<1:12:51, 8.90it/s, loss=3.23, v_num=2v
Validating: 73%|█████████████████████▉ | 73/100 [00:02<00:01, 23.19it/s]
Epoch 0: 91%|▉| 393918/432833 [12:17:24<1:12:50, 8.90it/s, loss=3.23, v_num=2v
Validating: 81%|████████████████████████▎ | 81/100 [00:03<00:00, 26.25it/s]
Epoch 0: 91%|▉| 393925/432833 [12:17:24<1:12:50, 8.90it/s, loss=3.23, v_num=2v
Validating: 87%|██████████████████████████ | 87/100 [00:03<00:00, 26.24it/s]
Epoch 0: 91%|▉| 393932/432833 [12:17:25<1:12:49, 8.90it/s, loss=3.23, v_num=2v
Validating: 95%|████████████████████████████▌ | 95/100 [00:03<00:00, 28.72it/s]
Epoch 0: 91%|▉| 393939/432833 [12:17:25<1:12:48, 8.90it/s, loss=3.23, v_num=2vSetting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 91%|▉| 393939/432833 [12:17:26<1:12:48, 8.90it/s, loss=3.23, v_num=2v
Epoch 0: 93%|▉| 403939/432833 [12:36:14<54:05, 8.90it/s, loss=3.22, v_num=2vt0
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 93%|▉| 403942/432833 [12:36:14<54:05, 8.90it/s, loss=3.22, v_num=2vt0
Validating: 5%|█▌ | 5/100 [00:00<00:18, 5.18it/s]
Validating: 7%|██▏ | 7/100 [00:00<00:14, 6.46it/s]
Epoch 0: 93%|▉| 403949/432833 [12:36:15<54:04, 8.90it/s, loss=3.22, v_num=2vt0
Validating: 12%|███▌ | 12/100 [00:00<00:08, 9.98it/s]
Epoch 0: 93%|▉| 403956/432833 [12:36:15<54:03, 8.90it/s, loss=3.22, v_num=2vt0
Validating: 19%|█████▋ | 19/100 [00:01<00:05, 14.39it/s]
Epoch 0: 93%|▉| 403963/432833 [12:36:15<54:02, 8.90it/s, loss=3.22, v_num=2vt0
Validating: 25%|███████▌ | 25/100 [00:01<00:03, 19.10it/s]
Epoch 0: 93%|▉| 403970/432833 [12:36:15<54:02, 8.90it/s, loss=3.22, v_num=2vt0
Epoch 0: 93%|▉| 403977/432833 [12:36:16<54:01, 8.90it/s, loss=3.22, v_num=2vt0
Validating: 38%|███████████▍ | 38/100 [00:01<00:02, 25.60it/s]
Epoch 0: 93%|▉| 403984/432833 [12:36:16<54:00, 8.90it/s, loss=3.22, v_num=2vt0
Validating: 45%|█████████████▌ | 45/100 [00:01<00:02, 25.65it/s]
Validating: 48%|██████████████▍ | 48/100 [00:02<00:02, 24.78it/s]
Epoch 0: 93%|▉| 403991/432833 [12:36:16<53:59, 8.90it/s, loss=3.22, v_num=2vt0
Validating: 54%|████████████████▏ | 54/100 [00:02<00:02, 22.41it/s]
Epoch 0: 93%|▉| 403998/432833 [12:36:16<53:58, 8.90it/s, loss=3.22, v_num=2vt0
Validating: 61%|██████████████████▎ | 61/100 [00:02<00:01, 24.59it/s]
Epoch 0: 93%|▉| 404005/432833 [12:36:17<53:57, 8.90it/s, loss=3.22, v_num=2vt0
Epoch 0: 93%|▉| 404012/432833 [12:36:17<53:57, 8.90it/s, loss=3.22, v_num=2vt0
Validating: 73%|█████████████████████▉ | 73/100 [00:02<00:00, 30.82it/s]
Epoch 0: 93%|▉| 404019/432833 [12:36:17<53:56, 8.90it/s, loss=3.22, v_num=2vt0
Validating: 81%|████████████████████████▎ | 81/100 [00:03<00:00, 26.76it/s]
Epoch 0: 93%|▉| 404026/432833 [12:36:17<53:55, 8.90it/s, loss=3.22, v_num=2vt0
Validating: 87%|██████████████████████████ | 87/100 [00:03<00:00, 27.11it/s]
Epoch 0: 93%|▉| 404033/432833 [12:36:18<53:54, 8.90it/s, loss=3.22, v_num=2vt0
Validating: 95%|████████████████████████████▌ | 95/100 [00:03<00:00, 29.14it/s]
Epoch 0: 93%|▉| 404040/432833 [12:36:18<53:53, 8.90it/s, loss=3.22, v_num=2vt0Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 93%|▉| 404040/432833 [12:36:19<53:53, 8.90it/s, loss=3.22, v_num=2vt0
Epoch 0: 96%|▉| 414040/432833 [12:55:07<35:10, 8.90it/s, loss=3.2, v_num=2vt0]
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 96%|▉| 414043/432833 [12:55:08<35:10, 8.90it/s, loss=3.2, v_num=2vt0]
Validating: 5%|█▌ | 5/100 [00:00<00:18, 5.22it/s]
Epoch 0: 96%|▉| 414050/432833 [12:55:08<35:09, 8.90it/s, loss=3.2, v_num=2vt0]
Validating: 12%|███▌ | 12/100 [00:00<00:09, 8.98it/s]
Epoch 0: 96%|▉| 414057/432833 [12:55:08<35:09, 8.90it/s, loss=3.2, v_num=2vt0]
Epoch 0: 96%|▉| 414064/432833 [12:55:08<35:08, 8.90it/s, loss=3.2, v_num=2vt0]
Validating: 24%|███████▏ | 24/100 [00:00<00:04, 17.28it/s]
Epoch 0: 96%|▉| 414071/432833 [12:55:09<35:07, 8.90it/s, loss=3.2, v_num=2vt0]
Validating: 32%|█████████▌ | 32/100 [00:01<00:03, 21.46it/s]
Epoch 0: 96%|▉| 414078/432833 [12:55:09<35:06, 8.90it/s, loss=3.2, v_num=2vt0]
Validating: 40%|████████████ | 40/100 [00:01<00:02, 23.19it/s]
Epoch 0: 96%|▉| 414085/432833 [12:55:09<35:05, 8.90it/s, loss=3.2, v_num=2vt0]
Epoch 0: 96%|▉| 414092/432833 [12:55:09<35:04, 8.90it/s, loss=3.2, v_num=2vt0]
Validating: 52%|███████████████▌ | 52/100 [00:01<00:01, 26.58it/s]
Epoch 0: 96%|▉| 414099/432833 [12:55:10<35:04, 8.90it/s, loss=3.2, v_num=2vt0]
Validating: 59%|█████████████████▋ | 59/100 [00:02<00:01, 27.73it/s]
Validating: 62%|██████████████████▌ | 62/100 [00:02<00:01, 24.66it/s]
Epoch 0: 96%|▉| 414106/432833 [12:55:10<35:03, 8.90it/s, loss=3.2, v_num=2vt0]
Epoch 0: 96%|▉| 414113/432833 [12:55:10<35:02, 8.90it/s, loss=3.2, v_num=2vt0]
Validating: 74%|██████████████████████▏ | 74/100 [00:02<00:01, 25.74it/s]
Epoch 0: 96%|▉| 414120/432833 [12:55:11<35:01, 8.90it/s, loss=3.2, v_num=2vt0]
Validating: 80%|████████████████████████ | 80/100 [00:03<00:00, 26.44it/s]
Validating: 83%|████████████████████████▉ | 83/100 [00:03<00:00, 25.04it/s]
Epoch 0: 96%|▉| 414127/432833 [12:55:11<35:00, 8.90it/s, loss=3.2, v_num=2vt0]
Validating: 89%|██████████████████████████▋ | 89/100 [00:03<00:00, 21.49it/s]
Epoch 0: 96%|▉| 414134/432833 [12:55:11<35:00, 8.90it/s, loss=3.2, v_num=2vt0]
Epoch 0: 96%|▉| 414141/432833 [12:55:11<34:59, 8.90it/s, loss=3.2, v_num=2vt0]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 96%|▉| 414141/432833 [12:55:12<34:59, 8.90it/s, loss=3.2, v_num=2vt0]
Epoch 0: 98%|▉| 424141/432833 [13:13:58<16:16, 8.90it/s, loss=3.19, v_num=2vt0
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/100 [00:00<?, ?it/s]
Epoch 0: 98%|▉| 424144/432833 [13:13:58<16:15, 8.90it/s, loss=3.19, v_num=2vt0
Validating: 6%|█▊ | 6/100 [00:00<00:17, 5.27it/s]
Epoch 0: 98%|▉| 424151/432833 [13:13:58<16:15, 8.90it/s, loss=3.19, v_num=2vt0
Validating: 12%|███▌ | 12/100 [00:00<00:09, 8.84it/s]
Epoch 0: 98%|▉| 424158/432833 [13:13:59<16:14, 8.90it/s, loss=3.19, v_num=2vt0
Validating: 19%|█████▋ | 19/100 [00:00<00:05, 13.71it/s]
Epoch 0: 98%|▉| 424165/432833 [13:13:59<16:13, 8.90it/s, loss=3.19, v_num=2vt0
Validating: 25%|███████▌ | 25/100 [00:01<00:04, 16.52it/s]
Epoch 0: 98%|▉| 424172/432833 [13:13:59<16:12, 8.90it/s, loss=3.19, v_num=2vt0
Validating: 33%|█████████▉ | 33/100 [00:01<00:03, 21.21it/s]
Epoch 0: 98%|▉| 424179/432833 [13:13:59<16:11, 8.90it/s, loss=3.19, v_num=2vt0
Validating: 39%|███████████▋ | 39/100 [00:01<00:02, 23.00it/s]
Epoch 0: 98%|▉| 424186/432833 [13:14:00<16:11, 8.90it/s, loss=3.19, v_num=2vt0
Validating: 46%|█████████████▊ | 46/100 [00:01<00:02, 26.17it/s]
Epoch 0: 98%|▉| 424193/432833 [13:14:00<16:10, 8.90it/s, loss=3.19, v_num=2vt0
Validating: 54%|████████████████▏ | 54/100 [00:02<00:01, 27.74it/s]
Epoch 0: 98%|▉| 424200/432833 [13:14:00<16:09, 8.90it/s, loss=3.19, v_num=2vt0
Validating: 60%|██████████████████ | 60/100 [00:02<00:01, 24.91it/s]
Epoch 0: 98%|▉| 424207/432833 [13:14:00<16:08, 8.90it/s, loss=3.19, v_num=2vt0
Validating: 66%|███████████████████▊ | 66/100 [00:02<00:01, 23.49it/s]
Epoch 0: 98%|▉| 424214/432833 [13:14:01<16:07, 8.90it/s, loss=3.19, v_num=2vt0
Validating: 75%|██████████████████████▌ | 75/100 [00:02<00:00, 27.86it/s]
Epoch 0: 98%|▉| 424221/432833 [13:14:01<16:07, 8.90it/s, loss=3.19, v_num=2vt0
Epoch 0: 98%|▉| 424228/432833 [13:14:01<16:06, 8.90it/s, loss=3.19, v_num=2vt0
Validating: 87%|██████████████████████████ | 87/100 [00:03<00:00, 30.92it/s]
Epoch 0: 98%|▉| 424235/432833 [13:14:01<16:05, 8.90it/s, loss=3.19, v_num=2vt0
Validating: 94%|████████████████████████████▏ | 94/100 [00:03<00:00, 25.95it/s]
Epoch 0: 98%|▉| 424242/432833 [13:14:02<16:04, 8.90it/s, loss=3.19, v_num=2vt0Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Epoch 0: 98%|▉| 424242/432833 [13:14:03<16:04, 8.90it/s, loss=3.19, v_num=2vt0
Epoch 0: 100%|█| 432833/432833 [13:30:14<00:00, 8.90it/s, loss=3.18, v_num=2vt0Saving latest checkpoint...
Epoch 0: 100%|█| 432833/432833 [13:30:14<00:00, 8.90it/s, loss=3.18, v_num=2vt0
wandb: Waiting for W&B process to finish, PID 100838
wandb: Program ended successfully.
wandb:
wandb: Find user logs for this run at: /data/wikipedia/processed/spanish-sentences/wandb/run-20210413_133917-16p22vt0/logs/debug.log
wandb: Find internal logs for this run at: /data/wikipedia/processed/spanish-sentences/wandb/run-20210413_133917-16p22vt0/logs/debug-internal.log
wandb: Run summary:
wandb: avg_val_loss 3.04847
wandb: epoch 0
wandb: trainer/global_step 209
wandb: _runtime 47676
wandb: _timestamp 1618365233
wandb: _step 45
wandb: train_loss 2.96082
wandb: Run history:
wandb: avg_val_loss █▇▆▅▅▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: epoch ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: trainer/global_step ▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇████
wandb: _runtime ▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇████
wandb: _timestamp ▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇████
wandb: _step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇████
wandb: train_loss █▄▃▁
wandb:
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb:
wandb: Synced accumulate_grad_batches_2000_amp_backend_amp_amp_level_O1_auto_lr_find_True_auto_scale_batch_size_False_auto_select_gpus_False_batch_size_32_benchmark_False_check_val_every_n_epoch_1_checkpoint_callback_True_data_index_path_/data/wikipedia/processed/spanish-sentences/data/es/preparation/final/index-040k.npy_data_path_/data/wikipedia/processed/spanish-sentences/data/es/preparation/final/data-040k.npy_deterministic_False_fast_dev_run_False_flush_logs_every_n_steps_100_gpus_1_gradient_clip_val_0_lang_es_limit_predict_batches_1.0_limit_test_batches_1.0_limit_train_batches_1.0_limit_val_batches_100_log_every_n_steps_50_logger_True_lr_0.001_max_epochs_1_max_seq_length_128_mmap_False_move_metrics_to_cpu_False_multiple_trainloader_mode_max_size_cycle_num_nodes_1_num_processes_1_num_sanity_val_steps_2_num_workers_4_overfit_batches_0.0_precision_32_prepare_data_per_node_True_pretrained_path_gpt2_process_position_0_reload_dataloaders_every_epoch_False_replace_sampler_ddp_True_reset_state_False_search_False_seed_7649832_stochastic_weight_avg_False_subset_size_1.0_sync_batchnorm_False_terminate_on_nan_False_tokenizer_path_/data/wikipedia/processed/spanish-sentences/data/es/preparation/vocabularies/es-040k.tokenizer.json_tpu_cores_<function _gpus_arg_default at 0x7ff2b2d63310>_track_grad_norm_-1_unfreeze_False_val_check_interval_10000_verbose_False_version_0_vocab_size_50257_weights_summary_top_wte_only_False: https://wandb.ai/matthewfranglen/mf-blog-recycle-gpt2-es/runs/16p22vt0
So it took a bit of work but I got it logging to wandb. I had to alter the get_trainer_kwargs
in main.py and then disable the self.logger.experiment.add_text('example', txt)
as the wandb logger doesn’t support that.
Anyway it’s running now and training the embedding for an epoch should take about 14 hours. At the moment it’s reporting a validation loss of 6.562 for the very first round of validation (which would be a perplexity >700).