[2025-04-15 01:40:47,851] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-04-15 01:40:49,816] [WARNING] [runner.py:215:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. Detected VISIBLE_DEVICES=0,1,2,3,4,5,6,7: setting --include=localhost:0,1,2,3,4,5,6,7 [2025-04-15 01:40:49,816] [INFO] [runner.py:605:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train.py --deepspeed scripts/newzero3.json --seed 42 --model_name_or_path /home/stern/GRPO/saved_models/Qwen2.5-32B-Instruct --train_tokenized_file /home/stern/GRPO/offline_rl_v2/data/32K_neg_tokenized.jsonl --output_dir /home/stern/GRPO/offline_rl_v2/output --per_device_train_batch_size 1 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy no --learning_rate 2e-6 --lr_scheduler_type cosine --save_only_model True --remove_unused_columns False --warmup_ratio 0.03 --num_train_epochs 4 --logging_steps 1 --report_to tensorboard --gradient_checkpointing True --overwrite_output_dir --bf16 True [2025-04-15 01:40:51,253] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-04-15 01:40:53,176] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} [2025-04-15 01:40:53,176] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=8, node_rank=0 [2025-04-15 01:40:53,176] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) [2025-04-15 01:40:53,176] [INFO] [launch.py:164:main] dist_world_size=8 [2025-04-15 01:40:53,176] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 [2025-04-15 01:40:53,177] [INFO] [launch.py:256:main] process 1302307 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=0', '--deepspeed', 'scripts/newzero3.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/Qwen2.5-32B-Instruct', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/32K_neg_tokenized.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '4', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True'] [2025-04-15 01:40:53,177] [INFO] [launch.py:256:main] process 1302308 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=1', '--deepspeed', 'scripts/newzero3.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/Qwen2.5-32B-Instruct', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/32K_neg_tokenized.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '4', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True'] [2025-04-15 01:40:53,178] [INFO] [launch.py:256:main] process 1302309 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=2', '--deepspeed', 'scripts/newzero3.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/Qwen2.5-32B-Instruct', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/32K_neg_tokenized.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '4', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True'] [2025-04-15 01:40:53,178] [INFO] [launch.py:256:main] process 1302310 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=3', '--deepspeed', 'scripts/newzero3.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/Qwen2.5-32B-Instruct', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/32K_neg_tokenized.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '4', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True'] [2025-04-15 01:40:53,179] [INFO] [launch.py:256:main] process 1302311 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=4', '--deepspeed', 'scripts/newzero3.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/Qwen2.5-32B-Instruct', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/32K_neg_tokenized.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '4', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True'] [2025-04-15 01:40:53,179] [INFO] [launch.py:256:main] process 1302312 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=5', '--deepspeed', 'scripts/newzero3.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/Qwen2.5-32B-Instruct', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/32K_neg_tokenized.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '4', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True'] [2025-04-15 01:40:53,179] [INFO] [launch.py:256:main] process 1302313 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=6', '--deepspeed', 'scripts/newzero3.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/Qwen2.5-32B-Instruct', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/32K_neg_tokenized.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '4', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True'] [2025-04-15 01:40:53,180] [INFO] [launch.py:256:main] process 1302314 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=7', '--deepspeed', 'scripts/newzero3.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/Qwen2.5-32B-Instruct', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/32K_neg_tokenized.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '4', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True'] [2025-04-15 01:40:57,224] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-04-15 01:40:57,773] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-04-15 01:40:57,823] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-04-15 01:40:57,901] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-04-15 01:40:57,956] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-04-15 01:40:58,062] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-04-15 01:40:58,065] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-04-15 01:40:58,068] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( [2025-04-15 01:40:59,735] [INFO] [comm.py:658:init_distributed] cdb=None /home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( [2025-04-15 01:40:59,868] [INFO] [comm.py:658:init_distributed] cdb=None /home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( [2025-04-15 01:40:59,876] [INFO] [comm.py:658:init_distributed] cdb=None /home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( [2025-04-15 01:40:59,979] [INFO] [comm.py:658:init_distributed] cdb=None /home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( [2025-04-15 01:41:00,086] [INFO] [comm.py:658:init_distributed] cdb=None [2025-04-15 01:41:00,086] [INFO] [comm.py:689:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl /home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( [2025-04-15 01:41:00,155] [INFO] [comm.py:658:init_distributed] cdb=None /home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( [2025-04-15 01:41:00,184] [INFO] [comm.py:658:init_distributed] cdb=None /home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( [2025-04-15 01:41:00,190] [INFO] [comm.py:658:init_distributed] cdb=None WARNING:__main__:Process rank: 2, device: cuda:2, n_gpu: 1 WARNING:__main__:Process rank: 6, device: cuda:6, n_gpu: 1 WARNING:__main__:Process rank: 3, device: cuda:3, n_gpu: 1 [2025-04-15 01:41:01,033] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8 [WARNING|logging.py:329] 2025-04-15 01:41:01,035 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [2025-04-15 01:41:01,053] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8 [2025-04-15 01:41:01,054] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8 [WARNING|logging.py:329] 2025-04-15 01:41:01,056 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [WARNING|logging.py:329] 2025-04-15 01:41:01,056 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. WARNING:__main__:Process rank: 7, device: cuda:7, n_gpu: 1 WARNING:__main__:Process rank: 0, device: cuda:0, n_gpu: 1 INFO:__main__:Training parameters CustomTrainingArguments( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, average_tokens_across_devices=False, batch_eval_metrics=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=scripts/newzero3.json, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=False, eval_accumulation_steps=None, eval_delay=0, eval_do_concat_batches=True, eval_on_start=False, eval_steps=None, eval_strategy=no, eval_use_gather_object=False, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=4, gradient_checkpointing=True, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=None, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_for_metrics=[], include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, kl_coeff=0.0, label_names=None, label_smoothing_factor=0.0, learning_rate=2e-06, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=/home/stern/GRPO/offline_rl_v2/output/runs/Apr15_01-41-00_nacamontrealdc1-p2r203n1.enovum.hivecloud.com, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=4.0, optim=adamw_torch, optim_args=None, optim_target_modules=None, output_dir=/home/stern/GRPO/offline_rl_v2/output, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=1, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=False, report_to=['tensorboard'], restore_callback_states_from_checkpoint=False, resume_from_checkpoint=None, run_name=/home/stern/GRPO/offline_rl_v2/output, save_on_each_node=False, save_only_model=True, save_safetensors=True, save_steps=500, save_strategy=no, save_total_limit=None, seed=42, skip_memory_metrics=True, split_batches=None, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torch_empty_cache_steps=None, torchdynamo=None, tp_size=0, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_liger_kernel=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.0, ) [INFO|tokenization_utils_base.py:2058] 2025-04-15 01:41:01,385 >> loading file vocab.json [INFO|tokenization_utils_base.py:2058] 2025-04-15 01:41:01,386 >> loading file merges.txt [INFO|tokenization_utils_base.py:2058] 2025-04-15 01:41:01,386 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2058] 2025-04-15 01:41:01,386 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2058] 2025-04-15 01:41:01,386 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2058] 2025-04-15 01:41:01,386 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2058] 2025-04-15 01:41:01,386 >> loading file chat_template.jinja WARNING:__main__:Process rank: 5, device: cuda:5, n_gpu: 1 WARNING:__main__:Process rank: 4, device: cuda:4, n_gpu: 1 WARNING:__main__:Process rank: 1, device: cuda:1, n_gpu: 1 [2025-04-15 01:41:01,618] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8 [WARNING|logging.py:329] 2025-04-15 01:41:01,620 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [INFO|tokenization_utils_base.py:2323] 2025-04-15 01:41:01,660 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|configuration_utils.py:697] 2025-04-15 01:41:01,661 >> loading configuration file /home/stern/GRPO/saved_models/Qwen2.5-32B-Instruct/config.json [INFO|configuration_utils.py:771] 2025-04-15 01:41:01,662 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 5120, "initializer_range": 0.02, "intermediate_size": 27648, "max_position_embeddings": 32768, "max_window_layers": 70, "model_type": "qwen2", "num_attention_heads": 40, "num_hidden_layers": 64, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_scaling": null, "rope_theta": 1000000.0, "sliding_window": 131072, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.50.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|modeling_utils.py:1151] 2025-04-15 01:41:01,691 >> loading weights file /home/stern/GRPO/saved_models/Qwen2.5-32B-Instruct/model.safetensors.index.json [INFO|modeling_utils.py:1225] 2025-04-15 01:41:01,691 >> Will use torch_dtype=torch.bfloat16 as defined in model's config object [INFO|modeling_utils.py:2170] 2025-04-15 01:41:01,691 >> Instantiating Qwen2ForCausalLM model under default dtype torch.bfloat16. [INFO|modeling_utils.py:3747] 2025-04-15 01:41:01,691 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model [2025-04-15 01:41:01,691] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8 [WARNING|logging.py:329] 2025-04-15 01:41:01,693 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [INFO|configuration_utils.py:1139] 2025-04-15 01:41:01,697 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645 } [2025-04-15 01:41:01,761] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8 [WARNING|logging.py:329] 2025-04-15 01:41:01,764 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [2025-04-15 01:41:01,769] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8 [WARNING|logging.py:329] 2025-04-15 01:41:01,771 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [2025-04-15 01:41:01,889] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8 [WARNING|logging.py:329] 2025-04-15 01:41:01,891 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [2025-04-15 01:41:23,740] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 771, num_elems = 32.76B Loading checkpoint shards: 0%| | 0/17 [00:00> All model checkpoint weights were used when initializing Qwen2ForCausalLM. [INFO|modeling_utils.py:4995] 2025-04-15 01:41:35,109 >> All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at /home/stern/GRPO/saved_models/Qwen2.5-32B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training. [INFO|configuration_utils.py:1092] 2025-04-15 01:41:35,113 >> loading configuration file /home/stern/GRPO/saved_models/Qwen2.5-32B-Instruct/generation_config.json [INFO|configuration_utils.py:1139] 2025-04-15 01:41:35,114 >> Generate config GenerationConfig { "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.7, "top_k": 20, "top_p": 0.8 } /home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead. trainer = OfflineREINFORCETrainer( /home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead. trainer = OfflineREINFORCETrainer( /home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead. trainer = OfflineREINFORCETrainer( /home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead. trainer = OfflineREINFORCETrainer( /home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead. trainer = OfflineREINFORCETrainer( /home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead. trainer = OfflineREINFORCETrainer( /home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead. trainer = OfflineREINFORCETrainer( Using custom data configuration default-e96436450119f8e9 INFO:datasets.builder:Using custom data configuration default-e96436450119f8e9 Loading Dataset Infos from /home/stern/.local/lib/python3.10/site-packages/datasets/packaged_modules/json INFO:datasets.info:Loading Dataset Infos from /home/stern/.local/lib/python3.10/site-packages/datasets/packaged_modules/json Overwrite dataset info from restored data version if exists. INFO:datasets.builder:Overwrite dataset info from restored data version if exists. Loading Dataset info from /home/stern/.cache/huggingface/datasets/json/default-e96436450119f8e9/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092 INFO:datasets.info:Loading Dataset info from /home/stern/.cache/huggingface/datasets/json/default-e96436450119f8e9/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092 Found cached dataset json (/home/stern/.cache/huggingface/datasets/json/default-e96436450119f8e9/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092) INFO:datasets.builder:Found cached dataset json (/home/stern/.cache/huggingface/datasets/json/default-e96436450119f8e9/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092) Loading Dataset info from /home/stern/.cache/huggingface/datasets/json/default-e96436450119f8e9/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092 INFO:datasets.info:Loading Dataset info from /home/stern/.cache/huggingface/datasets/json/default-e96436450119f8e9/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092 /home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead. trainer = OfflineREINFORCETrainer( [INFO|trainer.py:748] 2025-04-15 01:41:35,426 >> Using auto half precision backend INFO:__main__:*** Train *** [INFO|deepspeed.py:386] 2025-04-15 01:41:35,634 >> Detected ZeRO Offload and non-DeepSpeed optimizers: This combination should work as long as the custom optimizer has both CPU and GPU implementation (except LAMB) Installed CUDA version 12.4 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination Using /home/stern/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Emitting ninja build file /home/stern/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.2758774757385254 seconds Installed CUDA version 12.4 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination Using /home/stern/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Emitting ninja build file /home/stern/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.284808874130249 seconds Installed CUDA version 12.4 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination Using /home/stern/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Installed CUDA version 12.4 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination Using /home/stern/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Emitting ninja build file /home/stern/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.333336114883423 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 2.4223861694335938 seconds Installed CUDA version 12.4 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination Using /home/stern/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Emitting ninja build file /home/stern/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.4353716373443604 seconds Installed CUDA version 12.4 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination Using /home/stern/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Emitting ninja build file /home/stern/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.6901695728302 seconds Installed CUDA version 12.4 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination Installed CUDA version 12.4 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination Using /home/stern/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Using /home/stern/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Emitting ninja build file /home/stern/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.7156126499176025 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 2.8180086612701416 seconds Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000002, betas=(0.900000, 0.999000), weight_decay=0.010000, adam_w=1 [2025-04-15 01:41:40,052] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed info: version=0.16.5, git-hash=unknown, git-branch=unknown [2025-04-15 01:41:40,052] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8 [2025-04-15 01:41:40,097] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False [2025-04-15 01:41:40,101] [INFO] [logging.py:107:log_dist] [Rank 0] Using client Optimizer as basic optimizer [2025-04-15 01:41:40,101] [INFO] [logging.py:107:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2025-04-15 01:41:40,158] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam [2025-04-15 01:41:40,158] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type= [2025-04-15 01:41:40,158] [INFO] [logging.py:107:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False [2025-04-15 01:41:40,158] [INFO] [logging.py:107:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer [2025-04-15 01:41:40,288] [INFO] [utils.py:781:see_memory_usage] Stage 3 initialize beginning [2025-04-15 01:41:40,288] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 2.9 GB CA 0.0 GB Max_CA 3 GB [2025-04-15 01:41:40,288] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 142.01 GB, percent = 14.1% [2025-04-15 01:41:40,291] [INFO] [stage3.py:170:__init__] Reduce bucket size 100000000 [2025-04-15 01:41:40,291] [INFO] [stage3.py:171:__init__] Prefetch bucket size 100000000 [2025-04-15 01:41:40,391] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] [2025-04-15 01:41:40,391] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2025-04-15 01:41:40,391] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 142.01 GB, percent = 14.1% Parameter Offload: Total persistent parameters: 1119232 in 321 params [2025-04-15 01:41:40,561] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end] [2025-04-15 01:41:40,562] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2025-04-15 01:41:40,562] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 142.03 GB, percent = 14.1% [2025-04-15 01:41:40,672] [INFO] [utils.py:781:see_memory_usage] Before creating fp16 partitions [2025-04-15 01:41:40,672] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2025-04-15 01:41:40,673] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 142.03 GB, percent = 14.1% [2025-04-15 01:42:02,581] [INFO] [utils.py:781:see_memory_usage] After creating fp16 partitions: 41 [2025-04-15 01:42:02,582] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2025-04-15 01:42:02,582] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 235.85 GB, percent = 23.4% [2025-04-15 01:42:02,956] [INFO] [utils.py:781:see_memory_usage] Before creating fp32 partitions [2025-04-15 01:42:02,959] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2025-04-15 01:42:02,959] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 246.45 GB, percent = 24.5% [2025-04-15 01:42:06,769] [INFO] [utils.py:781:see_memory_usage] After creating fp32 partitions [2025-04-15 01:42:06,769] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2025-04-15 01:42:06,769] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 334.47 GB, percent = 33.2% [2025-04-15 01:42:06,967] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states [2025-04-15 01:42:06,967] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2025-04-15 01:42:06,968] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 340.88 GB, percent = 33.8% [2025-04-15 01:42:20,062] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states [2025-04-15 01:42:20,063] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2025-04-15 01:42:20,063] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 517.11 GB, percent = 51.3% [2025-04-15 01:42:20,063] [INFO] [stage3.py:534:_setup_for_real_optimizer] optimizer state initialized /home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.) batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64) /home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.) batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64) /home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.) batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64) /home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.) batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64) /home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.) batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64) /home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.) batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64) /home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.) batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64) [WARNING|logging.py:329] 2025-04-15 01:42:26,927 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. [WARNING|logging.py:329] 2025-04-15 01:42:26,928 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. [WARNING|logging.py:329] 2025-04-15 01:42:26,929 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. [WARNING|logging.py:329] 2025-04-15 01:42:26,930 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. [WARNING|logging.py:329] 2025-04-15 01:42:26,932 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. [WARNING|logging.py:329] 2025-04-15 01:42:26,934 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. [WARNING|logging.py:329] 2025-04-15 01:42:26,936 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. [2025-04-15 01:42:27,026] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer [2025-04-15 01:42:27,027] [INFO] [utils.py:782:see_memory_usage] MA 0.19 GB Max_MA 3.09 GB CA 3.09 GB Max_CA 3 GB [2025-04-15 01:42:27,027] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 568.26 GB, percent = 56.4% [2025-04-15 01:42:27,027] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer_Stage3 [2025-04-15 01:42:27,027] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = None [2025-04-15 01:42:27,027] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed LR Scheduler = None [2025-04-15 01:42:27,027] [INFO] [logging.py:107:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)] [2025-04-15 01:42:27,029] [INFO] [config.py:1000:print] DeepSpeedEngine configuration: [2025-04-15 01:42:27,029] [INFO] [config.py:1004:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2025-04-15 01:42:27,029] [INFO] [config.py:1004:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'intra_op_parallelism': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False} [2025-04-15 01:42:27,029] [INFO] [config.py:1004:print] amp_enabled .................. False [2025-04-15 01:42:27,029] [INFO] [config.py:1004:print] amp_params ................... False [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] bfloat16_enabled ............. True [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] bfloat16_immediate_grad_update True [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] checkpoint_parallel_write_pipeline False [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] checkpoint_tag_validation_enabled True [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] checkpoint_tag_validation_fail False [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] comms_config ................. [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] communication_data_type ...... None [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] curriculum_enabled_legacy .... False [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] curriculum_params_legacy ..... False [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'pin_memory': False, 'curriculum_learning': {'enabled': False}, 'dynamic_batching': {'enabled': False, 'lr_scaling_method': 'linear', 'min_batch_size': 1, 'max_batch_size': None, 'sequence_picking_order': 'dataloader', 'verbose': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] data_efficiency_enabled ...... False [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] dataloader_drop_last ......... False [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] disable_allgather ............ False [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] dump_state ................... False [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] dynamic_loss_scale_args ...... None [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] eigenvalue_enabled ........... False [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] eigenvalue_gas_boundary_resolution 1 [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] eigenvalue_layer_name ........ bert.encoder.layer [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] eigenvalue_layer_num ......... 0 [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] eigenvalue_max_iter .......... 100 [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] eigenvalue_stability ......... 1e-06 [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] eigenvalue_tol ............... 0.01 [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] eigenvalue_verbose ........... False [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] elasticity_enabled ........... False [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] fp16_auto_cast ............... None [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] fp16_enabled ................. False [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] fp16_master_weights_and_gradients False [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] global_rank .................. 0 [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] grad_accum_dtype ............. None [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] gradient_accumulation_steps .. 4 [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] gradient_clipping ............ 1.0 [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] gradient_predivide_factor .... 1.0 [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] graph_harvesting ............. False [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] initial_dynamic_scale ........ 1 [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] load_universal_checkpoint .... False [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] loss_scale ................... 1.0 [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] memory_breakdown ............. False [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] mics_hierarchial_params_gather False [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] mics_shard_size .............. -1 [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] optimizer_legacy_fusion ...... False [2025-04-15 01:42:27,030] [INFO] [config.py:1004:print] optimizer_name ............... None [2025-04-15 01:42:27,031] [INFO] [config.py:1004:print] optimizer_params ............. None [2025-04-15 01:42:27,031] [INFO] [config.py:1004:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2025-04-15 01:42:27,031] [INFO] [config.py:1004:print] pld_enabled .................. False [2025-04-15 01:42:27,031] [INFO] [config.py:1004:print] pld_params ................... False [2025-04-15 01:42:27,031] [INFO] [config.py:1004:print] prescale_gradients ........... False [2025-04-15 01:42:27,031] [INFO] [config.py:1004:print] scheduler_name ............... None [2025-04-15 01:42:27,031] [INFO] [config.py:1004:print] scheduler_params ............. None [2025-04-15 01:42:27,031] [INFO] [config.py:1004:print] seq_parallel_communication_data_type torch.float32 [2025-04-15 01:42:27,031] [INFO] [config.py:1004:print] sparse_attention ............. None [2025-04-15 01:42:27,031] [INFO] [config.py:1004:print] sparse_gradients_enabled ..... False [2025-04-15 01:42:27,031] [INFO] [config.py:1004:print] steps_per_print .............. inf [2025-04-15 01:42:27,031] [INFO] [config.py:1004:print] tensor_parallel_config ....... dtype=torch.float16 autotp_size=0 tensor_parallel=TPConfig(tp_size=1, tp_grain_size=1, mpu=None, tp_group=None) injection_policy_tuple=None keep_module_on_host=False replace_with_kernel_inject=False [2025-04-15 01:42:27,031] [INFO] [config.py:1004:print] timers_config ................ enabled=True synchronized=True [2025-04-15 01:42:27,031] [INFO] [config.py:1004:print] train_batch_size ............. 32 [2025-04-15 01:42:27,031] [INFO] [config.py:1004:print] train_micro_batch_size_per_gpu 1 [2025-04-15 01:42:27,031] [INFO] [config.py:1004:print] use_data_before_expert_parallel_ False [2025-04-15 01:42:27,031] [INFO] [config.py:1004:print] use_node_local_storage ....... False [2025-04-15 01:42:27,031] [INFO] [config.py:1004:print] wall_clock_breakdown ......... False [2025-04-15 01:42:27,031] [INFO] [config.py:1004:print] weight_quantization_config ... None [2025-04-15 01:42:27,031] [INFO] [config.py:1004:print] world_size ................... 8 [2025-04-15 01:42:27,031] [INFO] [config.py:1004:print] zero_allow_untested_optimizer True [2025-04-15 01:42:27,031] [INFO] [config.py:1004:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=100000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=100000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=100000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=100000000 max_reuse_distance=100000000 gather_16bit_weights_on_model_save=True module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True log_trace_cache_warnings=False [2025-04-15 01:42:27,031] [INFO] [config.py:1004:print] zero_enabled ................. True [2025-04-15 01:42:27,031] [INFO] [config.py:1004:print] zero_force_ds_cpu_optimizer .. True [2025-04-15 01:42:27,031] [INFO] [config.py:1004:print] zero_optimization_stage ...... 3 [2025-04-15 01:42:27,031] [INFO] [config.py:990:print_user_config] json = { "fp16": { "enabled": false }, "bf16": { "enabled": true }, "train_micro_batch_size_per_gpu": 1, "gradient_accumulation_steps": 4, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1.000000e+08, "reduce_bucket_size": 1.000000e+08, "stage3_prefetch_bucket_size": 1.000000e+08, "stage3_param_persistence_threshold": 1.000000e+05, "stage3_max_live_parameters": 1.000000e+08, "stage3_max_reuse_distance": 1.000000e+08, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_clipping": 1.0, "wall_clock_breakdown": false, "steps_per_print": inf, "zero_allow_untested_optimizer": true } [INFO|trainer.py:2409] 2025-04-15 01:42:27,031 >> ***** Running training ***** [INFO|trainer.py:2410] 2025-04-15 01:42:27,031 >> Num examples = 2,688 [INFO|trainer.py:2411] 2025-04-15 01:42:27,031 >> Num Epochs = 4 [INFO|trainer.py:2412] 2025-04-15 01:42:27,031 >> Instantaneous batch size per device = 1 [INFO|trainer.py:2415] 2025-04-15 01:42:27,031 >> Total train batch size (w. parallel, distributed & accumulation) = 32 [INFO|trainer.py:2416] 2025-04-15 01:42:27,031 >> Gradient Accumulation steps = 4 [INFO|trainer.py:2417] 2025-04-15 01:42:27,031 >> Total optimization steps = 336 [INFO|trainer.py:2418] 2025-04-15 01:42:27,033 >> Number of trainable parameters = 32,763,876,352 0%| | 0/336 [00:00> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. /home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] /home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] /home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] /home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] /home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] /home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] /home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] /home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 0%| | 1/336 [00:30<2:50:16, 30.50s/it] {'loss': 0.0336, 'grad_norm': 0.556122362613678, 'learning_rate': 1.818181818181818e-07, 'kl': -0.0009, 'entropy': -0.0017, 'ce_loss': 0.0271, 'epoch': 0.01} 0%| | 1/336 [00:30<2:50:16, 30.50s/it] 1%| | 2/336 [00:52<2:22:09, 25.54s/it] {'loss': 0.0306, 'grad_norm': 0.46383345127105713, 'learning_rate': 3.636363636363636e-07, 'kl': 0.0014, 'entropy': 0.0205, 'ce_loss': 0.01, 'epoch': 0.02} 1%| | 2/336 [00:52<2:22:09, 25.54s/it] 1%| | 3/336 [01:12<2:07:05, 22.90s/it] {'loss': 0.0262, 'grad_norm': 0.4181514084339142, 'learning_rate': 5.454545454545454e-07, 'kl': 0.001, 'entropy': 0.0233, 'ce_loss': 0.025, 'epoch': 0.04} 1%| | 3/336 [01:12<2:07:05, 22.90s/it] 1%| | 4/336 [01:36<2:08:37, 23.25s/it] {'loss': 0.0322, 'grad_norm': 0.5271944999694824, 'learning_rate': 7.272727272727272e-07, 'kl': 0.0002, 'entropy': -0.0258, 'ce_loss': 0.0298, 'epoch': 0.05} 1%| | 4/336 [01:36<2:08:37, 23.25s/it] 1%|▏ | 5/336 [01:54<1:58:20, 21.45s/it] {'loss': 0.0363, 'grad_norm': 0.5219911336898804, 'learning_rate': 9.09090909090909e-07, 'kl': 0.0016, 'entropy': -0.0593, 'ce_loss': 0.0205, 'epoch': 0.06} 1%|▏ | 5/336 [01:54<1:58:20, 21.45s/it] 2%|▏ | 6/336 [02:16<1:58:27, 21.54s/it] {'loss': 0.0306, 'grad_norm': 0.3969322144985199, 'learning_rate': 1.0909090909090908e-06, 'kl': 0.0022, 'entropy': 0.0518, 'ce_loss': 0.0105, 'epoch': 0.07} 2%|▏ | 6/336 [02:16<1:58:27, 21.54s/it] 2%|▏ | 7/336 [02:36<1:56:55, 21.32s/it] {'loss': 0.0311, 'grad_norm': 0.29038745164871216, 'learning_rate': 1.2727272727272726e-06, 'kl': -0.0002, 'entropy': -0.0295, 'ce_loss': 0.0093, 'epoch': 0.08} 2%|▏ | 7/336 [02:36<1:56:55, 21.32s/it] 2%|▏ | 8/336 [02:58<1:57:01, 21.41s/it] {'loss': 0.037, 'grad_norm': 0.35928836464881897, 'learning_rate': 1.4545454545454544e-06, 'kl': 0.0034, 'entropy': 0.0496, 'ce_loss': 0.0164, 'epoch': 0.1} 2%|▏ | 8/336 [02:58<1:57:01, 21.41s/it] 3%|▎ | 9/336 [03:19<1:56:10, 21.32s/it] {'loss': 0.0254, 'grad_norm': 0.1917060762643814, 'learning_rate': 1.6363636363636365e-06, 'kl': 0.0049, 'entropy': -0.0535, 'ce_loss': 0.0217, 'epoch': 0.11} 3%|▎ | 9/336 [03:19<1:56:10, 21.32s/it] 3%|▎ | 10/336 [03:37<1:50:22, 20.31s/it] {'loss': 0.0322, 'grad_norm': 0.23470519483089447, 'learning_rate': 1.818181818181818e-06, 'kl': 0.0074, 'entropy': -0.0132, 'ce_loss': 0.0146, 'epoch': 0.12} 3%|▎ | 10/336 [03:37<1:50:22, 20.31s/it] 3%|▎ | 11/336 [04:00<1:53:29, 20.95s/it] {'loss': 0.0294, 'grad_norm': 0.24164389073848724, 'learning_rate': 2e-06, 'kl': 0.0033, 'entropy': -0.0625, 'ce_loss': 0.008, 'epoch': 0.13} 3%|▎ | 11/336 [04:00<1:53:29, 20.95s/it] 4%|▎ | 12/336 [04:18<1:48:49, 20.15s/it] {'loss': 0.0351, 'grad_norm': 0.2439153492450714, 'learning_rate': 1.999953280342959e-06, 'kl': 0.0117, 'entropy': 0.0209, 'ce_loss': 0.0301, 'epoch': 0.14} 4%|▎ | 12/336 [04:18<1:48:49, 20.15s/it] 4%|▍ | 13/336 [04:40<1:50:50, 20.59s/it] {'loss': 0.0292, 'grad_norm': 0.2523662745952606, 'learning_rate': 1.9998131257372875e-06, 'kl': 0.0044, 'entropy': -0.0124, 'ce_loss': 0.0162, 'epoch': 0.15} 4%|▍ | 13/336 [04:40<1:50:50, 20.59s/it] 4%|▍ | 14/336 [05:01<1:51:49, 20.84s/it] {'loss': 0.038, 'grad_norm': 0.24493643641471863, 'learning_rate': 1.9995795492789365e-06, 'kl': 0.0112, 'entropy': -0.0439, 'ce_loss': 0.0152, 'epoch': 0.17} 4%|▍ | 14/336 [05:01<1:51:49, 20.84s/it] 4%|▍ | 15/336 [05:25<1:57:26, 21.95s/it] {'loss': 0.0335, 'grad_norm': 0.2690785527229309, 'learning_rate': 1.99925257279313e-06, 'kl': 0.0107, 'entropy': -0.0374, 'ce_loss': 0.0152, 'epoch': 0.18} 4%|▍ | 15/336 [05:25<1:57:26, 21.95s/it] 5%|▍ | 16/336 [05:45<1:53:09, 21.22s/it] {'loss': 0.0288, 'grad_norm': 0.22493208944797516, 'learning_rate': 1.9988322268323264e-06, 'kl': 0.0105, 'entropy': 0.0728, 'ce_loss': 0.03, 'epoch': 0.19} 5%|▍ | 16/336 [05:45<1:53:09, 21.22s/it] 5%|▌ | 17/336 [06:04<1:49:34, 20.61s/it] {'loss': 0.0191, 'grad_norm': 0.20840124785900116, 'learning_rate': 1.998318550673364e-06, 'kl': 0.019, 'entropy': 0.0208, 'ce_loss': 0.0142, 'epoch': 0.2} 5%|▌ | 17/336 [06:04<1:49:34, 20.61s/it] 5%|▌ | 18/336 [06:22<1:45:16, 19.86s/it] {'loss': 0.0337, 'grad_norm': 0.32258686423301697, 'learning_rate': 1.997711592313791e-06, 'kl': 0.0041, 'entropy': -0.0374, 'ce_loss': 0.0166, 'epoch': 0.21} 5%|▌ | 18/336 [06:22<1:45:16, 19.86s/it] 6%|▌ | 19/336 [06:43<1:46:04, 20.08s/it] {'loss': 0.028, 'grad_norm': 0.29989323019981384, 'learning_rate': 1.9970114084673796e-06, 'kl': 0.0044, 'entropy': -0.04, 'ce_loss': 0.03, 'epoch': 0.23} 6%|▌ | 19/336 [06:43<1:46:04, 20.08s/it] 6%|▌ | 20/336 [07:01<1:42:22, 19.44s/it] {'loss': 0.0335, 'grad_norm': 0.3687450885772705, 'learning_rate': 1.9962180645588286e-06, 'kl': 0.0115, 'entropy': -0.1133, 'ce_loss': 0.015, 'epoch': 0.24} 6%|▌ | 20/336 [07:01<1:42:22, 19.44s/it] 6%|▋ | 21/336 [07:19<1:39:49, 19.01s/it] {'loss': 0.0416, 'grad_norm': 0.3423936367034912, 'learning_rate': 1.9953316347176486e-06, 'kl': 0.0106, 'entropy': -0.0605, 'ce_loss': 0.0186, 'epoch': 0.25} 6%|▋ | 21/336 [07:19<1:39:49, 19.01s/it] 7%|▋ | 22/336 [07:37<1:38:42, 18.86s/it] {'loss': 0.0332, 'grad_norm': 0.28213027119636536, 'learning_rate': 1.994352201771236e-06, 'kl': 0.0108, 'entropy': -0.0459, 'ce_loss': 0.0198, 'epoch': 0.26} 7%|▋ | 22/336 [07:37<1:38:42, 18.86s/it] 7%|▋ | 23/336 [07:56<1:37:42, 18.73s/it] {'loss': 0.0279, 'grad_norm': 0.24293898046016693, 'learning_rate': 1.993279857237133e-06, 'kl': 0.0177, 'entropy': -0.0002, 'ce_loss': 0.0243, 'epoch': 0.27} 7%|▋ | 23/336 [07:56<1:37:42, 18.73s/it] 7%|▋ | 24/336 [08:14<1:36:16, 18.51s/it] {'loss': 0.0352, 'grad_norm': 0.26316407322883606, 'learning_rate': 1.9921147013144777e-06, 'kl': 0.0194, 'entropy': -0.0359, 'ce_loss': 0.0159, 'epoch': 0.29} 7%|▋ | 24/336 [08:14<1:36:16, 18.51s/it] 7%|▋ | 25/336 [08:33<1:36:55, 18.70s/it] {'loss': 0.0276, 'grad_norm': 0.23964284360408783, 'learning_rate': 1.9908568428746405e-06, 'kl': 0.0074, 'entropy': -0.0588, 'ce_loss': 0.0261, 'epoch': 0.3} 7%|▋ | 25/336 [08:33<1:36:55, 18.70s/it] 8%|▊ | 26/336 [08:52<1:36:40, 18.71s/it] {'loss': 0.0262, 'grad_norm': 0.17423106729984283, 'learning_rate': 1.989506399451051e-06, 'kl': 0.0115, 'entropy': -0.0398, 'ce_loss': 0.017, 'epoch': 0.31} 8%|▊ | 26/336 [08:52<1:36:40, 18.71s/it] 8%|▊ | 27/336 [09:10<1:35:37, 18.57s/it] {'loss': 0.0334, 'grad_norm': 0.2422921359539032, 'learning_rate': 1.9880634972282166e-06, 'kl': 0.0095, 'entropy': -0.0625, 'ce_loss': 0.0247, 'epoch': 0.32} 8%|▊ | 27/336 [09:10<1:35:37, 18.57s/it] 8%|▊ | 28/336 [09:31<1:39:14, 19.33s/it] {'loss': 0.0288, 'grad_norm': 0.21724973618984222, 'learning_rate': 1.986528271029931e-06, 'kl': 0.0106, 'entropy': -0.0608, 'ce_loss': 0.0328, 'epoch': 0.33} 8%|▊ | 28/336 [09:31<1:39:14, 19.33s/it] 9%|▊ | 29/336 [09:52<1:41:49, 19.90s/it] {'loss': 0.0289, 'grad_norm': 0.22811733186244965, 'learning_rate': 1.984900864306677e-06, 'kl': 0.0166, 'entropy': -0.0442, 'ce_loss': 0.0173, 'epoch': 0.35} 9%|▊ | 29/336 [09:52<1:41:49, 19.90s/it] 9%|▉ | 30/336 [10:10<1:38:45, 19.37s/it] {'loss': 0.0313, 'grad_norm': 0.2636219561100006, 'learning_rate': 1.9831814291222233e-06, 'kl': 0.021, 'entropy': 0.024, 'ce_loss': 0.0163, 'epoch': 0.36} 9%|▉ | 30/336 [10:10<1:38:45, 19.37s/it] 9%|▉ | 31/336 [10:31<1:40:35, 19.79s/it] {'loss': 0.0305, 'grad_norm': 0.25198692083358765, 'learning_rate': 1.981370126139413e-06, 'kl': 0.0066, 'entropy': -0.0262, 'ce_loss': 0.0094, 'epoch': 0.37} 9%|▉ | 31/336 [10:31<1:40:35, 19.79s/it] 10%|▉ | 32/336 [10:55<1:46:02, 20.93s/it] {'loss': 0.021, 'grad_norm': 0.16534800827503204, 'learning_rate': 1.979467124605156e-06, 'kl': 0.0052, 'entropy': -0.0437, 'ce_loss': 0.0101, 'epoch': 0.38} 10%|▉ | 32/336 [10:55<1:46:02, 20.93s/it] 10%|▉ | 33/336 [11:13<1:42:17, 20.26s/it] {'loss': 0.0309, 'grad_norm': 0.24338820576667786, 'learning_rate': 1.977472602334609e-06, 'kl': 0.0164, 'entropy': 0.0081, 'ce_loss': 0.0175, 'epoch': 0.39} 10%|▉ | 33/336 [11:13<1:42:17, 20.26s/it] 10%|█ | 34/336 [11:35<1:43:32, 20.57s/it] {'loss': 0.0273, 'grad_norm': 0.20457230508327484, 'learning_rate': 1.975386745694565e-06, 'kl': 0.0247, 'entropy': -0.0388, 'ce_loss': 0.0362, 'epoch': 0.4} 10%|█ | 34/336 [11:35<1:43:32, 20.57s/it] 10%|█ | 35/336 [11:52<1:38:56, 19.72s/it] {'loss': 0.0256, 'grad_norm': 0.19352681934833527, 'learning_rate': 1.9732097495860385e-06, 'kl': 0.0134, 'entropy': -0.0194, 'ce_loss': 0.0126, 'epoch': 0.42} 10%|█ | 35/336 [11:52<1:38:56, 19.72s/it] 11%|█ | 36/336 [12:13<1:40:02, 20.01s/it] {'loss': 0.0316, 'grad_norm': 0.21634066104888916, 'learning_rate': 1.970941817426052e-06, 'kl': 0.0258, 'entropy': -0.062, 'ce_loss': 0.0114, 'epoch': 0.43} 11%|█ | 36/336 [12:13<1:40:02, 20.01s/it] 11%|█ | 37/336 [12:32<1:37:23, 19.54s/it] {'loss': 0.0245, 'grad_norm': 0.1840643286705017, 'learning_rate': 1.968583161128631e-06, 'kl': 0.0083, 'entropy': -0.0198, 'ce_loss': 0.0205, 'epoch': 0.44} 11%|█ | 37/336 [12:32<1:37:23, 19.54s/it] 11%|█▏ | 38/336 [12:50<1:35:57, 19.32s/it] {'loss': 0.0236, 'grad_norm': 0.18451355397701263, 'learning_rate': 1.9661340010850024e-06, 'kl': 0.0181, 'entropy': 0.0021, 'ce_loss': 0.0146, 'epoch': 0.45} 11%|█▏ | 38/336 [12:50<1:35:57, 19.32s/it] 12%|█▏ | 39/336 [13:10<1:35:58, 19.39s/it] {'loss': 0.0295, 'grad_norm': 0.20285271108150482, 'learning_rate': 1.9635945661430005e-06, 'kl': 0.0107, 'entropy': -0.0466, 'ce_loss': 0.0145, 'epoch': 0.46} 12%|█▏ | 39/336 [13:10<1:35:58, 19.39s/it] 12%|█▏ | 40/336 [13:28<1:34:14, 19.10s/it] {'loss': 0.0331, 'grad_norm': 0.2485194355249405, 'learning_rate': 1.960965093585684e-06, 'kl': 0.0076, 'entropy': -0.083, 'ce_loss': 0.0217, 'epoch': 0.48} 12%|█▏ | 40/336 [13:28<1:34:14, 19.10s/it] 12%|█▏ | 41/336 [13:50<1:37:12, 19.77s/it] {'loss': 0.0315, 'grad_norm': 0.23605366051197052, 'learning_rate': 1.9582458291091663e-06, 'kl': 0.017, 'entropy': 0.015, 'ce_loss': 0.0123, 'epoch': 0.49} 12%|█▏ | 41/336 [13:50<1:37:12, 19.77s/it] 12%|█▎ | 42/336 [14:11<1:38:58, 20.20s/it] {'loss': 0.0255, 'grad_norm': 0.1994330734014511, 'learning_rate': 1.9554370267996535e-06, 'kl': 0.0072, 'entropy': -0.031, 'ce_loss': 0.022, 'epoch': 0.5} 12%|█▎ | 42/336 [14:11<1:38:58, 20.20s/it] 13%|█▎ | 43/336 [14:29<1:36:13, 19.70s/it] {'loss': 0.03, 'grad_norm': 0.2328450083732605, 'learning_rate': 1.952538949109708e-06, 'kl': 0.0063, 'entropy': -0.022, 'ce_loss': 0.0097, 'epoch': 0.51} 13%|█▎ | 43/336 [14:29<1:36:13, 19.70s/it] 13%|█▎ | 44/336 [14:51<1:37:50, 20.11s/it] {'loss': 0.0239, 'grad_norm': 0.16965439915657043, 'learning_rate': 1.94955186683372e-06, 'kl': 0.0212, 'entropy': -0.0162, 'ce_loss': 0.0086, 'epoch': 0.52} 13%|█▎ | 44/336 [14:51<1:37:50, 20.11s/it] 13%|█▎ | 45/336 [15:12<1:39:22, 20.49s/it] {'loss': 0.0294, 'grad_norm': 0.21374227106571198, 'learning_rate': 1.94647605908261e-06, 'kl': 0.0071, 'entropy': -0.0459, 'ce_loss': 0.0095, 'epoch': 0.54} 13%|█▎ | 45/336 [15:12<1:39:22, 20.49s/it] 14%|█▎ | 46/336 [15:30<1:36:07, 19.89s/it] {'loss': 0.0285, 'grad_norm': 0.3241690695285797, 'learning_rate': 1.943311813257743e-06, 'kl': 0.0088, 'entropy': -0.0339, 'ce_loss': 0.022, 'epoch': 0.55} 14%|█▎ | 46/336 [15:30<1:36:07, 19.89s/it] 14%|█▍ | 47/336 [15:51<1:36:09, 19.96s/it] {'loss': 0.0337, 'grad_norm': 0.24899235367774963, 'learning_rate': 1.9400594250240794e-06, 'kl': 0.0123, 'entropy': -0.0476, 'ce_loss': 0.0284, 'epoch': 0.56} 14%|█▍ | 47/336 [15:51<1:36:09, 19.96s/it] 14%|█▍ | 48/336 [16:12<1:38:00, 20.42s/it] {'loss': 0.03, 'grad_norm': 0.21918408572673798, 'learning_rate': 1.9367191982825448e-06, 'kl': 0.0124, 'entropy': -0.064, 'ce_loss': 0.0146, 'epoch': 0.57} 14%|█▍ | 48/336 [16:12<1:38:00, 20.42s/it] 15%|█▍ | 49/336 [16:30<1:34:45, 19.81s/it] {'loss': 0.0264, 'grad_norm': 0.19453562796115875, 'learning_rate': 1.9332914451416345e-06, 'kl': 0.0228, 'entropy': 0.0359, 'ce_loss': 0.0242, 'epoch': 0.58} 15%|█▍ | 49/336 [16:30<1:34:45, 19.81s/it] 15%|█▍ | 50/336 [16:54<1:40:11, 21.02s/it] {'loss': 0.0217, 'grad_norm': 0.16631463170051575, 'learning_rate': 1.929776485888251e-06, 'kl': 0.0092, 'entropy': -0.0309, 'ce_loss': 0.006, 'epoch': 0.6} 15%|█▍ | 50/336 [16:54<1:40:11, 21.02s/it] 15%|█▌ | 51/336 [17:13<1:36:24, 20.30s/it] {'loss': 0.0352, 'grad_norm': 0.2594395875930786, 'learning_rate': 1.9261746489577764e-06, 'kl': 0.0067, 'entropy': -0.0693, 'ce_loss': 0.024, 'epoch': 0.61} 15%|█▌ | 51/336 [17:13<1:36:24, 20.30s/it] 15%|█▌ | 52/336 [17:32<1:34:49, 20.03s/it] {'loss': 0.0315, 'grad_norm': 0.2360682636499405, 'learning_rate': 1.9224862709033824e-06, 'kl': 0.011, 'entropy': 0.0271, 'ce_loss': 0.0106, 'epoch': 0.62} 15%|█▌ | 52/336 [17:32<1:34:49, 20.03s/it] 16%|█▌ | 53/336 [17:51<1:32:16, 19.56s/it] {'loss': 0.039, 'grad_norm': 0.26288822293281555, 'learning_rate': 1.918711696364584e-06, 'kl': 0.0427, 'entropy': -0.1367, 'ce_loss': 0.0266, 'epoch': 0.63} 16%|█▌ | 53/336 [17:51<1:32:16, 19.56s/it] 16%|█▌ | 54/336 [18:09<1:30:21, 19.22s/it] {'loss': 0.0327, 'grad_norm': 0.25616738200187683, 'learning_rate': 1.914851278035038e-06, 'kl': 0.0067, 'entropy': -0.0334, 'ce_loss': 0.0268, 'epoch': 0.64} 16%|█▌ | 54/336 [18:09<1:30:21, 19.22s/it] 16%|█▋ | 55/336 [18:27<1:28:41, 18.94s/it] {'loss': 0.0306, 'grad_norm': 0.22439108788967133, 'learning_rate': 1.910905376629585e-06, 'kl': 0.0315, 'entropy': -0.0532, 'ce_loss': 0.0445, 'epoch': 0.65} 16%|█▋ | 55/336 [18:27<1:28:41, 18.94s/it] 17%|█▋ | 56/336 [18:46<1:27:35, 18.77s/it] {'loss': 0.0303, 'grad_norm': 0.21062982082366943, 'learning_rate': 1.9068743608505452e-06, 'kl': 0.0035, 'entropy': -0.013, 'ce_loss': 0.0076, 'epoch': 0.67} 17%|█▋ | 56/336 [18:46<1:27:35, 18.77s/it] 17%|█▋ | 57/336 [19:09<1:33:49, 20.18s/it] {'loss': 0.0311, 'grad_norm': 0.25195467472076416, 'learning_rate': 1.902758607353269e-06, 'kl': 0.0179, 'entropy': 0.0155, 'ce_loss': 0.0099, 'epoch': 0.68} 17%|█▋ | 57/336 [19:09<1:33:49, 20.18s/it] 17%|█▋ | 58/336 [19:28<1:31:05, 19.66s/it] {'loss': 0.0285, 'grad_norm': 0.21001462638378143, 'learning_rate': 1.8985585007109388e-06, 'kl': 0.011, 'entropy': -0.0408, 'ce_loss': 0.0096, 'epoch': 0.69} 17%|█▋ | 58/336 [19:28<1:31:05, 19.66s/it] 18%|█▊ | 59/336 [19:49<1:32:46, 20.09s/it] {'loss': 0.0276, 'grad_norm': 0.1998775452375412, 'learning_rate': 1.8942744333786395e-06, 'kl': 0.0076, 'entropy': -0.0221, 'ce_loss': 0.0205, 'epoch': 0.7} 18%|█▊ | 59/336 [19:49<1:32:46, 20.09s/it] 18%|█▊ | 60/336 [20:07<1:30:17, 19.63s/it] {'loss': 0.0372, 'grad_norm': 0.23801416158676147, 'learning_rate': 1.8899068056566838e-06, 'kl': 0.0217, 'entropy': 0.0289, 'ce_loss': 0.0264, 'epoch': 0.71} 18%|█▊ | 60/336 [20:07<1:30:17, 19.63s/it] 18%|█▊ | 61/336 [20:30<1:34:25, 20.60s/it] {'loss': 0.028, 'grad_norm': 0.2107585072517395, 'learning_rate': 1.8854560256532098e-06, 'kl': 0.0013, 'entropy': -0.0065, 'ce_loss': 0.0064, 'epoch': 0.73} 18%|█▊ | 61/336 [20:30<1:34:25, 20.60s/it] 18%|█▊ | 62/336 [20:49<1:32:03, 20.16s/it] {'loss': 0.0294, 'grad_norm': 0.2207033932209015, 'learning_rate': 1.8809225092460485e-06, 'kl': 0.0036, 'entropy': 0.025, 'ce_loss': 0.0099, 'epoch': 0.74} 18%|█▊ | 62/336 [20:49<1:32:03, 20.16s/it] 19%|█▉ | 63/336 [21:11<1:33:37, 20.58s/it] {'loss': 0.0265, 'grad_norm': 0.2275032252073288, 'learning_rate': 1.8763066800438634e-06, 'kl': 0.0162, 'entropy': -0.0437, 'ce_loss': 0.0313, 'epoch': 0.75} 19%|█▉ | 63/336 [21:11<1:33:37, 20.58s/it] 19%|█▉ | 64/336 [21:29<1:30:26, 19.95s/it] {'loss': 0.0337, 'grad_norm': 0.2637733221054077, 'learning_rate': 1.8716089693465693e-06, 'kl': 0.0135, 'entropy': -0.0242, 'ce_loss': 0.0157, 'epoch': 0.76} 19%|█▉ | 64/336 [21:29<1:30:26, 19.95s/it] 19%|█▉ | 65/336 [21:48<1:27:59, 19.48s/it] {'loss': 0.0372, 'grad_norm': 0.28374725580215454, 'learning_rate': 1.8668298161050306e-06, 'kl': 0.0159, 'entropy': -0.0206, 'ce_loss': 0.0125, 'epoch': 0.77} 19%|█▉ | 65/336 [21:48<1:27:59, 19.48s/it] 20%|█▉ | 66/336 [22:06<1:25:57, 19.10s/it] {'loss': 0.0299, 'grad_norm': 0.20476216077804565, 'learning_rate': 1.861969666880049e-06, 'kl': 0.0125, 'entropy': -0.0359, 'ce_loss': 0.019, 'epoch': 0.79} 20%|█▉ | 66/336 [22:06<1:25:57, 19.10s/it] 20%|█▉ | 67/336 [22:24<1:24:45, 18.91s/it] {'loss': 0.0284, 'grad_norm': 0.21959146857261658, 'learning_rate': 1.8570289758006343e-06, 'kl': 0.0208, 'entropy': -0.0378, 'ce_loss': 0.0239, 'epoch': 0.8} 20%|█▉ | 67/336 [22:24<1:24:45, 18.91s/it] 20%|██ | 68/336 [22:43<1:23:53, 18.78s/it] {'loss': 0.0319, 'grad_norm': 0.22621940076351166, 'learning_rate': 1.8520082045215717e-06, 'kl': 0.0251, 'entropy': 0.0508, 'ce_loss': 0.0309, 'epoch': 0.81} 20%|██ | 68/336 [22:43<1:23:53, 18.78s/it] 21%|██ | 69/336 [23:04<1:26:15, 19.38s/it] {'loss': 0.0303, 'grad_norm': 0.21314120292663574, 'learning_rate': 1.846907822180286e-06, 'kl': 0.0092, 'entropy': -0.103, 'ce_loss': 0.014, 'epoch': 0.82} 21%|██ | 69/336 [23:04<1:26:15, 19.38s/it] 21%|██ | 70/336 [23:22<1:24:35, 19.08s/it] {'loss': 0.0263, 'grad_norm': 0.18305498361587524, 'learning_rate': 1.8417283053530043e-06, 'kl': 0.0079, 'entropy': 0.013, 'ce_loss': 0.0128, 'epoch': 0.83} 21%|██ | 70/336 [23:22<1:24:35, 19.08s/it] 21%|██ | 71/336 [23:43<1:27:05, 19.72s/it] {'loss': 0.0254, 'grad_norm': 0.20499187707901, 'learning_rate': 1.8364701380102264e-06, 'kl': 0.0137, 'entropy': -0.0549, 'ce_loss': 0.0112, 'epoch': 0.85} 21%|██ | 71/336 [23:43<1:27:05, 19.72s/it] 21%|██▏ | 72/336 [24:02<1:25:01, 19.32s/it] {'loss': 0.0318, 'grad_norm': 0.22145715355873108, 'learning_rate': 1.8311338114715027e-06, 'kl': 0.0021, 'entropy': 0.0012, 'ce_loss': 0.0074, 'epoch': 0.86} 21%|██▏ | 72/336 [24:02<1:25:01, 19.32s/it] 22%|██▏ | 73/336 [24:20<1:23:43, 19.10s/it] {'loss': 0.0327, 'grad_norm': 0.2394620031118393, 'learning_rate': 1.825719824359524e-06, 'kl': 0.0078, 'entropy': -0.0239, 'ce_loss': 0.015, 'epoch': 0.87} 22%|██▏ | 73/336 [24:20<1:23:43, 19.10s/it] 22%|██▏ | 74/336 [24:41<1:25:03, 19.48s/it] {'loss': 0.027, 'grad_norm': 0.19388984143733978, 'learning_rate': 1.8202286825535329e-06, 'kl': 0.0056, 'entropy': -0.0581, 'ce_loss': 0.0094, 'epoch': 0.88} 22%|██▏ | 74/336 [24:41<1:25:03, 19.48s/it] 22%|██▏ | 75/336 [25:00<1:24:10, 19.35s/it] {'loss': 0.0276, 'grad_norm': 0.26712799072265625, 'learning_rate': 1.814660899142053e-06, 'kl': 0.0166, 'entropy': -0.0698, 'ce_loss': 0.0115, 'epoch': 0.89} 22%|██▏ | 75/336 [25:00<1:24:10, 19.35s/it] 23%|██▎ | 76/336 [25:19<1:23:09, 19.19s/it] {'loss': 0.0266, 'grad_norm': 0.2016630321741104, 'learning_rate': 1.8090169943749474e-06, 'kl': 0.0461, 'entropy': -0.082, 'ce_loss': 0.0215, 'epoch': 0.9} 23%|██▎ | 76/336 [25:19<1:23:09, 19.19s/it] 23%|██▎ | 77/336 [25:42<1:28:13, 20.44s/it] {'loss': 0.0269, 'grad_norm': 0.18199054896831512, 'learning_rate': 1.8032974956148062e-06, 'kl': 0.012, 'entropy': -0.0396, 'ce_loss': 0.0143, 'epoch': 0.92} 23%|██▎ | 77/336 [25:42<1:28:13, 20.44s/it] 23%|██▎ | 78/336 [26:04<1:29:49, 20.89s/it] {'loss': 0.0246, 'grad_norm': 0.18777887523174286, 'learning_rate': 1.7975029372876705e-06, 'kl': 0.0024, 'entropy': -0.0757, 'ce_loss': 0.0093, 'epoch': 0.93} 23%|██▎ | 78/336 [26:04<1:29:49, 20.89s/it] 24%|██▎ | 79/336 [26:22<1:26:10, 20.12s/it] {'loss': 0.0325, 'grad_norm': 0.23307697474956512, 'learning_rate': 1.7916338608330956e-06, 'kl': 0.0021, 'entropy': -0.0183, 'ce_loss': 0.0157, 'epoch': 0.94} 24%|██▎ | 79/336 [26:22<1:26:10, 20.12s/it] 24%|██▍ | 80/336 [26:43<1:27:23, 20.48s/it] {'loss': 0.0357, 'grad_norm': 0.25255751609802246, 'learning_rate': 1.78569081465356e-06, 'kl': 0.0079, 'entropy': -0.0674, 'ce_loss': 0.0146, 'epoch': 0.95} 24%|██▍ | 80/336 [26:43<1:27:23, 20.48s/it] 24%|██▍ | 81/336 [27:02<1:24:00, 19.77s/it] {'loss': 0.0438, 'grad_norm': 0.32335004210472107, 'learning_rate': 1.7796743540632221e-06, 'kl': 0.0107, 'entropy': -0.032, 'ce_loss': 0.0272, 'epoch': 0.96} 24%|██▍ | 81/336 [27:02<1:24:00, 19.77s/it] 24%|██▍ | 82/336 [27:27<1:31:27, 21.61s/it] {'loss': 0.022, 'grad_norm': 0.16930562257766724, 'learning_rate': 1.7735850412360328e-06, 'kl': 0.0155, 'entropy': -0.0518, 'ce_loss': 0.0045, 'epoch': 0.98} 24%|██▍ | 82/336 [27:27<1:31:27, 21.61s/it] 25%|██▍ | 83/336 [27:49<1:30:32, 21.47s/it] {'loss': 0.0262, 'grad_norm': 0.19447658956050873, 'learning_rate': 1.7674234451532063e-06, 'kl': 0.0276, 'entropy': -0.0143, 'ce_loss': 0.0223, 'epoch': 0.99} 25%|██▍ | 83/336 [27:49<1:30:32, 21.47s/it] 25%|██▌ | 84/336 [28:07<1:26:21, 20.56s/it] {'loss': 0.0277, 'grad_norm': 0.19826678931713104, 'learning_rate': 1.7611901415500533e-06, 'kl': 0.0039, 'entropy': -0.0427, 'ce_loss': 0.0146, 'epoch': 1.0} 25%|██▌ | 84/336 [28:07<1:26:21, 20.56s/it] 25%|██▌ | 85/336 [28:26<1:24:22, 20.17s/it] {'loss': 0.0244, 'grad_norm': 0.1927766650915146, 'learning_rate': 1.7548857128621874e-06, 'kl': 0.0228, 'entropy': -0.054, 'ce_loss': 0.0178, 'epoch': 1.01} 25%|██▌ | 85/336 [28:26<1:24:22, 20.17s/it] 26%|██▌ | 86/336 [28:45<1:21:43, 19.62s/it] {'loss': 0.0288, 'grad_norm': 0.22494536638259888, 'learning_rate': 1.748510748171101e-06, 'kl': 0.0026, 'entropy': -0.0059, 'ce_loss': 0.0175, 'epoch': 1.02} 26%|██▌ | 86/336 [28:45<1:21:43, 19.62s/it] 26%|██▌ | 87/336 [29:06<1:23:28, 20.12s/it] {'loss': 0.0226, 'grad_norm': 0.19510535895824432, 'learning_rate': 1.7420658431491222e-06, 'kl': 0.0119, 'entropy': -0.0275, 'ce_loss': 0.0053, 'epoch': 1.04} 26%|██▌ | 87/336 [29:06<1:23:28, 20.12s/it] 26%|██▌ | 88/336 [29:25<1:21:20, 19.68s/it] {'loss': 0.0249, 'grad_norm': 0.1876247376203537, 'learning_rate': 1.735551600003755e-06, 'kl': 0.0221, 'entropy': -0.0325, 'ce_loss': 0.0323, 'epoch': 1.05} 26%|██▌ | 88/336 [29:25<1:21:20, 19.68s/it] 26%|██▋ | 89/336 [29:43<1:19:21, 19.28s/it] {'loss': 0.0193, 'grad_norm': 0.16577742993831635, 'learning_rate': 1.7289686274214115e-06, 'kl': 0.0157, 'entropy': -0.1001, 'ce_loss': 0.0296, 'epoch': 1.06} 26%|██▋ | 89/336 [29:43<1:19:21, 19.28s/it] 27%|██▋ | 90/336 [30:05<1:22:46, 20.19s/it] {'loss': 0.0171, 'grad_norm': 0.12712135910987854, 'learning_rate': 1.722317540510534e-06, 'kl': 0.0195, 'entropy': -0.0742, 'ce_loss': 0.0205, 'epoch': 1.07} 27%|██▋ | 90/336 [30:05<1:22:46, 20.19s/it] 27%|██▋ | 91/336 [30:24<1:20:32, 19.73s/it] {'loss': 0.0217, 'grad_norm': 0.1650465875864029, 'learning_rate': 1.715598960744121e-06, 'kl': 0.0178, 'entropy': -0.0295, 'ce_loss': 0.0124, 'epoch': 1.08} 27%|██▋ | 91/336 [30:24<1:20:32, 19.73s/it] 27%|██▋ | 92/336 [30:45<1:22:17, 20.24s/it] {'loss': 0.0215, 'grad_norm': 0.18182268738746643, 'learning_rate': 1.7088135159016582e-06, 'kl': 0.05, 'entropy': -0.1069, 'ce_loss': 0.006, 'epoch': 1.1} 27%|██▋ | 92/336 [30:45<1:22:17, 20.24s/it] 28%|██▊ | 93/336 [31:04<1:20:04, 19.77s/it] {'loss': 0.0209, 'grad_norm': 0.17362666130065918, 'learning_rate': 1.7019618400104569e-06, 'kl': 0.0159, 'entropy': -0.0435, 'ce_loss': 0.0102, 'epoch': 1.11} 28%|██▊ | 93/336 [31:04<1:20:04, 19.77s/it] 28%|██▊ | 94/336 [31:24<1:20:29, 19.96s/it] {'loss': 0.0204, 'grad_norm': 0.1766798049211502, 'learning_rate': 1.6950445732864126e-06, 'kl': 0.0186, 'entropy': -0.0522, 'ce_loss': 0.0185, 'epoch': 1.12} 28%|██▊ | 94/336 [31:24<1:20:29, 19.96s/it] 28%|██▊ | 95/336 [31:45<1:21:26, 20.28s/it] {'loss': 0.0286, 'grad_norm': 0.2154623419046402, 'learning_rate': 1.688062362074184e-06, 'kl': 0.0037, 'entropy': -0.0217, 'ce_loss': 0.0074, 'epoch': 1.13} 28%|██▊ | 95/336 [31:45<1:21:26, 20.28s/it] 29%|██▊ | 96/336 [32:04<1:19:20, 19.83s/it] {'loss': 0.025, 'grad_norm': 0.27538999915122986, 'learning_rate': 1.681015858786797e-06, 'kl': 0.0149, 'entropy': -0.1328, 'ce_loss': 0.0261, 'epoch': 1.14} 29%|██▊ | 96/336 [32:04<1:19:20, 19.83s/it] 29%|██▉ | 97/336 [32:23<1:17:08, 19.37s/it] {'loss': 0.0197, 'grad_norm': 0.17810164391994476, 'learning_rate': 1.6739057218446857e-06, 'kl': 0.0248, 'entropy': 0.0229, 'ce_loss': 0.0165, 'epoch': 1.15} 29%|██▉ | 97/336 [32:23<1:17:08, 19.37s/it] 29%|██▉ | 98/336 [32:41<1:15:50, 19.12s/it] {'loss': 0.0252, 'grad_norm': 0.2165401428937912, 'learning_rate': 1.666732615614169e-06, 'kl': 0.0212, 'entropy': -0.017, 'ce_loss': 0.0216, 'epoch': 1.17} 29%|██▉ | 98/336 [32:41<1:15:50, 19.12s/it] 29%|██▉ | 99/336 [33:05<1:21:09, 20.55s/it] {'loss': 0.0248, 'grad_norm': 0.21358899772167206, 'learning_rate': 1.6594972103453724e-06, 'kl': 0.0315, 'entropy': -0.062, 'ce_loss': 0.0125, 'epoch': 1.18} 29%|██▉ | 99/336 [33:05<1:21:09, 20.55s/it] 30%|██▉ | 100/336 [33:26<1:21:51, 20.81s/it] {'loss': 0.0196, 'grad_norm': 0.18937267363071442, 'learning_rate': 1.6522001821096019e-06, 'kl': 0.0325, 'entropy': -0.124, 'ce_loss': 0.0157, 'epoch': 1.19} 30%|██▉ | 100/336 [33:26<1:21:51, 20.81s/it] 30%|███ | 101/336 [33:45<1:19:09, 20.21s/it] {'loss': 0.0325, 'grad_norm': 0.2945106327533722, 'learning_rate': 1.6448422127361705e-06, 'kl': 0.0178, 'entropy': -0.0006, 'ce_loss': 0.0252, 'epoch': 1.2} 30%|███ | 101/336 [33:45<1:19:09, 20.21s/it] 30%|███ | 102/336 [34:07<1:20:14, 20.58s/it] {'loss': 0.0199, 'grad_norm': 0.17453603446483612, 'learning_rate': 1.6374239897486897e-06, 'kl': 0.0189, 'entropy': -0.0267, 'ce_loss': 0.0184, 'epoch': 1.21} 30%|███ | 102/336 [34:07<1:20:14, 20.58s/it] 31%|███ | 103/336 [34:25<1:17:16, 19.90s/it] {'loss': 0.0272, 'grad_norm': 0.22808398306369781, 'learning_rate': 1.6299462063008269e-06, 'kl': 0.0559, 'entropy': -0.0297, 'ce_loss': 0.014, 'epoch': 1.23} 31%|███ | 103/336 [34:25<1:17:16, 19.90s/it] 31%|███ | 104/336 [34:43<1:14:47, 19.34s/it] {'loss': 0.0245, 'grad_norm': 0.20444047451019287, 'learning_rate': 1.6224095611115383e-06, 'kl': 0.0305, 'entropy': -0.0471, 'ce_loss': 0.0189, 'epoch': 1.24} 31%|███ | 104/336 [34:43<1:14:47, 19.34s/it] 31%|███▏ | 105/336 [35:04<1:16:33, 19.88s/it] {'loss': 0.027, 'grad_norm': 0.29930785298347473, 'learning_rate': 1.614814758399781e-06, 'kl': 0.0315, 'entropy': 0.007, 'ce_loss': 0.0059, 'epoch': 1.25} 31%|███▏ | 105/336 [35:04<1:16:33, 19.88s/it] 32%|███▏ | 106/336 [35:22<1:14:04, 19.32s/it] {'loss': 0.025, 'grad_norm': 0.22735632956027985, 'learning_rate': 1.6071625078187112e-06, 'kl': 0.0128, 'entropy': -0.0732, 'ce_loss': 0.0122, 'epoch': 1.26} 32%|███▏ | 106/336 [35:22<1:14:04, 19.32s/it] 32%|███▏ | 107/336 [35:44<1:16:08, 19.95s/it] {'loss': 0.0286, 'grad_norm': 0.2253393679857254, 'learning_rate': 1.599453524389374e-06, 'kl': 0.0466, 'entropy': -0.0811, 'ce_loss': 0.0066, 'epoch': 1.27} 32%|███▏ | 107/336 [35:44<1:16:08, 19.95s/it] 32%|███▏ | 108/336 [36:01<1:13:25, 19.32s/it] {'loss': 0.0267, 'grad_norm': 0.2761515974998474, 'learning_rate': 1.5916885284338935e-06, 'kl': 0.0276, 'entropy': -0.0771, 'ce_loss': 0.0138, 'epoch': 1.29} 32%|███▏ | 108/336 [36:01<1:13:25, 19.32s/it] 32%|███▏ | 109/336 [36:22<1:14:08, 19.60s/it] {'loss': 0.018, 'grad_norm': 0.18209226429462433, 'learning_rate': 1.5838682455081657e-06, 'kl': 0.0075, 'entropy': -0.0322, 'ce_loss': 0.0317, 'epoch': 1.3} 32%|███▏ | 109/336 [36:22<1:14:08, 19.60s/it] 33%|███▎ | 110/336 [36:40<1:12:27, 19.24s/it] {'loss': 0.0257, 'grad_norm': 0.20630405843257904, 'learning_rate': 1.5759934063340624e-06, 'kl': 0.0427, 'entropy': -0.0864, 'ce_loss': 0.0086, 'epoch': 1.31} 33%|███▎ | 110/336 [36:40<1:12:27, 19.24s/it] 33%|███▎ | 111/336 [37:03<1:16:04, 20.28s/it] {'loss': 0.0196, 'grad_norm': 0.19179001450538635, 'learning_rate': 1.5680647467311555e-06, 'kl': 0.0287, 'entropy': -0.0413, 'ce_loss': 0.0079, 'epoch': 1.32} 33%|███▎ | 111/336 [37:03<1:16:04, 20.28s/it] 33%|███▎ | 112/336 [37:26<1:19:36, 21.32s/it] {'loss': 0.0204, 'grad_norm': 0.19506439566612244, 'learning_rate': 1.56008300754796e-06, 'kl': 0.0437, 'entropy': 0.0088, 'ce_loss': 0.0214, 'epoch': 1.33} 33%|███▎ | 112/336 [37:26<1:19:36, 21.32s/it] 34%|███▎ | 113/336 [37:47<1:18:15, 21.06s/it] {'loss': 0.0198, 'grad_norm': 0.17770442366600037, 'learning_rate': 1.5520489345927094e-06, 'kl': 0.032, 'entropy': -0.0217, 'ce_loss': 0.0188, 'epoch': 1.35} 34%|███▎ | 113/336 [37:47<1:18:15, 21.06s/it] 34%|███▍ | 114/336 [38:05<1:15:05, 20.29s/it] {'loss': 0.0239, 'grad_norm': 0.2086181938648224, 'learning_rate': 1.5439632785636705e-06, 'kl': 0.0199, 'entropy': -0.0332, 'ce_loss': 0.0132, 'epoch': 1.36} 34%|███▍ | 114/336 [38:05<1:15:05, 20.29s/it] 34%|███▍ | 115/336 [38:24<1:12:26, 19.67s/it] {'loss': 0.0272, 'grad_norm': 0.22167104482650757, 'learning_rate': 1.5358267949789964e-06, 'kl': 0.0116, 'entropy': -0.033, 'ce_loss': 0.0136, 'epoch': 1.37} 34%|███▍ | 115/336 [38:24<1:12:26, 19.67s/it] 35%|███▍ | 116/336 [38:42<1:10:20, 19.19s/it] {'loss': 0.0265, 'grad_norm': 0.2389765828847885, 'learning_rate': 1.5276402441061327e-06, 'kl': 0.0391, 'entropy': -0.0508, 'ce_loss': 0.0194, 'epoch': 1.38} 35%|███▍ | 116/336 [38:42<1:10:20, 19.19s/it] 35%|███▍ | 117/336 [39:00<1:09:13, 18.97s/it] {'loss': 0.0234, 'grad_norm': 0.23680360615253448, 'learning_rate': 1.5194043908907772e-06, 'kl': 0.0214, 'entropy': 0.0034, 'ce_loss': 0.0253, 'epoch': 1.39} 35%|███▍ | 117/336 [39:00<1:09:13, 18.97s/it] 35%|███▌ | 118/336 [39:18<1:07:57, 18.71s/it] {'loss': 0.0265, 'grad_norm': 0.23225313425064087, 'learning_rate': 1.5111200048854054e-06, 'kl': 0.0067, 'entropy': -0.0102, 'ce_loss': 0.0121, 'epoch': 1.4} 35%|███▌ | 118/336 [39:18<1:07:57, 18.71s/it] 35%|███▌ | 119/336 [39:36<1:07:04, 18.55s/it] {'loss': 0.0294, 'grad_norm': 0.24765115976333618, 'learning_rate': 1.5027878601773632e-06, 'kl': 0.0162, 'entropy': -0.0417, 'ce_loss': 0.0219, 'epoch': 1.42} 35%|███▌ | 119/336 [39:36<1:07:04, 18.55s/it] 36%|███▌ | 120/336 [39:55<1:06:57, 18.60s/it] {'loss': 0.0181, 'grad_norm': 0.1848239302635193, 'learning_rate': 1.494408735316537e-06, 'kl': 0.0228, 'entropy': -0.0105, 'ce_loss': 0.0183, 'epoch': 1.43} 36%|███▌ | 120/336 [39:55<1:06:57, 18.60s/it] 36%|███▌ | 121/336 [40:14<1:06:38, 18.60s/it] {'loss': 0.0275, 'grad_norm': 0.2282910943031311, 'learning_rate': 1.4859834132426058e-06, 'kl': 0.0342, 'entropy': 0.0289, 'ce_loss': 0.0184, 'epoch': 1.44} 36%|███▌ | 121/336 [40:14<1:06:38, 18.60s/it] 36%|███▋ | 122/336 [40:38<1:12:02, 20.20s/it] {'loss': 0.02, 'grad_norm': 0.19180208444595337, 'learning_rate': 1.4775126812118863e-06, 'kl': 0.0247, 'entropy': -0.009, 'ce_loss': 0.0129, 'epoch': 1.45} 36%|███▋ | 122/336 [40:38<1:12:02, 20.20s/it] 37%|███▋ | 123/336 [40:56<1:09:58, 19.71s/it] {'loss': 0.0229, 'grad_norm': 0.19556733965873718, 'learning_rate': 1.4689973307237686e-06, 'kl': 0.0176, 'entropy': -0.0613, 'ce_loss': 0.021, 'epoch': 1.46} 37%|███▋ | 123/336 [40:56<1:09:58, 19.71s/it] 37%|███▋ | 124/336 [41:18<1:11:35, 20.26s/it] {'loss': 0.0199, 'grad_norm': 0.18009300529956818, 'learning_rate': 1.4604381574467614e-06, 'kl': 0.0334, 'entropy': -0.0898, 'ce_loss': 0.0233, 'epoch': 1.48} 37%|███▋ | 124/336 [41:18<1:11:35, 20.26s/it] 37%|███▋ | 125/336 [41:36<1:09:13, 19.69s/it] {'loss': 0.0215, 'grad_norm': 0.18584349751472473, 'learning_rate': 1.451835961144145e-06, 'kl': 0.0356, 'entropy': -0.083, 'ce_loss': 0.0055, 'epoch': 1.49} 37%|███▋ | 125/336 [41:36<1:09:13, 19.69s/it] 38%|███▊ | 126/336 [41:55<1:07:40, 19.34s/it] {'loss': 0.0242, 'grad_norm': 0.19484710693359375, 'learning_rate': 1.4431915455992414e-06, 'kl': 0.0212, 'entropy': 0.0058, 'ce_loss': 0.0112, 'epoch': 1.5} 38%|███▊ | 126/336 [41:55<1:07:40, 19.34s/it] 38%|███▊ | 127/336 [42:19<1:12:13, 20.73s/it] {'loss': 0.0235, 'grad_norm': 0.19529865682125092, 'learning_rate': 1.4345057185403098e-06, 'kl': 0.0272, 'entropy': -0.0415, 'ce_loss': 0.0113, 'epoch': 1.51} 38%|███▊ | 127/336 [42:19<1:12:13, 20.73s/it] 38%|███▊ | 128/336 [42:38<1:10:07, 20.23s/it] {'loss': 0.0309, 'grad_norm': 0.2496039867401123, 'learning_rate': 1.4257792915650725e-06, 'kl': 0.0354, 'entropy': 0.0023, 'ce_loss': 0.0212, 'epoch': 1.52} 38%|███▊ | 128/336 [42:38<1:10:07, 20.23s/it] 38%|███▊ | 129/336 [42:59<1:10:43, 20.50s/it] {'loss': 0.0265, 'grad_norm': 0.22667117416858673, 'learning_rate': 1.4170130800648812e-06, 'kl': 0.0106, 'entropy': -0.0067, 'ce_loss': 0.0084, 'epoch': 1.54} 38%|███▊ | 129/336 [42:59<1:10:43, 20.50s/it] 39%|███▊ | 130/336 [43:20<1:11:06, 20.71s/it] {'loss': 0.0201, 'grad_norm': 0.18162737786769867, 'learning_rate': 1.408207903148525e-06, 'kl': 0.0124, 'entropy': -0.0156, 'ce_loss': 0.0123, 'epoch': 1.55} 39%|███▊ | 130/336 [43:20<1:11:06, 20.71s/it] 39%|███▉ | 131/336 [43:38<1:08:07, 19.94s/it] {'loss': 0.0217, 'grad_norm': 0.18819168210029602, 'learning_rate': 1.3993645835656952e-06, 'kl': 0.0157, 'entropy': -0.0234, 'ce_loss': 0.0244, 'epoch': 1.56} 39%|███▉ | 131/336 [43:38<1:08:07, 19.94s/it] 39%|███▉ | 132/336 [43:59<1:08:12, 20.06s/it] {'loss': 0.0239, 'grad_norm': 0.22174504399299622, 'learning_rate': 1.3904839476301088e-06, 'kl': 0.0342, 'entropy': -0.0033, 'ce_loss': 0.0087, 'epoch': 1.57} 39%|███▉ | 132/336 [43:59<1:08:12, 20.06s/it] 40%|███▉ | 133/336 [44:23<1:12:05, 21.31s/it] {'loss': 0.0164, 'grad_norm': 0.1604408174753189, 'learning_rate': 1.3815668251422953e-06, 'kl': 0.007, 'entropy': -0.0261, 'ce_loss': 0.0245, 'epoch': 1.58} 40%|███▉ | 133/336 [44:23<1:12:05, 21.31s/it] 40%|███▉ | 134/336 [44:44<1:11:09, 21.14s/it] {'loss': 0.0174, 'grad_norm': 0.16344523429870605, 'learning_rate': 1.3726140493120637e-06, 'kl': 0.0383, 'entropy': -0.0708, 'ce_loss': 0.006, 'epoch': 1.6} 40%|███▉ | 134/336 [44:44<1:11:09, 21.14s/it] 40%|████ | 135/336 [45:05<1:10:51, 21.15s/it] {'loss': 0.0225, 'grad_norm': 0.19634518027305603, 'learning_rate': 1.363626456680647e-06, 'kl': 0.0074, 'entropy': -0.0488, 'ce_loss': 0.0135, 'epoch': 1.61} 40%|████ | 135/336 [45:05<1:10:51, 21.15s/it] 40%|████ | 136/336 [45:26<1:10:31, 21.16s/it] {'loss': 0.023, 'grad_norm': 0.19007377326488495, 'learning_rate': 1.3546048870425354e-06, 'kl': 0.0356, 'entropy': -0.0386, 'ce_loss': 0.017, 'epoch': 1.62} 40%|████ | 136/336 [45:26<1:10:31, 21.16s/it] 41%|████ | 137/336 [45:47<1:10:11, 21.16s/it] {'loss': 0.0224, 'grad_norm': 0.2226763367652893, 'learning_rate': 1.3455501833670087e-06, 'kl': 0.0228, 'entropy': -0.0752, 'ce_loss': 0.014, 'epoch': 1.63} 41%|████ | 137/336 [45:47<1:10:11, 21.16s/it] 41%|████ | 138/336 [46:06<1:07:19, 20.40s/it] {'loss': 0.025, 'grad_norm': 0.22744449973106384, 'learning_rate': 1.336463191719367e-06, 'kl': 0.0249, 'entropy': -0.0225, 'ce_loss': 0.0119, 'epoch': 1.64} 41%|████ | 138/336 [46:06<1:07:19, 20.40s/it] 41%|████▏ | 139/336 [46:24<1:05:10, 19.85s/it] {'loss': 0.0302, 'grad_norm': 0.24581502377986908, 'learning_rate': 1.3273447611818766e-06, 'kl': 0.0186, 'entropy': -0.0747, 'ce_loss': 0.0103, 'epoch': 1.65} 41%|████▏ | 139/336 [46:24<1:05:10, 19.85s/it] 42%|████▏ | 140/336 [46:44<1:04:39, 19.79s/it] {'loss': 0.0249, 'grad_norm': 0.20019924640655518, 'learning_rate': 1.3181957437744332e-06, 'kl': 0.0228, 'entropy': -0.0654, 'ce_loss': 0.023, 'epoch': 1.67} 42%|████▏ | 140/336 [46:44<1:04:39, 19.79s/it] 42%|████▏ | 141/336 [47:05<1:05:08, 20.04s/it] {'loss': 0.0178, 'grad_norm': 0.1666233092546463, 'learning_rate': 1.3090169943749473e-06, 'kl': 0.0457, 'entropy': 0.015, 'ce_loss': 0.0068, 'epoch': 1.68} 42%|████▏ | 141/336 [47:05<1:05:08, 20.04s/it] 42%|████▏ | 142/336 [47:23<1:03:25, 19.62s/it] {'loss': 0.0306, 'grad_norm': 0.2479308843612671, 'learning_rate': 1.2998093706394675e-06, 'kl': 0.0105, 'entropy': -0.0552, 'ce_loss': 0.0139, 'epoch': 1.69} 42%|████▏ | 142/336 [47:23<1:03:25, 19.62s/it] 43%|████▎ | 143/336 [47:43<1:03:43, 19.81s/it] {'loss': 0.018, 'grad_norm': 0.14878065884113312, 'learning_rate': 1.2905737329220392e-06, 'kl': 0.0354, 'entropy': -0.0752, 'ce_loss': 0.0115, 'epoch': 1.7} 43%|████▎ | 143/336 [47:43<1:03:43, 19.81s/it] 43%|████▎ | 144/336 [48:05<1:04:42, 20.22s/it] {'loss': 0.0217, 'grad_norm': 0.21767084300518036, 'learning_rate': 1.2813109441943164e-06, 'kl': 0.012, 'entropy': -0.0591, 'ce_loss': 0.0131, 'epoch': 1.71} 43%|████▎ | 144/336 [48:05<1:04:42, 20.22s/it] 43%|████▎ | 145/336 [48:23<1:02:43, 19.70s/it] {'loss': 0.0272, 'grad_norm': 0.23388831317424774, 'learning_rate': 1.2720218699649241e-06, 'kl': 0.0452, 'entropy': -0.0796, 'ce_loss': 0.0128, 'epoch': 1.73} 43%|████▎ | 145/336 [48:23<1:02:43, 19.70s/it] 43%|████▎ | 146/336 [48:44<1:03:59, 20.21s/it] {'loss': 0.0227, 'grad_norm': 0.2210669368505478, 'learning_rate': 1.262707378198587e-06, 'kl': 0.0159, 'entropy': -0.0476, 'ce_loss': 0.0202, 'epoch': 1.74} 43%|████▎ | 146/336 [48:44<1:03:59, 20.21s/it] 44%|████▍ | 147/336 [49:03<1:01:52, 19.64s/it] {'loss': 0.0194, 'grad_norm': 0.18279339373111725, 'learning_rate': 1.2533683392350262e-06, 'kl': 0.0073, 'entropy': -0.0476, 'ce_loss': 0.0126, 'epoch': 1.75} 44%|████▍ | 147/336 [49:03<1:01:52, 19.64s/it] 44%|████▍ | 148/336 [49:24<1:03:06, 20.14s/it] {'loss': 0.0204, 'grad_norm': 0.18311206996440887, 'learning_rate': 1.2440056257076374e-06, 'kl': 0.0432, 'entropy': -0.0403, 'ce_loss': 0.0107, 'epoch': 1.76} 44%|████▍ | 148/336 [49:24<1:03:06, 20.14s/it] 44%|████▍ | 149/336 [49:47<1:04:55, 20.83s/it] {'loss': 0.0201, 'grad_norm': 0.1995018720626831, 'learning_rate': 1.23462011246195e-06, 'kl': 0.0144, 'entropy': -0.0293, 'ce_loss': 0.0217, 'epoch': 1.77} 44%|████▍ | 149/336 [49:47<1:04:55, 20.83s/it] 45%|████▍ | 150/336 [50:05<1:02:02, 20.01s/it] {'loss': 0.0238, 'grad_norm': 0.19784873723983765, 'learning_rate': 1.2252126764738844e-06, 'kl': 0.0066, 'entropy': -0.0491, 'ce_loss': 0.0214, 'epoch': 1.79} 45%|████▍ | 150/336 [50:05<1:02:02, 20.01s/it] 45%|████▍ | 151/336 [50:28<1:05:07, 21.12s/it] {'loss': 0.02, 'grad_norm': 0.176497682929039, 'learning_rate': 1.2157841967678063e-06, 'kl': 0.0179, 'entropy': -0.0109, 'ce_loss': 0.0034, 'epoch': 1.8} 45%|████▍ | 151/336 [50:28<1:05:07, 21.12s/it] 45%|████▌ | 152/336 [50:47<1:02:22, 20.34s/it] {'loss': 0.0261, 'grad_norm': 0.20791840553283691, 'learning_rate': 1.2063355543343923e-06, 'kl': 0.0371, 'entropy': -0.0493, 'ce_loss': 0.0174, 'epoch': 1.81} 45%|████▌ | 152/336 [50:47<1:02:22, 20.34s/it] 46%|████▌ | 153/336 [51:05<1:00:16, 19.76s/it] {'loss': 0.0296, 'grad_norm': 0.28589048981666565, 'learning_rate': 1.1968676320483101e-06, 'kl': 0.014, 'entropy': -0.0214, 'ce_loss': 0.0109, 'epoch': 1.82} 46%|████▌ | 153/336 [51:05<1:00:16, 19.76s/it] 46%|████▌ | 154/336 [51:26<1:01:06, 20.14s/it] {'loss': 0.0219, 'grad_norm': 0.19946201145648956, 'learning_rate': 1.1873813145857248e-06, 'kl': 0.0322, 'entropy': -0.062, 'ce_loss': 0.0391, 'epoch': 1.83} 46%|████▌ | 154/336 [51:26<1:01:06, 20.14s/it] 46%|████▌ | 155/336 [51:47<1:01:34, 20.41s/it] {'loss': 0.0236, 'grad_norm': 0.2044534832239151, 'learning_rate': 1.1778774883416322e-06, 'kl': 0.0236, 'entropy': -0.0304, 'ce_loss': 0.0205, 'epoch': 1.85} 46%|████▌ | 155/336 [51:47<1:01:34, 20.41s/it] 46%|████▋ | 156/336 [52:09<1:02:18, 20.77s/it] {'loss': 0.0175, 'grad_norm': 0.1728210300207138, 'learning_rate': 1.1683570413470383e-06, 'kl': 0.0189, 'entropy': -0.0354, 'ce_loss': 0.0163, 'epoch': 1.86} 46%|████▋ | 156/336 [52:09<1:02:18, 20.77s/it] 47%|████▋ | 157/336 [52:31<1:03:19, 21.22s/it] {'loss': 0.0229, 'grad_norm': 0.2024766355752945, 'learning_rate': 1.1588208631859807e-06, 'kl': 0.0276, 'entropy': -0.0747, 'ce_loss': 0.015, 'epoch': 1.87} 47%|████▋ | 157/336 [52:31<1:03:19, 21.22s/it] 47%|████▋ | 158/336 [52:52<1:02:55, 21.21s/it] {'loss': 0.022, 'grad_norm': 0.21219158172607422, 'learning_rate': 1.149269844912404e-06, 'kl': 0.0228, 'entropy': -0.0393, 'ce_loss': 0.0059, 'epoch': 1.88} 47%|████▋ | 158/336 [52:52<1:02:55, 21.21s/it] 47%|████▋ | 159/336 [53:13<1:02:15, 21.10s/it] {'loss': 0.0246, 'grad_norm': 0.21696853637695312, 'learning_rate': 1.1397048789669059e-06, 'kl': 0.0243, 'entropy': -0.0806, 'ce_loss': 0.0189, 'epoch': 1.89} 47%|████▋ | 159/336 [53:13<1:02:15, 21.10s/it] 48%|████▊ | 160/336 [53:35<1:02:27, 21.29s/it] {'loss': 0.0223, 'grad_norm': 0.18749357759952545, 'learning_rate': 1.1301268590933434e-06, 'kl': 0.0212, 'entropy': -0.0718, 'ce_loss': 0.0125, 'epoch': 1.9} 48%|████▊ | 160/336 [53:35<1:02:27, 21.29s/it] 48%|████▊ | 161/336 [53:53<59:33, 20.42s/it] {'loss': 0.0259, 'grad_norm': 0.21566730737686157, 'learning_rate': 1.1205366802553228e-06, 'kl': 0.0171, 'entropy': -0.0461, 'ce_loss': 0.0129, 'epoch': 1.92} 48%|████▊ | 161/336 [53:53<59:33, 20.42s/it] 48%|████▊ | 162/336 [54:14<59:08, 20.39s/it] {'loss': 0.0245, 'grad_norm': 0.21803289651870728, 'learning_rate': 1.110935238552578e-06, 'kl': 0.0437, 'entropy': -0.0898, 'ce_loss': 0.0114, 'epoch': 1.93} 48%|████▊ | 162/336 [54:14<59:08, 20.39s/it] 49%|████▊ | 163/336 [54:32<57:14, 19.85s/it] {'loss': 0.0281, 'grad_norm': 0.21488355100154877, 'learning_rate': 1.1013234311372353e-06, 'kl': 0.033, 'entropy': -0.0757, 'ce_loss': 0.0335, 'epoch': 1.94} 49%|████▊ | 163/336 [54:32<57:14, 19.85s/it] 49%|████▉ | 164/336 [54:53<57:59, 20.23s/it] {'loss': 0.0211, 'grad_norm': 0.19921565055847168, 'learning_rate': 1.0917021561299862e-06, 'kl': 0.0295, 'entropy': -0.0312, 'ce_loss': 0.0118, 'epoch': 1.95} 49%|████▉ | 164/336 [54:53<57:59, 20.23s/it] 49%|████▉ | 165/336 [55:17<1:00:07, 21.10s/it] {'loss': 0.0209, 'grad_norm': 0.17787693440914154, 'learning_rate': 1.0820723125361684e-06, 'kl': 0.0254, 'entropy': -0.0679, 'ce_loss': 0.0111, 'epoch': 1.96} 49%|████▉ | 165/336 [55:17<1:00:07, 21.10s/it] 49%|████▉ | 166/336 [55:37<59:32, 21.01s/it] {'loss': 0.0196, 'grad_norm': 0.1530429720878601, 'learning_rate': 1.0724348001617625e-06, 'kl': 0.0923, 'entropy': -0.0728, 'ce_loss': 0.0092, 'epoch': 1.98} 49%|████▉ | 166/336 [55:37<59:32, 21.01s/it] 50%|████▉ | 167/336 [55:56<57:03, 20.26s/it] {'loss': 0.0222, 'grad_norm': 0.18336762487888336, 'learning_rate': 1.0627905195293135e-06, 'kl': 0.0256, 'entropy': -0.033, 'ce_loss': 0.0128, 'epoch': 1.99} 50%|████▉ | 167/336 [55:56<57:03, 20.26s/it] 50%|█████ | 168/336 [56:14<54:52, 19.60s/it] {'loss': 0.0272, 'grad_norm': 0.235432431101799, 'learning_rate': 1.0531403717937886e-06, 'kl': 0.0028, 'entropy': -0.009, 'ce_loss': 0.0051, 'epoch': 2.0} 50%|█████ | 168/336 [56:14<54:52, 19.60s/it] 50%|█████ | 169/336 [56:32<53:40, 19.29s/it] {'loss': 0.0161, 'grad_norm': 0.15607398748397827, 'learning_rate': 1.0434852586583737e-06, 'kl': 0.0437, 'entropy': 0.0143, 'ce_loss': 0.0148, 'epoch': 2.01} 50%|█████ | 169/336 [56:32<53:40, 19.29s/it] 51%|█████ | 170/336 [56:51<53:05, 19.19s/it] {'loss': 0.0146, 'grad_norm': 0.13646626472473145, 'learning_rate': 1.0338260822902165e-06, 'kl': 0.032, 'entropy': -0.0295, 'ce_loss': 0.0205, 'epoch': 2.02} 51%|█████ | 170/336 [56:51<53:05, 19.19s/it] 51%|█████ | 171/336 [57:11<53:13, 19.35s/it] {'loss': 0.0195, 'grad_norm': 0.20896901190280914, 'learning_rate': 1.0241637452361322e-06, 'kl': 0.0859, 'entropy': -0.082, 'ce_loss': 0.0114, 'epoch': 2.04} 51%|█████ | 171/336 [57:11<53:13, 19.35s/it] 51%|█████ | 172/336 [57:30<52:17, 19.13s/it] {'loss': 0.0214, 'grad_norm': 0.17581482231616974, 'learning_rate': 1.0144991503382673e-06, 'kl': 0.0574, 'entropy': -0.0771, 'ce_loss': 0.0107, 'epoch': 2.05} 51%|█████ | 172/336 [57:30<52:17, 19.13s/it] 51%|█████▏ | 173/336 [57:49<51:46, 19.06s/it] {'loss': 0.0162, 'grad_norm': 0.1555885225534439, 'learning_rate': 1.0048332006497404e-06, 'kl': 0.0396, 'entropy': -0.0649, 'ce_loss': 0.0219, 'epoch': 2.06} 51%|█████▏ | 173/336 [57:49<51:46, 19.06s/it] 52%|█████▏ | 174/336 [58:08<51:28, 19.06s/it] {'loss': 0.0163, 'grad_norm': 0.1954892873764038, 'learning_rate': 9.951667993502597e-07, 'kl': 0.0718, 'entropy': -0.1045, 'ce_loss': 0.0089, 'epoch': 2.07} 52%|█████▏ | 174/336 [58:08<51:28, 19.06s/it] 52%|█████▏ | 175/336 [58:26<50:36, 18.86s/it] {'loss': 0.018, 'grad_norm': 0.17306023836135864, 'learning_rate': 9.855008496617326e-07, 'kl': 0.0403, 'entropy': -0.0239, 'ce_loss': 0.0061, 'epoch': 2.08} 52%|█████▏ | 175/336 [58:26<50:36, 18.86s/it] 52%|█████▏ | 176/336 [58:47<51:46, 19.41s/it] {'loss': 0.0143, 'grad_norm': 0.16467143595218658, 'learning_rate': 9.75836254763868e-07, 'kl': 0.0311, 'entropy': -0.0747, 'ce_loss': 0.0088, 'epoch': 2.1} 52%|█████▏ | 176/336 [58:47<51:46, 19.41s/it] 53%|█████▎ | 177/336 [59:08<52:50, 19.94s/it] {'loss': 0.0209, 'grad_norm': 0.1877674162387848, 'learning_rate': 9.661739177097834e-07, 'kl': 0.0267, 'entropy': -0.0869, 'ce_loss': 0.0052, 'epoch': 2.11} 53%|█████▎ | 177/336 [59:08<52:50, 19.94s/it] 53%|█████▎ | 178/336 [59:26<51:06, 19.41s/it] {'loss': 0.0202, 'grad_norm': 0.21118910610675812, 'learning_rate': 9.565147413416265e-07, 'kl': 0.0535, 'entropy': -0.1187, 'ce_loss': 0.009, 'epoch': 2.12} 53%|█████▎ | 178/336 [59:26<51:06, 19.41s/it] 53%|█████▎ | 179/336 [59:45<50:00, 19.11s/it] {'loss': 0.0117, 'grad_norm': 0.13858912885189056, 'learning_rate': 9.468596282062113e-07, 'kl': 0.0181, 'entropy': -0.0098, 'ce_loss': 0.0136, 'epoch': 2.13} 53%|█████▎ | 179/336 [59:45<50:00, 19.11s/it] 54%|█████▎ | 180/336 [1:00:06<51:37, 19.85s/it] {'loss': 0.0174, 'grad_norm': 0.1708763837814331, 'learning_rate': 9.372094804706866e-07, 'kl': 0.0527, 'entropy': -0.0255, 'ce_loss': 0.0104, 'epoch': 2.14} 54%|█████▎ | 180/336 [1:00:06<51:37, 19.85s/it] 54%|█████▍ | 181/336 [1:00:27<52:11, 20.20s/it] {'loss': 0.0142, 'grad_norm': 0.15324528515338898, 'learning_rate': 9.275651998382377e-07, 'kl': 0.03, 'entropy': -0.0253, 'ce_loss': 0.0071, 'epoch': 2.15} 54%|█████▍ | 181/336 [1:00:27<52:11, 20.20s/it] 54%|█████▍ | 182/336 [1:00:48<52:29, 20.45s/it] {'loss': 0.022, 'grad_norm': 0.23443107306957245, 'learning_rate': 9.179276874638314e-07, 'kl': 0.0591, 'entropy': -0.1084, 'ce_loss': 0.0292, 'epoch': 2.17} 54%|█████▍ | 182/336 [1:00:48<52:29, 20.45s/it] 54%|█████▍ | 183/336 [1:01:07<50:41, 19.88s/it] {'loss': 0.0172, 'grad_norm': 0.18951916694641113, 'learning_rate': 9.082978438700138e-07, 'kl': 0.0527, 'entropy': -0.0693, 'ce_loss': 0.0163, 'epoch': 2.18} 54%|█████▍ | 183/336 [1:01:07<50:41, 19.88s/it] 55%|█████▍ | 184/336 [1:01:25<49:10, 19.41s/it] {'loss': 0.0168, 'grad_norm': 0.19647429883480072, 'learning_rate': 8.986765688627651e-07, 'kl': 0.0574, 'entropy': -0.0527, 'ce_loss': 0.0095, 'epoch': 2.19} 55%|█████▍ | 184/336 [1:01:25<49:10, 19.41s/it] 55%|█████▌ | 185/336 [1:01:43<47:56, 19.05s/it] {'loss': 0.0155, 'grad_norm': 0.1953701674938202, 'learning_rate': 8.890647614474222e-07, 'kl': 0.0223, 'entropy': -0.0425, 'ce_loss': 0.0141, 'epoch': 2.2} 55%|█████▌ | 185/336 [1:01:43<47:56, 19.05s/it] 55%|█████▌ | 186/336 [1:02:07<51:23, 20.55s/it] {'loss': 0.0172, 'grad_norm': 0.1988769918680191, 'learning_rate': 8.79463319744677e-07, 'kl': 0.062, 'entropy': -0.0265, 'ce_loss': 0.0119, 'epoch': 2.21} 55%|█████▌ | 186/336 [1:02:07<51:23, 20.55s/it] 56%|█████▌ | 187/336 [1:02:28<51:00, 20.54s/it] {'loss': 0.0183, 'grad_norm': 0.2061987966299057, 'learning_rate': 8.698731409066568e-07, 'kl': 0.0226, 'entropy': -0.0216, 'ce_loss': 0.01, 'epoch': 2.23} 56%|█████▌ | 187/336 [1:02:28<51:00, 20.54s/it] 56%|█████▌ | 188/336 [1:02:47<49:25, 20.03s/it] {'loss': 0.0214, 'grad_norm': 0.21873442828655243, 'learning_rate': 8.602951210330941e-07, 'kl': 0.0796, 'entropy': -0.054, 'ce_loss': 0.0158, 'epoch': 2.24} 56%|█████▌ | 188/336 [1:02:47<49:25, 20.03s/it] 56%|█████▋ | 189/336 [1:03:07<49:34, 20.23s/it] {'loss': 0.0177, 'grad_norm': 0.2178049087524414, 'learning_rate': 8.507301550875959e-07, 'kl': 0.0031, 'entropy': -0.0019, 'ce_loss': 0.0026, 'epoch': 2.25} 56%|█████▋ | 189/336 [1:03:07<49:34, 20.23s/it] 57%|█████▋ | 190/336 [1:03:30<50:50, 20.89s/it] {'loss': 0.0167, 'grad_norm': 0.2083812952041626, 'learning_rate': 8.411791368140195e-07, 'kl': 0.0262, 'entropy': -0.0486, 'ce_loss': 0.0066, 'epoch': 2.26} 57%|█████▋ | 190/336 [1:03:30<50:50, 20.89s/it] 57%|█████▋ | 191/336 [1:03:48<48:24, 20.03s/it] {'loss': 0.0183, 'grad_norm': 0.20880137383937836, 'learning_rate': 8.316429586529614e-07, 'kl': 0.0569, 'entropy': -0.0381, 'ce_loss': 0.0077, 'epoch': 2.27} 57%|█████▋ | 191/336 [1:03:48<48:24, 20.03s/it] 57%|█████▋ | 192/336 [1:04:09<48:35, 20.25s/it] {'loss': 0.0218, 'grad_norm': 0.2501277029514313, 'learning_rate': 8.221225116583676e-07, 'kl': 0.032, 'entropy': -0.0728, 'ce_loss': 0.0265, 'epoch': 2.29} 57%|█████▋ | 192/336 [1:04:09<48:35, 20.25s/it] 57%|█████▋ | 193/336 [1:04:27<47:12, 19.81s/it] {'loss': 0.0187, 'grad_norm': 0.19226092100143433, 'learning_rate': 8.126186854142751e-07, 'kl': 0.0255, 'entropy': -0.0596, 'ce_loss': 0.0075, 'epoch': 2.3} 57%|█████▋ | 193/336 [1:04:27<47:12, 19.81s/it] 58%|█████▊ | 194/336 [1:04:46<46:13, 19.53s/it] {'loss': 0.0212, 'grad_norm': 0.34131720662117004, 'learning_rate': 8.031323679516899e-07, 'kl': 0.0184, 'entropy': -0.0869, 'ce_loss': 0.0117, 'epoch': 2.31} 58%|█████▊ | 194/336 [1:04:46<46:13, 19.53s/it] 58%|█████▊ | 195/336 [1:05:08<47:30, 20.22s/it] {'loss': 0.0137, 'grad_norm': 0.18951615691184998, 'learning_rate': 7.936644456656081e-07, 'kl': 0.0625, 'entropy': -0.0806, 'ce_loss': 0.0049, 'epoch': 2.32} 58%|█████▊ | 195/336 [1:05:08<47:30, 20.22s/it] 58%|█████▊ | 196/336 [1:05:27<45:57, 19.70s/it] {'loss': 0.019, 'grad_norm': 0.22135503590106964, 'learning_rate': 7.84215803232194e-07, 'kl': 0.0287, 'entropy': -0.0464, 'ce_loss': 0.0073, 'epoch': 2.33} 58%|█████▊ | 196/336 [1:05:27<45:57, 19.70s/it] 59%|█████▊ | 197/336 [1:05:45<44:48, 19.34s/it] {'loss': 0.0188, 'grad_norm': 0.20496729016304016, 'learning_rate': 7.747873235261156e-07, 'kl': 0.0344, 'entropy': -0.0361, 'ce_loss': 0.0217, 'epoch': 2.35} 59%|█████▊ | 197/336 [1:05:45<44:48, 19.34s/it] 59%|█████▉ | 198/336 [1:06:06<45:50, 19.93s/it] {'loss': 0.0166, 'grad_norm': 0.17408807575702667, 'learning_rate': 7.653798875380499e-07, 'kl': 0.0332, 'entropy': -0.0786, 'ce_loss': 0.0165, 'epoch': 2.36} 59%|█████▉ | 198/336 [1:06:06<45:50, 19.93s/it] 59%|█████▉ | 199/336 [1:06:30<47:48, 20.94s/it] {'loss': 0.0183, 'grad_norm': 0.18544034659862518, 'learning_rate': 7.559943742923625e-07, 'kl': 0.0057, 'entropy': -0.0054, 'ce_loss': 0.0043, 'epoch': 2.37} 59%|█████▉ | 199/336 [1:06:30<47:48, 20.94s/it] 60%|█████▉ | 200/336 [1:06:53<49:02, 21.63s/it] {'loss': 0.0144, 'grad_norm': 0.1591438204050064, 'learning_rate': 7.466316607649736e-07, 'kl': 0.0025, 'entropy': -0.0052, 'ce_loss': 0.0053, 'epoch': 2.38} 60%|█████▉ | 200/336 [1:06:53<49:02, 21.63s/it] 60%|█████▉ | 201/336 [1:07:13<47:44, 21.22s/it] {'loss': 0.0128, 'grad_norm': 0.16107751429080963, 'learning_rate': 7.372926218014131e-07, 'kl': 0.0581, 'entropy': -0.0255, 'ce_loss': 0.0278, 'epoch': 2.39} 60%|█████▉ | 201/336 [1:07:13<47:44, 21.22s/it] 60%|██████ | 202/336 [1:07:31<45:16, 20.27s/it] {'loss': 0.0223, 'grad_norm': 0.20544856786727905, 'learning_rate': 7.279781300350757e-07, 'kl': 0.1826, 'entropy': -0.0947, 'ce_loss': 0.0102, 'epoch': 2.4} 60%|██████ | 202/336 [1:07:31<45:16, 20.27s/it] 60%|██████ | 203/336 [1:07:50<43:41, 19.71s/it] {'loss': 0.0187, 'grad_norm': 0.20159998536109924, 'learning_rate': 7.186890558056836e-07, 'kl': 0.0139, 'entropy': -0.0359, 'ce_loss': 0.0137, 'epoch': 2.42} 60%|██████ | 203/336 [1:07:50<43:41, 19.71s/it] 61%|██████ | 204/336 [1:08:08<42:19, 19.24s/it] {'loss': 0.0176, 'grad_norm': 0.1796092987060547, 'learning_rate': 7.09426267077961e-07, 'kl': 0.0388, 'entropy': -0.0219, 'ce_loss': 0.0181, 'epoch': 2.43} 61%|██████ | 204/336 [1:08:08<42:19, 19.24s/it] 61%|██████ | 205/336 [1:08:26<41:27, 18.99s/it] {'loss': 0.0164, 'grad_norm': 0.16899417340755463, 'learning_rate': 7.001906293605329e-07, 'kl': 0.0669, 'entropy': -0.0874, 'ce_loss': 0.0111, 'epoch': 2.44} 61%|██████ | 205/336 [1:08:26<41:27, 18.99s/it] 61%|██████▏ | 206/336 [1:08:48<42:43, 19.72s/it] {'loss': 0.0153, 'grad_norm': 0.16512706875801086, 'learning_rate': 6.909830056250526e-07, 'kl': 0.0265, 'entropy': -0.0004, 'ce_loss': 0.004, 'epoch': 2.45} 61%|██████▏ | 206/336 [1:08:48<42:43, 19.72s/it] 62%|██████▏ | 207/336 [1:09:06<41:42, 19.40s/it] {'loss': 0.0201, 'grad_norm': 0.186506986618042, 'learning_rate': 6.81804256225567e-07, 'kl': 0.0571, 'entropy': -0.1001, 'ce_loss': 0.0062, 'epoch': 2.46} 62%|██████▏ | 207/336 [1:09:06<41:42, 19.40s/it] 62%|██████▏ | 208/336 [1:09:33<46:06, 21.61s/it] {'loss': 0.0228, 'grad_norm': 0.1702680140733719, 'learning_rate': 6.726552388181233e-07, 'kl': 0.0232, 'entropy': -0.0005, 'ce_loss': 0.0116, 'epoch': 2.48} 62%|██████▏ | 208/336 [1:09:33<46:06, 21.61s/it] 62%|██████▏ | 209/336 [1:09:54<45:30, 21.50s/it] {'loss': 0.0168, 'grad_norm': 0.26210325956344604, 'learning_rate': 6.63536808280633e-07, 'kl': 0.0488, 'entropy': 0.0199, 'ce_loss': 0.0123, 'epoch': 2.49} 62%|██████▏ | 209/336 [1:09:54<45:30, 21.50s/it] 62%|██████▎ | 210/336 [1:10:13<43:05, 20.52s/it] {'loss': 0.0179, 'grad_norm': 0.1930081844329834, 'learning_rate': 6.544498166329912e-07, 'kl': 0.004, 'entropy': -0.0391, 'ce_loss': 0.0282, 'epoch': 2.5} 62%|██████▎ | 210/336 [1:10:13<43:05, 20.52s/it] 63%|██████▎ | 211/336 [1:10:31<41:15, 19.80s/it] {'loss': 0.0189, 'grad_norm': 0.1929868459701538, 'learning_rate': 6.453951129574643e-07, 'kl': 0.0864, 'entropy': -0.0947, 'ce_loss': 0.0092, 'epoch': 2.51} 63%|██████▎ | 211/336 [1:10:31<41:15, 19.80s/it] 63%|██████▎ | 212/336 [1:10:50<40:34, 19.63s/it] {'loss': 0.0248, 'grad_norm': 0.23292659223079681, 'learning_rate': 6.363735433193529e-07, 'kl': 0.043, 'entropy': -0.0859, 'ce_loss': 0.0103, 'epoch': 2.52} 63%|██████▎ | 212/336 [1:10:50<40:34, 19.63s/it] 63%|██████▎ | 213/336 [1:11:09<39:44, 19.39s/it] {'loss': 0.0192, 'grad_norm': 0.20558880269527435, 'learning_rate': 6.273859506879364e-07, 'kl': 0.0204, 'entropy': -0.0549, 'ce_loss': 0.0075, 'epoch': 2.54} 63%|██████▎ | 213/336 [1:11:09<39:44, 19.39s/it] 64%|██████▎ | 214/336 [1:11:30<40:45, 20.04s/it] {'loss': 0.0197, 'grad_norm': 0.22434385120868683, 'learning_rate': 6.18433174857705e-07, 'kl': 0.0645, 'entropy': -0.054, 'ce_loss': 0.0103, 'epoch': 2.55} 64%|██████▎ | 214/336 [1:11:30<40:45, 20.04s/it] 64%|██████▍ | 215/336 [1:11:52<41:22, 20.52s/it] {'loss': 0.0173, 'grad_norm': 0.1981441229581833, 'learning_rate': 6.095160523698912e-07, 'kl': 0.0122, 'entropy': -0.1152, 'ce_loss': 0.027, 'epoch': 2.56} 64%|██████▍ | 215/336 [1:11:52<41:22, 20.52s/it] 64%|██████▍ | 216/336 [1:12:16<43:16, 21.63s/it] {'loss': 0.015, 'grad_norm': 0.17937667667865753, 'learning_rate': 6.006354164343046e-07, 'kl': 0.032, 'entropy': -0.0144, 'ce_loss': 0.0291, 'epoch': 2.57} 64%|██████▍ | 216/336 [1:12:16<43:16, 21.63s/it] 65%|██████▍ | 217/336 [1:12:37<42:36, 21.48s/it] {'loss': 0.0198, 'grad_norm': 0.2788499891757965, 'learning_rate': 5.917920968514751e-07, 'kl': 0.1118, 'entropy': -0.0835, 'ce_loss': 0.0068, 'epoch': 2.58} 65%|██████▍ | 217/336 [1:12:37<42:36, 21.48s/it] 65%|██████▍ | 218/336 [1:12:59<42:06, 21.42s/it] {'loss': 0.0167, 'grad_norm': 0.16743479669094086, 'learning_rate': 5.829869199351187e-07, 'kl': 0.0376, 'entropy': -0.0684, 'ce_loss': 0.0244, 'epoch': 2.6} 65%|██████▍ | 218/336 [1:12:59<42:06, 21.42s/it] 65%|██████▌ | 219/336 [1:13:22<42:51, 21.98s/it] {'loss': 0.0145, 'grad_norm': 0.1557641625404358, 'learning_rate': 5.742207084349273e-07, 'kl': 0.0457, 'entropy': -0.0215, 'ce_loss': 0.013, 'epoch': 2.61} 65%|██████▌ | 219/336 [1:13:22<42:51, 21.98s/it] 65%|██████▌ | 220/336 [1:13:40<40:22, 20.88s/it] {'loss': 0.0167, 'grad_norm': 0.18459084630012512, 'learning_rate': 5.654942814596901e-07, 'kl': 0.0693, 'entropy': -0.0688, 'ce_loss': 0.0218, 'epoch': 2.62} 65%|██████▌ | 220/336 [1:13:40<40:22, 20.88s/it] 66%|██████▌ | 221/336 [1:14:01<39:48, 20.77s/it] {'loss': 0.0171, 'grad_norm': 0.18660375475883484, 'learning_rate': 5.568084544007588e-07, 'kl': 0.0547, 'entropy': -0.0596, 'ce_loss': 0.0097, 'epoch': 2.63} 66%|██████▌ | 221/336 [1:14:01<39:48, 20.77s/it] 66%|██████▌ | 222/336 [1:14:19<38:16, 20.15s/it] {'loss': 0.0155, 'grad_norm': 0.17597365379333496, 'learning_rate': 5.48164038855855e-07, 'kl': 0.0115, 'entropy': 0.0092, 'ce_loss': 0.0086, 'epoch': 2.64} 66%|██████▌ | 222/336 [1:14:19<38:16, 20.15s/it] 66%|██████▋ | 223/336 [1:14:38<37:05, 19.69s/it] {'loss': 0.0152, 'grad_norm': 0.1828441172838211, 'learning_rate': 5.395618425532389e-07, 'kl': 0.0006, 'entropy': -0.0103, 'ce_loss': 0.0103, 'epoch': 2.65} 66%|██████▋ | 223/336 [1:14:38<37:05, 19.69s/it] 67%|██████▋ | 224/336 [1:14:59<37:30, 20.09s/it] {'loss': 0.0139, 'grad_norm': 0.16057166457176208, 'learning_rate': 5.310026692762314e-07, 'kl': 0.0618, 'entropy': -0.0459, 'ce_loss': 0.0045, 'epoch': 2.67} 67%|██████▋ | 224/336 [1:14:59<37:30, 20.09s/it] 67%|██████▋ | 225/336 [1:15:17<36:13, 19.58s/it] {'loss': 0.0206, 'grad_norm': 0.20677301287651062, 'learning_rate': 5.224873187881136e-07, 'kl': 0.0493, 'entropy': -0.0583, 'ce_loss': 0.022, 'epoch': 2.68} 67%|██████▋ | 225/336 [1:15:17<36:13, 19.58s/it] 67%|██████▋ | 226/336 [1:15:36<35:10, 19.19s/it] {'loss': 0.016, 'grad_norm': 0.17295347154140472, 'learning_rate': 5.140165867573939e-07, 'kl': 0.0496, 'entropy': -0.0613, 'ce_loss': 0.006, 'epoch': 2.69} 67%|██████▋ | 226/336 [1:15:36<35:10, 19.19s/it] 68%|██████▊ | 227/336 [1:15:54<34:35, 19.04s/it] {'loss': 0.0171, 'grad_norm': 0.18624082207679749, 'learning_rate': 5.055912646834635e-07, 'kl': 0.0215, 'entropy': -0.0209, 'ce_loss': 0.0121, 'epoch': 2.7} 68%|██████▊ | 227/336 [1:15:54<34:35, 19.04s/it] 68%|██████▊ | 228/336 [1:16:17<36:25, 20.23s/it] {'loss': 0.0139, 'grad_norm': 0.1567946821451187, 'learning_rate': 4.972121398226371e-07, 'kl': 0.033, 'entropy': -0.0391, 'ce_loss': 0.011, 'epoch': 2.71} 68%|██████▊ | 228/336 [1:16:17<36:25, 20.23s/it] 68%|██████▊ | 229/336 [1:16:36<35:01, 19.64s/it] {'loss': 0.0199, 'grad_norm': 0.1982155740261078, 'learning_rate': 4.888799951145947e-07, 'kl': 0.0299, 'entropy': -0.0845, 'ce_loss': 0.0137, 'epoch': 2.73} 68%|██████▊ | 229/336 [1:16:36<35:01, 19.64s/it] 68%|██████▊ | 230/336 [1:16:54<34:09, 19.34s/it] {'loss': 0.0174, 'grad_norm': 0.19088280200958252, 'learning_rate': 4.805956091092227e-07, 'kl': 0.0203, 'entropy': -0.0518, 'ce_loss': 0.0135, 'epoch': 2.74} 68%|██████▊ | 230/336 [1:16:54<34:09, 19.34s/it] 69%|██████▉ | 231/336 [1:17:12<33:15, 19.00s/it] {'loss': 0.023, 'grad_norm': 0.2482292652130127, 'learning_rate': 4.7235975589386713e-07, 'kl': 0.0508, 'entropy': -0.083, 'ce_loss': 0.0163, 'epoch': 2.75} 69%|██████▉ | 231/336 [1:17:12<33:15, 19.00s/it] 69%|██████▉ | 232/336 [1:17:31<32:45, 18.90s/it] {'loss': 0.0176, 'grad_norm': 0.18963736295700073, 'learning_rate': 4.641732050210031e-07, 'kl': 0.0703, 'entropy': -0.0378, 'ce_loss': 0.0164, 'epoch': 2.76} 69%|██████▉ | 232/336 [1:17:31<32:45, 18.90s/it] 69%|██████▉ | 233/336 [1:17:50<32:19, 18.83s/it] {'loss': 0.0159, 'grad_norm': 0.19031205773353577, 'learning_rate': 4.5603672143632945e-07, 'kl': 0.0151, 'entropy': -0.0479, 'ce_loss': 0.0189, 'epoch': 2.77} 69%|██████▉ | 233/336 [1:17:50<32:19, 18.83s/it] 70%|██████▉ | 234/336 [1:18:13<33:58, 19.99s/it] {'loss': 0.0165, 'grad_norm': 0.18193629384040833, 'learning_rate': 4.479510654072909e-07, 'kl': 0.0452, 'entropy': -0.0354, 'ce_loss': 0.0045, 'epoch': 2.79} 70%|██████▉ | 234/336 [1:18:13<33:58, 19.99s/it] 70%|██████▉ | 235/336 [1:18:34<34:20, 20.40s/it] {'loss': 0.0179, 'grad_norm': 0.18861278891563416, 'learning_rate': 4.399169924520403e-07, 'kl': 0.0359, 'entropy': -0.0457, 'ce_loss': 0.0173, 'epoch': 2.8} 70%|██████▉ | 235/336 [1:18:34<34:20, 20.40s/it] 70%|███████ | 236/336 [1:18:52<33:04, 19.84s/it] {'loss': 0.016, 'grad_norm': 0.3272123634815216, 'learning_rate': 4.3193525326884426e-07, 'kl': 0.0173, 'entropy': -0.0293, 'ce_loss': 0.0122, 'epoch': 2.81} 70%|███████ | 236/336 [1:18:52<33:04, 19.84s/it] 71%|███████ | 237/336 [1:19:11<32:06, 19.46s/it] {'loss': 0.0228, 'grad_norm': 0.22482118010520935, 'learning_rate': 4.240065936659374e-07, 'kl': 0.0444, 'entropy': -0.0942, 'ce_loss': 0.0173, 'epoch': 2.82} 71%|███████ | 237/336 [1:19:11<32:06, 19.46s/it] 71%|███████ | 238/336 [1:19:30<31:24, 19.23s/it] {'loss': 0.0176, 'grad_norm': 0.19276946783065796, 'learning_rate': 4.1613175449183446e-07, 'kl': 0.0496, 'entropy': -0.0786, 'ce_loss': 0.0106, 'epoch': 2.83} 71%|███████ | 238/336 [1:19:30<31:24, 19.23s/it] 71%|███████ | 239/336 [1:19:54<33:25, 20.67s/it] {'loss': 0.0165, 'grad_norm': 0.18611279129981995, 'learning_rate': 4.0831147156610676e-07, 'kl': 0.0339, 'entropy': -0.0674, 'ce_loss': 0.0079, 'epoch': 2.85} 71%|███████ | 239/336 [1:19:54<33:25, 20.67s/it] 71%|███████▏ | 240/336 [1:20:17<34:21, 21.47s/it] {'loss': 0.0141, 'grad_norm': 0.17015516757965088, 'learning_rate': 4.0054647561062615e-07, 'kl': 0.0693, 'entropy': 0.0118, 'ce_loss': 0.0197, 'epoch': 2.86} 71%|███████▏ | 240/336 [1:20:17<34:21, 21.47s/it] 72%|███████▏ | 241/336 [1:20:35<32:31, 20.54s/it] {'loss': 0.0196, 'grad_norm': 0.19862358272075653, 'learning_rate': 3.928374921812888e-07, 'kl': 0.0454, 'entropy': -0.0625, 'ce_loss': 0.0053, 'epoch': 2.87} 72%|███████▏ | 241/336 [1:20:35<32:31, 20.54s/it] 72%|███████▏ | 242/336 [1:20:57<32:29, 20.74s/it] {'loss': 0.0185, 'grad_norm': 0.1786581426858902, 'learning_rate': 3.851852416002187e-07, 'kl': 0.0216, 'entropy': -0.0398, 'ce_loss': 0.0205, 'epoch': 2.88} 72%|███████▏ | 242/336 [1:20:57<32:29, 20.74s/it] 72%|███████▏ | 243/336 [1:21:18<32:13, 20.79s/it] {'loss': 0.0161, 'grad_norm': 0.19768285751342773, 'learning_rate': 3.7759043888846173e-07, 'kl': 0.0332, 'entropy': -0.0491, 'ce_loss': 0.0096, 'epoch': 2.89} 72%|███████▏ | 243/336 [1:21:18<32:13, 20.79s/it] 73%|███████▎ | 244/336 [1:21:39<32:06, 20.94s/it] {'loss': 0.0164, 'grad_norm': 0.18463560938835144, 'learning_rate': 3.7005379369917324e-07, 'kl': 0.0253, 'entropy': -0.0294, 'ce_loss': 0.0119, 'epoch': 2.9} 73%|███████▎ | 244/336 [1:21:39<32:06, 20.94s/it] 73%|███████▎ | 245/336 [1:21:57<30:34, 20.16s/it] {'loss': 0.0187, 'grad_norm': 0.19589222967624664, 'learning_rate': 3.625760102513102e-07, 'kl': 0.0559, 'entropy': -0.0254, 'ce_loss': 0.0084, 'epoch': 2.92} 73%|███████▎ | 245/336 [1:21:57<30:34, 20.16s/it] 73%|███████▎ | 246/336 [1:22:18<30:22, 20.25s/it] {'loss': 0.0167, 'grad_norm': 0.2014390528202057, 'learning_rate': 3.551577872638296e-07, 'kl': 0.085, 'entropy': -0.0869, 'ce_loss': 0.0135, 'epoch': 2.93} 73%|███████▎ | 246/336 [1:22:18<30:22, 20.25s/it] 74%|███████▎ | 247/336 [1:22:36<29:11, 19.68s/it] {'loss': 0.0222, 'grad_norm': 0.2570900619029999, 'learning_rate': 3.477998178903981e-07, 'kl': 0.0254, 'entropy': -0.0072, 'ce_loss': 0.0083, 'epoch': 2.94} 74%|███████▎ | 247/336 [1:22:36<29:11, 19.68s/it] 74%|███████▍ | 248/336 [1:22:54<28:13, 19.24s/it] {'loss': 0.0202, 'grad_norm': 0.22248730063438416, 'learning_rate': 3.4050278965462763e-07, 'kl': 0.0192, 'entropy': -0.0113, 'ce_loss': 0.0097, 'epoch': 2.95} 74%|███████▍ | 248/336 [1:22:54<28:13, 19.24s/it] 74%|███████▍ | 249/336 [1:23:17<29:36, 20.42s/it] {'loss': 0.0149, 'grad_norm': 0.15645594894886017, 'learning_rate': 3.3326738438583114e-07, 'kl': 0.0292, 'entropy': -0.0679, 'ce_loss': 0.0154, 'epoch': 2.96} 74%|███████▍ | 249/336 [1:23:17<29:36, 20.42s/it] 74%|███████▍ | 250/336 [1:23:35<28:14, 19.71s/it] {'loss': 0.0184, 'grad_norm': 0.2130623161792755, 'learning_rate': 3.260942781553142e-07, 'kl': 0.0413, 'entropy': -0.0559, 'ce_loss': 0.0085, 'epoch': 2.98} 74%|███████▍ | 250/336 [1:23:35<28:14, 19.71s/it] 75%|███████▍ | 251/336 [1:23:54<27:18, 19.28s/it] {'loss': 0.0188, 'grad_norm': 0.20211167633533478, 'learning_rate': 3.189841412132027e-07, 'kl': 0.0143, 'entropy': -0.0359, 'ce_loss': 0.0103, 'epoch': 2.99} 75%|███████▍ | 251/336 [1:23:54<27:18, 19.28s/it] 75%|███████▌ | 252/336 [1:24:15<27:43, 19.81s/it] {'loss': 0.0175, 'grad_norm': 0.21536721289157867, 'learning_rate': 3.1193763792581594e-07, 'kl': 0.0267, 'entropy': -0.0317, 'ce_loss': 0.0074, 'epoch': 3.0} 75%|███████▌ | 252/336 [1:24:15<27:43, 19.81s/it] 75%|███████▌ | 253/336 [1:24:36<27:49, 20.11s/it] {'loss': 0.0118, 'grad_norm': 0.14527536928653717, 'learning_rate': 3.0495542671358744e-07, 'kl': 0.0297, 'entropy': -0.1079, 'ce_loss': 0.0125, 'epoch': 3.01} 75%|███████▌ | 253/336 [1:24:36<27:49, 20.11s/it] 76%|███████▌ | 254/336 [1:24:54<26:52, 19.66s/it] {'loss': 0.0189, 'grad_norm': 0.17946068942546844, 'learning_rate': 2.980381599895433e-07, 'kl': 0.0413, 'entropy': -0.042, 'ce_loss': 0.006, 'epoch': 3.02} 76%|███████▌ | 254/336 [1:24:54<26:52, 19.66s/it] 76%|███████▌ | 255/336 [1:25:13<26:19, 19.50s/it] {'loss': 0.016, 'grad_norm': 0.17064058780670166, 'learning_rate': 2.91186484098342e-07, 'kl': 0.0454, 'entropy': -0.0742, 'ce_loss': 0.0195, 'epoch': 3.04} 76%|███████▌ | 255/336 [1:25:13<26:19, 19.50s/it] 76%|███████▌ | 256/336 [1:25:32<25:40, 19.26s/it] {'loss': 0.0173, 'grad_norm': 0.18997728824615479, 'learning_rate': 2.84401039255879e-07, 'kl': 0.106, 'entropy': -0.1602, 'ce_loss': 0.01, 'epoch': 3.05} 76%|███████▌ | 256/336 [1:25:32<25:40, 19.26s/it] 76%|███████▋ | 257/336 [1:25:53<26:08, 19.86s/it] {'loss': 0.0137, 'grad_norm': 0.16442851722240448, 'learning_rate': 2.776824594894661e-07, 'kl': 0.0289, 'entropy': -0.063, 'ce_loss': 0.0069, 'epoch': 3.06} 76%|███████▋ | 257/336 [1:25:53<26:08, 19.86s/it] 77%|███████▋ | 258/336 [1:26:14<25:59, 20.00s/it] {'loss': 0.0139, 'grad_norm': 0.16608139872550964, 'learning_rate': 2.7103137257858863e-07, 'kl': 0.0045, 'entropy': -0.0039, 'ce_loss': 0.007, 'epoch': 3.07} 77%|███████▋ | 258/336 [1:26:14<25:59, 20.00s/it] 77%|███████▋ | 259/336 [1:26:35<26:12, 20.42s/it] {'loss': 0.0139, 'grad_norm': 0.13842828571796417, 'learning_rate': 2.644483999962449e-07, 'kl': -0.0026, 'entropy': -0.0325, 'ce_loss': 0.0261, 'epoch': 3.08} 77%|███████▋ | 259/336 [1:26:35<26:12, 20.42s/it] 77%|███████▋ | 260/336 [1:26:53<25:04, 19.79s/it] {'loss': 0.0142, 'grad_norm': 0.15147317945957184, 'learning_rate': 2.579341568508779e-07, 'kl': 0.0221, 'entropy': -0.0396, 'ce_loss': 0.016, 'epoch': 3.1} 77%|███████▋ | 260/336 [1:26:53<25:04, 19.79s/it] 78%|███████▊ | 261/336 [1:27:12<24:13, 19.37s/it] {'loss': 0.0178, 'grad_norm': 0.1974438577890396, 'learning_rate': 2.514892518288988e-07, 'kl': 0.041, 'entropy': -0.0135, 'ce_loss': 0.0069, 'epoch': 3.11} 78%|███████▊ | 261/336 [1:27:12<24:13, 19.37s/it] 78%|███████▊ | 262/336 [1:27:30<23:39, 19.18s/it] {'loss': 0.0181, 'grad_norm': 0.317388653755188, 'learning_rate': 2.4511428713781236e-07, 'kl': -0.0018, 'entropy': -0.0569, 'ce_loss': 0.0194, 'epoch': 3.12} 78%|███████▊ | 262/336 [1:27:30<23:39, 19.18s/it] 78%|███████▊ | 263/336 [1:27:52<24:12, 19.90s/it] {'loss': 0.0137, 'grad_norm': 0.16860726475715637, 'learning_rate': 2.3880985844994673e-07, 'kl': 0.061, 'entropy': -0.0957, 'ce_loss': 0.0109, 'epoch': 3.13} 78%|███████▊ | 263/336 [1:27:52<24:12, 19.90s/it] 79%|███████▊ | 264/336 [1:28:13<24:22, 20.32s/it] {'loss': 0.013, 'grad_norm': 0.14538967609405518, 'learning_rate': 2.3257655484679372e-07, 'kl': 0.0212, 'entropy': -0.0178, 'ce_loss': 0.0202, 'epoch': 3.14} 79%|███████▊ | 264/336 [1:28:13<24:22, 20.32s/it] 79%|███████▉ | 265/336 [1:28:32<23:26, 19.81s/it] {'loss': 0.0149, 'grad_norm': 0.16616979241371155, 'learning_rate': 2.264149587639671e-07, 'kl': 0.0432, 'entropy': -0.0864, 'ce_loss': 0.0044, 'epoch': 3.15} 79%|███████▉ | 265/336 [1:28:32<23:26, 19.81s/it] 79%|███████▉ | 266/336 [1:28:51<22:42, 19.47s/it] {'loss': 0.0135, 'grad_norm': 0.15596073865890503, 'learning_rate': 2.2032564593677772e-07, 'kl': 0.0522, 'entropy': -0.0242, 'ce_loss': 0.0035, 'epoch': 3.17} 79%|███████▉ | 266/336 [1:28:51<22:42, 19.47s/it] 79%|███████▉ | 267/336 [1:29:09<22:10, 19.29s/it] {'loss': 0.0169, 'grad_norm': 0.1640929877758026, 'learning_rate': 2.1430918534643994e-07, 'kl': 0.0481, 'entropy': -0.033, 'ce_loss': 0.0063, 'epoch': 3.18} 79%|███████▉ | 267/336 [1:29:09<22:10, 19.29s/it] 80%|███████▉ | 268/336 [1:29:31<22:28, 19.83s/it] {'loss': 0.0131, 'grad_norm': 0.1770205944776535, 'learning_rate': 2.0836613916690427e-07, 'kl': 0.0884, 'entropy': -0.1172, 'ce_loss': 0.006, 'epoch': 3.19} 80%|███████▉ | 268/336 [1:29:31<22:28, 19.83s/it] 80%|████████ | 269/336 [1:29:51<22:30, 20.16s/it] {'loss': 0.0148, 'grad_norm': 0.1883751004934311, 'learning_rate': 2.0249706271232946e-07, 'kl': 0.0233, 'entropy': -0.0005, 'ce_loss': 0.0048, 'epoch': 3.2} 80%|████████ | 269/336 [1:29:51<22:30, 20.16s/it] 80%|████████ | 270/336 [1:30:12<22:20, 20.30s/it] {'loss': 0.0139, 'grad_norm': 0.18343131244182587, 'learning_rate': 1.9670250438519386e-07, 'kl': 0.0457, 'entropy': -0.0236, 'ce_loss': 0.023, 'epoch': 3.21} 80%|████████ | 270/336 [1:30:12<22:20, 20.30s/it] 81%|████████ | 271/336 [1:30:33<22:19, 20.60s/it] {'loss': 0.0155, 'grad_norm': 0.16689811646938324, 'learning_rate': 1.9098300562505264e-07, 'kl': 0.0674, 'entropy': -0.0442, 'ce_loss': 0.0139, 'epoch': 3.23} 81%|████████ | 271/336 [1:30:33<22:19, 20.60s/it] 81%|████████ | 272/336 [1:30:55<22:08, 20.76s/it] {'loss': 0.0138, 'grad_norm': 0.16900287568569183, 'learning_rate': 1.8533910085794713e-07, 'kl': 0.0576, 'entropy': -0.0806, 'ce_loss': 0.0091, 'epoch': 3.24} 81%|████████ | 272/336 [1:30:55<22:08, 20.76s/it] 81%|████████▏ | 273/336 [1:31:18<22:46, 21.69s/it] {'loss': 0.0128, 'grad_norm': 0.15000148117542267, 'learning_rate': 1.7977131744646724e-07, 'kl': 0.0669, 'entropy': -0.0605, 'ce_loss': 0.0046, 'epoch': 3.25} 81%|████████▏ | 273/336 [1:31:18<22:46, 21.69s/it] 82%|████████▏ | 274/336 [1:31:37<21:27, 20.76s/it] {'loss': 0.0166, 'grad_norm': 0.1860027015209198, 'learning_rate': 1.742801756404759e-07, 'kl': 0.0786, 'entropy': -0.1001, 'ce_loss': 0.0023, 'epoch': 3.26} 82%|████████▏ | 274/336 [1:31:37<21:27, 20.76s/it] 82%|████████▏ | 275/336 [1:31:55<20:22, 20.03s/it] {'loss': 0.016, 'grad_norm': 0.17286522686481476, 'learning_rate': 1.688661885284972e-07, 'kl': 0.0486, 'entropy': -0.0186, 'ce_loss': 0.0045, 'epoch': 3.27} 82%|████████▏ | 275/336 [1:31:55<20:22, 20.03s/it] 82%|████████▏ | 276/336 [1:32:14<19:36, 19.61s/it] {'loss': 0.0151, 'grad_norm': 0.17063429951667786, 'learning_rate': 1.6352986198977325e-07, 'kl': 0.0374, 'entropy': -0.0718, 'ce_loss': 0.0241, 'epoch': 3.29} 82%|████████▏ | 276/336 [1:32:14<19:36, 19.61s/it] 82%|████████▏ | 277/336 [1:32:32<18:56, 19.26s/it] {'loss': 0.0135, 'grad_norm': 0.16551153361797333, 'learning_rate': 1.5827169464699575e-07, 'kl': 0.0112, 'entropy': -0.0457, 'ce_loss': 0.0107, 'epoch': 3.3} 82%|████████▏ | 277/336 [1:32:32<18:56, 19.26s/it] 83%|████████▎ | 278/336 [1:32:54<19:09, 19.83s/it] {'loss': 0.0122, 'grad_norm': 0.15877796709537506, 'learning_rate': 1.5309217781971416e-07, 'kl': 0.0569, 'entropy': -0.0522, 'ce_loss': 0.0122, 'epoch': 3.31} 83%|████████▎ | 278/336 [1:32:54<19:09, 19.83s/it] 83%|████████▎ | 279/336 [1:33:20<20:39, 21.75s/it] {'loss': 0.0117, 'grad_norm': 0.13788071274757385, 'learning_rate': 1.479917954784282e-07, 'kl': 0.0525, 'entropy': -0.1108, 'ce_loss': 0.0119, 'epoch': 3.32} 83%|████████▎ | 279/336 [1:33:20<20:39, 21.75s/it] 83%|████████▎ | 280/336 [1:33:38<19:23, 20.77s/it] {'loss': 0.0151, 'grad_norm': 0.19752632081508636, 'learning_rate': 1.429710241993656e-07, 'kl': 0.063, 'entropy': 0.003, 'ce_loss': 0.0132, 'epoch': 3.33} 83%|████████▎ | 280/336 [1:33:38<19:23, 20.77s/it] 84%|████████▎ | 281/336 [1:33:58<18:43, 20.43s/it] {'loss': 0.0148, 'grad_norm': 0.16931654512882233, 'learning_rate': 1.380303331199507e-07, 'kl': 0.0425, 'entropy': -0.0645, 'ce_loss': 0.0144, 'epoch': 3.35} 84%|████████▎ | 281/336 [1:33:58<18:43, 20.43s/it] 84%|████████▍ | 282/336 [1:34:19<18:37, 20.70s/it] {'loss': 0.0124, 'grad_norm': 0.1510416567325592, 'learning_rate': 1.3317018389496926e-07, 'kl': 0.0396, 'entropy': -0.0649, 'ce_loss': 0.0107, 'epoch': 3.36} 84%|████████▍ | 282/336 [1:34:19<18:37, 20.70s/it] 84%|████████▍ | 283/336 [1:34:39<17:58, 20.36s/it] {'loss': 0.0139, 'grad_norm': 0.17398256063461304, 'learning_rate': 1.283910306534308e-07, 'kl': -0.0009, 'entropy': -0.0188, 'ce_loss': 0.0253, 'epoch': 3.37} 84%|████████▍ | 283/336 [1:34:39<17:58, 20.36s/it] 85%|████████▍ | 284/336 [1:34:57<17:07, 19.75s/it] {'loss': 0.0147, 'grad_norm': 0.15433430671691895, 'learning_rate': 1.2369331995613663e-07, 'kl': 0.0018, 'entropy': 0.0128, 'ce_loss': 0.0137, 'epoch': 3.38} 85%|████████▍ | 284/336 [1:34:57<17:07, 19.75s/it] 85%|████████▍ | 285/336 [1:35:15<16:26, 19.34s/it] {'loss': 0.0186, 'grad_norm': 0.20888136327266693, 'learning_rate': 1.1907749075395146e-07, 'kl': 0.0659, 'entropy': -0.064, 'ce_loss': 0.0134, 'epoch': 3.39} 85%|████████▍ | 285/336 [1:35:15<16:26, 19.34s/it] 85%|████████▌ | 286/336 [1:35:39<17:04, 20.50s/it] {'loss': 0.0126, 'grad_norm': 0.14132261276245117, 'learning_rate': 1.145439743467902e-07, 'kl': 0.0747, 'entropy': -0.0144, 'ce_loss': 0.0046, 'epoch': 3.4} 85%|████████▌ | 286/336 [1:35:39<17:04, 20.50s/it] 85%|████████▌ | 287/336 [1:36:00<16:57, 20.76s/it] {'loss': 0.0153, 'grad_norm': 0.22063836455345154, 'learning_rate': 1.1009319434331621e-07, 'kl': 0.0261, 'entropy': -0.0386, 'ce_loss': 0.0075, 'epoch': 3.42} 85%|████████▌ | 287/336 [1:36:00<16:57, 20.76s/it] 86%|████████▌ | 288/336 [1:36:22<16:46, 20.98s/it] {'loss': 0.014, 'grad_norm': 0.17454616725444794, 'learning_rate': 1.0572556662136035e-07, 'kl': 0.0221, 'entropy': -0.0469, 'ce_loss': 0.0083, 'epoch': 3.43} 86%|████████▌ | 288/336 [1:36:22<16:46, 20.98s/it] 86%|████████▌ | 289/336 [1:36:40<15:47, 20.16s/it] {'loss': 0.0191, 'grad_norm': 0.20339645445346832, 'learning_rate': 1.014414992890611e-07, 'kl': 0.0859, 'entropy': -0.0376, 'ce_loss': 0.0065, 'epoch': 3.44} 86%|████████▌ | 289/336 [1:36:40<15:47, 20.16s/it] 86%|████████▋ | 290/336 [1:37:01<15:39, 20.43s/it] {'loss': 0.0158, 'grad_norm': 0.1846119910478592, 'learning_rate': 9.724139264673114e-08, 'kl': 0.0325, 'entropy': -0.0801, 'ce_loss': 0.001, 'epoch': 3.45} 86%|████████▋ | 290/336 [1:37:01<15:39, 20.43s/it] 87%|████████▋ | 291/336 [1:37:22<15:29, 20.65s/it] {'loss': 0.0113, 'grad_norm': 0.14021579921245575, 'learning_rate': 9.312563914945459e-08, 'kl': 0.0723, 'entropy': -0.0723, 'ce_loss': 0.0065, 'epoch': 3.46} 87%|████████▋ | 291/336 [1:37:22<15:29, 20.65s/it] 87%|████████▋ | 292/336 [1:37:43<15:06, 20.60s/it] {'loss': 0.0178, 'grad_norm': 0.1994849294424057, 'learning_rate': 8.909462337041507e-08, 'kl': 0.0359, 'entropy': -0.019, 'ce_loss': 0.0083, 'epoch': 3.48} 87%|████████▋ | 292/336 [1:37:43<15:06, 20.60s/it] 87%|████████▋ | 293/336 [1:38:04<14:55, 20.82s/it] {'loss': 0.014, 'grad_norm': 0.15486009418964386, 'learning_rate': 8.514872196496181e-08, 'kl': 0.042, 'entropy': -0.0503, 'ce_loss': 0.0015, 'epoch': 3.49} 87%|████████▋ | 293/336 [1:38:04<14:55, 20.82s/it] 88%|████████▊ | 294/336 [1:38:22<14:03, 20.09s/it] {'loss': 0.0162, 'grad_norm': 0.18833573162555695, 'learning_rate': 8.128830363541572e-08, 'kl': 0.0591, 'entropy': -0.0469, 'ce_loss': 0.0057, 'epoch': 3.5} 88%|████████▊ | 294/336 [1:38:22<14:03, 20.09s/it] 88%|████████▊ | 295/336 [1:38:43<13:54, 20.36s/it] {'loss': 0.0161, 'grad_norm': 0.18126247823238373, 'learning_rate': 7.751372909661768e-08, 'kl': 0.0039, 'entropy': -0.0086, 'ce_loss': 0.007, 'epoch': 3.51} 88%|████████▊ | 295/336 [1:38:43<13:54, 20.36s/it] 88%|████████▊ | 296/336 [1:39:02<13:13, 19.83s/it] {'loss': 0.0116, 'grad_norm': 0.14697256684303284, 'learning_rate': 7.382535104222364e-08, 'kl': -0.0009, 'entropy': -0.0466, 'ce_loss': 0.0171, 'epoch': 3.52} 88%|████████▊ | 296/336 [1:39:02<13:13, 19.83s/it] 88%|████████▊ | 297/336 [1:39:22<12:59, 19.98s/it] {'loss': 0.0138, 'grad_norm': 0.1563737690448761, 'learning_rate': 7.022351411174865e-08, 'kl': -0.0063, 'entropy': -0.0347, 'ce_loss': 0.0151, 'epoch': 3.54} 88%|████████▊ | 297/336 [1:39:22<12:59, 19.98s/it] 89%|████████▊ | 298/336 [1:39:44<12:59, 20.52s/it] {'loss': 0.0127, 'grad_norm': 0.1611286848783493, 'learning_rate': 6.670855485836524e-08, 'kl': 0.1201, 'entropy': -0.1152, 'ce_loss': 0.0081, 'epoch': 3.55} 89%|████████▊ | 298/336 [1:39:44<12:59, 20.52s/it] 89%|████████▉ | 299/336 [1:40:06<12:55, 20.96s/it] {'loss': 0.0142, 'grad_norm': 0.1811935156583786, 'learning_rate': 6.328080171745509e-08, 'kl': 0.0234, 'entropy': -0.0415, 'ce_loss': 0.0107, 'epoch': 3.56} 89%|████████▉ | 299/336 [1:40:06<12:55, 20.96s/it] 89%|████████▉ | 300/336 [1:40:25<12:09, 20.25s/it] {'loss': 0.0159, 'grad_norm': 0.1792113482952118, 'learning_rate': 5.994057497592031e-08, 'kl': 0.0845, 'entropy': -0.052, 'ce_loss': 0.0054, 'epoch': 3.57} 89%|████████▉ | 300/336 [1:40:25<12:09, 20.25s/it] 90%|████████▉ | 301/336 [1:40:43<11:31, 19.75s/it] {'loss': 0.0158, 'grad_norm': 0.18118223547935486, 'learning_rate': 5.6688186742256835e-08, 'kl': 0.0525, 'entropy': -0.0776, 'ce_loss': 0.0216, 'epoch': 3.58} 90%|████████▉ | 301/336 [1:40:43<11:31, 19.75s/it] 90%|████████▉ | 302/336 [1:41:02<11:02, 19.50s/it] {'loss': 0.019, 'grad_norm': 0.15603701770305634, 'learning_rate': 5.352394091739021e-08, 'kl': 0.0442, 'entropy': -0.104, 'ce_loss': 0.009, 'epoch': 3.6} 90%|████████▉ | 302/336 [1:41:02<11:02, 19.50s/it] 90%|█████████ | 303/336 [1:41:23<10:58, 19.95s/it] {'loss': 0.0154, 'grad_norm': 0.1694929301738739, 'learning_rate': 5.0448133166279935e-08, 'kl': 0.0334, 'entropy': -0.0791, 'ce_loss': 0.0075, 'epoch': 3.61} 90%|█████████ | 303/336 [1:41:23<10:58, 19.95s/it] 90%|█████████ | 304/336 [1:41:42<10:25, 19.55s/it] {'loss': 0.0161, 'grad_norm': 0.1850585788488388, 'learning_rate': 4.746105089029229e-08, 'kl': 0.0605, 'entropy': -0.0148, 'ce_loss': 0.0057, 'epoch': 3.62} 90%|█████████ | 304/336 [1:41:42<10:25, 19.55s/it] 91%|█████████ | 305/336 [1:42:03<10:21, 20.04s/it] {'loss': 0.014, 'grad_norm': 0.1686019003391266, 'learning_rate': 4.456297320034641e-08, 'kl': 0.0544, 'entropy': -0.05, 'ce_loss': 0.0074, 'epoch': 3.63} 91%|█████████ | 305/336 [1:42:03<10:21, 20.04s/it] 91%|█████████ | 306/336 [1:42:21<09:45, 19.52s/it] {'loss': 0.0142, 'grad_norm': 0.17044714093208313, 'learning_rate': 4.1754170890833774e-08, 'kl': 0.033, 'entropy': -0.0557, 'ce_loss': 0.0175, 'epoch': 3.64} 91%|█████████ | 306/336 [1:42:21<09:45, 19.52s/it] 91%|█████████▏| 307/336 [1:42:43<09:44, 20.14s/it] {'loss': 0.0104, 'grad_norm': 0.131026491522789, 'learning_rate': 3.9034906414315725e-08, 'kl': 0.017, 'entropy': -0.0297, 'ce_loss': 0.0245, 'epoch': 3.65} 91%|█████████▏| 307/336 [1:42:43<09:44, 20.14s/it] 92%|█████████▏| 308/336 [1:43:02<09:13, 19.75s/it] {'loss': 0.0151, 'grad_norm': 0.18852005898952484, 'learning_rate': 3.6405433856999676e-08, 'kl': 0.0635, 'entropy': -0.0364, 'ce_loss': 0.0115, 'epoch': 3.67} 92%|█████████▏| 308/336 [1:43:02<09:13, 19.75s/it] 92%|█████████▏| 309/336 [1:43:22<09:01, 20.07s/it] {'loss': 0.0168, 'grad_norm': 0.20448844134807587, 'learning_rate': 3.386599891499764e-08, 'kl': 0.061, 'entropy': -0.0703, 'ce_loss': 0.0112, 'epoch': 3.68} 92%|█████████▏| 309/336 [1:43:22<09:01, 20.07s/it] 92%|█████████▏| 310/336 [1:43:44<08:52, 20.50s/it] {'loss': 0.0119, 'grad_norm': 0.14956925809383392, 'learning_rate': 3.141683887136892e-08, 'kl': 0.0535, 'entropy': -0.041, 'ce_loss': 0.0092, 'epoch': 3.69} 92%|█████████▏| 310/336 [1:43:44<08:52, 20.50s/it] 93%|█████████▎| 311/336 [1:44:05<08:35, 20.62s/it] {'loss': 0.0152, 'grad_norm': 0.1799231916666031, 'learning_rate': 2.9058182573947986e-08, 'kl': 0.1021, 'entropy': -0.0461, 'ce_loss': 0.0214, 'epoch': 3.7} 93%|█████████▎| 311/336 [1:44:05<08:35, 20.62s/it] 93%|█████████▎| 312/336 [1:44:26<08:19, 20.82s/it] {'loss': 0.0151, 'grad_norm': 0.19520780444145203, 'learning_rate': 2.6790250413961546e-08, 'kl': 0.0334, 'entropy': -0.0417, 'ce_loss': 0.0158, 'epoch': 3.71} 93%|█████████▎| 312/336 [1:44:26<08:19, 20.82s/it] 93%|█████████▎| 313/336 [1:44:45<07:42, 20.12s/it] {'loss': 0.0193, 'grad_norm': 0.21787181496620178, 'learning_rate': 2.4613254305434815e-08, 'kl': 0.0217, 'entropy': -0.0391, 'ce_loss': 0.0141, 'epoch': 3.73} 93%|█████████▎| 313/336 [1:44:45<07:42, 20.12s/it] 93%|█████████▎| 314/336 [1:45:06<07:31, 20.50s/it] {'loss': 0.0134, 'grad_norm': 0.1614626944065094, 'learning_rate': 2.2527397665391024e-08, 'kl': 0.0659, 'entropy': -0.0776, 'ce_loss': 0.019, 'epoch': 3.74} 93%|█████████▎| 314/336 [1:45:06<07:31, 20.50s/it] 94%|█████████▍| 315/336 [1:45:24<06:57, 19.87s/it] {'loss': 0.0142, 'grad_norm': 0.17268434166908264, 'learning_rate': 2.053287539484405e-08, 'kl': 0.0317, 'entropy': -0.0193, 'ce_loss': 0.0077, 'epoch': 3.75} 94%|█████████▍| 315/336 [1:45:24<06:57, 19.87s/it] 94%|█████████▍| 316/336 [1:45:43<06:28, 19.44s/it] {'loss': 0.0167, 'grad_norm': 0.18626344203948975, 'learning_rate': 1.8629873860586564e-08, 'kl': 0.0183, 'entropy': -0.0559, 'ce_loss': 0.0449, 'epoch': 3.76} 94%|█████████▍| 316/336 [1:45:43<06:28, 19.44s/it] 94%|█████████▍| 317/336 [1:46:05<06:22, 20.15s/it] {'loss': 0.0125, 'grad_norm': 0.17596346139907837, 'learning_rate': 1.6818570877776718e-08, 'kl': 0.03, 'entropy': 0.0457, 'ce_loss': 0.0167, 'epoch': 3.77} 94%|█████████▍| 317/336 [1:46:05<06:22, 20.15s/it] 95%|█████████▍| 318/336 [1:46:26<06:11, 20.61s/it] {'loss': 0.0159, 'grad_norm': 0.17278118431568146, 'learning_rate': 1.5099135693322773e-08, 'kl': 0.0776, 'entropy': -0.0947, 'ce_loss': 0.0141, 'epoch': 3.79} 95%|█████████▍| 318/336 [1:46:26<06:11, 20.61s/it] 95%|█████████▍| 319/336 [1:46:45<05:39, 19.96s/it] {'loss': 0.019, 'grad_norm': 0.22613902390003204, 'learning_rate': 1.3471728970068985e-08, 'kl': 0.0708, 'entropy': -0.0552, 'ce_loss': 0.0094, 'epoch': 3.8} 95%|█████████▍| 319/336 [1:46:45<05:39, 19.96s/it] 95%|█████████▌| 320/336 [1:47:03<05:12, 19.52s/it] {'loss': 0.0147, 'grad_norm': 0.16876791417598724, 'learning_rate': 1.1936502771783486e-08, 'kl': 0.0645, 'entropy': -0.0427, 'ce_loss': 0.0086, 'epoch': 3.81} 95%|█████████▌| 320/336 [1:47:03<05:12, 19.52s/it] 96%|█████████▌| 321/336 [1:47:22<04:48, 19.27s/it] {'loss': 0.0141, 'grad_norm': 0.15639908611774445, 'learning_rate': 1.0493600548948877e-08, 'kl': 0.0295, 'entropy': -0.0139, 'ce_loss': 0.006, 'epoch': 3.82} 96%|█████████▌| 321/336 [1:47:22<04:48, 19.27s/it] 96%|█████████▌| 322/336 [1:47:40<04:26, 19.01s/it] {'loss': 0.0184, 'grad_norm': 0.20723900198936462, 'learning_rate': 9.143157125359513e-09, 'kl': 0.0996, 'entropy': -0.0776, 'ce_loss': 0.0067, 'epoch': 3.83} 96%|█████████▌| 322/336 [1:47:40<04:26, 19.01s/it] 96%|█████████▌| 323/336 [1:48:01<04:13, 19.48s/it] {'loss': 0.0155, 'grad_norm': 0.18778249621391296, 'learning_rate': 7.885298685522235e-09, 'kl': 0.0332, 'entropy': -0.0199, 'ce_loss': 0.0108, 'epoch': 3.85} 96%|█████████▌| 323/336 [1:48:01<04:13, 19.48s/it] 96%|█████████▋| 324/336 [1:48:20<03:51, 19.31s/it] {'loss': 0.0145, 'grad_norm': 0.16465556621551514, 'learning_rate': 6.720142762867032e-09, 'kl': 0.1641, 'entropy': -0.1025, 'ce_loss': 0.0075, 'epoch': 3.86} 96%|█████████▋| 324/336 [1:48:20<03:51, 19.31s/it] 97%|█████████▋| 325/336 [1:48:44<03:47, 20.69s/it] {'loss': 0.0106, 'grad_norm': 0.13748055696487427, 'learning_rate': 5.647798228764156e-09, 'kl': 0.0117, 'entropy': -0.0942, 'ce_loss': 0.0244, 'epoch': 3.87} 97%|█████████▋| 325/336 [1:48:44<03:47, 20.69s/it] 97%|█████████▋| 326/336 [1:49:02<03:20, 20.03s/it] {'loss': 0.0143, 'grad_norm': 0.16196677088737488, 'learning_rate': 4.668365282351372e-09, 'kl': 0.0273, 'entropy': -0.0723, 'ce_loss': 0.0153, 'epoch': 3.88} 97%|█████████▋| 326/336 [1:49:02<03:20, 20.03s/it] 97%|█████████▋| 327/336 [1:49:21<02:56, 19.56s/it] {'loss': 0.0152, 'grad_norm': 0.1792673021554947, 'learning_rate': 3.7819354411713355e-09, 'kl': 0.0854, 'entropy': -0.0586, 'ce_loss': 0.0078, 'epoch': 3.89} 97%|█████████▋| 327/336 [1:49:21<02:56, 19.56s/it] 98%|█████████▊| 328/336 [1:49:42<02:40, 20.12s/it] {'loss': 0.0143, 'grad_norm': 0.16059669852256775, 'learning_rate': 2.9885915326203216e-09, 'kl': -0.006, 'entropy': -0.0378, 'ce_loss': 0.0077, 'epoch': 3.9} 98%|█████████▊| 328/336 [1:49:42<02:40, 20.12s/it] 98%|█████████▊| 329/336 [1:50:00<02:17, 19.58s/it] {'loss': 0.0152, 'grad_norm': 0.16452479362487793, 'learning_rate': 2.2884076862089707e-09, 'kl': 0.0427, 'entropy': 0.0209, 'ce_loss': 0.0159, 'epoch': 3.92} 98%|█████████▊| 329/336 [1:50:00<02:17, 19.58s/it] 98%|█████████▊| 330/336 [1:50:22<02:01, 20.30s/it] {'loss': 0.0151, 'grad_norm': 0.1804715096950531, 'learning_rate': 1.6814493266357199e-09, 'kl': 0.0205, 'entropy': -0.0138, 'ce_loss': 0.022, 'epoch': 3.93} 98%|█████████▊| 330/336 [1:50:22<02:01, 20.30s/it] 99%|█████████▊| 331/336 [1:50:40<01:38, 19.64s/it] {'loss': 0.0152, 'grad_norm': 0.1733705848455429, 'learning_rate': 1.1677731676733581e-09, 'kl': 0.0366, 'entropy': -0.0176, 'ce_loss': 0.0107, 'epoch': 3.94} 99%|█████████▊| 331/336 [1:50:40<01:38, 19.64s/it] 99%|█████████▉| 332/336 [1:50:59<01:17, 19.26s/it] {'loss': 0.0142, 'grad_norm': 0.1726919561624527, 'learning_rate': 7.474272068698217e-10, 'kl': 0.0527, 'entropy': -0.0289, 'ce_loss': 0.002, 'epoch': 3.95} 99%|█████████▉| 332/336 [1:50:59<01:17, 19.26s/it] 99%|█████████▉| 333/336 [1:51:23<01:02, 20.75s/it] {'loss': 0.011, 'grad_norm': 0.14010635018348694, 'learning_rate': 4.204507210633368e-10, 'kl': 0.0703, 'entropy': -0.1045, 'ce_loss': 0.0008, 'epoch': 3.96} 99%|█████████▉| 333/336 [1:51:23<01:02, 20.75s/it] 99%|█████████▉| 334/336 [1:51:44<00:41, 20.89s/it] {'loss': 0.014, 'grad_norm': 0.1840401440858841, 'learning_rate': 1.8687426271246642e-10, 'kl': 0.0693, 'entropy': -0.0396, 'ce_loss': 0.011, 'epoch': 3.98} 99%|█████████▉| 334/336 [1:51:44<00:41, 20.89s/it] 100%|█████████▉| 335/336 [1:52:05<00:20, 20.96s/it] {'loss': 0.0146, 'grad_norm': 0.17463603615760803, 'learning_rate': 4.6719657041283115e-11, 'kl': 0.0386, 'entropy': -0.0884, 'ce_loss': 0.0102, 'epoch': 3.99} 100%|█████████▉| 335/336 [1:52:05<00:20, 20.96s/it] 100%|██████████| 336/336 [1:52:24<00:00, 20.17s/it] {'loss': 0.0141, 'grad_norm': 0.16217519342899323, 'learning_rate': 0.0, 'kl': 0.0505, 'entropy': -0.0674, 'ce_loss': 0.0039, 'epoch': 4.0} 100%|██████████| 336/336 [1:52:24<00:00, 20.17s/it][INFO|trainer.py:2665] 2025-04-15 03:34:51,297 >> Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 6744.2643, 'train_samples_per_second': 1.594, 'train_steps_per_second': 0.05, 'train_loss': 0.02149970159821567, 'epoch': 4.0} 100%|██████████| 336/336 [1:52:24<00:00, 20.17s/it] 100%|██████████| 336/336 [1:52:24<00:00, 20.07s/it] [INFO|trainer.py:3966] 2025-04-15 03:35:38,558 >> Saving model checkpoint to /home/stern/GRPO/offline_rl_v2/output [INFO|configuration_utils.py:423] 2025-04-15 03:35:38,563 >> Configuration saved in /home/stern/GRPO/offline_rl_v2/output/config.json [INFO|configuration_utils.py:908] 2025-04-15 03:35:38,563 >> Configuration saved in /home/stern/GRPO/offline_rl_v2/output/generation_config.json [2025-04-15 03:35:52,229] [INFO] [launch.py:351:main] Process 1302309 exits successfully. [2025-04-15 03:35:56,234] [INFO] [launch.py:351:main] Process 1302314 exits successfully. [2025-04-15 03:36:00,238] [INFO] [launch.py:351:main] Process 1302312 exits successfully. [2025-04-15 03:36:05,244] [INFO] [launch.py:351:main] Process 1302311 exits successfully. [2025-04-15 03:36:10,249] [INFO] [launch.py:351:main] Process 1302310 exits successfully. [2025-04-15 03:36:14,254] [INFO] [launch.py:351:main] Process 1302308 exits successfully. [2025-04-15 03:36:19,259] [INFO] [launch.py:351:main] Process 1302313 exits successfully. [INFO|modeling_utils.py:3594] 2025-04-15 03:37:22,907 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 14 checkpoint shards. You can find where each parameters has been saved in the index located at /home/stern/GRPO/offline_rl_v2/output/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2025-04-15 03:37:22,908 >> tokenizer config file saved in /home/stern/GRPO/offline_rl_v2/output/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2025-04-15 03:37:22,909 >> Special tokens file saved in /home/stern/GRPO/offline_rl_v2/output/special_tokens_map.json ***** train metrics ***** epoch = 4.0 total_flos = 86240GF train_loss = 0.0215 train_runtime = 1:52:24.26 train_samples = 2688 train_samples_per_second = 1.594 train_steps_per_second = 0.05 [2025-04-15 03:37:37,339] [INFO] [launch.py:351:main] Process 1302307 exits successfully.