TFPI
Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners
Preview • Updated • 74Note Evaluation datasets used in the TFPI paper.
xx18/TFPI-DeepSeek-Qwen-1.5B-Stage1
Text Generation • 2B • Updated • 3Note Initial Model: DeepSeek-Qwen-1.5B; TFPI stage 1: train for 1K steps with response length 2048
xx18/TFPI-DeepSeek-Qwen-1.5B-Stage2
Text Generation • 2B • Updated • 3Note Initial Model: TFPI-DeepSeek-Qwen-1.5B-Stage1; TFPI stage 2: train for 440 steps with response length 4096
xx18/TFPI-DeepSeek-Qwen-1.5B-Stage3
Text Generation • 2B • Updated • 8Note Initial Model: TFPI-DeepSeek-Qwen-1.5B-Stage2; TFPI stage 3: train for 440 steps with response length 8192
xx18/TFPI-DeepSeek-Qwen-1.5B-Stage3_then_RL
Text Generation • 2B • Updated • 2Note Initial Model: TFPI-DeepSeek-Qwen-1.5B-Stage3; Normal DAPO: train for 472 steps with response length 16K
xx18/DirectRL_DeepSeek-Qwen-1.5B_baseline1
Text Generation • 2B • Updated • 5Note Initial Model: DeepSeek-Qwen-1.5B; Normal RLVR with DAPO: train for 456 steps with response length 16K
xx18/DirectRL_DeepSeek-Qwen-1.5B_baseline2
Text Generation • 2B • Updated • 1Note Initial Model: DeepSeek-Qwen-1.5B; Normal RLVR with DAPO: train for 896 steps with response length 16K
xx18/TFPI-Qwen3-4B-Stage1
Text Generation • 4B • Updated • 2Note Initial Model: Qwen3-4B; TFPI Stage 1: train for 100 steps with response length 4096
xx18/TFPI-Qwen3-4B-Stage2
Text Generation • 4B • Updated • 6Note Initial Model: TFPI-Qwen3-4B-Stage1; TFPI Stage 2: train for 56 steps with response length 8K
xx18/TFPI-Qwen3-4B-Stage3
Text Generation • 4B • Updated • 6Note Initial Model: TFPI-Qwen3-4B-Stage2; TFPI Stage 3: train for 64 steps with response length 16K
xx18/TFPI-Qwen3-4B-Stage3_then_RL
Text Generation • 4B • Updated • 9Note Initial Model: TFPI-Qwen3-4B-Stage3; Normal DAPO: train for 192 steps with response length 32K
xx18/DirectRL_Qwen3-4B_baseline1
Text Generation • 4B • Updated • 7Note Initial Model: Qwen3-4B; Normal RLVR with DAPO: train for 20 steps with response length 32K
xx18/DirectRL_Qwen3-4B_baseline2
Text Generation • 4B • Updated • 2Note Initial Model: Qwen3-4B; Normal RLVR with DAPO: train for 216 steps with response length 32K
xx18/TFPI-Qwen3-4B-Thinking-2507-Stage3
Text Generation • 4B • Updated • 4Note Initial Model: Qwen3-4B-Thinking-2507 after 2 stages of TFPI; TFPI Stage 3: response length 16K