TFPI - a xx18 Collection

xx18 's Collections

TFPI

updated Nov 7

Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners

Upvote

xx18/TFPI-EVA

Preview • Updated Sep 28 • 74

Note Evaluation datasets used in the TFPI paper.
xx18/TFPI-DeepSeek-Qwen-1.5B-Stage1

Text Generation • 2B • Updated Nov 7 • 3

Note Initial Model: DeepSeek-Qwen-1.5B; TFPI stage 1: train for 1K steps with response length 2048
xx18/TFPI-DeepSeek-Qwen-1.5B-Stage2

Text Generation • 2B • Updated Nov 7 • 3

Note Initial Model: TFPI-DeepSeek-Qwen-1.5B-Stage1; TFPI stage 2: train for 440 steps with response length 4096
xx18/TFPI-DeepSeek-Qwen-1.5B-Stage3

Text Generation • 2B • Updated Nov 7 • 8

Note Initial Model: TFPI-DeepSeek-Qwen-1.5B-Stage2; TFPI stage 3: train for 440 steps with response length 8192
xx18/TFPI-DeepSeek-Qwen-1.5B-Stage3_then_RL

Text Generation • 2B • Updated Nov 7 • 2

Note Initial Model: TFPI-DeepSeek-Qwen-1.5B-Stage3; Normal DAPO: train for 472 steps with response length 16K
xx18/DirectRL_DeepSeek-Qwen-1.5B_baseline1

Text Generation • 2B • Updated Nov 7 • 5

Note Initial Model: DeepSeek-Qwen-1.5B; Normal RLVR with DAPO: train for 456 steps with response length 16K
xx18/DirectRL_DeepSeek-Qwen-1.5B_baseline2

Text Generation • 2B • Updated Nov 7 • 1

Note Initial Model: DeepSeek-Qwen-1.5B; Normal RLVR with DAPO: train for 896 steps with response length 16K
xx18/TFPI-Qwen3-4B-Stage1

Text Generation • 4B • Updated Nov 7 • 2

Note Initial Model: Qwen3-4B; TFPI Stage 1: train for 100 steps with response length 4096
xx18/TFPI-Qwen3-4B-Stage2

Text Generation • 4B • Updated Nov 7 • 6

Note Initial Model: TFPI-Qwen3-4B-Stage1; TFPI Stage 2: train for 56 steps with response length 8K
xx18/TFPI-Qwen3-4B-Stage3

Text Generation • 4B • Updated Nov 7 • 6

Note Initial Model: TFPI-Qwen3-4B-Stage2; TFPI Stage 3: train for 64 steps with response length 16K
xx18/TFPI-Qwen3-4B-Stage3_then_RL

Text Generation • 4B • Updated Nov 7 • 9

Note Initial Model: TFPI-Qwen3-4B-Stage3; Normal DAPO: train for 192 steps with response length 32K
xx18/DirectRL_Qwen3-4B_baseline1

Text Generation • 4B • Updated Nov 7 • 7

Note Initial Model: Qwen3-4B; Normal RLVR with DAPO: train for 20 steps with response length 32K
xx18/DirectRL_Qwen3-4B_baseline2

Text Generation • 4B • Updated Nov 7 • 2

Note Initial Model: Qwen3-4B; Normal RLVR with DAPO: train for 216 steps with response length 32K
xx18/TFPI-Qwen3-4B-Thinking-2507-Stage3

Text Generation • 4B • Updated Nov 7 • 4

Note Initial Model: Qwen3-4B-Thinking-2507 after 2 stages of TFPI; TFPI Stage 3: response length 16K

Upvote