Need guidance in reproducing AIME25 score
The AIME25 score is reported over 84.00, and I was trying to reproduce the result.
I am using the recommended docker image, with H100*8 and the recommended vLLM options precisely.
A request example is:
{'model': '', 'temperature': 0.8, 'max_tokens': 120000, 'top_p': 0.95, 'messages': [{'role': 'user', 'content': "Solve the following math problem efficiently and clearly. The last line of your response should be of the following format: 'Therefore, the final answer is: $\\boxed{ANSWER}$. I hope it is correct' (without quotes) where ANSWER is just the final number or expression that solves the problem. Think step by step before answering.\n\nFind the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$."}], 'skip_special_tokens': False, 'chat_template_kwargs': {'default_system_prompt': False}}
where the prompt is Llama3-like and another is:
{'model': '', 'temperature': 0.8, 'max_tokens': 120000, 'top_p': 0.95, 'messages': [{'role': 'user', 'content': 'Solve the following math problem step by step. Put your answer inside \\boxed{}.\n\nFind the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$.\n\nRemember to put your answer inside \\boxed{}.'}], 'skip_special_tokens': False, 'chat_template_kwargs': {'default_system_prompt': False}}
where the prompt is from ArtificialAnalysis.ai.
In both cases, I tried setting default_system_prompt on/off so 4 trials, where in each trial 8 answers were generated from each question.
The pass@1 scores are between 72 and 74, which is significantly lower than reported, so I wonder what else I should adjust to harness the full capability.
The predicted answers are extracted from \boxed, and huggingface math_verify was used to decide the correctness.
Here is the full truth/pred/correctness table from default_system_prompt=False, prompt from artificial analysis.
| truth | pred | correct |
|---|---|---|
| 70 | 70 | True |
| 70 | 70 | True |
| 70 | 70 | True |
| 70 | 70 | True |
| 70 | 70 | True |
| 70 | 70 | True |
| 70 | 70 | True |
| 70 | 70 | True |
| 588 | 588 | True |
| 588 | 588 | True |
| 588 | 588 | True |
| 588 | 588 | True |
| 588 | 588 | True |
| 588 | 588 | True |
| 588 | 588 | True |
| 588 | 588 | True |
| 16 | 16 | True |
| 16 | 16 | True |
| 16 | 16 | True |
| 16 | 16 | True |
| 16 | 16 | True |
| 16 | 16 | True |
| 16 | 16 | True |
| 16 | 16 | True |
| 117 | 117 | True |
| 117 | 117 | True |
| 117 | 117 | True |
| 117 | 117 | True |
| 117 | 117 | True |
| 117 | 117 | True |
| 117 | 117 | True |
| 117 | 117 | True |
| 279 | 279 | True |
| 279 | 279 | True |
| 279 | 279 | True |
| 279 | 279 | True |
| 279 | 279 | True |
| 279 | 279 | True |
| 279 | 279 | True |
| 279 | 279 | True |
| 504 | 504 | True |
| 504 | 504 | True |
| 504 | 504 | True |
| 504 | 504 | True |
| 504 | 504 | True |
| 504 | 504 | True |
| 504 | 504 | True |
| 504 | 504 | True |
| 821 | 271 | False |
| 821 | 271 | False |
| 821 | 821 | True |
| 821 | 821 | True |
| 821 | 821 | True |
| 821 | 821 | True |
| 821 | 701 | False |
| 821 | 821 | True |
| 77 | 77 | True |
| 77 | 143 | False |
| 77 | 77 | True |
| 77 | 77 | True |
| 77 | 77 | True |
| 77 | 77 | True |
| 77 | 77 | True |
| 77 | 77 | True |
| 62 | 62 | True |
| 62 | 119 | False |
| 62 | 87 | False |
| 62 | 62 | True |
| 62 | 62 | True |
| 62 | 60 | False |
| 62 | 62 | True |
| 62 | 73 | False |
| 81 | 81 | True |
| 81 | 973520 | False |
| 81 | 81 | True |
| 81 | 62 | False |
| 81 | 81 | True |
| 81 | 95 | False |
| 81 | 57 | False |
| 81 | 81 | True |
| 259 | 259 | True |
| 259 | 259 | True |
| 259 | 4152 | False |
| 259 | 22 | False |
| 259 | 259 | True |
| 259 | 259 | True |
| 259 | 259 | True |
| 259 | 42 | False |
| 510 | 510 | True |
| 510 | 510 | True |
| 510 | 761308 | False |
| 510 | 510 | True |
| 510 | 510 | True |
| 510 | 510 | True |
| 510 | 510 | True |
| 510 | 303 | False |
| 204 | \dfrac{559}{4} | False |
| 204 | 128 | False |
| 204 | \dfrac{593}{6} | False |
| 204 | \displaystyle \frac{787}{3} | False |
| 204 | \dfrac{115}{3} | False |
| 204 | 529 | False |
| 204 | \displaystyle \frac{187}{3}+300\cdot\frac{5}{12}= \frac{561}{4} | False |
| 204 | \displaystyle \frac{399}{6} | False |
| 60 | 30 | False |
| 60 | 78 | False |
| 60 | 106 | False |
| 60 | 62 | False |
| 60 | 429 | False |
| 60 | 194 | False |
| 60 | 235 | False |
| 60 | 3521 | False |
| 735 | 273 | False |
| 735 | 197 | False |
| 735 | 461 | False |
| 735 | 729 | False |
| 735 | 147 | False |
| 735 | 999 | False |
| 735 | 499 | False |
| 735 | 479 | False |
| 468 | 468 | True |
| 468 | 468 | True |
| 468 | 468 | True |
| 468 | 468 | True |
| 468 | 468 | True |
| 468 | 468 | True |
| 468 | 468 | True |
| 468 | 468 | True |
| 49 | 49 | True |
| 49 | 49 | True |
| 49 | 49 | True |
| 49 | 49 | True |
| 49 | 49 | True |
| 49 | 49 | True |
| 49 | 49 | True |
| 49 | 49 | True |
| 82 | 82 | True |
| 82 | 82 | True |
| 82 | 100 | False |
| 82 | 82 | True |
| 82 | 82 | True |
| 82 | 82 | True |
| 82 | 82 | True |
| 82 | 82 | True |
| 106 | 106 | True |
| 106 | 106 | True |
| 106 | 106 | True |
| 106 | 106 | True |
| 106 | 106 | True |
| 106 | 106 | True |
| 106 | 106 | True |
| 106 | 106 | True |
| 336 | 378 | False |
| 336 | 336 | True |
| 336 | 336 | True |
| 336 | 168 | False |
| 336 | 768 | False |
| 336 | 552^{\circ} | False |
| 336 | 336 | True |
| 336 | 672 | False |
| 293 | 293 | True |
| 293 | 293 | True |
| 293 | 293 | True |
| 293 | 293 | True |
| 293 | 293 | True |
| 293 | 293 | True |
| 293 | 293 | True |
| 293 | 293 | True |
| 237 | 237 | True |
| 237 | 237 | True |
| 237 | 237 | True |
| 237 | 237 | True |
| 237 | 237 | True |
| 237 | 237 | True |
| 237 | 237 | True |
| 237 | 237 | True |
| 610 | 610 | True |
| 610 | 610 | True |
| 610 | 610 | True |
| 610 | 200 | False |
| 610 | 805 | False |
| 610 | 610 | True |
| 610 | 610 | True |
| 610 | 610 | True |
| 149 | 149 | True |
| 149 | 149 | True |
| 149 | 149 | True |
| 149 | 149 | True |
| 149 | 143 | False |
| 149 | 149 | True |
| 149 | 149 | True |
| 149 | 149 | True |
| 907 | 907 | True |
| 907 | 907 | True |
| 907 | 907 | True |
| 907 | 907 | True |
| 907 | 907 | True |
| 907 | 2907 | False |
| 907 | 907 | True |
| 907 | 907 | True |
| 113 | 113 | True |
| 113 | 113 | True |
| 113 | 113 | True |
| 113 | 113 | True |
| 113 | 113 | True |
| 113 | 113 | True |
| 113 | 113 | True |
| 113 | 113 | True |
| 19 | 19 | True |
| 19 | 19 | True |
| 19 | 19 | True |
| 19 | 19 | True |
| 19 | 19 | True |
| 19 | 19 | True |
| 19 | 19 | True |
| 19 | 19 | True |
| 248 | 736 | False |
| 248 | 933 | False |
| 248 | 162 | False |
| 248 | 177 | False |
| 248 | 3 | False |
| 248 | 501 | False |
| 248 | 627 | False |
| 248 | 6 | False |
| 104 | 104 | True |
| 104 | 104 | True |
| 104 | 104 | True |
| 104 | n=104 | True |
| 104 | 104 | True |
| 104 | 104 | True |
| 104 | n=104 | True |
| 104 | 104 | True |
| 240 | \frac{182579}{1000} | False |
| 240 | 240 | True |
| 240 | 4146 | False |
| 240 | 3162 | False |
| 240 | 472 | False |
| 240 | 8+32+200=240 | True |
| 240 | 564 | False |
| 240 | 264 | False |
If it helps, in my trials the maximum length of generations were under 40k.
hi @se-ok , thanks for the report. we're refining a guidance for you. and yes, the more details you've tried with, the more helpful for us.
@se-ok this is the summary:
- make sure you use the latest config we provide
- make sure you set
reasoning_effort=high - the provided docker image is configured for efficient sserving. update these env variables for the benchmark.
SOLAR_REASONING_BUDGET_HIGH_MAX=131072
SOLAR_REASONING_BUDGET_HIGH_RATIO=100
e.g.
docker run --gpus all \
--ipc=host \
-p 8000:8000 \
-e SOLAR_REASONING_BUDGET_HIGH_MAX=131072 \
-e SOLAR_REASONING_BUDGET_HIGH_RATIO=100 \
upstage/vllm-solar-open:latest \
upstage/Solar-Open-100B \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser solar_open \
--reasoning-parser solar_open \
--logits-processors vllm.model_executor.models.parallel_tool_call_logits_processor:ParallelToolCallLogitsProcessor \
--logits-processors vllm.model_executor.models.solar_open_logits_processor:SolarOpenTemplateLogitsProcessor \
--tensor-parallel-size 8
or, on vLLM,
SOLAR_REASONING_BUDGET_HIGH_MAX=131072 SOLAR_REASONING_BUDGET_HIGH_RATIO=100 vllm serve upstage/Solar-Open-100B \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser solar_open \
--reasoning-parser solar_open \
--logits-processors vllm.model_executor.models.parallel_tool_call_logits_processor:ParallelToolCallLogitsProcessor \
--logits-processors vllm.model_executor.models.solar_open_logits_processor:SolarOpenTemplateLogitsProcessor \
--tensor-parallel-size 8
@keunwooupstage
With the environment variables I have successfully reproduced the reported AIME25 score. I've got Pass@1 85.00 from 8 runs.
Due to the provided settings I now understand that the custom logit processor was cutting off the thinking part at 32k tokens by default, which was an interesting tweak.
Thank you for helping me out and congratulations to your achievement!
Very happy that you could reproduce the results. Thank you for your work.
closing this issue now.
(and we keep the original score in the benchmark table)