AutoBench Run 4 is out with Gemini 3 Pro, Gpt 5.1, Grok 4.1 etc. And the winner is not who you expect.

Community Article Published November 28, 2025

The need for dynamic, and granular benchmarks such as AutoBench has never been higher. As the AI landscape accelerates with the release of heavy hitters like GPT-5.1, Gemini 3 Pro, and Grok 4.1, we are thrilled to announce the completion of the 4th run of AutoBench, our last evaluation to date.

This run evaluated 33 models across over 300 iterations (generated questions) using 21 ranking models and generating over 220,000 individual rankings. The results are in, and they paint a picture of a rapidly maturing ecosystem where proprietary giants are pushing boundaries, but efficient open-weight models are nipping at their heels.

You can explore the full interactive data on the AutoBench Leaderboard or directly by visiting autobench.org

The Headlines

New King of the Hill: GPT-5.1 has established a new State-of-the-Art (SOTA) with an AutoBench score of 4.49, edging out its predecessor GPT-5.
Gemini 3 Pro enters the arena at 3rd place. Despite dominating almost every external academic benchmark (MMLU, MathArena, etc.), it struggles to dethrone OpenAI's flagship models in the "Collective-LLM-as-a-Judge" arena.
The Double Champion: GPT-OSS-120b is the breakout star of this run. It claimed the titles of Efficiency Champion ($0.0006/answer) and Speed Demon (17.14s average latency), all while securing a top-tier score of 4.37.
Validated Accuracy: AutoBench continues to align with the community consensus, showing an 87.08% correlation with the Artificial Analysis Intelligence Index.

🏆 The Leaderboard

The top of the leaderboard has shifted significantly. OpenAI maintains its lead, but Google has firmly planted its flag on the podium.

Note: The "AutoBench Score" is our aggregate quality metric (1-5 scale) derived from the "Collective-LLM-as-a-Judge" methodology.

🧐 The Gemini 3 Pro Story

The most fascinating insight from Run 4 is the performance of Gemini 3 Pro. In traditional benchmarks like MMLU-Pro (where it scores a massive 91%), Gemini 3 Pro is often undisputed number one.

However, in AutoBench—which relies on a jury of diverse LLMs ranking each other's answers—it lands in 3rd place (Score: 4.39), unable to surpass GPT-5.1 or GPT-5.

Why the discrepancy? Our analysis suggests that less capable models, acting as rankers, struggle to identify the subtleties of more advanced models, especially in less "hard" domains. This run was implemented using 21 ranking models, ranging from low performance models such as Phi4 and Amazon Nova all the way to mid-high performers such as Gemini 2.5 Flash, Gpt-oss-120, and Grok 4.1 Fast (non-thinking). For cost constraints, we avoid using as rankers high end SOTA models such as Gpt 5 or Gemini 2.5 Pro. Our analysis shows that, if we remove low-end models, top-tier SOTA model performance tends to aligns slightly better to other benchmarks. This is a strong suggestion that if we used such models as rankers, the ranking capability of AutoBench would significantly increase also for such models.

That said, the current gap between GPT 5.1 and Gemini 3 Pro is quite large and we doubt it could be compensated even using more performing models as rankers. This is going to be an active area of investigation for us in the coming weeks.

💰 Cost vs. Performance: The "Smart Shopper" Chart

One of the most critical insights from AutoBench is the trade-off between quality and cost.

While GPT-5.1 provides the absolute best answers, it comes at a premium (~$0.075 per answer). On the other hand, GPT-OSS-120b confirms as one of the most cost-efficient models (as per our previous Run 3). With a score of 4.37, it rivals the top proprietary models released just a few days ago but costs nearly 125x less than the flagship GPT-5 series.

For developers building at scale, models like GPT-5-Nano and DeepSeek-v3.2-exp also offer a compelling "sweet spot"—high intelligence for most tasks at a negligible cost.

🧠 Domain Mastery

AutoBench doesn't just give a single number; we break down performance by domain. Here is who conquered the key categories in Run 4:

🧮 Math: A surprise upset! The older Gemini-2.5-Pro takes the crown (4.31), edging out its successor Gemini-3-Pro (4.30) and GPT-5.1 (4.29).
🧠 Logic: GPT-5.1 (4.37) demonstrates superior reasoning capabilities, maintaining a comfortable lead over GPT-5 and Gemini-3-Pro.
🎨 Creative Writing: Gemini-3-Pro shines here, securing the top spot (4.46) and narrowly beating GPT-5.1 (4.45), showcasing its nuanced and high-quality text generation capabilities.

🔗 Correlations & Methodology

AutoBench Run 4 continues to show incredibly strong alignment with other gold-standard benchmarks, proving that our "Collective-LLM-as-a-Judge" system is capturing well the LLM consensus of model capability.

87.08% correlation with Artificial Analysis Intelligence Index
77.16% correlation with LMArena (Chatbot Arena)
80.68% correlation with MMLU

This high correlation validates that AutoBench is accurately reflecting the ground truth of model quality, but doing so faster and cheaper than human evaluation.

What's Next?

Run 4 has set a new baseline. We are now working at using AutoBench also for evaluating "reasoning" and agentic capabilities. We are also using it to dig into more specific domanins to prove its validity in exploring LLM pperformance also in areas that have to date never been examined.

Explore the data: AutoBench Leaderboard
Read the docs: AutoBench Methodology

Stay tuned for Run 5!

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote