Performance of various LLMs on OckBench-Math (200 questions). Models are ranked by OckScore = Accuracy β 10 Γ log(Tokens / 10,000) β higher is better. All models evaluated with single-shot prompts and greedy decoding (temperature = 0).
"Entities must not be multiplied beyond necessity." β The Principle of Ockham's Razor
Large Language Models (LLMs) like GPT-5, Gemini 3, and Claude serve as the frontier of automated intelligence, fueled by reasoning techniques like Chain of Thought (CoT). As the field embraces test-time compute scaling, models are increasingly trained to generate extensive token chains to tackle tougher problems. However, this massive inflation of decoding tokens introduces a critical bottleneck: solving just six problems in the International Olympiad in Informatics can now take over ten hours, and complex mathematical challenges frequently explode into millions of tokens.
While the community celebrates gains in reasoning capability, prevailing benchmarks like HELM and Chatbot Arena focus almost exclusively on output quality, ignoring this token efficiency crisis. In reality, many models consume vastly more tokens than necessary to reach the correct answer. Models of identical size (7B) achieving similar accuracy can differ by over 3.4Γ in token consumption and 5.0Γ in end-to-end latency. As accuracy on standard tasks approaches saturation, tokens must be treated as a cost β not a free resource.
We introduce OckBench, the first model- and hardware-agnostic benchmark that jointly measures accuracy and token efficiency. Our key contributions include:
Through extensive evaluations on 47 frontier and open-source models, we find that top open-source models have nearly closed the accuracy gap but consume up to 5.1Γ more tokens than commercial counterparts for comparable accuracy. Meanwhile, frontier commercial models are rapidly co-optimizing both dimensions, validating Per-Token Intelligence as the next key axis of LLM evaluation.
Performance of various LLMs on OckBench-Math (200 questions). Models are ranked by OckScore = Accuracy β 10 Γ log(Tokens / 10,000) β higher is better. All models evaluated with single-shot prompts and greedy decoding (temperature = 0).
| # | Model | License | Avg Tokens | Accuracy (%) | OckScore β |
|---|---|---|---|---|---|
| 1 | gemini-3.1-pro-preview Gemini | Commercial | 24,765 | 75.5 | 70.09 |
| 2 | gemini-3-pro-preview Gemini | Commercial | 20,154 | 72.0 | 67.21 |
| 3 | gemini-3-flash-preview Gemini | Commercial | 36,212 | 71.0 | 64.35 |
| 4 | Qwen3.5-397B-A17B-FP8 Qwen | Open-Source | 33,994 | 68.5 | 62.07 |
| 5 | gpt-5.2 (high) OpenAI | Commercial | 19,541 | 66.0 | 61.30 |
| 6 | gpt-5.2 (medium) OpenAI | Commercial | 15,683 | 63.0 | 58.90 |
| 7 | gpt-5-mini (high) OpenAI | Commercial | 29,297 | 52.0 | 46.06 |
| 8 | Kimi-K2-Thinking Kimi | Open-Source | 41,746 | 52.5 | 45.36 |
| 9 | gpt-5.2 (low) OpenAI | Commercial | 5,003 | 41.5 | 39.74 |
| 10 | DeepSeek-V3.2-Thinking DeepSeek | Open-Source | 25,492 | 43.5 | 38.00 |
| 11 | o4-mini (high) OpenAI | Commercial | 25,677 | 43.5 | 37.98 |
| 12 | gpt-5-mini (medium) OpenAI | Commercial | 8,831 | 38.5 | 35.75 |
| 13 | gemini-2.5-pro Gemini | Commercial | 30,622 | 41.0 | 34.91 |
| 14 | o4-mini (medium) OpenAI | Commercial | 6,514 | 36.0 | 33.82 |
| 15 | Qwen3-235B-A22B-Thinking-2507 Qwen | Open-Source | 28,558 | 38.5 | 32.64 |
| 16 | o3-mini (high) OpenAI | Commercial | 21,623 | 37.0 | 32.00 |
| 17 | Qwen3-30B-A3B-Thinking-2507 Qwen | Open-Source | 30,513 | 37.0 | 30.92 |
| 18 | Qwen3-32B-Thinking Qwen | Open-Source | 19,098 | 31.0 | 26.36 |
| 19 | o3-mini (medium) OpenAI | Commercial | 7,251 | 27.5 | 25.13 |
| 20 | Qwen3-235B-A22B-Instruct-2507 Qwen | Open-Source | 7,707 | 27.0 | 24.52 |
| 21 | DeepSeek-R1 DeepSeek | Open-Source | 24,685 | 28.5 | 23.10 |
| 22 | AReaL-boba-2-32B AReaL | Open-Source | 23,327 | 27.0 | 21.77 |
| 23 | AceReason-Nemotron-14B AceReason | Open-Source | 19,424 | 25.0 | 20.31 |
| 24 | AReaL-boba-2-14B AReaL | Open-Source | 21,309 | 25.0 | 20.04 |
| 25 | gpt-5-mini (low) OpenAI | Commercial | 2,148 | 20.5 | 19.65 |
| 26 | DeepSeek-V3.2-Instruct DeepSeek | Open-Source | 9,379 | 21.0 | 18.13 |
| 27 | o3-mini (low) OpenAI | Commercial | 1,898 | 18.5 | 17.75 |
| 28 | Kimi-K2-Instruct-0905 Kimi | Open-Source | 3,823 | 18.5 | 17.09 |
| 29 | AceReason-Nemotron-7B AceReason | Open-Source | 12,315 | 20.5 | 17.01 |
| 30 | o4-mini (low) OpenAI | Commercial | 2,022 | 17.5 | 16.70 |
| 31 | Qwen3-14B-Thinking Qwen | Open-Source | 19,381 | 21.0 | 16.32 |
| 32 | Qwen3-4B-Thinking-2507 Qwen | Open-Source | 27,238 | 22.0 | 16.29 |
| 33 | Qwen3-30B-A3B-Instruct-2507 Qwen | Open-Source | 12,475 | 18.5 | 14.98 |
| 34 | AReaL-boba-2-8B AReaL | Open-Source | 24,127 | 20.0 | 14.67 |
| 35 | DeepSeek-R1-Distill-Qwen-32B DeepSeek | Open-Source | 12,895 | 15.5 | 11.90 |
| 36 | DeepSeek-R1-Distill-Qwen-14B DeepSeek | Open-Source | 13,211 | 15.0 | 11.34 |
| 37 | Qwen3-8B-Thinking Qwen | Open-Source | 24,207 | 16.5 | 11.16 |
| 38 | Qwen3-4B-Instruct-2507 Qwen | Open-Source | 11,859 | 13.5 | 10.10 |
| 39 | Qwen3-14B-Instruct Qwen | Open-Source | 8,205 | 10.0 | 7.40 |
| 40 | DeepSeek-R1-Distill-Qwen-7B DeepSeek | Open-Source | 41,415 | 14.5 | 7.39 |
| 41 | gpt-4.1 OpenAI | Commercial | 4,668 | 8.5 | 6.84 |
| 42 | gemini-2.5-flash Gemini | Commercial | 59,447 | 13.0 | 4.58 |
| 43 | Qwen3-8B-Instruct Qwen | Open-Source | 10,655 | 7.5 | 4.35 |
| 44 | Qwen3-32B-Instruct Qwen | Open-Source | 9,233 | 6.0 | 3.16 |
| 45 | gpt-4o OpenAI | Commercial | 947 | 3.5 | 3.11 |
| 46 | gemini-2.5-flash-lite Gemini | Commercial | 61,040 | 1.0 | -7.52 |
OckBench provides comprehensive evaluation across multiple dimensions
OckBench aggregates tasks across three complementary reasoning domains. Rather than random sampling, we apply the Differentiation Filter: selecting problems where accuracy across models falls within 10%β90% (avoiding floor/ceiling effects) and token variance is maximized β isolating instances that reveal intrinsic efficiency differences.
Sample problems from OckBench-Math and OckBench-Coding
These examples illustrate the types of problems where token efficiency varies significantly across models.
Question: A store sells notebooks for $3 each. If you buy more than 10, you get a 20% discount on the total price. How much would it cost to buy 15 notebooks?
Domain: Mathematics | Source: GSM8K
$3 Γ 15 Γ 0.8 = $36 β a 3-second mental calculation. Yet some reasoning models spend 2,000+ tokens setting up formal equations, double-checking edge cases, and re-reading the problem before arriving at the obvious answer.
Question: Find the number of ordered pairs (a,b) of integers such that aΒ² + bΒ² = 2024 and both a and b are positive.
Domain: Mathematics | Source: AIME 2024
Efficient models notice 2024 = 4 Γ 506 and apply modular arithmetic to eliminate large search spaces in a few steps. Verbose models enumerate all 44Β² candidate pairs one by one β correct eventually, but at 10Γ the token cost.
Task: Write a function to find the longest common subsequence of two strings. For example, lcs("ABCDGH", "AEDFHR") should return 3 (the LCS is "ADH").
Domain: Coding | Source: MBPP variant
A clean DP solution is ~10 lines. But many models first write a recursive solution, identify the redundancy, rewrite with memoization, then pivot to bottom-up β producing correct code buried under 800 tokens of self-tutoring.
Question: A molecule undergoes a photochemical reaction in which it absorbs a photon and transitions to an excited state. If the excited state has a lifetime of 10 ns, what is the natural linewidth (in Hz) of the corresponding spectral line?
Domain: Scientific Reasoning | Source: GPQA-Diamond
Efficient models recall ΞΞ½ = 1/(2ΟΟ) and plug in Ο = 10 ns for an instant answer. Overthinking models re-derive the time-energy uncertainty relation from first principles β impressively thorough, but it's a textbook formula lookup.
If you find OckBench useful for your research, please cite our work
@article{du2025ockbench,
title={OckBench: Measuring the Efficiency of LLM Reasoning},
author={Du, Zheng and Kang, Hao and Han, Song and Krishna, Tushar and Zhu, Ligeng},
journal={arXiv preprint arXiv:2511.05722},
year={2025}
}