Logo OckBench

Tokens are Not to Be Multiplied without Necessity

*Equal contribution
1Georgia Institute of Technology 2MIT 3NVIDIA

Introduction

"Entities must not be multiplied beyond necessity." β€” The Principle of Ockham's Razor

Large Language Models (LLMs) like GPT-5, Gemini 3, and Claude serve as the frontier of automated intelligence, fueled by reasoning techniques like Chain of Thought (CoT). As the field embraces test-time compute scaling, models are increasingly trained to generate extensive token chains to tackle tougher problems. However, this massive inflation of decoding tokens introduces a critical bottleneck: solving just six problems in the International Olympiad in Informatics can now take over ten hours, and complex mathematical challenges frequently explode into millions of tokens.

While the community celebrates gains in reasoning capability, prevailing benchmarks like HELM and Chatbot Arena focus almost exclusively on output quality, ignoring this token efficiency crisis. In reality, many models consume vastly more tokens than necessary to reach the correct answer. Models of identical size (7B) achieving similar accuracy can differ by over 3.4Γ— in token consumption and 5.0Γ— in end-to-end latency. As accuracy on standard tasks approaches saturation, tokens must be treated as a cost β€” not a free resource.

We introduce OckBench, the first model- and hardware-agnostic benchmark that jointly measures accuracy and token efficiency. Our key contributions include:

  • Per-Token Intelligence: We introduce a new evaluation dimension β€” a superior model must not only achieve high accuracy but do so with minimal token consumption
  • OckBench & OckScore: A benchmark with a novel Differentiation Filter that isolates tasks exposing the efficiency gap, paired with a unified metric that rewards high accuracy achieved with fewer tokens
  • The Overthinking Tax: We formally quantify how smaller models often incur paradoxically higher deployment costs due to excessive verbose reasoning chains
  • Optimization Pathways: We demonstrate that efficiency is tractable β€” both training-free model interpolation and difficulty-aware RL significantly improve OckScore

Through extensive evaluations on 47 frontier and open-source models, we find that top open-source models have nearly closed the accuracy gap but consume up to 5.1Γ— more tokens than commercial counterparts for comparable accuracy. Meanwhile, frontier commercial models are rapidly co-optimizing both dimensions, validating Per-Token Intelligence as the next key axis of LLM evaluation.

Accuracy vs. Token Consumption for all 47 evaluated models
Accuracy vs. Average Tokens across 47 evaluated models. Models in the upper-left corner are ideal. The Pareto frontier is dominated by commercial models; open-source models cluster to the right.
7B models: AceReason-7B is 5x faster than DeepSeek-R1-7B at same parameter scale
Same Scale, 5Γ— Faster. At identical 7B parameters, AceReason-7B is 5.0Γ— faster than DeepSeek-R1-7B by using 3.4Γ— fewer tokens.
Open vs Closed Source: DeepSeek-V3.2 uses 5.1x more tokens than GPT-5.2 at similar accuracy
Open vs. Closed Gap. DeepSeek-V3.2 matches GPT-5.2 in accuracy but consumes 5.1Γ— more tokens β€” open-source lags in per-token intelligence.
The Overthinking Tax: larger Qwen3 models use fewer tokens than smaller ones
The Overthinking Tax. In the Qwen3 family, larger models are paradoxically more token-efficient β€” smaller models over-generate to compensate for lower capacity.

Leaderboard

Performance of various LLMs on OckBench-Math (200 questions). Models are ranked by OckScore = Accuracy βˆ’ 10 Γ— log(Tokens / 10,000) β€” higher is better. All models evaluated with single-shot prompts and greedy decoding (temperature = 0).

# Model License Avg Tokens Accuracy (%) OckScore ↑
1gemini-3.1-pro-preview
Gemini
Commercial24,76575.570.09
2gemini-3-pro-preview
Gemini
Commercial20,15472.067.21
3gemini-3-flash-preview
Gemini
Commercial36,21271.064.35
4Qwen3.5-397B-A17B-FP8
Qwen
Open-Source33,99468.562.07
5gpt-5.2 (high)
OpenAI
Commercial19,54166.061.30
6gpt-5.2 (medium)
OpenAI
Commercial15,68363.058.90
7gpt-5-mini (high)
OpenAI
Commercial29,29752.046.06
8Kimi-K2-Thinking
Kimi
Open-Source41,74652.545.36
9gpt-5.2 (low)
OpenAI
Commercial5,00341.539.74
10DeepSeek-V3.2-Thinking
DeepSeek
Open-Source25,49243.538.00
11o4-mini (high)
OpenAI
Commercial25,67743.537.98
12gpt-5-mini (medium)
OpenAI
Commercial8,83138.535.75
13gemini-2.5-pro
Gemini
Commercial30,62241.034.91
14o4-mini (medium)
OpenAI
Commercial6,51436.033.82
15Qwen3-235B-A22B-Thinking-2507
Qwen
Open-Source28,55838.532.64
16o3-mini (high)
OpenAI
Commercial21,62337.032.00
17Qwen3-30B-A3B-Thinking-2507
Qwen
Open-Source30,51337.030.92
18Qwen3-32B-Thinking
Qwen
Open-Source19,09831.026.36
19o3-mini (medium)
OpenAI
Commercial7,25127.525.13
20Qwen3-235B-A22B-Instruct-2507
Qwen
Open-Source7,70727.024.52
21DeepSeek-R1
DeepSeek
Open-Source24,68528.523.10
22AReaL-boba-2-32B
AReaL
Open-Source23,32727.021.77
23AceReason-Nemotron-14B
AceReason
Open-Source19,42425.020.31
24AReaL-boba-2-14B
AReaL
Open-Source21,30925.020.04
25gpt-5-mini (low)
OpenAI
Commercial2,14820.519.65
26DeepSeek-V3.2-Instruct
DeepSeek
Open-Source9,37921.018.13
27o3-mini (low)
OpenAI
Commercial1,89818.517.75
28Kimi-K2-Instruct-0905
Kimi
Open-Source3,82318.517.09
29AceReason-Nemotron-7B
AceReason
Open-Source12,31520.517.01
30o4-mini (low)
OpenAI
Commercial2,02217.516.70
31Qwen3-14B-Thinking
Qwen
Open-Source19,38121.016.32
32Qwen3-4B-Thinking-2507
Qwen
Open-Source27,23822.016.29
33Qwen3-30B-A3B-Instruct-2507
Qwen
Open-Source12,47518.514.98
34AReaL-boba-2-8B
AReaL
Open-Source24,12720.014.67
35DeepSeek-R1-Distill-Qwen-32B
DeepSeek
Open-Source12,89515.511.90
36DeepSeek-R1-Distill-Qwen-14B
DeepSeek
Open-Source13,21115.011.34
37Qwen3-8B-Thinking
Qwen
Open-Source24,20716.511.16
38Qwen3-4B-Instruct-2507
Qwen
Open-Source11,85913.510.10
39Qwen3-14B-Instruct
Qwen
Open-Source8,20510.07.40
40DeepSeek-R1-Distill-Qwen-7B
DeepSeek
Open-Source41,41514.57.39
41gpt-4.1
OpenAI
Commercial4,6688.56.84
42gemini-2.5-flash
Gemini
Commercial59,44713.04.58
43Qwen3-8B-Instruct
Qwen
Open-Source10,6557.54.35
44Qwen3-32B-Instruct
Qwen
Open-Source9,2336.03.16
45gpt-4o
OpenAI
Commercial9473.53.11
46gemini-2.5-flash-lite
Gemini
Commercial61,0401.0-7.52

Benchmark Overview

OckBench provides comprehensive evaluation across multiple dimensions

Questions per Domain
200
Task Domains
3
Source Datasets
8+
Models Evaluated
47

Benchmark Composition

OckBench aggregates tasks across three complementary reasoning domains. Rather than random sampling, we apply the Differentiation Filter: selecting problems where accuracy across models falls within 10%–90% (avoiding floor/ceiling effects) and token variance is maximized β€” isolating instances that reveal intrinsic efficiency differences.

  • Mathematics & Reasoning: GSM8K, AIME 2024/2025, OlympiadBench, MATH500, AMO-Bench, and the mathematics subset of Humanity's Last Exam β€” spanning grade-school arithmetic to competition-level number theory.
  • Software Engineering: A lightweight variant of MBPP and LiveCodeBench, assessing practical code generation and planning skills verified via unit test execution.
  • Scientific Reasoning: ScienceQA, MMLU (STEM subsets), and GPQA-Diamond, testing knowledge-constrained reasoning and concision under technical load.

Example Tasks

Sample problems from OckBench-Math and OckBench-Coding

These examples illustrate the types of problems where token efficiency varies significantly across models.

Math Problem (GSM8K)

Question: A store sells notebooks for $3 each. If you buy more than 10, you get a 20% discount on the total price. How much would it cost to buy 15 notebooks?

Domain: Mathematics  |  Source: GSM8K

$3 Γ— 15 Γ— 0.8 = $36 β€” a 3-second mental calculation. Yet some reasoning models spend 2,000+ tokens setting up formal equations, double-checking edge cases, and re-reading the problem before arriving at the obvious answer.

Math Problem (AIME)

Question: Find the number of ordered pairs (a,b) of integers such that aΒ² + bΒ² = 2024 and both a and b are positive.

Domain: Mathematics  |  Source: AIME 2024

Efficient models notice 2024 = 4 Γ— 506 and apply modular arithmetic to eliminate large search spaces in a few steps. Verbose models enumerate all 44Β² candidate pairs one by one β€” correct eventually, but at 10Γ— the token cost.

Coding Problem (MBPP)

Task: Write a function to find the longest common subsequence of two strings. For example, lcs("ABCDGH", "AEDFHR") should return 3 (the LCS is "ADH").

Domain: Coding  |  Source: MBPP variant

A clean DP solution is ~10 lines. But many models first write a recursive solution, identify the redundancy, rewrite with memoization, then pivot to bottom-up β€” producing correct code buried under 800 tokens of self-tutoring.

Science Problem (GPQA-Diamond)

Question: A molecule undergoes a photochemical reaction in which it absorbs a photon and transitions to an excited state. If the excited state has a lifetime of 10 ns, what is the natural linewidth (in Hz) of the corresponding spectral line?

Domain: Scientific Reasoning  |  Source: GPQA-Diamond

Efficient models recall Δν = 1/(2πτ) and plug in Ο„ = 10 ns for an instant answer. Overthinking models re-derive the time-energy uncertainty relation from first principles β€” impressively thorough, but it's a textbook formula lookup.

Key Findings

  • 5Γ— latency gap at the same scale: Among 7B models, AceReason-7B uses 12,315 tokens vs DeepSeek-R1-Distill-7B's 41,415 β€” a 3.4Γ— token gap translating to a 5.0Γ— end-to-end latency difference, despite identical parameter counts.
  • Open-source accuracy β‰ˆ commercial, but efficiency lags: Kimi-K2-Thinking matches GPT-5-mini-high in accuracy (52.5% vs 52.0%) but consumes 43% more tokens (41,746 vs 29,297). Open-source models have closed the capability gap but trail significantly in per-token intelligence.
  • The Overthinking Tax: In the Qwen3 family, larger models are paradoxically more token-efficient β€” Qwen3-32B-Thinking uses 19,098 tokens at 31.0% accuracy, while Qwen3-4B-Thinking burns 27,238 tokens at just 22.0%. Smaller models compensate for lower capacity with verbose, inefficient reasoning chains.
  • Cheaper per-token β‰  cheaper per-query: DeepSeek-R1-Distill-7B generates 3.13Γ— more tokens than 14B (41,415 vs 13,211) yet achieves lower accuracy (14.5% vs 15.0%). Despite costing 50% less per token, the 7B model is 57% more expensive per query in practice.
  • Gemini 3 Pro leads the frontier: Top OckScore of 70.09 with 75.5% accuracy at 24,765 avg tokens. Gemini 3 Pro resolves complex reasoning paths more succinctly than Gemini 3 Flash (36,212 tokens, OckScore 64.35), demonstrating that efficiency and accuracy are co-optimized at the frontier.
  • Frontier models are rapidly converging: Comparing successive model generations reveals simultaneous improvement in both accuracy and token efficiency β€” validating Per-Token Intelligence as the primary next optimization objective for the community.

Citation

If you find OckBench useful for your research, please cite our work

@article{du2025ockbench,
  title={OckBench: Measuring the Efficiency of LLM Reasoning},
  author={Du, Zheng and Kang, Hao and Han, Song and Krishna, Tushar and Zhu, Ligeng},
  journal={arXiv preprint arXiv:2511.05722},
  year={2025}
}