OckBench: Tokens are Not to Be Multiplied without Necessity | NeurIPS 2025 Workshop on Efficient Reasoning

Introduction

"Entities must not be multiplied beyond necessity." — The Principle of Ockham's Razor

Large Language Models (LLMs) such as GPT-4, Claude 3, and Gemini have demonstrated remarkable capabilities in complex problem-solving, largely attributed to their advanced reasoning abilities. Techniques like Chain of Thought (CoT) prompting and self-reflection have become central to this success, enabling models to perform step-by-step deductions for tasks requiring deep knowledge and logical rigor. However, as the industry increasingly emphasizes this "long decoding" mode, the computational cost associated with these reasoning processes has grown significantly.

While LLM evaluation and comparison have become increasingly important, most evaluations focus primarily on accuracy while the efficiency of generation is less discussed. For example, HELM, LM-Eval, and the LMSYS Chatbot Arena rank models almost entirely on task accuracy. Yet in real systems, the difference between generating 10K tokens vs 100K tokens is non-trivial in latency, cost, and energy.

We introduce OckBench, the first model-agnostic, hardware-agnostic benchmark that jointly measures accuracy and decoding token count for reasoning and coding tasks. Our key contributions include:

Model-Agnostic Efficiency Metric: We formalize decoding token count as an intrinsic, hardware- and system-independent efficiency metric
Efficiency-Accuracy Aware Benchmark: The first unified benchmark specifically designed to evaluate LLM reasoning efficiency by measuring token consumption alongside accuracy
Empirical Trade-offs: We conduct experiments across multiple open- and closed-source models, revealing substantial practical trade-offs on the accuracy–efficiency Pareto frontier

Through experiments comparing multiple open- and closed-source models, we uncover that many models with comparable accuracy differ wildly in token consumption. For instance, among commercial models, one high-accuracy model required over 2× the tokens of another to achieve similar accuracy. This reveals that efficiency variance is a neglected but significant axis of differentiation in LLM evaluation.

Model Performance Comparison across all evaluated models showing accuracy vs token efficiency trade-offs — Similar model size but takes drastically different time in reasoning tasks

Performance comparison of 7B parameter models highlighting efficiency differences — Commercial models with similar performance level but spent token budget differently

Benchmark Overview

OckBench provides comprehensive evaluation across multiple dimensions

Total Questions

400

Task Domains

2

Questions per Domain

200

Models Evaluated

18

Benchmark Composition

OckBench is structured to test LLMs' reasoning efficiency across two complementary domains: mathematical problem solving and coding skills. To better expose token-efficiency differences, we select questions that exhibit high variance in decoding token usage among baseline models.

Mathematics and Reasoning Tasks (200 questions): We adopt GSM8K, AIME24, and AIME25 as core reasoning benchmarks. We select the top 200 questions that exhibit high variance in decoding token usage among baseline models, ensuring the benchmark emphasizes efficiency contrast rather than merely ranking by accuracy.
Software Engineering Tasks (200 questions): For the coding domain, we build a lightweight variant of MBPP, supplemented by 200 carefully curated real-world coding problems. These coding tasks cover algorithmic challenges, code transformation, debugging, and small-scale project tasks.

Evaluation Metrics

OckBench uses decoding token count as the core efficiency metric—a model- and hardware-agnostic measure that captures the intrinsic reasoning efficiency of models.

Decoding Token Count (#Tokens): Total number of tokens generated during reasoning (model-agnostic, hardware-agnostic)
Accuracy (%): Percentage of correctly solved problems
Reasoning Efficiency: Computed as #Tokens / Accuracy, measuring cost per unit of correctness

By selecting questions with high variance in token consumption across models, OckBench reveals efficiency differences that traditional accuracy-only benchmarks miss.

Leaderboard

Performance of various LLMs on OckBench. Models are ranked by their reasoning efficiency, which is computed as #Tokens / Accuracy (lower is better). Click on the tabs to switch between Math and Coding domains.

#Tokens: Average decoding token count Accuracy (%): Percentage of correctly solved problems Reasoning Efficiency: #Tokens / Accuracy (lower is better)

OckBench-Math
OckBench-Coding

#	Model	Category	#Tokens	Accuracy (%)	Reasoning Efficiency
1	GPT-4o OpenAI	Commercial	495	35	14.1
2	GPT-4.1 OpenAI	Commercial	872	59	14.9
3	Sky-T1-7B NovaSky-AI	Open-Source	556	33	17.1
4	GPT-5 OpenAI	Commercial	2,336	73	32.2
5	GPT-o3 OpenAI	Commercial	2,347	64	36.8
6	Gemini-2.5 Flash Google	Commercial	4,777	66	72.6
7	Gemini-2.5 Pro Google	Commercial	5,198	68	76.2
8	Qwen3-14B (non-thinking) Alibaba	Open-Source	3,010	33	92.0
9	Qwen3-4B (non-thinking) Alibaba	Open-Source	3,494	30	118.4
10	Qwen3-8B (non-thinking) Alibaba	Open-Source	3,692	30	124.1
11	Nemotron-14B NVIDIA	Open-Source	5,540	40	139.4
12	Sky-T1-mini NovaSky-AI	Open-Source	6,657	33	204.8
13	Qwen3-14B (thinking) Alibaba	Open-Source	8,190	40	206.0
14	Nemotron-7B NVIDIA	Open-Source	8,895	35	254.2
15	AReaL-boba-2-14B inclusionAI	Open-Source	10,439	38	278.4
16	AReaL-boba-2-8B inclusionAI	Open-Source	17,038	37	457.4
17	Qwen3-8B (thinking) Alibaba	Open-Source	20,440	38	541.5
18	Qwen3-4B (thinking) Alibaba	Open-Source	24,025	37	649.3

#	Model	Category	#Tokens	Accuracy (%)	Reasoning Efficiency
1	GPT-4o OpenAI	Commercial	491	38	12.9
2	Sky-T1-7B NovaSky-AI	Open-Source	348	23	15.1
3	GPT-4.1 OpenAI	Commercial	782	47	16.6
4	GPT-5 OpenAI	Commercial	1,436	75	19.1
5	Gemini 2.5 Pro Google	Commercial	1,798	77	23.4
6	Gemini 2.5 Flash Google	Commercial	2,346	60	39.1
7	GPT-o3 OpenAI	Commercial	3,001	71	42.3
8	Qwen3-4B (non-thinking) Alibaba	Open-Source	1,700	28	60.7
9	Qwen3-14B (non-thinking) Alibaba	Open-Source	2,413	35	68.9
10	Qwen3-8B (non-thinking) Alibaba	Open-Source	2,098	27	77.7
11	Nemotron-14B NVIDIA	Open-Source	9,840	46	213.9
12	Qwen3-14B (thinking) Alibaba	Open-Source	10,498	48	218.7
13	Sky-T1-mini NovaSky-AI	Open-Source	5,603	24	233.5
14	Qwen3-8B (thinking) Alibaba	Open-Source	11,738	41	286.3
15	Qwen3-4B (thinking) Alibaba	Open-Source	12,563	39	322.1
16	Nemotron-7B NVIDIA	Open-Source	12,895	40	322.4
17	AReaL-boba-2-14B inclusionAI	Open-Source	12,648	32	395.3
18	AReaL-boba-2-8B inclusionAI	Open-Source	14,537	31	468.9

Example Tasks

Sample problems from OckBench-Math and OckBench-Coding

These examples illustrate the types of problems where token efficiency varies significantly across models.

Math Problem (GSM8K)

Question: A store sells notebooks for $3 each. If you buy more than 10, you get a 20% discount on the total price. How much would it cost to buy 15 notebooks?

Domain: Mathematics | Source: GSM8K

Token variance: Some models use 200 tokens, others use 2,000+ for the same answer.

Math Problem (AIME)

Question: Find the number of ordered pairs (a,b) of integers such that a² + b² = 2024 and both a and b are positive.

Domain: Mathematics | Source: AIME 2024

Token variance: High variance across models due to different reasoning approaches.

Coding Problem (MBPP)

Task: Write a function to find the longest common subsequence of two strings. For example, lcs("ABCDGH", "AEDFHR") should return 3 (the LCS is "ADH").

Domain: Coding | Source: MBPP variant

Token variance: Efficient models write concise code with brief explanations.

Future Work: More Domains

Planned Extension: We plan to extend OckBench to additional domains such as algorithmic challenges, debugging tasks, and code transformation problems to provide more comprehensive token efficiency evaluation.

Status: Coming in next version | Focus: Broader coverage

Stay tuned for expanded benchmark coverage across more reasoning domains!

Key Findings

2× token variance among top models: Gemini-2.5 Pro used 2× more tokens than GPT-5 for similar accuracy (5,198 vs 2,336 tokens).
GPT-4o is most token-efficient: Best reasoning efficiency of 14.1 (math) and 12.9 (coding) despite lower accuracy.
"Thinking" modes costly: Qwen3-14B thinking used 2.7× more tokens than non-thinking (8,190 vs 3,010) without proportional gains.
Commercial models lead: 60.8% average accuracy vs 35.3% for open-source on math tasks.
10-18× efficiency differences: Models with comparable accuracy differ wildly in token consumption.
Size doesn't guarantee efficiency: Smaller models sometimes use more tokens than larger ones for the same tasks.

Citation

If you find OckBench useful for your research, please cite our work

@inproceedings{du2025ockbench,
  author    = {Du, Zheng and Kang, Hao and Zhu, Ligeng and Han, Song and Krishna, Tushar},
  title     = {OckBench: Tokens are Not to Be Multiplied without Necessity},
  booktitle = {NeurIPS 2025 Workshop on Efficient Reasoning},
  year      = {2025}
}