OckBench provides comprehensive evaluation across multiple dimensions
"Entities must not be multiplied beyond necessity." — The Principle of Ockham's Razor
Large Language Models (LLMs) such as GPT-4, Claude 3, and Gemini have demonstrated remarkable capabilities in complex problem-solving, largely attributed to their advanced reasoning abilities. Techniques like Chain of Thought (CoT) prompting and self-reflection have become central to this success, enabling models to perform step-by-step deductions for tasks requiring deep knowledge and logical rigor. However, as the industry increasingly emphasizes this "long decoding" mode, the computational cost associated with these reasoning processes has grown significantly.
While LLM evaluation and comparison have become increasingly important, most evaluations focus primarily on accuracy while the efficiency of generation is less discussed. For example, HELM, LM-Eval, and the LMSYS Chatbot Arena rank models almost entirely on task accuracy. Yet in real systems, the difference between generating 10K tokens vs 100K tokens is non-trivial in latency, cost, and energy.
We introduce OckBench, the first model-agnostic, hardware-agnostic benchmark that jointly measures accuracy and decoding token count for reasoning and coding tasks. Our key contributions include:
Through experiments comparing multiple open- and closed-source models, we uncover that many models with comparable accuracy differ wildly in token consumption. For instance, among commercial models, one high-accuracy model required over 2× the tokens of another to achieve similar accuracy. This reveals that efficiency variance is a neglected but significant axis of differentiation in LLM evaluation.
OckBench provides comprehensive evaluation across multiple dimensions
OckBench is structured to test LLMs' reasoning efficiency across two complementary domains: mathematical problem solving and coding skills. To better expose token-efficiency differences, we select questions that exhibit high variance in decoding token usage among baseline models.
OckBench uses decoding token count as the core efficiency metric—a model- and hardware-agnostic measure that captures the intrinsic reasoning efficiency of models.
By selecting questions with high variance in token consumption across models, OckBench reveals efficiency differences that traditional accuracy-only benchmarks miss.
Performance of various LLMs on OckBench. Models are ranked by their reasoning efficiency, which is computed as #Tokens / Accuracy (lower is better). Click on the tabs to switch between Math and Coding domains.
#Tokens: Average decoding token count Accuracy (%): Percentage of correctly solved problems Reasoning Efficiency: #Tokens / Accuracy (lower is better)
| # | Model | Category | #Tokens | Accuracy (%) | Reasoning Efficiency |
|---|---|---|---|---|---|
| 1 | GPT-4o OpenAI | Commercial | 495 | 35 | 14.1 |
| 2 | GPT-4.1 OpenAI | Commercial | 872 | 59 | 14.9 |
| 3 | Sky-T1-7B NovaSky-AI | Open-Source | 556 | 33 | 17.1 |
| 4 | GPT-5 OpenAI | Commercial | 2,336 | 73 | 32.2 |
| 5 | GPT-o3 OpenAI | Commercial | 2,347 | 64 | 36.8 |
| 6 | Gemini-2.5 Flash | Commercial | 4,777 | 66 | 72.6 |
| 7 | Gemini-2.5 Pro | Commercial | 5,198 | 68 | 76.2 |
| 8 | Qwen3-14B (non-thinking) Alibaba | Open-Source | 3,010 | 33 | 92.0 |
| 9 | Qwen3-4B (non-thinking) Alibaba | Open-Source | 3,494 | 30 | 118.4 |
| 10 | Qwen3-8B (non-thinking) Alibaba | Open-Source | 3,692 | 30 | 124.1 |
| 11 | Nemotron-14B NVIDIA | Open-Source | 5,540 | 40 | 139.4 |
| 12 | Sky-T1-mini NovaSky-AI | Open-Source | 6,657 | 33 | 204.8 |
| 13 | Qwen3-14B (thinking) Alibaba | Open-Source | 8,190 | 40 | 206.0 |
| 14 | Nemotron-7B NVIDIA | Open-Source | 8,895 | 35 | 254.2 |
| 15 | AReaL-boba-2-14B inclusionAI | Open-Source | 10,439 | 38 | 278.4 |
| 16 | AReaL-boba-2-8B inclusionAI | Open-Source | 17,038 | 37 | 457.4 |
| 17 | Qwen3-8B (thinking) Alibaba | Open-Source | 20,440 | 38 | 541.5 |
| 18 | Qwen3-4B (thinking) Alibaba | Open-Source | 24,025 | 37 | 649.3 |
Sample problems from OckBench-Math and OckBench-Coding
These examples illustrate the types of problems where token efficiency varies significantly across models.
Question: A store sells notebooks for $3 each. If you buy more than 10, you get a 20% discount on the total price. How much would it cost to buy 15 notebooks?
Domain: Mathematics | Source: GSM8K
Token variance: Some models use 200 tokens, others use 2,000+ for the same answer.
Question: Find the number of ordered pairs (a,b) of integers such that a² + b² = 2024 and both a and b are positive.
Domain: Mathematics | Source: AIME 2024
Token variance: High variance across models due to different reasoning approaches.
Task: Write a function to find the longest common subsequence of two strings. For example, lcs("ABCDGH", "AEDFHR") should return 3 (the LCS is "ADH").
Domain: Coding | Source: MBPP variant
Token variance: Efficient models write concise code with brief explanations.
Planned Extension: We plan to extend OckBench to additional domains such as algorithmic challenges, debugging tasks, and code transformation problems to provide more comprehensive token efficiency evaluation.
Status: Coming in next version | Focus: Broader coverage
Stay tuned for expanded benchmark coverage across more reasoning domains!
If you find OckBench useful for your research, please cite our work
@inproceedings{du2025ockbench,
author = {Du, Zheng and Kang, Hao and Zhu, Ligeng and Han, Song and Krishna, Tushar},
title = {OckBench: Tokens are Not to Be Multiplied without Necessity},
booktitle = {NeurIPS 2025 Workshop on Efficient Reasoning},
year = {2025}
}