Performance & Cost Analysis - November 2025
Comprehensive comparison of the latest reasoning models including KIMI K2, Grok 4, DeepSeek R1, OpenAI o3, and Claude Sonnet 4.5
0
KIMI K2 Thinking (score/dollar)
$0.28
Grok 4 Fast (per 1M tokens)
0
Grok 4 (87.5%)
0
KIMI K2 (44.9% HLE)
| Model | Blended Cost ($/1M tokens) |
Input Cost | Output Cost | GPQA Diamond |
AIME | HLE (Tools) |
BrowseComp | SWE-Bench | LiveCodeBench | MATH-500 | Context (K tokens) |
Score/$ (Value) |
|---|
| Model | 1M Tokens | 10M Tokens | 100M Tokens |
|---|
Graduate-level science questions across physics, chemistry, and biology
American Invitational Mathematics Examination - Olympiad-level problem solving
Multi-step reasoning across 2,500 questions spanning dozens of domains with tool use
Continuous web browsing, search, and reasoning over hard-to-find information
Real GitHub issues requiring software engineering and debugging
Live coding competitions and algorithm challenges
High-school level mathematical problems requiring detailed reasoning
KIMI K2 Thinking (base)
Best value at $1.23 per 1M tokens with near-frontier performance (68.70 score/dollar)
Grok 4 Fast
Cheapest at $0.28 per 1M tokens, though with less comprehensive benchmarks
KIMI K2 Thinking
Leads with 44.9% HLE and 60.2% BrowseComp, best for autonomous multi-step reasoning
KIMI K2 Thinking
Highest SWE-Bench (71.3%) and LiveCodeBench (83.1%) scores
DeepSeek R1
Leads MATH-500 with 97.3%, though KIMI K2 leads AIME with 94.5%
Grok 4 Fast
2M token context window vs K2's 256K for processing entire books/codebases
KIMI K2 Thinking or DeepSeek R1
Both MIT licensed for self-hosting, with K2 offering superior performance