Frontline AI Reasoning Models Comparison

Open Source Only

Max Cost: $20.00/1M tokens

Focus Benchmark:

Best Value Leader

KIMI K2 Thinking (score/dollar)

Most Cost-Effective

$0.28

Grok 4 Fast (per 1M tokens)

Top GPQA Score

Grok 4 (87.5%)

Best Agentic Reasoning

KIMI K2 (44.9% HLE)

Model Comparison

Model	Blended Cost ($/1M tokens)	Input Cost	Output Cost	GPQA Diamond	AIME	HLE (Tools)	BrowseComp	SWE-Bench	LiveCodeBench	MATH-500	Context (K tokens)	Score/$ (Value)

Visual Analysis

Cost Comparison

Performance Overview

Value Analysis

Detailed Insights

Pricing Breakdown

Cost Scenarios

Model	1M Tokens	10M Tokens	100M Tokens

Performance Deep Dive

GPQA Diamond

Graduate-level science questions across physics, chemistry, and biology

AIME 2024/2025

American Invitational Mathematics Examination - Olympiad-level problem solving

Humanity's Last Exam (HLE)

Multi-step reasoning across 2,500 questions spanning dozens of domains with tool use

BrowseComp

Continuous web browsing, search, and reasoning over hard-to-find information

SWE-Bench Verified

Real GitHub issues requiring software engineering and debugging

LiveCodeBench

Live coding competitions and algorithm challenges

MATH-500

High-school level mathematical problems requiring detailed reasoning

Use Case Recommendations

Cost-Conscious Production

KIMI K2 Thinking (base)

Best value at $1.23 per 1M tokens with near-frontier performance (68.70 score/dollar)

Maximum Cost Optimization

Grok 4 Fast

Cheapest at $0.28 per 1M tokens, though with less comprehensive benchmarks

Agentic & Research Workflows

KIMI K2 Thinking

Leads with 44.9% HLE and 60.2% BrowseComp, best for autonomous multi-step reasoning

Coding & Software Engineering

KIMI K2 Thinking

Highest SWE-Bench (71.3%) and LiveCodeBench (83.1%) scores

Mathematical Excellence

DeepSeek R1

Leads MATH-500 with 97.3%, though KIMI K2 leads AIME with 94.5%

Extremely Long Context

Grok 4 Fast

2M token context window vs K2's 256K for processing entire books/codebases

Open Source Requirements

KIMI K2 Thinking or DeepSeek R1

Both MIT licensed for self-hosting, with K2 offering superior performance