AI Model Benchmark

Independent head-to-head comparison of 7 frontier models across 27 real-world tasks in code generation, UI design, and research reasoning.

February 17, 2026 27 Tasks 7 Models 189 API Calls Temp = 0

At a Glance

Performance across all 27 tasks. Sorted by average response time.

Qwen 3.5 397B (Open)Qwen 3.5 Plus (Hosted)Claude Sonnet 4.5Claude Sonnet 4.6Claude Opus 4.5Claude Opus 4.6Gemini 2.5 Pro
Qwen 3.5 Plus (Hosted)
45.5savg response
3679avg tokens
81/sthroughput
$0.07total cost
Gemini 2.5 Pro
60.3savg response
7123avg tokens
118/sthroughput
$1.93total cost
Claude Opus 4.5
79.1savg response
6805avg tokens
86/sthroughput
$13.89total cost
Claude Sonnet 4.5
80.9savg response
5827avg tokens
72/sthroughput
$2.38total cost
Claude Sonnet 4.6
100.1savg response
8417avg tokens
84/sthroughput
$3.43total cost
Claude Opus 4.6
104.5savg response
7876avg tokens
75/sthroughput
$16.06total cost
Qwen 3.5 397B (Open)
138.6savg response
5466avg tokens
39/sthroughput
$0.11total cost
Qwen 3.5 Plus
Fastest Avg Response
45.5s per task
Gemini 2.5 Pro
Highest Throughput
118 tok/s
Qwen 3.5 Plus
Lowest Cost
$0.0728 total
Claude Sonnet 4.6
Most Verbose
8417 avg tokens

Model Outputs

Select a task from the sidebar, then click a model tab to see its raw output. Design tasks can be rendered as live HTML.

CODE
REST API with JWT Auth
React Data Table Component
Dynamic Programming - Longest Common Subsequence
Debug Broken Code
Refactor Spaghetti Code
Complex SQL Queries
WebSocket Chat Server
CLI Tool with Argument Parsing
Write Comprehensive Tests
Callback to Async/Await Conversion
DESIGN
SaaS Landing Page
Analytics Dashboard
Mobile Navigation Menu
Settings/Preferences Page
Multi-Step Onboarding Flow
Chat Interface
Advanced Data Table
RESEARCH
Personality Cloning State of Research
Data Anomaly Analysis
Architecture Comparison
Research Methodology Critique
Experiment Design
Audience-Adapted Explanation
Second-Order Effects Analysis
Logical Argument Analysis
Conflicting Data Inference
Literature Review Outline
Select a task from the sidebar to view outputs

Response Speed

Time to complete each task (seconds). Lower is better.

Average by Category

Cost Analysis

Total API cost for all 27 tasks per model. OpenRouter pricing as of Feb 17, 2026.

Output Verbosity

Token output per task.

Average by Category

All Tasks

Click any row to jump to its output.

TaskCatQwen 3.5 PlusGemini 2.5 ProClaude Opus 4.5Claude Sonnet 4.5Claude Sonnet 4.6Claude Opus 4.6Qwen 3.5 397B
TimeTokensTimeTokensTimeTokensTimeTokensTimeTokensTimeTokensTimeTokens
REST API with JWT Authcode27.7s2,21841.9s5,07235.9s3,53927.7s2,72838.0s3,18336.1s2,930100.0s3,925
React Data Table Componentcode47.9s4,09646.3s6,03372.9s8,19242.2s4,09684.5s8,19286.8s8,192122.3s4,433
Dynamic Programming - Longest Common Subsequencecode27.0s1,93952.7s6,82643.4s3,85550.2s4,09644.2s4,25360.9s4,710136.2s8,360
Debug Broken Codecode21.1s1,44830.9s3,59516.0s1,07414.7s1,01725.6s1,58819.2s1,16248.7s1,969
Refactor Spaghetti Codecode17.9s1,42440.4s4,65616.3s1,26916.4s1,38219.1s1,44413.8s98695.6s4,766
Complex SQL Queriescode15.5s1,22340.3s5,26221.6s1,79621.7s1,58634.1s2,30319.2s1,40187.3s4,768
WebSocket Chat Servercode37.9s3,34960.2s6,869111.4s11,98891.0s8,512122.6s13,496135.1s12,27248.0s4,427
CLI Tool with Argument Parsingcode51.8s4,09667.2s7,83976.7s8,19245.0s4,09687.0s8,19293.5s8,181181.0s6,640
Write Comprehensive Testscode40.2s2,80459.4s7,68759.7s6,09144.6s4,09671.5s7,71286.0s7,514144.9s6,064
Callback to Async/Await Conversioncode21.9s1,69538.0s4,1379.3s91014.7s1,12333.0s1,93215.0s1,06596.6s5,706
SaaS Landing Pagedesign56.8s7,42691.8s11,060189.5s20,981178.6s18,038192.5s21,498230.1s24,55530.2s2,870
Analytics Dashboarddesign103.4s10,10795.8s11,854118.4s14,486107.9s10,902211.4s22,729144.8s15,446296.5s7,288
Mobile Navigation Menudesign46.3s4,89494.9s10,95074.3s10,07987.8s8,174174.2s18,267108.6s10,984316.4s5,650
Settings/Preferences Pagedesign103.7s8,15885.5s10,969119.5s14,493127.8s12,467202.6s20,859165.5s19,223129.6s9,346
Multi-Step Onboarding Flowdesign75.3s5,57185.0s11,039104.6s11,336125.7s11,272128.3s11,570131.0s14,478443.3s9,573
Chat Interfacedesign72.3s7,53181.6s10,35089.0s9,014109.2s9,508140.0s12,965144.4s13,04982.5s8,392
Advanced Data Tabledesign73.4s7,86990.5s11,51387.1s9,887101.8s10,03899.3s12,329107.2s11,009273.0s7,647
Personality Cloning State of Researchresearch79.2s2,57257.1s5,908130.5s6,803175.9s7,32077.0s3,801167.2s6,377143.3s5,640
Data Anomaly Analysisresearch36.7s2,76756.9s6,64963.0s3,86442.8s1,987125.9s8,192123.0s6,945149.5s9,973
Architecture Comparisonresearch35.5s2,76170.0s7,922119.8s8,192158.4s8,192122.3s8,192123.6s8,192138.6s5,078
Research Methodology Critiqueresearch26.5s1,73841.9s4,34338.1s1,64057.9s2,27974.9s3,10087.9s3,47889.4s2,899
Experiment Designresearch40.2s2,58460.2s6,198166.8s8,192130.7s6,070134.4s7,968168.7s8,192109.1s4,205
Audience-Adapted Explanationresearch29.3s2,25449.5s5,71592.0s4,30379.1s2,999115.6s5,380135.7s5,51292.6s3,285
Second-Order Effects Analysisresearch33.3s1,95040.0s4,26382.9s3,62366.7s2,52795.6s4,635120.2s4,07888.6s3,232
Logical Argument Analysisresearch35.6s2,39449.9s5,20036.4s1,72354.6s2,34250.2s2,55876.4s2,91299.8s3,548
Conflicting Data Inferenceresearch36.6s1,85652.7s4,86553.8s2,13560.1s2,27774.3s3,55999.3s3,36593.3s3,796
Literature Review Outlineresearch36.3s2,61846.7s5,534105.6s6,087152.8s8,192124.1s7,364121.4s6,432106.1s4,100

Methodology

How this benchmark was conducted.

Models

Qwen 3.5 397B (Open)
Qwen 3.5 Plus (Hosted)
Claude Sonnet 4.5
Claude Sonnet 4.6
Claude Opus 4.5
Claude Opus 4.6
Gemini 2.5 Pro

Setup

All via OpenRouter API. Temp=0. Max 8192 tokens. No system prompt. Identical prompts.

Tasks

10 Code: APIs, algorithms, debugging, refactoring, SQL, WebSockets, CLI, tests, async.
7 Design: Landing pages, dashboards, nav, settings, onboarding, chat, tables.
10 Research: Analysis, reasoning, critique, experiment design, inference.

Reproducibility

189 API calls, zero errors. All raw outputs saved. Checkpoint system for resumable runs. Full code on GitHub.