Independent head-to-head comparison of 7 frontier models across 27 real-world tasks in code generation, UI design, and research reasoning.
Performance across all 27 tasks. Sorted by average response time.
Select a task from the sidebar, then click a model tab to see its raw output. Design tasks can be rendered as live HTML.
Time to complete each task (seconds). Lower is better.
Total API cost for all 27 tasks per model. OpenRouter pricing as of Feb 17, 2026.
Token output per task.
Click any row to jump to its output.
| Task | Cat | Qwen 3.5 Plus | Gemini 2.5 Pro | Claude Opus 4.5 | Claude Sonnet 4.5 | Claude Sonnet 4.6 | Claude Opus 4.6 | Qwen 3.5 397B | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Time | Tokens | Time | Tokens | Time | Tokens | Time | Tokens | Time | Tokens | Time | Tokens | Time | Tokens | ||
| REST API with JWT Auth | code | 27.7s | 2,218 | 41.9s | 5,072 | 35.9s | 3,539 | 27.7s | 2,728 | 38.0s | 3,183 | 36.1s | 2,930 | 100.0s | 3,925 |
| React Data Table Component | code | 47.9s | 4,096 | 46.3s | 6,033 | 72.9s | 8,192 | 42.2s | 4,096 | 84.5s | 8,192 | 86.8s | 8,192 | 122.3s | 4,433 |
| Dynamic Programming - Longest Common Subsequence | code | 27.0s | 1,939 | 52.7s | 6,826 | 43.4s | 3,855 | 50.2s | 4,096 | 44.2s | 4,253 | 60.9s | 4,710 | 136.2s | 8,360 |
| Debug Broken Code | code | 21.1s | 1,448 | 30.9s | 3,595 | 16.0s | 1,074 | 14.7s | 1,017 | 25.6s | 1,588 | 19.2s | 1,162 | 48.7s | 1,969 |
| Refactor Spaghetti Code | code | 17.9s | 1,424 | 40.4s | 4,656 | 16.3s | 1,269 | 16.4s | 1,382 | 19.1s | 1,444 | 13.8s | 986 | 95.6s | 4,766 |
| Complex SQL Queries | code | 15.5s | 1,223 | 40.3s | 5,262 | 21.6s | 1,796 | 21.7s | 1,586 | 34.1s | 2,303 | 19.2s | 1,401 | 87.3s | 4,768 |
| WebSocket Chat Server | code | 37.9s | 3,349 | 60.2s | 6,869 | 111.4s | 11,988 | 91.0s | 8,512 | 122.6s | 13,496 | 135.1s | 12,272 | 48.0s | 4,427 |
| CLI Tool with Argument Parsing | code | 51.8s | 4,096 | 67.2s | 7,839 | 76.7s | 8,192 | 45.0s | 4,096 | 87.0s | 8,192 | 93.5s | 8,181 | 181.0s | 6,640 |
| Write Comprehensive Tests | code | 40.2s | 2,804 | 59.4s | 7,687 | 59.7s | 6,091 | 44.6s | 4,096 | 71.5s | 7,712 | 86.0s | 7,514 | 144.9s | 6,064 |
| Callback to Async/Await Conversion | code | 21.9s | 1,695 | 38.0s | 4,137 | 9.3s | 910 | 14.7s | 1,123 | 33.0s | 1,932 | 15.0s | 1,065 | 96.6s | 5,706 |
| SaaS Landing Page | design | 56.8s | 7,426 | 91.8s | 11,060 | 189.5s | 20,981 | 178.6s | 18,038 | 192.5s | 21,498 | 230.1s | 24,555 | 30.2s | 2,870 |
| Analytics Dashboard | design | 103.4s | 10,107 | 95.8s | 11,854 | 118.4s | 14,486 | 107.9s | 10,902 | 211.4s | 22,729 | 144.8s | 15,446 | 296.5s | 7,288 |
| Mobile Navigation Menu | design | 46.3s | 4,894 | 94.9s | 10,950 | 74.3s | 10,079 | 87.8s | 8,174 | 174.2s | 18,267 | 108.6s | 10,984 | 316.4s | 5,650 |
| Settings/Preferences Page | design | 103.7s | 8,158 | 85.5s | 10,969 | 119.5s | 14,493 | 127.8s | 12,467 | 202.6s | 20,859 | 165.5s | 19,223 | 129.6s | 9,346 |
| Multi-Step Onboarding Flow | design | 75.3s | 5,571 | 85.0s | 11,039 | 104.6s | 11,336 | 125.7s | 11,272 | 128.3s | 11,570 | 131.0s | 14,478 | 443.3s | 9,573 |
| Chat Interface | design | 72.3s | 7,531 | 81.6s | 10,350 | 89.0s | 9,014 | 109.2s | 9,508 | 140.0s | 12,965 | 144.4s | 13,049 | 82.5s | 8,392 |
| Advanced Data Table | design | 73.4s | 7,869 | 90.5s | 11,513 | 87.1s | 9,887 | 101.8s | 10,038 | 99.3s | 12,329 | 107.2s | 11,009 | 273.0s | 7,647 |
| Personality Cloning State of Research | research | 79.2s | 2,572 | 57.1s | 5,908 | 130.5s | 6,803 | 175.9s | 7,320 | 77.0s | 3,801 | 167.2s | 6,377 | 143.3s | 5,640 |
| Data Anomaly Analysis | research | 36.7s | 2,767 | 56.9s | 6,649 | 63.0s | 3,864 | 42.8s | 1,987 | 125.9s | 8,192 | 123.0s | 6,945 | 149.5s | 9,973 |
| Architecture Comparison | research | 35.5s | 2,761 | 70.0s | 7,922 | 119.8s | 8,192 | 158.4s | 8,192 | 122.3s | 8,192 | 123.6s | 8,192 | 138.6s | 5,078 |
| Research Methodology Critique | research | 26.5s | 1,738 | 41.9s | 4,343 | 38.1s | 1,640 | 57.9s | 2,279 | 74.9s | 3,100 | 87.9s | 3,478 | 89.4s | 2,899 |
| Experiment Design | research | 40.2s | 2,584 | 60.2s | 6,198 | 166.8s | 8,192 | 130.7s | 6,070 | 134.4s | 7,968 | 168.7s | 8,192 | 109.1s | 4,205 |
| Audience-Adapted Explanation | research | 29.3s | 2,254 | 49.5s | 5,715 | 92.0s | 4,303 | 79.1s | 2,999 | 115.6s | 5,380 | 135.7s | 5,512 | 92.6s | 3,285 |
| Second-Order Effects Analysis | research | 33.3s | 1,950 | 40.0s | 4,263 | 82.9s | 3,623 | 66.7s | 2,527 | 95.6s | 4,635 | 120.2s | 4,078 | 88.6s | 3,232 |
| Logical Argument Analysis | research | 35.6s | 2,394 | 49.9s | 5,200 | 36.4s | 1,723 | 54.6s | 2,342 | 50.2s | 2,558 | 76.4s | 2,912 | 99.8s | 3,548 |
| Conflicting Data Inference | research | 36.6s | 1,856 | 52.7s | 4,865 | 53.8s | 2,135 | 60.1s | 2,277 | 74.3s | 3,559 | 99.3s | 3,365 | 93.3s | 3,796 |
| Literature Review Outline | research | 36.3s | 2,618 | 46.7s | 5,534 | 105.6s | 6,087 | 152.8s | 8,192 | 124.1s | 7,364 | 121.4s | 6,432 | 106.1s | 4,100 |
How this benchmark was conducted.
Qwen 3.5 397B (Open)
Qwen 3.5 Plus (Hosted)
Claude Sonnet 4.5
Claude Sonnet 4.6
Claude Opus 4.5
Claude Opus 4.6
Gemini 2.5 Pro
All via OpenRouter API. Temp=0. Max 8192 tokens. No system prompt. Identical prompts.
10 Code: APIs, algorithms, debugging, refactoring, SQL, WebSockets, CLI, tests, async.
7 Design: Landing pages, dashboards, nav, settings, onboarding, chat, tables.
10 Research: Analysis, reasoning, critique, experiment design, inference.
189 API calls, zero errors. All raw outputs saved. Checkpoint system for resumable runs. Full code on GitHub.