UI-Bench

Can a model rebuild a screen from words alone?

A vision model describes a mobile screenshot in text. A coding agent rebuilds the screen from that text only — never seeing the original — and we render it on an iOS simulator and score the match.

screenshot → vision model → text → harness + model → iOS render → score

2
screens
2
vision models
3
harnesses
12/12
cells scored
Browse →
gallery

Vision models

Decomposition — marginal over harness × model

# Vision model n Score
1 anthropic/claude-opus-4.5 6 52.4
2 google/gemini-2.5-pro 6 49.9

Coding harnesses

Reconstruction — marginal over vision × model

# Harness n Score
1 codex 4 53.5
2 claude-code 4 52.3
3 opencode 4 47.7

Coding models

Marginal over vision × harness

# Coding model n Score
1 gpt-5.5 4 53.5
2 sonnet 4 52.3
3 deepseek/deepseek-v4-pro 4 47.7

Full pipeline

Vision · harness / model, end to end

# Pipeline Vision Harness n Score
1 anthropic/claude-opus-4.5 · claude-code/sonnet 2 54.7
2 google/gemini-2.5-pro · codex/gpt-5.5 2 54.4
3 anthropic/claude-opus-4.5 · codex/gpt-5.5 2 52.6
4 google/gemini-2.5-pro · claude-code/sonnet 2 49.8
5 anthropic/claude-opus-4.5 · opencode/deepseek/deepseek-v4-pro 2 49.8
6 google/gemini-2.5-pro · opencode/deepseek/deepseek-v4-pro 2 45.6