UI-Bench

Can a model rebuild a screen from words alone?

A vision model describes a mobile screenshot in text. A coding agent rebuilds the screen from that text only — never seeing the original — and we render it on an iOS simulator and score the match.

screenshot → vision model → text → harness + model → iOS render → score

2
screens
2
vision models
3
harnesses
12/12
cells scored
Browse →
gallery

Vision models

Decomposition — marginal over harness × model

# Vision model n Score
1 google/gemini-2.5-pro 6 69.7
2 anthropic/claude-opus-4.5 6 66.5

Coding harnesses

Reconstruction — marginal over vision × model

# Harness n Score
1 codex 4 69.3
2 claude-code 4 69.3
3 opencode 4 65.8

Coding models

Marginal over vision × harness

# Coding model n Score
1 gpt-5.5 4 69.3
2 sonnet 4 69.3
3 deepseek/deepseek-v4-pro 4 65.8

Full pipeline

Vision · harness / model, end to end

# Pipeline Cost n Score
1 google/gemini-2.5-pro · claude-code/sonnet 2 71.7
2 google/gemini-2.5-pro · codex/gpt-5.5 2 71.0
3 anthropic/claude-opus-4.5 · codex/gpt-5.5 2 67.6
4 anthropic/claude-opus-4.5 · claude-code/sonnet 2 66.9
5 google/gemini-2.5-pro · opencode/deepseek/deepseek-v4-pro 2 66.6
6 anthropic/claude-opus-4.5 · opencode/deepseek/deepseek-v4-pro 2 65.0