Can a model rebuild a screen from words alone?

A vision model describes a mobile screenshot in text. A coding agent rebuilds the screen from that text only — never seeing the original — and we render it on an iOS simulator and score the match.

screenshot → vision model → text → harness + model → iOS render → score

screens

vision models

harnesses

12/12

cells scored

Browse →

gallery

Vision models

Decomposition — marginal over harness × model

#	Vision model		n	Score
1	anthropic/claude-opus-4.5		6	52.4
2	google/gemini-2.5-pro		6	49.9

Coding harnesses

Reconstruction — marginal over vision × model

#	Harness	n	Score
1	codex	4	53.5
2	claude-code	4	52.3
3	opencode	4	47.7

Coding models

Marginal over vision × harness

#	Coding model	n	Score
1	gpt-5.5	4	53.5
2	sonnet	4	52.3
3	deepseek/deepseek-v4-pro	4	47.7

Full pipeline

Vision · harness / model, end to end

#	Pipeline	Vision	Harness	n	Score
1	anthropic/claude-opus-4.5 · claude-code/sonnet	—	—	2	54.7
2	google/gemini-2.5-pro · codex/gpt-5.5	—	—	2	54.4
3	anthropic/claude-opus-4.5 · codex/gpt-5.5	—	—	2	52.6
4	google/gemini-2.5-pro · claude-code/sonnet	—	—	2	49.8
5	anthropic/claude-opus-4.5 · opencode/deepseek/deepseek-v4-pro	—	—	2	49.8
6	google/gemini-2.5-pro · opencode/deepseek/deepseek-v4-pro	—	—	2	45.6