Methodology

How the benchmark works

UI-Bench runs two complementary benchmarks. The primary reconstruction benchmark isolates how well a model can describe and rebuild an existing screen. A second one-shot benchmark measures how well a coding harness builds a whole app from a written product spec.

Reconstruction pipeline

Every cell is one end-to-end run keyed by (vision, model, harness, screenshot):

Decompose. A vision model sees a mobile screenshot and writes a precise text description — layout, components, copy, colours, spacing.
Rebuild. A coding harness running a coding model is given only that text and edits screens/Target.tsx in a pinned Expo scaffold. It never sees the original image.
Render. The rebuilt screen is loaded in Expo Go on a booted iOS simulator and screenshotted.
Score. The render is compared against the original, objectively and by an AI judge.

The text description is a deliberate bottleneck. Because the rebuild step is blind to the image, the benchmark cleanly separates decomposition quality (vision) from reconstruction quality (harness + model).

Scoring

Each rebuild gets a composite headline 0–100, a configurable blend of two families (currently 0.7 judge / 0.3 objective). The scale is deliberately harsh and wide: a near-perfect rebuild scores in the 80s–90s, a decent but clearly-flawed one lands in the 50s–60s, a render with a primary element missing or replaced by a placeholder box drops into the 30s–40s, and a crash or blank render scores near zero. Obvious quality gaps produce obvious score gaps.

AI judge

A vision model sees original and rebuild side by side and scores five criteria on an anchored 0–10 scale (0 = absent/crash, 5–6 = roughly right with visible flaws, 9–10 = near-indistinguishable) at temperature 0, returning structured JSON:

layout & hierarchy
component correctness
content / text / illustration fidelity
colour & style
overall gestalt

Completeness rule: if a primary element is missing, blank, or shown as a placeholder rectangle, the components / content / gestalt criteria are capped at 2 — no credit for "the box is in the right place".

Objective metrics

Mechanical comparisons on the normalized image pair:

SSIM on RGB — structure and colour
spatial colour match (per-region, not just palette)
edge-IoU layout — do components line up
OCR text match

A completeness factor scales the objective score down sharply when the rebuild is far sparser than the target (blank / placeholder renders).

All sub-scores are stored. Scoring is decoupled from rendering: uibench-rescore recomputes scores on saved renders (objective metrics free; the judge re-runs against the original) without re-rendering anything.

Cost capture

Every cell records the USD cost of each model call it makes — the vision description, the harness rebuild, and the judge evaluation — plus a total. Any leg may be unknown for a given provider; unknown values are reported as — and are never assumed to be zero. The gallery shows cost-to-create per rebuild, and the full-pipeline leaderboard reports mean cost per configuration so quality can be read against price.

One-shot from a PRD

The second benchmark removes the screenshot entirely. A coding harness is handed a written product requirements document and asked to build the whole app in one shot. The render is then scored by an AI judge on PRD adherence across six criteria:

feature coverage — are the specified features actually present?
visual fidelity — does it look like a finished, intentional product?
layout — structure and hierarchy of the screen.
interactions — interactive affordances called for by the spec.
polish — spacing, states, and finishing details.
justifications present — did the harness explain its design decisions?

See the one-shot leaderboard for the latest PRD run, including per-criterion scores, cost, and the generated source for each app.

Fairness

Every harness starts from an identical pinned Expo scaffold with a frozen dependency set and local placeholder assets, copied fresh per run.
One identical task prompt is injected into every harness via its non-interactive CLI; only the harness and model vary.
Bounds are identical: wall-clock timeout, and any added dependency or non-compiling output is recorded as a deviation / failure, never silently ignored.
A run can fail at three points — harness error, build/render failure, or a blank render — and each is a measured outcome scored 0 that counts toward the model's average (and is shown as a fail count on the leaderboard), not a skip. A pure infrastructure flake (Metro / simulator) is the one exception: it is retried and excluded, never blamed on the model.

This run

run v1

dataset v1

vision anthropic/claude-opus-4.5, google/gemini-2.5-pro

opencode deepseek/deepseek-v4-pro

codex gpt-5.5

claude-code sonnet

judge openai/gpt-4.1

blend judge 0.7 · objective 0.3

Interactive demos

Every rebuild card in the gallery has a segmented Image / Live / Code control. Live runs the rebuild in your browser via react-native-web; Code opens the source in an Expo Snack editor. Clicking a card opens a full-size comparison with the same controls beside the original and the scores.

Live demos use react-native-web, not the iOS simulator the benchmark scores against. Fonts, shadows, and native-only APIs can diverge from the scored render — and a few rebuilds that depend on unsupported native modules or bundled assets have no Live/Code demo. For the true scored look, use the Image view (the scored PNG).

Known limitations (v1)

Rebuilds render inside Expo Go, which overlays a small developer "Tools" button in the top-right corner of every render — a uniform artifact present on rebuilds but not originals.
Small rosters and a single repeat per cell: this is a proof of the full pipeline, not a high-statistics ranking.
Objective metrics are heuristics: edge-IoU and OCR tolerate fonts and anti-aliasing imperfectly, so the AI judge carries most of the signal.
Cost is only as accurate as each provider's reported usage; some legs may be unattributed.