Methodology
How the benchmark works
UI-Bench runs two complementary benchmarks. The primary reconstruction benchmark isolates how well a model can describe and rebuild an existing screen. A second one-shot benchmark measures how well a coding harness builds a whole app from a written product spec.
Reconstruction pipeline
Every cell is one end-to-end run keyed by (vision, model, harness, screenshot):
- Decompose. A vision model sees a mobile screenshot and writes a precise text description — layout, components, copy, colours, spacing.
- Rebuild. A coding harness running a coding model is given only that text and edits screens/Target.tsx in a pinned Expo scaffold. It never sees the original image.
- Render. The rebuilt screen is loaded in Expo Go on a booted iOS simulator and screenshotted.
- Score. The render is compared against the original, objectively and by an AI judge.
The text description is a deliberate bottleneck. Because the rebuild step is blind to the image, the benchmark cleanly separates decomposition quality (vision) from reconstruction quality (harness + model).
Scoring
Each rebuild gets a composite headline 0–100, a configurable blend of two families (currently 0.7 judge / 0.3 objective):
AI judge
A strong vision model sees original and rebuild side by side and scores a fixed rubric at temperature 0, returning structured JSON:
- layout & hierarchy
- component correctness
- content / text fidelity
- colour & style
- overall gestalt
Objective metrics
Mechanical comparisons on the normalized image pair:
- SSIM — structural similarity
- colour histogram distance
- layout IoU — content-region overlap
- OCR text match
All sub-scores are stored, so leaderboards can be re-weighted after the fact without re-running anything.
Cost capture
Every cell records the USD cost of each model call it makes — the vision description, the harness rebuild, and the judge evaluation — plus a total. Any leg may be unknown for a given provider; unknown values are reported as — and are never assumed to be zero. The gallery shows cost-to-create per rebuild, and the full-pipeline leaderboard reports mean cost per configuration so quality can be read against price.
One-shot from a PRD
The second benchmark removes the screenshot entirely. A coding harness is handed a written product requirements document and asked to build the whole app in one shot. The render is then scored by an AI judge on PRD adherence across six criteria:
- feature coverage — are the specified features actually present?
- visual fidelity — does it look like a finished, intentional product?
- layout — structure and hierarchy of the screen.
- interactions — interactive affordances called for by the spec.
- polish — spacing, states, and finishing details.
- justifications present — did the harness explain its design decisions?
See the one-shot leaderboard for the latest PRD run, including per-criterion scores, cost, and the generated source for each app.
Fairness
- Every harness starts from an identical pinned Expo scaffold with a frozen dependency set and local placeholder assets, copied fresh per run.
- One identical task prompt is injected into every harness via its non-interactive CLI; only the harness and model vary.
- Bounds are identical: wall-clock timeout, and any added dependency or non-compiling output is recorded as a deviation / failure, never silently ignored.
- A run can fail at three points — harness error, build/render failure, or a blank render — and each is a measured outcome with a penalty, not a skip.
This run
Known limitations (v1)
- Rebuilds render inside Expo Go, which overlays a small developer "Tools" button in the top-right corner of every render — a uniform artifact present on rebuilds but not originals.
- Small rosters and a single repeat per cell: this is a proof of the full pipeline, not a high-statistics ranking.
- Layout IoU is a coarse foreground-overlap metric.
- Cost is only as accurate as each provider's reported usage; some legs may be unattributed.