UI-Bench

Methodology

How the benchmark works

UI-Bench runs two complementary benchmarks. The primary reconstruction benchmark isolates how well a model can describe and rebuild an existing screen. A second one-shot benchmark measures how well a coding harness builds a whole app from a written product spec.

Reconstruction pipeline

Every cell is one end-to-end run keyed by (vision, model, harness, screenshot):

  1. Decompose. A vision model sees a mobile screenshot and writes a precise text description — layout, components, copy, colours, spacing.
  2. Rebuild. A coding harness running a coding model is given only that text and edits screens/Target.tsx in a pinned Expo scaffold. It never sees the original image.
  3. Render. The rebuilt screen is loaded in Expo Go on a booted iOS simulator and screenshotted.
  4. Score. The render is compared against the original, objectively and by an AI judge.

The text description is a deliberate bottleneck. Because the rebuild step is blind to the image, the benchmark cleanly separates decomposition quality (vision) from reconstruction quality (harness + model).

Scoring

Each rebuild gets a composite headline 0–100, a configurable blend of two families (currently 0.7 judge / 0.3 objective):

AI judge

A strong vision model sees original and rebuild side by side and scores a fixed rubric at temperature 0, returning structured JSON:

  • layout & hierarchy
  • component correctness
  • content / text fidelity
  • colour & style
  • overall gestalt

Objective metrics

Mechanical comparisons on the normalized image pair:

  • SSIM — structural similarity
  • colour histogram distance
  • layout IoU — content-region overlap
  • OCR text match

All sub-scores are stored, so leaderboards can be re-weighted after the fact without re-running anything.

Cost capture

Every cell records the USD cost of each model call it makes — the vision description, the harness rebuild, and the judge evaluation — plus a total. Any leg may be unknown for a given provider; unknown values are reported as and are never assumed to be zero. The gallery shows cost-to-create per rebuild, and the full-pipeline leaderboard reports mean cost per configuration so quality can be read against price.

One-shot from a PRD

The second benchmark removes the screenshot entirely. A coding harness is handed a written product requirements document and asked to build the whole app in one shot. The render is then scored by an AI judge on PRD adherence across six criteria:

See the one-shot leaderboard for the latest PRD run, including per-criterion scores, cost, and the generated source for each app.

Fairness

This run

run v1
dataset v1
vision anthropic/claude-opus-4.5, google/gemini-2.5-pro
opencode deepseek/deepseek-v4-pro
codex gpt-5.5
claude-code sonnet
judge openai/gpt-4.1
blend judge 0.7 · objective 0.3

Known limitations (v1)