ARC-AGI Solver
A hybrid program synthesis system for solving ARC-AGI puzzles. No LLM required for core solving. Currently at 130/400 on ARC-1.
ARC (Abstraction and Reasoning Corpus) is a benchmark for measuring general intelligence. Each task shows 2-3 input/output grid examples that demonstrate a transformation rule. The solver must infer the rule and apply it to an unseen test input.
Grids are 2D arrays of integers 0-9 (colors), up to 30x30. Transformations range from simple rotations to complex multi-step reasoning about objects, patterns, and spatial relationships.
The hard part: every task has a different rule. There's no training set that teaches you "how to solve ARC." You need genuine abstraction.
Six solver layers, tried in priority order. Each layer is a different strategy. The first one that succeeds wins.
Decompose grid into connected components, match input objects to output objects, learn per-object transforms
Detect separator lines dividing grids into cells, solve overlay/boolean/stamp relationships between cells. Meta-grid mode: summarize cells as colors, run inference engines at cell level, reconstruct
25 specialized sub-engines for specific pattern families: color mapping, gravity, tiling, object extraction, enclosed fill, position-aware pixel rules, etc.
Weighted A* over a space of 50 primitives. Composes multi-step programs (depth 3-4) mapping input to output
Ask an LLM to generate DSL programs, verify against training pairs
Last resort: LLM writes arbitrary Python, executed in a sandboxed subprocess
ARC-1 benchmark (400 tasks), depth 3 search, 10s timeout per task. No LLM.
Object-centric, grid decomposition, and inference engines are fast (< 100ms) and handle tasks that would take the DSL search minutes to find. They now account for the majority of solves.
Restricting recolor destinations, color maps, and fill operations to only colors present in the target output. Cuts the search branching factor dramatically.
At search depth >= 2, reject candidates whose secondary pair pixel diff worsens vs parent. Kills dead-end branches early without losing solutions.
When a program matches the first training pair but fails multi-pair verification, remove its hash from the visited set. Prevents one false positive from blocking all programs that produce the same intermediate grid.
Each one is a hand-crafted detector for a specific ARC pattern family (gravity fill, stamp templates, diagonal tiling, color ranking, enclosed fill, etc.). Individually narrow, collectively powerful.
If input/output have same dimensions, exclude shape-changing primitives. If colors are preserved, exclude color changers. Simple heuristic, massive search space reduction.
After A* exhausts its node budget, beam search (width=100) continues from the best frontier nodes. Catches solutions that A* couldn't reach within budget.
Planned but never implemented. Most ARC primitives aren't cleanly invertible (crop loses size info, gravity loses positions). The forward-only A* with good pruning was sufficient.
Going from depth 3 to depth 4 explodes the search space (50^4 = 6.25M). Smarter pruning and analytical solvers were always a better investment than deeper search.
The transformer router (~27K params) needs rule-based fallback. With only 400 labeled tasks, the model isn't reliable enough alone. Rules + model beats model-only.
Early object solver used greedy matching without shape awareness. Adding shape-based selectors, broadcast selectors, and compound transforms (rotation + recolor) was necessary to handle real tasks.
Programs are tuples of (primitive, params) applied sequentially. No branching, no data flow graphs. Simpler search space, sufficient for depth 3-4.
Every component works without API calls. The LLM layers are optional amplifiers, not crutches. The core 32.5% solve rate is pure compute.
Enough expressiveness to cover common transformations without drowning the search. 7 geometric, 7 color, 8 spatial, 10 object, 8 pattern, 10 higher-order.
ObjectProgram, GridProgram, InferenceProgram all duck-type the Program interface (execute + verify). The hybrid solver doesn't care which layer produced the solution.
The remaining 270 unsolved tasks fall into a few categories: multi-step reasoning that exceeds depth 3, spatial relationships the DSL can't express, and pattern types none of the 25 inference engines recognize.
The plan:
Analyze unsolved tasks, identify common pattern families, build targeted sub-engines. Each new engine is cheap to add and can unlock 5-15 tasks.
The CNN policy (~60K params) guides beam search by predicting which primitive to try next. Training on more solved programs and better synthetic data.
Turn-based grid games launching March 2026. Requires a fundamentally different approach: exploration, causal discovery, and planning on 64x64 grids.
- • Enclosed region fill engine: detects border-enclosed areas and fills with learned color (00d62c1b task class)
- • Extended pixel rules: position-aware features (border distance, parity, diagonal neighbors) for spatial-context transforms
- • Grid cell majority fill: meta-grid summarizer picks most common non-bg color per cell, handles noisy/mixed cells
- • 25 inference engines (was 23), 1,242 tests (was 1,214), 4,713 statements (was 4,584), 100% coverage maintained