Research

ARC-AGI Solver

A hybrid program synthesis system for solving ARC-AGI puzzles. No LLM required for core solving. Currently at 130/400 on ARC-1.

The Problem

ARC (Abstraction and Reasoning Corpus) is a benchmark for measuring general intelligence. Each task shows 2-3 input/output grid examples that demonstrate a transformation rule. The solver must infer the rule and apply it to an unseen test input.

Grids are 2D arrays of integers 0-9 (colors), up to 30x30. Transformations range from simple rotations to complex multi-step reasoning about objects, patterns, and spatial relationships.

The hard part: every task has a different rule. There's no training set that teaches you "how to solve ARC." You need genuine abstraction.

Architecture

Six solver layers, tried in priority order. Each layer is a different strategy. The first one that succeeds wins.

L0Object-Centric
analytical

Decompose grid into connected components, match input objects to output objects, learn per-object transforms

L0.5Grid Decomposition
analytical

Detect separator lines dividing grids into cells, solve overlay/boolean/stamp relationships between cells. Meta-grid mode: summarize cells as colors, run inference engines at cell level, reconstruct

L0.75Inference Engine
analytical

25 specialized sub-engines for specific pattern families: color mapping, gravity, tiling, object extraction, enclosed fill, position-aware pixel rules, etc.

L1DSL Search
search

Weighted A* over a space of 50 primitives. Composes multi-step programs (depth 3-4) mapping input to output

L2LLM DSL
llm

Ask an LLM to generate DSL programs, verify against training pairs

L3LLM Python
llm

Last resort: LLM writes arbitrary Python, executed in a sandboxed subprocess

Progress

ARC-1 benchmark (400 tasks), depth 3 search, 10s timeout per task. No LLM.

76
v1
103
v2
108
v3
110
v4
118
v5
125
v6
130
v7
130/400 solved (32.5%)
1,242 tests100% coverage4,713 stmts
What Worked
+
Analytical solvers before search

Object-centric, grid decomposition, and inference engines are fast (< 100ms) and handle tasks that would take the DSL search minutes to find. They now account for the majority of solves.

+
Target-informed parameter pruning

Restricting recolor destinations, color maps, and fill operations to only colors present in the target output. Cuts the search branching factor dramatically.

+
Multi-pair early rejection

At search depth >= 2, reject candidates whose secondary pair pixel diff worsens vs parent. Kills dead-end branches early without losing solutions.

+
False-positive hash unblocking

When a program matches the first training pair but fails multi-pair verification, remove its hash from the visited set. Prevents one false positive from blocking all programs that produce the same intermediate grid.

+
25 inference sub-engines

Each one is a hand-crafted detector for a specific ARC pattern family (gravity fill, stamp templates, diagonal tiling, color ranking, enclosed fill, etc.). Individually narrow, collectively powerful.

+
Feature-based pruning

If input/output have same dimensions, exclude shape-changing primitives. If colors are preserved, exclude color changers. Simple heuristic, massive search space reduction.

+
Beam search fallback

After A* exhausts its node budget, beam search (width=100) continues from the best frontier nodes. Catches solutions that A* couldn't reach within budget.

What Didn't Work
-
Bidirectional search

Planned but never implemented. Most ARC primitives aren't cleanly invertible (crop loses size info, gravity loses positions). The forward-only A* with good pruning was sufficient.

-
Brute-force depth increases

Going from depth 3 to depth 4 explodes the search space (50^4 = 6.25M). Smarter pruning and analytical solvers were always a better investment than deeper search.

-
Pure ML routing without rules

The transformer router (~27K params) needs rule-based fallback. With only 400 labeled tasks, the model isn't reliable enough alone. Rules + model beats model-only.

-
Generic object matching

Early object solver used greedy matching without shape awareness. Adding shape-based selectors, broadcast selectors, and compound transforms (rotation + recolor) was necessary to handle real tasks.

Key Decisions
+
Linear programs, not DAGs

Programs are tuples of (primitive, params) applied sequentially. No branching, no data flow graphs. Simpler search space, sufficient for depth 3-4.

+
LLM-free first

Every component works without API calls. The LLM layers are optional amplifiers, not crutches. The core 32.5% solve rate is pure compute.

+
50 primitives, not 200

Enough expressiveness to cover common transformations without drowning the search. 7 geometric, 7 color, 8 spatial, 10 object, 8 pattern, 10 higher-order.

+
Duck-typed program objects

ObjectProgram, GridProgram, InferenceProgram all duck-type the Program interface (execute + verify). The hybrid solver doesn't care which layer produced the solution.

What's Next

The remaining 270 unsolved tasks fall into a few categories: multi-step reasoning that exceeds depth 3, spatial relationships the DSL can't express, and pattern types none of the 25 inference engines recognize.

The plan:

More inference engines

Analyze unsolved tasks, identify common pattern families, build targeted sub-engines. Each new engine is cheap to add and can unlock 5-15 tasks.

Policy network refinement

The CNN policy (~60K params) guides beam search by predicting which primitive to try next. Training on more solved programs and better synthetic data.

ARC-AGI-3 (interactive games)

Turn-based grid games launching March 2026. Requires a fundamentally different approach: exploration, causal discovery, and planning on 64x64 grids.

Stack
Python 3.12NumPyPyTorch (optional)pytestuv
Changelog
2026-02-20
Sprint 4: Cross-Dataset Improvements
  • Enclosed region fill engine: detects border-enclosed areas and fills with learned color (00d62c1b task class)
  • Extended pixel rules: position-aware features (border distance, parity, diagonal neighbors) for spatial-context transforms
  • Grid cell majority fill: meta-grid summarizer picks most common non-bg color per cell, handles noisy/mixed cells
  • 25 inference engines (was 23), 1,242 tests (was 1,214), 4,713 statements (was 4,584), 100% coverage maintained