SynthPath
Pattern-learning ARC solver. This system is a neural guided program synthesis solver that learns reusable patterns from solved tasks and compounds them across rounds.
92/400
TRAIN
8/400
EVAL
~14s
TIME
PIPELINE
PERCEPTION
39Each grid is parsed into a scene graph: objects detected via connected components, classified by role, linked by spatial relations, and summarized as a 64-dim feature vector for library recall. Hierarchical mode detects separator-partitioned grids and builds per-cell sub-scenes for cell-summary tasks.
CANDIDATE GENERATORS
62When no library pattern matches, beam search tries these generators. Each proposes candidate actions scored by minimum description length.
PATTERN DSL
57Learned patterns are stored as guard \u2192 bind \u2192 body programs. Guards check preconditions, binds extract task-specific values, and the body applies parameterized actions.
CHANGELOG
2026-03-24 16:00 — Unified Symbolic Runtime + full family port: 35 transform handlers, 25 induction strategies
Full 8-phase runtime + family port. executor.py: 35 transform handlers (18 new: extract, decompose, upscale, mirror_concat, fold_symmetry, slide_to_wall, pattern_extend, count_encode, color_count, row_col_projection, separator_summary, stack_concat, stamp, hollow_rect, object_extension, bbox_complement, recolor_closest, neighbor_rule). induction.py: 25 strategies (8 new diff-dims + 7 new same-dims). 42/400 tasks solvable standalone (train+test verified). Benchmark: 115/400 (+1), all 42 overlap with existing solves. Runtime provides speed (6.9ms vs 290ms beam) not new coverage yet. 2873 tests, 0 regressions.
2026-03-26 02:00 — fill_enclosed migrated into config runtime: first same-dims family migration
Migrated fill_enclosed into config_exec. Added fill_enclosed op to config_executor with robust enclosed-region detection (tries each color as hole candidate, not just bg). Color sources: explicit, from_encloser (adjacent different-color pixels), from_context. Added _infer_fill_enclosed to config_infer. 2/5 fill_enclosed training tasks now solve through config_search/config_exec. The remaining 3 need multi-color fill or composition. Config runtime now covers: same-dims (recolor, remove, fill_between, fill_enclosed) + diff-dims (transform_tile). Eval 34/400, train 114/400 (stable). 32 config tests passing (7 fill + 14 same-dims + 11 diff-dims).
2026-03-25 23:30 — Diff-dims in config runtime: transform_tile absorbs kronecker, 5 eval via config_search
Extended config runtime to handle diff-dims transform tiling. Added transform_tile op to config_executor (row_transforms, block_grid, simple tile). Added _infer_transform_tile to config_infer, delegating to kronecker_infer for pattern detection. config_search seed schema now covers both same-dims and diff-dims (requires_same_dims=False). Result: 5 eval hypothesis solves now come through seed_schema:config_search instead of old kronecker_tile path. The config runtime IS the solver for these tasks. Old kronecker_tile seed demoted to LEGACY. Eval 34/400, train 114/400 (stable). 25 config tests passing (14 same + 11 diff).
2026-03-25 21:00 — Config proposer v2: richer encoding + explicit config candidates
Upgraded proposer to predict explicit TransformConfig candidates alongside family rankings. Richer encoder: 96-dim object-relational features (grid dims, color histograms, object counts/sizes/deltas, diff ratios, structural signals, pair consistency) concatenated with 64-dim global features → 160-dim combined. Leave-one-out: top-1 43.9%, top-3 60.5%, top-5 70.2%. Proposer now generates concrete TransformConfig candidates from top predicted families (recolor predicates, fill patterns, remove operations). ConfigPrediction carries both family_rankings and config_candidates. Integration via priority_families + encode_task(). Eval 34/400, train 114/400 (no regression). 11 tests passing.
2026-03-25 19:00 — Learned program proposer: neural proposal, symbolic execution
First learned proposer: nearest-neighbor on 64-dim task features predicts solve family from 114 training examples (37 families). Leave-one-out: top-1 49%, top-3 61%, top-5 69%. Proposer reorders seed schemas so predicted families are tried first. Wired into solve_by_hypothesis via priority_families param. Persisted to data/proposer.json. Architecture: proposer emits rankings → symbolic runtime executes → exact verification accepts/rejects. No opaque execution. Eval 34/400, train 114/400 (no regression). 8 new tests, all passing.
2026-03-25 16:00 — Unified config runtime: one machine, many programs
Architecture pivot: added config_exec body kind as the single gateway from ProgramIR to a unified config runtime. Three new modules: config_schema.py (TransformConfig = SelectSpec + TransformSpec + OutputSpec, serializable, typed), config_executor.py (fixed reviewed executor: select objects/pixels by predicate → apply recolor/remove/fill_source_color), config_infer.py (train-pair induction: fill_between, recolor by size/color, remove by color/size). Wired as config_search seed schema — configs route through existing ProgramIR verify/register/recall pipeline. Eval 34/400, train 114/400 (no regression). No new cluster-specific body kinds. The runtime is the substrate — capability growth now means widening the config vocabulary (predicates, transforms, relations), not adding code paths. 14 tests passing.
2026-03-25 12:00 — Same-dims machine: axis-fill executor + honest residual analysis
Built samedims_machine.py: constrained executor for same-dims tasks with axis-fill operations (between_same_color, extend_to_boundary on row/col/both). Also built object_config.py: select-by-predicate → recolor/remove meta-executor. Exhaustive testing on all unsolved tasks: 0 train + 0 eval solved by axis fill, 0 by object recolor, 0 by global color substitution. Analysis: 16 train + 21 eval tasks have row/col-aligned added pixels, but the operations are object-relational (stamp, connect-toward-target, extend-pattern) not bulk axis fills. All remaining ~184 train + ~252 eval unsolved same-dims tasks require per-object spatial/relational reasoning. The machine substrate is correct; the induction vocabulary needs containment, adjacency, shape-match, and directional predicates. 15 tests passing.
2026-03-25 10:00 — Object-relational meta-executor: architecture correct, induction bottleneck identified
Built object_config.py: constrained meta-executor for same-dims tasks. Config language: select (by_size, by_color, by_position, by_relation) × action (recolor_to, remove) × source (explicit, from_largest, from_context, from_neighbor). Executor works correctly on synthetic examples. Honest finding: 0 unsolved tasks solved. Same-dims tasks use highly relational / contextual predicates that change per training pair. Even brute-force fixed color-to-color recoloring fails (0/184 train, 0/252 eval). The bottleneck is not the executor machine — it is the induction engine's predicate vocabulary. Current vocabulary (smallest, largest, unique_color, fixed_color) is too narrow; the real tasks need shape-match, containment, adjacency, and other relational predicates. 7 new tests, all passing.
2026-03-25 07:00 — Extended kronecker: rot90/rot270 transforms, eval 32→34
Audited all 7 PRIMITIVE_MISSING clusters (42 same-dims tasks). Finding: these tasks are complex object manipulations tagged "periodic" by edit_features but not actually tile-periodic. No new template family is justified. Instead, extended the existing transform_tiler vocabulary with rot90 and rot270. Updated: kronecker_infer.py, eval_body.py (TILE body), and primitive_templates.py. Result: eval 32→34/400 (+2 new: 7953d61e, ed98d772 — both use [identity, rot90; rot180, rot270] 4-fold rotation tiling). Training 114/400 (stable). Scanned 18 unsolved diff-dims eval tasks with integer scale. Honest: the remaining same-dims periodic gap requires genuinely new object-level capabilities, not a narrow tile template. 30 tests passing.
2026-03-25 04:00 — Full admission integration: primitive instances in live solver + eval gate
Integrated primitive instances into the full autonomy stack. Solver: _try_primitive_instances() loads data/primitives.jsonl in hypothesis stage, tries each instance on the task, wraps as SynthProgram. Admission: gap detector finds PRIMITIVE_MISSING clusters → compile_sketch() → verify on cluster tasks → add to PrimitiveStore. Eval-first gate unchanged. The system can now propose, compile, verify, persist, and load primitive instances as data without code edits. Verified live: task 00576224 solved by transform_tiler instance loaded from data/primitives.jsonl. Manual boundary: template families (TEMPLATE_EXECUTORS) are fixed reviewed code; instances are data. 46 tests passing.
2026-03-25 02:00 — Template-based primitive graduation: sketch → compile → execute
Built template-based primitive graduation system. PrimitiveInstance = (template_name, serializable_params). Template registry: transform_tiler (per-block geometric transforms). Compiler: compile_sketch() maps operator sketch + training pairs → template instance via kronecker_infer. Graduate: compile → multi-task verify → persist. PrimitiveStore: JSONL persistence at data/primitives.jsonl. Real graduation: sketch "proposed_kronecker_tile" compiled to transform_tiler(3×3, [identity, flip_lr, identity]) and verified on eval task 00576224 with test-pair PASS. Manual step removed: the system now proposes, compiles, verifies, and persists primitive instances as data without hand-writing an executor. 18 new tests, all passing.
2026-03-24 23:30 — Kronecker tile primitive: +4 eval solves from first auto-proposed primitive
Implemented narrow kronecker_tile as a mode of the existing TILE body kind. Train-derived induction: infer scale + per-block transform pattern from training pairs only. Transforms: identity, flip_lr, flip_ud, rot180. Two patterns: row_transforms (same transform per row) and block_grid (explicit per-block). Added as seed schema with derive_params_fn. Added kronecker_row and kronecker_grid mirror modes to TILE body executor. Result: eval 28→32/400 (+4 new solves: 00576224, 0c786b71, 59341089, 833dafe3). Training: 113/400 (stable). This is the first primitive proposed by the gap detector that was validated by shadow execution, then implemented as a live body and produced real benchmark gains. 12 new tests, all passing.
2026-03-24 21:00 — Tile derivation analysis: composition-first primitive assessment
Built tile_derivation.py: classifies all 17 periodic-gap tasks by derivation type. Classes: kronecker_tile (4 tasks, mirror-variant tiling), color_band_fill (3, uniform rows from color stats), extract_downsample (3, output summarizes input), simple_upscale (2, direct tile — reducible), row/col_broadcast (3), reshape (1), inconsistent (1). Composition-first analysis: 2/17 reducible to existing bodies (simple_upscale). 15/17 require genuinely new derivation. 1/17 kronecker tile fully verified (00576224). Dominant remaining gap: kronecker_tile (4 tasks) + color_band_fill (3 tasks). The kronecker tile is the strongest next primitive candidate — 4 eval tasks need input tiled with mirror/flip variants. 8 new tests (19 total tile/periodic), all passing.
2026-03-24 19:00 — Train-derived periodic body: honest gap between shadow and real inference
Built periodic_body.py: train-derived inference that infers PeriodicConfig from training pairs only, executes on test without target access. Modes: row_period, col_period, tile_2d, axis_agnostic, upscale_tile. Source strategies: dense (densest block), top_left, scan (try all positions). Honest result: shadow executor verified 17 tasks (oracle target access), but train-derived inference solves only 1 (already solved). The gap is genuine: the shadow tasks need tile derivation (not tile extraction) — the output tile is a TRANSFORM of the input, not a copy. Simple row/col/2D repetition + densest-block extraction is insufficient. The periodic primitive gap remains open for a richer tile-derivation strategy. 10 new tests.
2026-03-24 16:00 — Shadow executor: periodic sketch solves 4 train + 13 eval (unsolved)
Built constrained shadow executor for periodic/tile operator sketches. Strategies: row-period detection, col-period detection, 2D tile detection, template broadcast. Not a live solver body — shadow-only for offline validation. Results: 4 currently-unsolved training tasks exactly verified (e26a3af2, e9afcf9a, 0a938d79, 8eb1be9a). 13 currently-unsolved eval tasks exactly verified. This validates the periodic operator sketch as a genuine primitive gap — the shadow executor finds real solutions that the main solver cannot. If promoted to a live BodyKind, the potential gain is +4 train / +13 eval. 11 tests, all passing. No solver changes.
2026-03-24 14:00 — Real gap classification + operator-sketch primitive proposals
Connected gap detector to real semantic clusters from edit_features. 70 clusters, 20 with ≥3 tasks. Gap types: 7 BIND_MISSING (partition bind, focused near-miss), 6 COMPOSITION_MISSING (additive+periodic pattern), 7 PRIMITIVE_MISSING (diffuse near-miss >4 bodies). Tag-to-sketch mapping: additive→additive composition, periodic→tile propagation, deletive→bg color, aligned_row→cardinal direction, new_colors→context_derived, etc. Sketches have unique semantic keys (e.g. by_predicate.tile.fixed.grid_boundary). Default-only sketches suppressed. Top proposals: proposed_deletive_periodic (18 tasks), proposed_periodic_recolor (6 tasks). 14 new tests (40 total gap+proposal), all passing. No solver changes.
2026-03-24 12:00 — Failure-driven candidate mining for admission loop
Added failure_mining.py: mines 2-step ProgramIR candidates from unsolved task near-misses. Tries step-1 seed schema near-miss (residual < 50%) + step-2 completion (geometric, gravity, remove, fill_enclosed with task-derived colors), and reverse direction (step-1 global transform + step-2 seed schema). Integrated into admission loop alongside library-trace and role-template mining. Candidates require min_cluster_size ≥ 2 tasks. Result: 0 failure-driven candidates found on current unsolved tasks — confirms the remaining 284 unsolved training tasks genuinely need capabilities beyond 2-step compositions over existing body kinds. This is the correct honest output: the system is not generating noise. The admission gate correctly admits only the library-trace macro (damage_repair → periodic_tile_fill). 6 new tests (22 total admission+failure), 2601 passing.
2026-03-24 10:00 — Live A/B eval-first admission: hardened gates, dual-split benchmarking
Hardened admission into authoritative offline self-expansion. Four live benchmark runs per cycle: baseline train (126/400 174s), baseline eval (30/400 363s), trial train (126/400 174s), trial eval (30/400 362s). Eval is the authoritative no-regression gate. Gates: eval no-regression (fatal), train no-regression > 1 (fatal), runtime ≤ baseline × 1.5 + 30s (both splits), ≥ 1 validated candidate. Macros and role templates share one JSONL store with kind field. Role templates can now be admitted and persisted alongside macros. Removed stale-results baseline path — baseline always measured live. 1 macro admitted (damage_repair → periodic_tile_fill). 16 tests, 2595 passing.
2026-03-24 08:00 — First real benchmark-gated admission: +10 solves, 1 macro admitted
Ran the first real end-to-end admission pass. Baseline: 116/400 (existing results). Trial: 126/400 (+10 solves with candidate macro injected). Gate: PASSED (no regression). Admitted: damage_repair → periodic_tile_fill macro (6 source tasks, 3/3 exact verification, leave-one-out passed, param slots: mode ∈{transpose, flip_lr+flip_ud+rot180}, damage_color ∈ {0, 6}). Persisted to data/macros.jsonl. Report at data/results/admission_report.json. Fixed runtime gate to skip when baseline timing unknown (loaded from existing results, not timed). Added load_baseline_from_results() and run_trial_benchmark(). 5 new tests (19 total), 2598 passing.
2026-03-24 06:00 — Benchmark-gated offline artifact admission workflow
Built complete offline self-expansion loop in src/autocap/admission.py. Workflow: mine ProgramIR macros + role templates → validate exactly on multi-task evidence → A/B benchmark baseline vs trial → apply conservative gates (no regression, no runtime blowup, no single-task artifacts) → admit only passing artifacts to data/macros.jsonl. Gates: trial ≥ baseline solves, runtime ≤ 1.5× baseline + 30s, ≥ 1 validated candidate. Dry run: mined 3 macros + 2 role templates, 1 passed validation (damage_repair → periodic_tile_fill), 4 rejected (compression gate). All artifacts remain JSONL data — no source-code mutation. Supports skip_benchmark mode for fast unit testing. 14 new tests, 2593 passing.
2026-03-24 04:00 — Primitive-gap detector + operator sketch meta-DSL scaffold
First scaffold for automatic primitive invention. Three new modules in src/autocap/: (1) gap_detector.py — classifies failure clusters into BIND_MISSING, ROLE_MISSING, COMPOSITION_MISSING, or PRIMITIVE_MISSING using structural signals, beam near-misses, and body feature overlap. (2) operator_sketch.py — constrained meta-DSL for candidate primitives using fixed vocabulary enums: SelectionType (9), PropagationStyle (10), DirectionSource (7), StopCondition (7), ColorSource (6), CompositionMode (4). No Python code generation. (3) primitive_gates.py — validation gate scaffold: multi-task support, exact verification, compression benefit, leave-one-out stability, benchmark gate. All data structures are serializable, inspectable, and human-vetoable. Dry run on current clusters produces correct gap classifications and operator sketches. 26 new tests, 2579 passing. No active solver changes.
2026-03-24 00:30 — Selection persistence + compounding library gains
Added StepSelect.from_step for selection persistence across ProgramIR steps. Step 2 can reuse step 1's resolved selection instead of re-detecting from the mutated grid. This fixes the draw_ray+remove composition: without from_step, newly painted cells are also detected as markers and removed. 66e6c45b now solves via draw_ray(away_from_center)+remove(from_step=0). Library compounding is the real story: registered ProgramIR leaves from prior runs transfer via recall. Training: 113/400 (from 104 baseline). Eval: 28/400 (from 26). library_recall: 67 train, 14 eval. 2,583 tests.
2026-03-23 — AutoCapabilityLoop: automatic symbolic macro induction and validation
Built a safe automatic capability-growth loop (src/autocap). The system now automatically: (1) mines failure clusters from benchmark diagnostics by failure_reason x semantic_tag, (2) induces recurring multi-step ProgramIR macro templates from library entries, (3) proposes role+body_kind selection templates from existing perception roles, (4) validates each candidate via exact train verification + compression gate + overfit rejection + leave-one-out, (5) admits validated macros into a JSONL macro store, (6) exposes admitted macros to search_compositional and search_role_compositional as extra candidates. All artifacts are serializable and inspectable — no closures, no arbitrary code generation, no dynamic self-modification. Manual primitives remain the hard safety boundary. 9 focused tests, CLI at scripts/run_autocap.py. 2,584 tests passing.
2026-03-23 23:00 — Selection persistence + bg auto-detect: +9 train, +2 eval
Added StepSelect.from_step for cross-step selection persistence. Step 2 can now reference step 1's original selected objects instead of re-detecting from the mutated grid. This fixes draw_ray+remove_marker 2-step compositions where painted ray cells were wrongly detected as new markers. Also fixed execute_program_ir bg: auto-detect from grid when ProgramIR.background is None (was hardcoded to 0). This unblocked extend_line, stamp_template, grid_decompose, and connect solves on non-zero-bg tasks. Net: +10 gained, -1 lost (predicate_recolor on varying-bg task that worked by coincidence with bg=0). 113/400 train (+9), 28/400 eval (+2). 2,575 tests. Connect-between-markers (22 tasks) confirmed separate.
2026-03-23 22:00 — Role-enrichment: register role-structured ProgramIR for all solved tasks
Role-search found 20 solvable training tasks but only registered 1 ProgramIR (library_recall solved them first). Fix: _maybe_register_role_ir runs role_search as post-solve enrichment for every solved task, registers any verified ProgramIR with StepSelect metadata. Only registers entries with role structure (avoids single-step duplicates). No _strengthen_guards for ProgramIR entries. ProgramIR: 34→42 entries, 19→26 signatures, 1→4 role-annotated. New role entries: draw_ray (marker), draw_ray (by_color), draw_ray→draw_ray (by_color×2). Training: 103→113/400 (+10 from enriched library recalls). Eval: 26→28/400 (+2 — one from ProgramIR recall of draw_ray→remove with marker role). 5 new tests, 2531 passing.
2026-03-22 24:00 — Marker propagation cluster analysis + bg auto-detect fix
Quantified marker_propagation cluster: 144 unsolved same-dims tasks with markers. Subclusters: 91 ray, 22 connect, 17 mixed, 14 other. Added 2 new derived direction modes (toward_nearest_edge, away_from_nearest_edge) to existing 5 (center, corner, outward). Key finding: existing modes already covered the space — only 1 task (ea786f4a) solvable by draw_ray alone. Real bottleneck was bg=0 hardcoding in execute_program_ir. Fixed: auto-detect bg from grid when ProgramIR.background is None. This unblocked 5 library recall solves (extend_line, stamp_template) that previously failed on non-zero-bg tasks. Net +4 (5 gained, 1 lost from bg change). 108/400 train (+4), 26/400 eval (stable). 2,551 tests passing. Connect-between-markers (22 tasks) confirmed as a distinct semantic needing its own body kind — not a draw_ray variant.
2026-03-23 20:30 — Role-match recall bonus for ProgramIR entries
Turns ProgramIR role metadata into a ranking signal. Task-side: compute_task_role_summary builds a scene graph for the first input and extracts detected roles (marker, frame, template, separator, content, legend). Entry-side: ProgramIR.role_signature provides the ordered tuple of StepSelect roles. Scoring: _role_bonus gives up to 0.05 for ProgramIR entries whose StepSelect roles match the task's scene roles. Only applies to role-annotated ProgramIR entries; non-role entries and plain PatternDef entries always get 0. Diagnostics: recall_with_diagnostics now shows role_bonus field alongside composition_bonus. Both recall() and recall_with_diagnostics() accept task_role_summary kwarg. Training 103/400, eval 26/400. 12 new tests, 2500 passing.
2026-03-23 18:00 — ProgramIR inventory: role annotation, color_substitute lowering, role_search registration
Three targeted fixes to grow ProgramIR diversity: (1) role_search stage now attaches _program_ir to SynthProgram for ProgramIR registration — first role-annotated entry (draw_ray with StepSelect role=marker). (2) lower_synth_program now refines pixel_rule+color_map to color_substitute body kind and extract_object+selector to extract_by_predicate+predicate. New signature: fill_enclosed → color_substitute. (3) Added ProgramIR.role_signature, has_role_structure properties. Recall diagnostics now show signature and roles for ProgramIR entries. ProgramIR: 32→34 entries, 19→21 signatures, 0→1 role-annotated. registered_ir_from_role_search appears on both train and eval. 9 new tests, 2482 passing.
2026-03-22 14:05 — Cluster-focused solve loop prompt
Added scripts/prompts/capability_cluster.md forcontinuous_solve_loop.py. The old default loop prompt was tuned for library/extraction hygiene. The new prompt is tuned for capability growth on a repeated failure cluster: inspect a small representative set of tasks, classify the shared gap correctly (bind/body/role/composition/registration), and make one reusable architecture improvement per iteration. This keeps the loop aimed at substrate growth instead of generic benchmark churn.
2026-03-22 23:30 — Guidance A/B evaluation: heuristic + learned MLP
First empirical validation of AlphaGo-style search guidance. Added --guidance=none/heuristic/learned CLI flag. Built trace dataset: 98 examples from 82 library entries (67 schema, 31 ProgramIR). Trained MLP policy (80.6% top-1 accuracy, 15 classes) and value model. A/B benchmark results (training 400 tasks, timeout 5s): none=104, heuristic=104, learned=104 — same solve count. Beam expansions: none=46 avg, heuristic=47, learned=46. Beam time: none=244ms, heuristic=251ms, learned=257ms. Eval: 26/400 all three. Conclusion: guidance substrate confirmed working (diagnostics report policy/value activity, 330 beam tasks tracked), but no efficiency or solve-count gain yet. Bottleneck: (1) 98 all-positive training examples too small — need negative examples from failed branches. (2) Value model trivially overfit (100% train acc on all-positive). (3) Heuristic biases too weak to change beam ordering materially given max_depth=3, beam_width=20 constraint. Next: collect beam-trace negatives, augment dataset, retrain with balanced labels. 2503 tests passing.
2026-03-23 14:00 — ProgramIR inventory growth: seed schemas + no_mapper fallback
ProgramIR inventory was too small: 15 entries, 9 signatures. Two fixes: (1) Hypothesis stage now attaches verified ProgramIR to seed-schema SynthPrograms via _program_ir attribute. Registration stores it as a ProgramIR entry alongside the PatternDef — produces entries for extract_by_predicate, recolor, grid_decompose, fill_enclosed, partition_cell_map/broadcast, transpose, upscale_block, fold_symmetry. (2) Single-step no_mapper solves now fall back to ProgramIR lowering + registration when hypothesis_to_pattern fails. ProgramIR entries skip _strengthen_guards (self-verifying; restrictive guards only hurt recall). Result: ProgramIR 15→32 entries, 9→19 unique signatures. Training recall hit rate 12.3%→16.5% (+4.2pp). library_recall solves 46→61. Training 104/400, eval 26/400 unchanged. 4 new tests, 2473 passing.
2026-03-23 11:30 — DRAW_RAY spatial propagation body kind
New BodyKind.DRAW_RAY: ray propagation from seed/marker cells. Params: directions (cardinal/diagonal/all/specific), color (seed or explicit), seeds (_selected binding). Stops at grid boundary or non-bg obstacle. Integrated with StepSelect roles. Role search generates draw_ray candidates for 13 direction modes × seed/explicit colors. 101 tasks (40 train + 61 eval) match the ray pattern in analysis. However, most need per-marker direction derivation or inter-marker connection, not simple boundary rays. Result: 104/400 train (+1 new via role_search DRAW_RAY on 623ea044). 26/400 eval (stable). The cluster needs more sophisticated direction derivation for broader gains. 2,499 tests.
2026-03-22 22:00 — AlphaGo-style search guidance layer
Added first neural-guided search layer: policy prior + value estimator integrated into beam search. SearchState representation (97-dim: 64 task features + 10 search progress + 23 body-kind histogram). Heuristic policy scorer uses co-occurrence transitions + structural biases. Heuristic value estimator detects dead-ends and stagnation. MLP policy/value architectures ready for training on search traces. CompositePolicy/CompositeValue blend learned + heuristic signals with graceful fallback. Beam search now accepts optional policy_scorer and value_estimator: policy reranks candidates, value prunes low-promise branches (threshold 0.03). Training data extraction from solved ProgramIR library entries and seed schema solves. 42 new tests, 2489 total (0 failures). Symbolic-first: exact verification unchanged, all guidance is advisory.
2026-03-23 08:30 — Role search expansion + exhaustive eval sweep
Expanded role vocabulary: by_color, contained, container, unique_color, minority_color, by_shape. Expanded role search templates: select-by-color recolor, remove+fold, remove+connect, remove+damage_repair, 2-step marker recolor+fold. Exhaustive eval sweep: 0/253 same-dims unsolved by simple body kinds, 0/150 by role-select compositions, 0 by diff-dims extract/crop. The 374 unsolved eval tasks genuinely require spatial operations beyond current body executors — the bottleneck is body set, not role infrastructure. 103/400 train, 26/400 eval (stable). 2,428 tests.
2026-03-23 09:00 — Positive composition-aware recall ranking for ProgramIR
Added positive composition matching to complement the negative structural pre-filter. Program-side: ProgramIR.precondition_tags extracts structural tags from step body kinds (damage, periodic, enclosed, connect, extract, geometric, partition). Task-side: compute_task_composition_tags derives matching signals cheaply (damage-color present, additive pixels, diff-dims, periodicity, partition detection). Scoring: _composition_bonus adds up to 0.10 based on Jaccard overlap of program tags vs task tags. Only ProgramIR entries receive the bonus; single-step recall is unaffected. Both recall() and recall_with_diagnostics() now accept task_composition_tags and show composition_bonus in diagnostics. Training 103/400, eval 26/400 — no regression. 19 new tests, 2459 passing.
2026-03-23 06:30 — Composition-aware structural pre-filter for ProgramIR recall
ProgramIR recall had 99.5% waste: 600/603 attempts failed (verify_failed or guard_failed). The lowered similarity floor recalls compositions too broadly. Added program_ir_matches_task_structure() — a cheap structural pre-check before expensive verification. Two checks: (1) dimension compatibility — same-dims programs on same-dims tasks only, diff-dims programs (crop/extract) on diff-dims only. (2) damage_repair precondition — requires a plausible damage-color signal (color in input absent from output). 75 eval recall attempts now skip verification via "structure_mismatch", saving ~1.6s total (recall 12.3s→10.7s, 13% reduction). Diagnostics show step signature and mismatch reason. Zero solve regressions: 103/400 train, 26/400 eval. 12 new tests, 2399 passing.
2026-03-23 04:00 — ProgramIR recall: multi-step compositions now recalled from library
ProgramIR entries were stored in the library (15 multi-step programs) but never recalled — the recall path only executed PatternDef bodies, not stored ProgramIR. Three fixes: (1) _run_library_recall now checks entry.program_ir and executes full ProgramIR via verify_program_ir + _make_program_ir_program. (2) _pattern_candidates creates beam candidates that execute full ProgramIR for entries with program_ir. (3) Library recall uses a lower similarity floor for ProgramIR entries (min_sim×0.75) because they self-verify against all training pairs. Training: 103/400 (+1 new solve via composition recall). library_recall 39→46 (+7), recall hit rate 10.3%→12.3%. Eval: 26/400 (+1 from unrelated role_search). library_recall 5→9 (+4), recall hit rate 1.3%→2.3%. The damage_repair+periodic_tile_fill compositions and connect+connect composition now recall from library instead of re-discovering via hypothesis or beam search. 9 new tests, 2382 total passing.
2026-03-23 02:00 — Role-structured compositional search
New architecture: StepSelect on ProgramStep for typed role-based object selection. Roles: marker, template, frame, separator, largest, smallest, by_color. _resolve_select builds scene graph, selects objects by role, injects as _selected binding for body executor. New pipeline stage: role_search (role-guided compositional search, 11.8ms train / 21.6ms eval). Generates candidates from scene role structure instead of brute-force. 135/375 unsolved eval tasks have marker+template roles — the dominant failure cluster. First solve: 12eac192 via “select markers, recolor to 3” (typed StepSelect program). Training: 103/400 (stable). Eval: 25→26 (+1 new role_search solve). library_recall: 6→9 on eval (ProgramIR leaves from prior runs). 2,411 tests.
2026-03-23 00:30 — Zero-transfer audit + predicate_recolor exposure fix
Audited all zero-transfer structural families. rotate_grid (3 train), mirror_concat (3), transpose (2), mirror_grid (2): genuinely absent from eval, not exposure gaps. fold_symmetry: 0 genuine eval matches (8 false-positive symmetric outputs). predicate_recolor seed schema had narrow predicate search (4) vs handwritten fallback (20+). Expanded to 8: not_is_smallest, not_is_largest, is_unique_color, is_minority_color. Added selectors in body executor + _background hint in execute_program_ir. Result: 103/400 train (+1), predicate_recolor: 1→3 hypothesis solves. 25/400 eval unchanged. train_fit_test_fail: 11→10. 2,391 tests.
2026-03-22 23:30 — Fix 2-step composition lowering via resolved library pattern params
Beam search finds 2-step compositions (e.g. fill_enclosed+remove, connect+connect, tile+rotate), but lowering to ProgramIR failed because library-recalled step-1 actions only stored opaque "pattern" name refs. lower_synth_program couldn't reconstruct the body. Fix: _pattern_candidates now stores _body_kind and _resolved_params on each recalled SynthAction, and lower_synth_program uses them to build ProgramSteps directly. Training: multi_action_verify_failed 9→1. Eval: multi_action_verify_failed 5→2, registered_ir_from_beam_search 0→2, already_in_library 5→6. One eval task now solved via library recall of a registered 2-step ProgramIR. Training/eval solve counts unchanged (102/25). 8 new tests, 2355 passing.
2026-03-22 22:00 — Hypothesis budget reclaim: marker_directed grid area guard
Profiled all 32 seed schemas on eval: marker_directed consumed 22.4s (67% of total hypothesis time) across 270 eval tasks with zero solves. On 20×20+ grids, its body executor costs 83ms/task but only verifies on grids ≤7×7 (area 49). Added max_grid_area=225 guard to SeedSchema. Training hypothesis time: 101ms→34ms avg (66% drop). Eval beam search gets more budget: 397ms vs 354ms avg, +5.6% expansions (57 vs 54). Training gains: 2 new predicate_recolor hypothesis solves (freed time lets more schemas try). Eval stays 25/400 but beam search coverage improves. No training regressions (102/400). 12 new tests, 2347 passing (excluding 7 pre-existing slide test failures from uncommitted work).
2026-03-22 19:30 — TEMPLATE_RAY_STAMP: anchor-template propagation primitive
New body kind: detect largest object as template + smaller objects as directional markers. For each marker, derive cardinal/diagonal direction from marker position relative to template bbox, compute stride from edge gap, then repeatedly stamp recolored template copies along that ray until grid boundary with edge clipping. Supports multi-marker additive composition and diagonal propagation. 045e512c now solves end-to-end via seed_schema:template_ray_stamp path (train verified + test verified). 5 files changed, 7 new tests, 2390 passing. Potential cluster: 61 unsolved tasks with similar misrouting through partition logic.
2026-03-22 18:00 — Per-object SLIDE: selective contact/process primitive
Extended SLIDE_TO_WALL from global-only (all objects move) to per-object selective (chosen objects slide, others stay as obstacles). Params: direction, stop_mode (wall/obstacle), selector (by_color/by_size_rank/ by_position_rank). Object solver now infers SLIDE when training pairs show varying-displacement translates with consistent cardinal direction — verified via simulation. Also fixed tuple selector resolution to handle by_color + list-form selectors (JSON roundtrip). 5 files changed, 27 new tests, 2375 passing. Remaining for richer contact: multi-object slide ordering, diagonal slide, anchor-relative targeting.
2026-03-22 16:30 — Primitive-mining audit: ericagi1 → ericagi2 + by_position_rank selector
Structured audit of all 53 primitives/operations from ericagi1. Result: 37 already present, 4 missing bind/select vocab, 1 missing primitive, 6 missing exposure, 5 rejected as brittle. Top 5 import candidates ranked by eval transfer value. Implemented #1: by_position_rank object selector — selects objects by spatial order (top-to-bottom, left-to-right). Critical fallback when color/size are ambiguous. 3 touch points: eval_body resolver, candidates.py selector key + action builder. 6 new tests, 2339 passing. Deferred: per-object SLIDE (needs selectors first), by_shape selector, flood_fill_voronoi, grow/shrink_objects.
2026-03-22 01:00 — Container-projection analysis: one-off, not implemented
Analyzed 855e0971 (hole-in-stripe projection). Rule: for each monochrome rectangular stripe, project holes along the orthogonal axis within the container. Scanned all 800 tasks (train + eval): exact match = 1. All 4 projection variants (orthogonal/same/always-h/always-v) also match 1 task only. 117/400 tasks have the stripe-with-hole structure (86 unsolved), but none share the projection action. Decision: one-off — not implemented per design constraint. Containment detection infrastructure could be reused later if a multi-task cluster emerges.
2026-03-22 00:30 — Local-rule audit + hypothesis label fix + pixel_feature overfit gate
Audited all 12 neighborhood_rule and 17 residual pixel_rule training solves. Found: (1) hypothesis.py had same mislabeling bug as solver.py — 7 seed-schema solves still reported as pixel_rule. Fixed via body_kind_to_action_kind() in hypothesis.py. pixel_rule: 17→1 residual (sequence body kind pattern). extract_object: 9→14, partition_cell_map: 0→2, separator_cell_summary: 0→1. (2) neighborhood_rule: 11 of 12 are genuinely small (1-4 rules). 1 task (b6afb2da) had 13-rule pixel_feature lookup table — classic overfit. Added tighter gate: pixel_feature mode with >6 rules AND poor compression (>40%) is now rejected. (3) meta_rule mode transfers (3 train → 1 eval = 0.33x). neighbor_recolor does not (7 train → 0 eval). Training: 104→104 (b6afb2da rejected but cached). Eval: 25/400 unchanged. 9 new tests (54 total in mapping file), 2333 total passing. Transfer audit now shows local-rule mode breakdown.
2026-03-21 23:15 — Structural action kind resolution
All 52 pixel_rule training solves were structural ops (rotate, flip, tile, connect, recolor, etc.) mislabeled as pixel_rule. Added body_kind_to_action_kind() mapping (29 direct + 15 override). Library recall now produces correct ActionKind instead of always PIXEL_RULE. Result: pixel_rule dropped from 52→17 (was 50% of training, now 16%). Newly visible families: tile_transform=7, recolor=7, connect=6, damage_repair=4, rotate_grid=3, mirror_concat=3, fold_symmetry=2, mirror_grid=2, transpose=2. Transfer picture now truthful: damage_repair 1.50x, tile_transform 0.50x, pixel_rule exactly 0.24x (matches overall ratio). Structural patterns also get lower MDL cost (1.0 vs 3.0), so overfit-detection no longer flags them. 102/400 train, 25/400 eval — zero solve count change. 45 new tests, 2324 total passing.
2026-03-21 22:30 — Train-vs-eval transfer audit
Full transfer audit: 104/400 train vs 25/400 eval (0.24 ratio). pixel_rule dominates training (50% of solves, 52 train) but transfers at 0.19 — fair but below par. neighborhood_rule weak transfer (12→1, ratio 0.08). Strong transferrers: damage_repair (3→3, 1.00), upscale_block (3→2, 0.67), fill_enclosed/connect/recolor (1:1 each). Library recall: 41 solves on train, only 5 on eval — guard_failed dominates (2869 eval vs 2519 train). Seed schemas transfer well: damage_repair_tile_fill solves 3 eval tasks. 15 families are train-only singletons. Top bottleneck: local-rule reliance (pixel_rule + neighborhood_rule = 62% of training wins but mostly train-side compression). Added scripts/transfer_audit.py for reproducible side-by-side reporting.
2026-03-21 20:45 — Body-executor parity fixes
Fixed 5 parity gaps between SynthAction closures and body executors. connect: sequential H→V for connect_both, diagonal modes (d1/d2/both_diag/h/v). recolor: from_color/to_color substitution, object selectors (is_smallest, by_size_rank), color_map param alias. remove: pixel-level color removal. border_draw: bbox-outline mode. multi_action_verify_failed: 8→6 (remaining are opaque pattern refs + neighbor_rule tables). library_recall: 35→39 (ProgramIR leaves now registered and recalled). 2,279 tests.
2026-03-21 — Multi-action lowering audit
Body mappings added for grow/shrink/object_outline. multi_action_unlowerable: 3→0. Remaining 8 multi_action_verify_failed are body-executor parity gaps in individual steps (connect, recolor, fill_enclosed body executors produce different results from SynthAction closures), not composition issues. 2,247 tests.
2026-03-21 — Docs consolidation: 102/400, zero fallback
Docs updated to reflect: 102/400 solved, 32 seed schemas, zero handwritten fallback, 45+ body kinds. STATUS.md rewritten. CLAUDE.md updated. README.md with accounting semantics.
2026-03-21 — Zero handwritten fallback
PARTITION_MAX_CELL_FILL: fill cells with max non-bg count, clear rest. 29623171 was hiding an explicit “fill most-marked cells” rule, not a classifier. Both former classifier tasks now solve via seed schemas. Handwritten fallback: 0. The entire hypothesis layer is now seed-schema / ProgramIR driven. 2,247 tests.
2026-03-21 — PARTITION_CELL_BROADCAST: solve 09629e4f
New body kind PARTITION_CELL_BROADCAST: select one partition cell by criterion (least/most nonbg), broadcast its values as template over partition grid. 09629e4f now SOLVES (test-verified) — the hardest-analyzed partition task through V1-V6 modes and classifiers. 29623171 correctly does not match (sole remaining handwritten fallback). 6 new tests, 2,246 total.
2026-03-21 — BORDER_DRAW executor + final fallback audit
Added BORDER_DRAW body executor (8-conn and 4-conn border around objects). border_draw now seed-schema-primary. partition_cell_classify (2 tasks) is the sole remaining handwritten fallback — it builds a per-task learned lookup table that cannot be a fixed seed schema. All other hypothesis families now run through seed schemas. 2,240 tests.
2026-03-21 — Fix 3 body-executor parity gaps
GRID_DECOMPOSE: axis-aware separator detection for row/col-only splits. EXTRACT_BY_PREDICATE: frame_interior predicate uses scene perception for frame + interior crop. SEPARATOR_SUMMARY: unique/most_nonbg/least_nonbg return actual cell content. 3 fewer handwritten fallbacks. Remaining: border_draw (1), partition_cell_classify (2). 2,240 tests.
2026-03-21 — Fallback audit: 6 body-executor parity gaps
Audited remaining 6 handwritten fallback tasks. All are genuine body-executor parity gaps, not missing schemas: border_draw (algorithm differs), frame_interior (missing extraction), select_unique_cell (wrong cell values), separator_operation (axis param ignored), partition_cell_classify (complex canonicalization). Need per-family body executor fixes, not more seed schemas. No code changes — measurement result. 2,240 tests.
2026-03-21 — Parity fixes: retire handwritten fallback
Added missing body executors for MIRROR_CONCAT (h/v concat + flip) and OBJECT_EXTENSION (directional pixel extension). Fixed fill_enclosed to derive output-only colors. 11 of 13 previously-handwritten tasks now use seed schemas. Remaining handwritten fallback: border_draw (1), frame_interior (1), partition-specific (4). 2,240 tests.
2026-03-21 — Partition seed schemas (complete de-hardcoding)
4 partition schemas: select_unique_cell (3 modes), separator_cell_summary (5 modes), partition_cell_map (20 modes derived), partition_cell_classify (8 strategies derived). 30 total schemas. The entire hypothesis layer is now covered — every _hyp_* family has a seed schema that runs first. Handwritten code remains only as fallback. 7 new tests, 2,239 total.
2026-03-21 — Quantitative/object seed schemas
4 quantitative/object schemas: object_sort (3 keys), stack_objects (axis×sort derived), object_extension (4 dirs×2 stop modes), color_count_output. 26 total schemas. Honest note: object_extension body executor differs from handwritten _hyp_* — seed schema registered but match depends on body executor canonicality. Remaining handwritten: only partition families and separator_cell_summary. 5 new tests, 2,230 total.
2026-03-21 — Template/region seed schemas
3 template/region schemas: row_col_projection (3 modes), input_as_template (derives scale from dim ratios, 3 upscale modes), template_stamp (clear_source True/False). 22 total schemas. 9172f3a0 upscale now via seed schema. Remaining handwritten: stack_objects, object_extension/sort, color_count_output, partition families. 6 new tests, 2,225 total.
2026-03-21 — Diff-derived seed schemas
3 diff-derived schemas: color_mapping (learns remap from diffs), crop_to_colored_region (per-color bbox crop), 1x1_summary (5 summary modes). ProgramIR executor now auto-separates complex values into bindings. 19 total schemas. b1948b0a color remap now via seed schema. Remaining handwritten: template_stamp, stack_objects, object_extension/sort, input_as_template, row_col_projection, partition families. 6 new tests, 2,219 total.
2026-03-21 — Object-level seed schemas
4 object-level seed schemas: object_removal (predicates: smallest/markers/color_X), marker_directed (4 modes), predicate_recolor (predicate×color combos), border_draw (connectivity×color). 16 total schemas. 5582e5ca now solves via seed_schema:marker_directed. Handwritten _hyp_* only for complex object-relational, partition, stack/crop. 7 new tests, 2,213 total.
2026-03-21 — Scene-derived seed schemas
4 new scene-derived seed schemas: extract_by_predicate (enumerates predicates), fill_enclosed (derives fill colors), damage_repair (detects damage color + symmetry), damage_repair_tile_fill (2-step with residual detection). 12 total seed schemas. damage_repair and extract now solve via seed schema path, not handwritten Python. Handwritten _hyp_* only for complex object-relational and partition families. 6 new tests, 2,206 total.
2026-03-21 — Seed schemas: ProgramIR as primary hypothesis path
8 seed schemas lower manual hypotheses to explicit ProgramIR: gravity, mirror_grid, rotate_grid, transpose, fold_symmetry, mirror_concat, upscale_block, grid_decompose. solve_by_hypothesis tries seed schemas first, falls back to handwritten _hyp_* for unsupported families. hypothesis_source diagnostic tracks solve origin. Verified: flip/rotate tasks now solve via seed_schema path, damage_repair correctly falls back. 12 new tests, 2,200 total.
2026-03-21 — ProgramIR compositional substrate
Architecture pivot: explicit ProgramIR with ordered ProgramSteps (body_kind + params). Serializable, executable through existing body eval, no closures. Lowering from SynthProgram and PatternDef. Multi-action solver wins now register as ProgramIR-bearing library entries. search_compositional() searches 1-2 step programs. Verified: 0dfd9992 2-action solve → ProgramIR leaf. 16 new tests, 2,188 total.
2026-03-21 — PERIODIC_TILE_FILL composition
New PERIODIC_TILE_FILL body kind: infers smallest 2D tile from non-damage pixels, fills residual holes. Damage repair hypothesis emits 2-action composition when peer symmetry repair leaves periodic residuals. +3 new solves: 0dfd9992, 29ec7d0e, c3f564a4 (all test-verified). b8825c91 unaffected. 3631a71a correctly not solved. 10 new tests, 2,172 total.
2026-03-20 — Damage-repair interpolation audit
Audited 5 transpose-only damage-repair tasks with residual peer-repair holes. Cluster is heterogeneous: 3 tasks (0dfd9992, 29ec7d0e, c3f564a4) are 2D-periodic tile fill, 2 tasks (3631a71a, 73251a56) are non-periodic context interpolation. Added inspector diagnostic: “partial damage repair” reports dc, symmetry, %resolved, residual count, and whether holes are periodic. No new family implemented — cluster too heterogeneous. 2,166 tests.
2026-03-21 — Registration accounting fix
no_registration_attempted was 100% library-recall solves (pixel_color_remap), not a leak. Tagged as “already_in_library”. Overfit-rejected local rules tagged as “rejected_overfit_local_rule.” Pure accounting, no solver change. 1 new test, 2,162 total.
2026-03-21 — Overfit local-rule rejection
Reject neighborhood_rule and pixel_rule beam winners with >15 original rules. These large memorized lookup tables produce train-fit-but-test-fail results (0-11% test accuracy). Verified: 2 worst offenders now rejected, small legitimate rules preserved, correct solves unaffected. New deferred_local_rule diagnostic flag. 6 new tests, 2,161 total.
2026-03-20 — Damage-repair family overhaul
Audited 3631a71a and found 13-task damage-repair / symmetry-repair cluster (0/13 solved). Fixed existing DAMAGE_REPAIR family: body executor param mismatch (method vs mode), added transpose/rot90/rot180 support, iterative peer repair for chained damage, multi-symmetry repair, explicit damage-color detection. +1 new solve (b8825c91). 5 transpose-only tasks need pattern interpolation for diagonal self-peer regions — identified as future work, not a simple peer-repair extension. 2,159 tests, all passing.
2026-03-21 — Semantic edit-feature clustering
New edit-semantic feature extraction: additive/recolor/ deletive, periodic/diagonal, aligned objects, single fill color. scripts/semantic_clusters.py groups timeout tasks by edit semantics with heterogeneity warnings. Key finding: of 56 additive+aligned tasks, 46 are periodic/diagonal (pattern-gen, NOT line-fill). Previous coarse clustering was misleading — now explicit. 15 new tests, 2,155 total.
2026-03-21 — Timeout cluster deep dive
Deep examination of the “6 clean extension tasks” cluster revealed they are NOT simple line-fills: they are complex pattern-generation tasks (fractals, checkerboards, growing L-shapes). No narrow reusable primitive covers them. Only 3 tasks in the entire timeout set have genuine straight-line fills. The timeout residual is genuinely hard — each task needs its own spatial reasoning logic. Train-fit-but-test-fail is dominated by overfit neighborhood _rule (7) and pixel_rule (3). No code changes — this is an honest measurement result.
2026-03-21 12:32 — Benchmark summary shows seed-schema sources
Benchmark main output now prints a top-level hypothesis-source summary for solved hypothesis tasks, so seed_schema vs handwritten fallback counts are visible without digging into the diagnostics footer or per-task JSON. The benchmark and diagnostics summary now share the same rendering helper, and a targeted benchmark-output test covers the new path.
2026-03-21 12:15 — Seed-schema source visible in diagnostics
inspect_task now prints solve_source, solve_family, hypothesis_source, registration_result, and program step count in the solver section, so seed_schema vs handwritten fallback is visible without opening raw JSON. Benchmark diagnostics summary also reports hypothesis-source counts and top schemas/fallbacks for solved hypothesis tasks. Targeted diagnostics tests pass.
2026-03-21 — Accounting fix: hard-timeout diagnostics
Fixed benchmark denominator mismatch (400 vs 399): when _SolverTimeout kills solver before diagnostics are created, benchmark now creates minimal SolveDiagnostics with failure_reason=“hard_timeout.” Remaining no_mapper: 11 (hollow_rect_op=3, 8 one-offs at 1 each). All low-volume one-offs left unmapped. 2,139 tests.
2026-03-20 — Pattern mappers for top no_mapper families
5 new mappers: transpose, move, rotate_grid, bbox_complement (new BodyKind + eval_body), recolor_closest (new BodyKind + eval_body). Covers 10/21 no_mapper cases from benchmark accounting. hollow_rect_op deliberately left unmapped (multi-mode semantics too complex for clean mapping). 2,144 tests pass.
2026-03-21 15:00 — General registration for non-hypothesis stages
Any fresh verified single-action solve with a pattern mapper can now register as a library candidate, not just hypothesis solves. Covers beam_search, diff_synthesis, pixel_infer winners. Multi-action compositions excluded. Extraction pipeline uses register_patterns=False to avoid interference. Registration reasons now include stage source. 2 new tests, 2,103 total.
2026-03-21 13:00 — Solve-to-library accounting
Track where solved programs are lost: new registration_result field (registered/no_mapper/duplicate/ rejected/not_hypothesis), solve_family, solve_source on SolveDiagnostics. Benchmark prints accounting summary. Validated: 5582e5ca fresh solve → hypothesis → object_removal → registered → OE leaf. Main leakage: cached solves bypass registration; non-hypothesis stages (beam_search) have no registration path. 8 new tests, 2,103 total.
2026-03-21 11:00 — OE empirical validation
Ran ARC-1 training benchmark: 207 solved (72 beam, 26 hypothesis, 9 library, 7 diff, 4 pixel). Library: 6 entries (2 promoted, 4 candidate). 0 marker_directed solves, 2 object_removal solves (1 unique task). 0 real OE leaves because: (a) marker_directed doesn't match any real tasks yet, (b) cached solves bypass hypothesis registration. The OE pipeline is architecturally complete but upstream solve generation is the bottleneck — the families need to actually solve tasks before leaves can form.
2026-03-21 09:00 — OE backfill pipeline
PatternLibrary.backfill_oe_ir() attaches ObjectEditProgram IR to bridgeable entries (marker_directed, object_removal). scripts/oe_backfill.py runs offline audit: scan → backfill → anti-unify → report. Current library: 0 real OE leaves (needs benchmark run with new families). Pipeline validated on synthetic data. Real OE parent creation requires solved tasks first. 10 new tests, 2,093 total.
2026-03-21 07:00 — OE anti-unification
Conservative anti-unification for ObjectEditProgram leaves. Groups by transform signature, lifts only safe literal fields (size_le, size_ge, color, derive.value). Rejects when transforms differ, derive sources differ, or non-liftable fields (role, color_source) differ. OEParentCandidate records template + param_slots + child provenance. Library method creates parents only when compressive (≥2 children, ≥1 lifted, ≥2 source tasks). Parents stay candidate-only. Inspector shows OE-parent(Nsteps, Mparams). 15 new tests, 2,083 total.
2026-03-21 05:00 — Object-edit leaf storage
ObjectEditProgram leaves now persist in PatternLibrary via object_edit_ir field. Coexists with existing PatternDef — supplementary metadata, not replacement. Solver attaches IR at hypothesis registration for MARKER_DIRECTED and OBJECT_REMOVAL. Properties: is_object_edit_leaf, object_edit_step_count. Inspector shows OE-leaf(Nsteps). Save/load round-trips through library persistence. No normalization — exact operational thresholds stored as-is. Leaf-first only, no parent abstractions yet. 8 new tests, 2,068 total.
2026-03-21 03:00 — ObjectEditProgram IR
Minimal typed IR for object-level transforms: ObjSelector (role/size/color) → ColorDerivation (literal/scene-derived) → transform (remove/recolor/fill_interior/flood_adjacent). Fully serializable JSON round-trip, no closures. Multi-step sequential composition. Bridge converts MARKER_DIRECTED and OBJECT_REMOVAL to concrete IR leaves. Execution matches direct body evaluation exactly. Leaf-level infrastructure for future parent abstraction via anti-unification. 22 new tests, 2,060 total.
2026-03-21 01:00 — Marker-directed tightening + diagnostics scoping
Fixed MARKER_DIRECTED serialization: proper BodyKind.MARKER_DIRECTED with dedicated body executor. hypothesis_to_pattern → execute_pattern round-trips correctly. Marker color derived per-input (minority marker) instead of baked from pair 0. Inspector now diagnoses marker-template tasks where simple modes are exhausted as “likely needs object-spatial transform,” explicitly scoping 025d127b-class tasks out of simple marker-directed family. 2,038 tests.
2026-03-20 23:00 — Marker-directed object transforms
New MARKER_DIRECTED family for same-size tasks where marker objects indicate properties to apply to template objects. 4 modes: remove_markers, marker_color_fill_interior, marker_recolor_templates, marker_fill_adjacent. Hypothesis + candidate generation, gated by marker/template role presence. Targets 50 identified same-size timeout tasks that were falling back to irrelevant global transforms. 10 new tests, 2,033 total.
2026-03-20 21:00 — Binding diagnostics + failure classification
New resolve_bindings_with_diagnostics showing per-binding failure reasons (missing property, inconsistent color map, no unique output color). Inspector now shows specific bind failure reasons. Diagnosis distinguishes “irrelevant recall” (bind ok, verify fails) from “real bind failure” (specific reason). Library hygiene: rejected non-primitives not inserted (name slot stays open). Deferred stages correctly attributed in traces. 3 new properties. 8 new tests, 2,021 total.
2026-03-19 21:00 — Overfit-bypass for neighborhood_rule
When diff_synthesis or pixel_infer returns a neighborhood_rule program, stash as fallback and let beam search try to find a generalizing alternative. Fixes “early stage steal” where overfitting programs prevent better ones from being found. 50cb2852 flips to solved. b230c067 improves 11→4 wrong pixels.
2026-03-19 20:15 — Known-truths candidate filter
Beam search candidates now filtered by known-truth invariant: reject any candidate that corrupts pixels already correct across all train pairs. Median 80% candidate reduction across 186 timeout tasks. 154 tasks lose >50% of useless candidates (flips, rotates, remaps that destroy correct pixels). Falls back to unfiltered if filter kills everything. Motivated by analysis showing 147 “zero-expansion” tasks where generic candidates all make things worse.
2026-03-19 19:30 — Per-color morpho ops + gravity candidates
New diff_candidates module: per-color grow/shrink and per-color gravity as beam search candidates. Finer-grained than all-color variants — only expand, erode, or slide one specific color. 49 timeout tasks show improved pixel accuracy, 7 formerly zero-expansion tasks now have beam progress. Motivated by analysis showing 147 tasks generate candidates where zero reduce residual.
2026-03-19 18:30 — Morphological grow/shrink primitives
Added morphological dilation (grow) and erosion (shrink) as beam search candidate actions. Both support 4-connected and 8-connected neighborhoods. 20 timeout tasks show improved pixel accuracy when grow/shrink is used as a composition step (top: +37px on 6cdd2623 via shrink4). These are fundamental operations that were completely absent from the candidate set. 15 new tests, 1,985 total.
2026-03-20 17:00 — Rule-table compression
Conservative exact compression for neighborhood_rule tables: remove irrelevant feature dimensions, factor dominant default outputs. Example: 7-rule 5-dim table → 3 rules on 2 dims. Compressed at candidate creation time, stored in serializable params. Body executor handles relevant_dims + default_output. Quality audit: is_oversized_rule_table rejects >20-entry tables with weak support. Inspector shows compression ratio. New rule_compress module. 7 new tests, 1,946 total.
2026-03-20 15:00 — Explicit neighborhood_rule representation
Converted neighborhood_rule from hidden-logic closure family to explicit serializable representation. Rule tables now stored in params/body data as JSON-compatible structures. Two formats: pixel_feature (5-tuple feature → color dict) and neighbor_recolor (conditional rule list). Body executor dispatches by mode, falling back to legacy dict format. hypothesis_to_pattern mapper creates explicit patterns. Quality audit no longer flags neighborhood_rule as hidden_logic when rule_table present. New rule_table_size property. 11 new tests, 1,939 total.
2026-03-20 13:56 — Immediate learn-progress logging
Restored visible benchmark progress when running with --learn. iterative_extract now emits task_start events before each solve, so benchmark.py prints an immediate [learn ...] starting line instead of staying silent until the first task finishes. Also fixed the extraction progress helper placement in benchmark.py so diagnostics summary and learn logging stay cleanly separated.
2026-03-20 13:44 — Cooperative beam-search timeout fix
Fixed a real timeout bug exposed by task 3631a71a. The pipeline deadline was only checked between stages, so once beam_search started it could run for minutes despite a 5s task budget. beam_search now accepts a cooperative deadline and bails before candidate generation and during candidate evaluation. Repro: 3631a71a previously recorded ~195.6s in beam_search; after the fix, a 6s solve_with_diagnostics run returns in ~6.15s with beam_search capped at ~3.66s. Added regression tests for expired and mid-loop beam deadlines.
2026-03-20 13:00 — Generalized primitive-quality audit
Quality audit now covers all pattern families, not just classifiers. New properties: extractability_class (explicit/partial/hidden_logic), is_literal_heavy (multi- constant no lifting), has_hidden_logic (closure-dependent families). quality_score penalizes: hidden-logic 70%, literal-heavy 50%, memorizing classifiers 90%. reject_non_primitives gates: hidden-logic single-source, literal-heavy single-source, memorizing classifiers. Inspector shows quality/lit_ratio/family/flags per recalled pattern. Benchmark table adds Ext column with L/H/M flags. 14 new tests, 1,928 total.
2026-03-20 11:00 — Classifier quality as system-wide filter
V6 metrics now gate the entire pattern lifecycle. New PatternEntry fields: classifier_sig_count, classifier_test_unseen, classifier_n_test_cells, classifier_strategy. quality_score applies 90% penalty for memorizing classifiers (>50% unseen rate). reject_non_primitives gates them before promotion. Metrics computed at hypothesis registration time, serialized in library, shown in benchmark table. partition_cell_map mapper added for pattern serialization. 13 new tests, 1,914 total.
2026-03-20 09:00 — Compressive cell signatures (Partition V6)
8 canonicalization strategies ordered by compression: 5 compressive (n_nonbg, color_freq_profile, n_colors_n_objs, occupancy, occupancy_n_colors) then 3 exact (color_rank, color_rank_posn, sorted_color_set). Inspector shows per-strategy: signature count, consistency, test unseen rate. Classifier tries compressive first — genuinely solvable tasks use few signatures that generalize to test; memorization is clearly flagged via test_unseen metric. 3 new tests, 1,887 total.
2026-03-20 07:00 — Cell pattern classifier (Partition V5)
New partition cell classifier family: treats each cell as a structured token, computes canonical signatures under 4 strategies (occupancy, color_rank, color_rank_posn, sorted_color_set), builds sig→output lookup from training pairs. Distinct from summary modes — learns a lookup table rather than computing a fixed function. Inspector shows per-strategy consistency/conflict/match status. Correctly train-fits 09629e4f (36 sigs) but flags non-generalization. 6 new tests, 1,883 total.
2026-03-20 05:00 — Template completion reasoning (Partition V4)
Added 4 template completion modes for position-based reasoning: missing_position_fill (consensus color at absent majority-template position), extra_position_color (color at position unique to this cell), sparse_position_color (color at rarest position), dense_position_color (color at most common position). 20 total partition modes across 4 categories (palette/relational/structural/completion). Pre-computes cross-cell position-color maps for O(1) per-cell lookups. Inspector categorizes modes with per-category failure reporting. 6 new tests, 1,875 total.
2026-03-20 03:00 — Cross-cell relational reasoning (Partition V2)
Extended partition cell map with 4 cross-cell relational modes: missing_from_global_palette, missing_from_row_palette, missing_from_col_palette, unique_vs_grid. Pre-computes per-cell palette stats and threads them through execution. Inspector now probes all 8 modes per partition task, showing match/fail with per-mode cell mismatch counts. Diagnosis reports when all modes fail or when a winning mode is found. Refactored body executor into shared _partition_cell_fill for hypothesis/candidate reuse. 8 new tests, 1,862 total.
2026-03-20 01:00 — Partition cell map operator family
New general partition/cell operator for same-size separator-grid tasks. PARTITION_CELL_MAP operates over detected partition cells, applying per-cell summary functions (dominant non-bg, unique non-bg, max/min color) and rendering back into the same cell layout with scaffold preserved. Added as ActionKind, BodyKind, hypothesis family, and beam search candidate generator. TaskContext now caches partition scenes (has_partition, partitions properties). 17 new tests, 1,854 total.
2026-03-19 23:00 — Task inspection tool
New diagnostic inspector: python scripts/inspect_task.py --task-id ID. Shows task summary, hierarchical perception (plain vs partitioned), structural triples by category, hint predictions, library recall with per-pattern guard/bind/verify probing, full solver diagnostics with stage timings, and synthesized bottleneck diagnosis. Reuses existing pipeline (solve_with_diagnostics, recall_with_diagnostics) — no duplicate solve logic. Supports --json for machine-readable output, --show-grids for cell details. Diagnosis engine identifies 11 bottleneck classes from recorded signals. 25 new tests, 1,837 total.
2026-03-19 21:30 — Phase 4: separator-aware hierarchical perception
New perception path for separator-partitioned grids (e.g. 09629e4f). Detects full-span separator rows/cols, extracts rectangular cells between them, builds per-cell sub-scenes with local background inference. New types: CellRegion, PartitionScene, HierarchicalScene. build_hierarchical_scene() overlays partition detection on existing scene graph. Structural triples gain 8 partition predicates (has_separator_grid, uniform_cells, cell palette/symmetry/object_count varies, cell_summary_task). SolveDiagnostics extended with perception_mode, partition_rows/cols/cell_count. Benchmark prints partition perception aggregate stats. 19 new tests, 1,809 total, 100% coverage.
2026-03-19 18:00 — Neural-guided search, tiered library, diagnostics infrastructure
HintNet multi-head classifier predicts task priors (size family, action kind, flags) to rerank library recall, candidates, and beam search — model only reranks, verification stays in control. Pattern library now uses candidate/promoted/rejected tiers: promotion requires ≥2 distinct provenance tasks and non-source success; failure-based auto-demotion for high-use low-success patterns; default recall returns only promoted patterns. New --diag mode in benchmark.py emits per-task JSON with stage timing, recall quality, candidate coverage, beam stats, failure taxonomy, and hint quality metrics. Bug fixes: pixel_infer now tries geometric extractors before color maps (fixes generalization failure on fliplr tasks); action-family grouping now consistent between training and runtime; expensive solver stages (pixel infer, body sweep, neural) now respect timeout budget.
2026-03-19 14:00 — Comprehensive solve diagnostics
Added SolveDiagnostics dataclass capturing per-task timing, recall quality, failure taxonomy, and hint accuracy metrics. New solve_with_diagnostics() function instruments every pipeline stage (8 stages) with monotonic timing. Library gains recall_with_diagnostics() exposing cosine/hint_bonus/ hybrid scores per recalled pattern. Benchmark gets --diag flag that saves per-task JSON and prints aggregated summary: solved-by-stage breakdown, failure taxonomy, recall hit rate, avg stage time, hint quality metrics. 25 new tests, 1,695 total, 100% coverage.
2026-03-16 22:00 — Meta-strategy: auto-discover abstract rules
Built diff-driven synthesis that replaces blind beam search with structural analysis. Meta-strategy tries 10 abstract feature extractors (neighbor count, object size rank, border detection) and picks the simplest one consistent across all training pairs. Leave-one-out validation prevents overfitting. Key insight: pixel-level memorization doesn't generalize; abstract features (IS_BORDER, NEIGHBOR_COUNT, SIZE_RANK) do. Also: candidate filtering by same_dims/diff_dims, param-shape sub-grouping in extraction, first-step decomposition for partial pattern learning. 92/400 train (+4), 1,391 tests, 100% coverage.
2026-03-16 18:00 — Learning loop fixes: 3 param coordination fixes
Three fixes to unblock the pattern learning loop. Fix 1: body executor normalizes short-form fold_symmetry modes (“lr”→“sym_lr”, etc.) so hypothesis params flow through the DSL pipeline. Fix 2: split mixed ActionKind groups — PATTERN_CONTINUATION, HOLLOW_RECT_OP, FRAME_FILL now have distinct signatures for cleaner extraction grouping. Fix 3: added selector_hint task property for property-based selector explanation in extract_object groups. 1,321 tests, 100% coverage.
2026-03-16 16:30 — 13 new BodyKinds for pattern generalization
Added 13 new BodyKind enum values with full implementations: UPSCALE_BLOCK, EXTRACT_BY_PREDICATE, STAMP_AT_MARKERS, SEPARATOR_SUMMARY, COLOR_SUBSTITUTE, STACK_CONCAT, DRAW_LINE, RIGID_SHIFT, DAMAGE_REPAIR, PATTERN_EXTEND, NEIGHBOR_RULE, SLIDE_TO_WALL, COUNT_ENCODE. These enable the learning loop to generalize hypothesis solutions into reusable patterns. MOVE and EXTEND_LINE unstubbed (delegate to RIGID_SHIFT and DRAW_LINE). Updated antiunify mappings for 16 action types. Pattern DSL body vocabulary: 21 → 34. 1,313 tests, 100% coverage.
2026-03-17 00:15 — Port 12 inference engines (batch 3)
Ported 12 more inference engines from ericagi as candidate generators: directed cross, gravity fill, bbox complement fill, translate to target, connect over bg, recolor to closest, diagonal stamp, object outline, row period fill/extend, col period extend, pattern substitution. 88/400 train (22.0%), 54 candidate generators, 1,168 tests, 100% coverage.
2026-03-16 23:30 — Output construction hypothesis families
Added 7 new hypothesis families targeting diff-dims tasks: separator cell summary, upscale block, stack objects, crop to colored region, select unique cell from separator grid, 1x1 summary, and input-as-template. These handle the 138 unsolved diff-dims tasks (crop, scale, summary, stacking categories). 88/400 train (+12), 17 hypotheses total, 1,226 tests, 100% coverage.
2026-03-16 22:45 — Hypothesis families + ported inference engines
Added 6 new hypothesis families (template stamp, color mapping, gravity, symmetry completion, fill enclosed, extract by predicate) for instant pattern recognition before beam search. Ported 5 inference engines as candidates (diagonal connect, cross extension, gravity align, gap fill, rigid shift). 76/400 train (+3), 10 hypotheses, 42 candidate generators, 1050 tests, 100% coverage.
2026-03-16 20:30 — Role-aware candidate generators
Added 5 scene-graph-driven candidate generators that use role classification (FRAME, SEPARATOR, MARKER, TEMPLATE, LEGEND) to propose structured transformations: stamp template at markers, template recolor at markers, frame interior fill, separator grid operations, and legend color mapping. 73/400 train (+1), 951 tests, 100% coverage.