Verified Autonomous Engineering
Nine headline use cases, one engine. Real metrics, real artifacts, documented plateaus where they exist.
Nine frontiers. One engine.
What Remoroo runs in production today — 3 solved, 4 iterating (with documented plateaus and honest experiment logs), and 2 open frontiers with locked harnesses, baselines pending. Every metric below is sourced from a real run, not a marketing target.
From CSV statistics to async refactors, race-condition fixes and Fibonacci runtime guards — same engine, same loop, 47 verifiable tasks across four difficulty tiers.
Locked validation set, anti-gaming by construction. The engine plateaued at ~47 mm for 27 experiments, then broke through with depth-corrected PnP + global bundle adjustment — final trans_std landed at 0.166 mm, ~6× under the 1 mm target.
ClinVar missense pathogenicity classification on a locked 2024+ time-holdout. The engine started from a 0.70 biochem-only baseline, widened the schema to ensemble REVEL + AlphaMissense + gnomAD constraint + per-gene priors, and landed at 0.9838 ROC AUC on the locked 47 254-variant test set — past REVEL alone (0.9716) on the same split, and well past the 0.97 target. Single CPU core, 4 GB RAM, ten minutes per iteration.
25 logged experiments — including the failures and one regression of −89 points — driving toward a clean Stage-1 → Stage-2 curriculum on hardcore terrain.
Stage-1 → Stage-2 curriculum on a 38-DoF quadruped with reward shaping. Baseline PPO has been measured; engine is starting iteration toward the 700 reward target.
Push CIFAR-10 top-1 accuracy to ≥ 95 % under a hard budget: ≤ 1 M parameters, ≤ 15 minutes wall-clock on Apple Silicon (MPS), seed locked. 142 logged experiments through the constraint envelope — current best **93.62 %** with the top three runs all clustered in the 0.935+ band. Less than 1.4 points from the bar.
Particle-physics tabular benchmark from the Baldi 2014 Nature Comms paper. Engine has to clear AUC 0.733 (shallow baseline) → 0.880 (deep+features baseline) on the canonical 500 K test split — using a single CPU core and 4 GB of RAM. 34 logged experiments through the constraint envelope; current best 0.8747, expected to clear 0.880 in the next 1–2 days.
Headline metric is mel-reconstruction loss against a public locked eval set, with a phoneme-error proxy as the secondary cross-check. Harness ready, engine training next.
Word Error Rate on a frozen public eval set. Engine boots into a real ASR codebase; first iteration cycle starts next sprint.
Full SWE suite
The full catalog of software-engineering tasks behind the showcase's SWE tile — searchable, with each task's metric and outcome.
Contribute to the Standard
The benchmark system is open-source. Help build the standard for autonomous engineering.
View GitHub Repo