Verified Autonomous Engineering

Nine headline use cases, one engine. Real metrics, real artifacts, documented plateaus where they exist.

Use case showcase

Nine frontiers. One engine.

What Remoroo runs in production today — 3 solved, 4 iterating (with documented plateaus and honest experiment logs), and 2 open frontiers with locked harnesses, baselines pending. Every metric below is sourced from a real run, not a marketing target.

+def calculate_mean(rows):
+ return sum(rows) / len(rows)
- return sum(rows) // len(rows)
 
@pytest.mark.timeout(2)
+def test_fib_runtime():
+ assert fib(35) == 9227465
- return data.dropna().mean()
+ return data.replace(-999, np.nan).dropna().mean()
 
+async def fetch_all(urls):
+ return await asyncio.gather(*[get(u) for u in urls])
- return [requests.get(u) for u in urls]
 
+with lock:
- counter = counter + 1
+ counter += 1
+def calculate_mean(rows):
+ return sum(rows) / len(rows)
- return sum(rows) // len(rows)
 
@pytest.mark.timeout(2)
+def test_fib_runtime():
+ assert fib(35) == 9227465
- return data.dropna().mean()
+ return data.replace(-999, np.nan).dropna().mean()
 
+async def fetch_all(urls):
+ return await asyncio.gather(*[get(u) for u in urls])
- return [requests.get(u) for u in urls]
 
+with lock:
- counter = counter + 1
+ counter += 1
Software · EvalVerified
47-Task Software Engineering Suite
Autonomous code · multi-tier benchmark

From CSV statistics to async refactors, race-condition fixes and Fibonacci runtime guards — same engine, same loop, 47 verifiable tasks across four difficulty tiers.

Easy 100% · Hard 88.9% · Multi-Metric 82.4%
Baseline0 % passBest95.2 % pass
47 logged experimentsView evidence
Robotics · PerceptionVerified
Eye-in-Hand Hand-Eye Calibration
Robotics perception · MuJoCo

Locked validation set, anti-gaming by construction. The engine plateaued at ~47 mm for 27 experiments, then broke through with depth-corrected PnP + global bundle adjustment — final trans_std landed at 0.166 mm, ~6× under the 1 mm target.

Plateau broken at exp 34 · depth-corrected PnP + global BA · cross-checked vs GT
Baseline55.66 mmBest0.17 mm
35 logged experimentsView evidence
Scientific ML · GenomicsVerified
Variant Triage
Clinical genomics · 1 CPU · 4 GB · 10 min

ClinVar missense pathogenicity classification on a locked 2024+ time-holdout. The engine started from a 0.70 biochem-only baseline, widened the schema to ensemble REVEL + AlphaMissense + gnomAD constraint + per-gene priors, and landed at 0.9838 ROC AUC on the locked 47 254-variant test set — past REVEL alone (0.9716) on the same split, and well past the 0.97 target. Single CPU core, 4 GB RAM, ten minutes per iteration.

0.70 baseline → 0.9838 final · beats REVEL-alone (0.9716) on same split
Baseline0.7 ROC AUCBest0.98 ROC AUC
8 logged experimentsView evidence
RL · ControlIterating
Hover to preview
BipedalWalkerHardcore · PPO
Continuous control · Box2D

25 logged experiments — including the failures and one regression of −89 points — driving toward a clean Stage-1 → Stage-2 curriculum on hardcore terrain.

Stage-1 nailed · Stage-2 climbing · honest failure log
Baseline-100 avg rewardBest166.6 avg rewardTarget>= 300 avg reward
67%
25 logged experimentsView evidence
RL · LocomotionIterating
Quadruped Locomotion · dog_run
RL · dm_control · MuJoCo

Stage-1 → Stage-2 curriculum on a 38-DoF quadruped with reward shaping. Baseline PPO has been measured; engine is starting iteration toward the 700 reward target.

Curriculum learning · baseline measured
Baseline5 avg rewardBest169.3 avg rewardTarget>= 700 avg reward
24%
1 logged experimentsView evidence
Vision · ConstrainedIterating
CIFAR-10 Speedrun
Vision · constrained · Apple Silicon

Push CIFAR-10 top-1 accuracy to ≥ 95 % under a hard budget: ≤ 1 M parameters, ≤ 15 minutes wall-clock on Apple Silicon (MPS), seed locked. 142 logged experiments through the constraint envelope — current best **93.62 %** with the top three runs all clustered in the 0.935+ band. Less than 1.4 points from the bar.

142 logged experiments · top-3 in 0.935+ band · OneCycleLR + mixup + RandomErasing + EMA
Baseline86 % top-1Best93.62 % top-1Target>= 95 % top-1
85%
142 logged experimentsView evidence
f01f02f03f04f05f06f07f08f09f10f11f12f13f14
Scientific ML · TabularIterating
Higgs Boost
Scientific ML · 1 CPU · 4 GB · particle-physics tabular

Particle-physics tabular benchmark from the Baldi 2014 Nature Comms paper. Engine has to clear AUC 0.733 (shallow baseline) → 0.880 (deep+features baseline) on the canonical 500 K test split — using a single CPU core and 4 GB of RAM. 34 logged experiments through the constraint envelope; current best 0.8747, expected to clear 0.880 in the next 1–2 days.

1-thread pin · 4 GB RLIMIT · 22 keep / 9 crash / 2 time_exceeded · canonical Baldi split
Baseline0.73 ROC AUCBest0.87 ROC AUCTarget>= 0.88 ROC AUC
96%
34 logged experimentsView evidence
/aɪ//θ//ʃ//ə//r//iː//oʊ//n//aɪ//θ//ʃ//ə//r//iː//oʊ//n//aɪ//θ//ʃ//ə//r//iː//oʊ//n/
Speech · TTSOpen frontier
Neural Voice Synthesis
Speech · TTS · open frontier

Headline metric is mel-reconstruction loss against a public locked eval set, with a phoneme-error proxy as the secondary cross-check. Harness ready, engine training next.

Eval set locked · baseline pending
TargetTBD
Awaiting first runView evidence
the quick brown fox jumps over the lazy dogshe sells seashells by the seashoretwo roads diverged in a yellow wood
Speech · STTOpen frontier
Automatic Speech Recognition
Speech · STT · open frontier

Word Error Rate on a frozen public eval set. Engine boots into a real ASR codebase; first iteration cycle starts next sprint.

Frozen test set · iteration begins next sprint
Target<= 5 % WER
Awaiting first runView evidence

Full SWE suite

The full catalog of software-engineering tasks behind the showcase's SWE tile — searchable, with each task's metric and outcome.

Contribute to the Standard

The benchmark system is open-source. Help build the standard for autonomous engineering.

View GitHub Repo