Verified Autonomous Engineering

Nine headline use cases, one engine. Real metrics, real artifacts, documented plateaus where they exist.

Use case showcase

Nine frontiers. One engine.

What Remoroo runs in production today — 3 solved, 4 iterating (with documented plateaus and honest experiment logs), and 2 open frontiers with locked harnesses, baselines pending. Every metric below is sourced from a real run, not a marketing target.

+def calculate_mean(rows):

+ return sum(rows) / len(rows)

- return sum(rows) // len(rows)

@pytest.mark.timeout(2)

+def test_fib_runtime():

+ assert fib(35) == 9227465

- return data.dropna().mean()

+ return data.replace(-999, np.nan).dropna().mean()

+async def fetch_all(urls):

+ return await asyncio.gather(*[get(u) for u in urls])

- return [requests.get(u) for u in urls]

+with lock:

- counter = counter + 1

+ counter += 1

+def calculate_mean(rows):

+ return sum(rows) / len(rows)

- return sum(rows) // len(rows)

@pytest.mark.timeout(2)

+def test_fib_runtime():

+ assert fib(35) == 9227465

- return data.dropna().mean()

+ return data.replace(-999, np.nan).dropna().mean()

+async def fetch_all(urls):

+ return await asyncio.gather(*[get(u) for u in urls])

- return [requests.get(u) for u in urls]

+with lock:

- counter = counter + 1

+ counter += 1

Software · EvalVerified

47-Task Software Engineering Suite

Autonomous code · multi-tier benchmark

From CSV statistics to async refactors, race-condition fixes and Fibonacci runtime guards — same engine, same loop, 47 verifiable tasks across four difficulty tiers.

Easy 100% · Hard 88.9% · Multi-Metric 82.4%

Baseline0 % passBest95.2 % pass

47 logged experimentsView evidence

Robotics · PerceptionVerified

Eye-in-Hand Hand-Eye Calibration

Robotics perception · MuJoCo

Locked validation set, anti-gaming by construction. The engine plateaued at ~47 mm for 27 experiments, then broke through with depth-corrected PnP + global bundle adjustment — final trans_std landed at 0.166 mm, ~6× under the 1 mm target.

Plateau broken at exp 34 · depth-corrected PnP + global BA · cross-checked vs GT

Baseline55.66 mmBest0.17 mm

35 logged experimentsView evidence

Scientific ML · GenomicsVerified

Variant Triage

Clinical genomics · 1 CPU · 4 GB · 10 min

ClinVar missense pathogenicity classification on a locked 2024+ time-holdout. The engine started from a 0.70 biochem-only baseline, widened the schema to ensemble REVEL + AlphaMissense + gnomAD constraint + per-gene priors, and landed at 0.9838 ROC AUC on the locked 47 254-variant test set — past REVEL alone (0.9716) on the same split, and well past the 0.97 target. Single CPU core, 4 GB RAM, ten minutes per iteration.

0.70 baseline → 0.9838 final · beats REVEL-alone (0.9716) on same split

Baseline0.7 ROC AUCBest0.98 ROC AUC

8 logged experimentsView evidence

RL · ControlIterating

Hover to preview

BipedalWalkerHardcore · PPO

Continuous control · Box2D

25 logged experiments — including the failures and one regression of −89 points — driving toward a clean Stage-1 → Stage-2 curriculum on hardcore terrain.

Stage-1 nailed · Stage-2 climbing · honest failure log

Baseline-100 avg rewardBest166.6 avg rewardTarget>= 300 avg reward

67%

25 logged experimentsView evidence

RL · LocomotionIterating

Quadruped Locomotion · dog_run

RL · dm_control · MuJoCo

Stage-1 → Stage-2 curriculum on a 38-DoF quadruped with reward shaping. Baseline PPO has been measured; engine is starting iteration toward the 700 reward target.

Curriculum learning · baseline measured

Baseline5 avg rewardBest169.3 avg rewardTarget>= 700 avg reward

24%

1 logged experimentsView evidence

Vision · ConstrainedIterating

CIFAR-10 Speedrun

Vision · constrained · Apple Silicon

Push CIFAR-10 top-1 accuracy to ≥ 95 % under a hard budget: ≤ 1 M parameters, ≤ 15 minutes wall-clock on Apple Silicon (MPS), seed locked. 142 logged experiments through the constraint envelope — current best **93.62 %** with the top three runs all clustered in the 0.935+ band. Less than 1.4 points from the bar.

142 logged experiments · top-3 in 0.935+ band · OneCycleLR + mixup + RandomErasing + EMA

Baseline86 % top-1Best93.62 % top-1Target>= 95 % top-1

85%

142 logged experimentsView evidence

Scientific ML · TabularIterating

Higgs Boost

Scientific ML · 1 CPU · 4 GB · particle-physics tabular

Particle-physics tabular benchmark from the Baldi 2014 Nature Comms paper. Engine has to clear AUC 0.733 (shallow baseline) → 0.880 (deep+features baseline) on the canonical 500 K test split — using a single CPU core and 4 GB of RAM. 34 logged experiments through the constraint envelope; current best 0.8747, expected to clear 0.880 in the next 1–2 days.

1-thread pin · 4 GB RLIMIT · 22 keep / 9 crash / 2 time_exceeded · canonical Baldi split

Baseline0.73 ROC AUCBest0.87 ROC AUCTarget>= 0.88 ROC AUC

96%

34 logged experimentsView evidence

/aɪ//θ//ʃ//ə//r//iː//oʊ//n//aɪ//θ//ʃ//ə//r//iː//oʊ//n//aɪ//θ//ʃ//ə//r//iː//oʊ//n/

Speech · TTSOpen frontier

Neural Voice Synthesis

Speech · TTS · open frontier

Headline metric is mel-reconstruction loss against a public locked eval set, with a phoneme-error proxy as the secondary cross-check. Harness ready, engine training next.

Eval set locked · baseline pending

TargetTBD

Awaiting first runView evidence

the quick brown fox jumps over the lazy dogshe sells seashells by the seashoretwo roads diverged in a yellow wood

Speech · STTOpen frontier

Automatic Speech Recognition

Speech · STT · open frontier

Word Error Rate on a frozen public eval set. Engine boots into a real ASR codebase; first iteration cycle starts next sprint.

Frozen test set · iteration begins next sprint

Target<= 5 % WER

Awaiting first runView evidence

Full SWE suite

The full catalog of software-engineering tasks behind the showcase's SWE tile — searchable, with each task's metric and outcome.

Contribute to the Standard

The benchmark system is open-source. Help build the standard for autonomous engineering.

View GitHub Repo