Full Benchmark Explorer

Verifiable Autonomous Engineering

Explore the comprehensive suite of tests that power Remoroo. From environment healing to high-fidelity artifact replays, we measure everything to ensure reliability at scale.

Foundation Tests

Basic functionality and regression tests to ensure core system stability.

8 Benchmarks
ID: calculate-mean-from-csv

Calculate Mean from CSV

Write a script to calculate the mean of values in a CSV file

Target Metrics
mean_value is correct
VerifiedView Source
ID: eda-blind-spot-corrupted-data

EDA Blind Spot (Corrupted Data)

A Data Science task where the agent must perform EDA to discover and filter out specific error codes (encoded as -999) before training a model. Without specific cleaning, the model will fail to meet the R^2 threshold.

Target Metrics
r2_score >= 0.9
VerifiedView Source
ID: environment-recovery-env-doctor

Environment Recovery (Env Doctor)

A repository with missing dependencies. The Environment Doctor must detect the missing package (requests) and install it to make the environment healthy.

Target Metrics
exit_code == 0
VerifiedView Source
ID: fix-fibonacci-off-by-one

Fix Fibonacci Off-by-One

Fix an off-by-one error in a fibonacci sequence generator

Target Metrics
fib_10 == 55
VerifiedView Source
ID: fix-string-reversal-bug

Fix String Reversal Bug

Fix a bug in a string reversal function that incorrectly handles special characters

Target Metrics
reversed_output == '!dlroW olleH'
VerifiedView Source
ID: patch-brittle-repeated-blocks

Patch Brittle Repeated Blocks

A repo with repeated code blocks; stresses patch precision and avoids wrong-block edits.

Target Metrics
code_runs == true
VerifiedView Source
ID: regression-guardrails-fibonacci

Regression Guardrails (Fibonacci)

Implement a Fibonacci calculator. The system must reject inefficient (recursive) implementations that violate the runtime threshold.

Target Metrics
runtime_s < 0.5
VerifiedView Source
ID: simpson-s-paradox-causal-inference

Simpson's Paradox (Causal Inference)

A Data Science task where the global correlation is the inverse of the local correlation. The agent must discover that the 'group' feature is a confounder and include it in the model to achieve the target accuracy.

Target Metrics
mse < 50.0
VerifiedView Source

Integration Challenges

Intermediate scenarios testing component interaction and state management.

1 Benchmarks
ID: targeted-ambiguous-date-parsing-locales

Targeted Ambiguous Date Parsing (Locales)

A Data Processing task where 'date' formats vary by 'region' (US=MM/DD/YYYY, UK=DD/MM/YYYY). The agent must calculate total sales for '2023-01' (January). Naive parsing will misinterpret days <= 12.

Target Metrics
accuracy_score == 1.0
VerifiedView Source

Complex Engineering

Difficult bug fixes and refactors requiring deep codebase understanding.

18 Benchmarks
ID: fix-division-bug

Fix Division Bug

Fix a division by zero bug in a calculator module

Target Metrics
code_runs == true
VerifiedView Source
ID: complex-refactor

Complex Refactor

Refactor a monolithic legacy processor into modular classes.

Target Metrics
tests_passed == true
VerifiedView Source
ID: fix-cycle-detection-in-directed-graph

Fix Cycle Detection in Directed Graph

Fix broken cycle detection that gives false positives due to incorrect visited state tracking

Target Metrics
all_tests_pass == true
VerifiedView Source
ID: fix-data-pipeline-aggregation-bug

Fix Data Pipeline Aggregation Bug

Fix a subtle bug in a multi-file data pipeline that incorrectly aggregates time-series data

Target Metrics
aggregated_total == 12500
VerifiedView Source
ID: fix-memory-leak-in-cache

Fix Memory Leak in Cache

Fix unbounded cache growth that causes memory issues

Target Metrics
memory_stable == true
VerifiedView Source
ID: fix-numerical-instability-in-statistics

Fix Numerical Instability in Statistics

Fix catastrophic cancellation in variance calculation that gives wrong results for large values

Target Metrics
variance_correct == true
VerifiedView Source
ID: fix-race-condition-in-counter

Fix Race Condition in Counter

Fix a race condition in a multi-threaded counter that causes incorrect final counts

Target Metrics
final_count == 10000
VerifiedView Source
ID: fix-state-machine-transition-bug

Fix State Machine Transition Bug

Fix subtle bugs in an order processing state machine that allows invalid state transitions

Target Metrics
all_tests_pass == true
VerifiedView Source
ID: hard-discovery-force-search

Hard Discovery (Force Search)

A massive repository where the target code is hidden deep in the directory structure, truncated from the initial index summary.

Target Metrics
tests_passed == true
VerifiedView Source
ID: implement-binary-search-tree

Implement Binary Search Tree

Complete the BST implementation with insert, search, and in-order traversal

Target Metrics
traversal_correct == true
VerifiedView Source
ID: implement-retry-logic-with-exponential-backoff

Implement Retry Logic with Exponential Backoff

Add retry logic to an API client that fails on transient errors

Target Metrics
retry_works == true
VerifiedView Source
ID: large-repo-navigation

Large Repo Navigation

Navigate a large, noisy repository to fix a specific bug in a deep module.

Target Metrics
tests_passed == true
VerifiedView Source
ID: mnist-training-with-pytorch

MNIST Training with PyTorch

Download and train a neural network to classify MNIST digits using torch with validation accuracy metric

Target Metrics
validation_accuracy >= 0.97
VerifiedView Source
ID: performance-optimization

Performance Optimization

Optimize a data processing pipeline for both runtime and memory usage.

Target Metrics
no metric
VerifiedView Source
ID: readiness-check

Readiness Check

Verify the engine can perform a simple plan-patch-test loop.

Target Metrics
tests_passed == true
VerifiedView Source
ID: refactor-callback-hell-to-async-await

Refactor Callback Hell to Async/Await

Convert deeply nested callback-based code to clean async/await pattern

Target Metrics
refactor_complete == true
VerifiedView Source
ID: refactor-circular-import

Refactor Circular Import

Fix circular import error by refactoring code across multiple files

Target Metrics
code_runs == true
VerifiedView Source
ID: fix-deadlock-in-resource-manager

Fix Deadlock in Resource Manager

Fix a deadlock caused by inconsistent lock ordering in a multi-resource system

Target Metrics
completed_transfers == 1000
PARTIAL_SUCCESSView Source

Advanced Optimization

High-stakes multi-objective optimization problems with strict constraints.

14 Benchmarks
ID: impossible-compression-task

Impossible Compression Task

Compress any file to 50% size with no loss, always. This is an impossible task that should fail.

Target Metrics
compressed_size / original_size <= 0.5 for 100% of files
ID: baseline-hidden-metrics

baseline_hidden_metrics

Experimental engineering scenario designed to validate specific engine capabilities.

Target Metrics
runtime_s <= 2.0, accuracy >= 0.92
VerifiedView Source
ID: baseline-log-aliases

baseline_log_aliases

Experimental engineering scenario designed to validate specific engine capabilities.

Target Metrics
runtime_s <= 2.0, accuracy >= 0.90
VerifiedView Source
ID: cache-system-optimization-multi-metric

Cache System Optimization (Multi-Metric)

Optimize a caching system to simultaneously achieve high hit rate, low memory usage, fast access time, and correct eviction behavior

Target Metrics
hit_rate >= 0.80, peak_memory_mb <= 128, avg_access_time_ms <= 5, eviction_correct == true
VerifiedView Source
ID: data-pipeline-with-schema-validation

Data Pipeline with Schema Validation

Build ETL pipeline that processes CSV to JSON with strict validation: correct schema, no nulls, minimum record count

Target Metrics
records_processed >= 100, null_count == 0, schema_valid == true
VerifiedView Source
ID: fair-fast-classifier-multi-metric-multi-file

Fair + Fast Classifier (Multi-metric, Multi-file)

Fix correctness and performance across a small ML codebase: improve accuracy & loss, reduce training time vs baseline, and reduce group fairness gap.

Target Metrics
accuracy >= 0.9, fairness_gap <= 0.10, training_time < baseline training_time
VerifiedView Source
ID: hard-multi-stage-hidden-metrics

Hard Multi-stage Hidden Metrics

A multi-stage training/evaluation repo that requires multi-file optimization plus careful behavior preservation.

Target Metrics
runtime_s <= 2.0, validation_accuracy >= 0.93
VerifiedView Source
ID: ml-training-with-multiple-metrics

ML Training with Multiple Metrics

Train a classifier that must satisfy multiple constraints: accuracy >= 0.85, loss <= 0.5, and training time < 30 seconds

Target Metrics
accuracy >= 0.85, loss <= 0.5, training_time < 30
VerifiedView Source
ID: mnist-training-with-pytorch

MNIST Training with PyTorch

Download and train a neural network to classify MNIST digits using torch with validation accuracy metric

Target Metrics
validation_accuracy >= 0.97, and inference_time <= 100ms
VerifiedView Source
ID: noisy-metrics-stabilization-determinism-threshold

Noisy Metrics Stabilization (Determinism + Threshold)

A small repo where metrics are noisy due to missing seeding. The engine must make it deterministic and meet an accuracy threshold.

Target Metrics
accuracy >= 0.90, deterministic == true
VerifiedView Source
ID: planner-suite-multi-runtime-baseline-relative

Planner Suite Multi-Runtime (Baseline-Relative)

A deterministic suite that runs several planners and measures per-planner runtime plus total runtime. Requires improvements across multiple files.

Target Metrics
planner_a_runtime_s < baseline planner_a_runtime_s, planner_b_runtime_s < baseline planner_b_runtime_s, planner_c_runtime_s < baseline planner_c_runtime_s, runtime_total_s < baseline runtime_total_s
VerifiedView Source
ID: refactor-for-test-coverage

Refactor for Test Coverage

Fix bugs AND achieve test coverage >= 80% by adding tests and making code testable

Target Metrics
tests_pass == true, coverage >= 80
VerifiedView Source
ID: slow-train-with-streaming-logs

Slow Train With Streaming Logs

A long-running training loop that prints progress; stresses command execution, timeouts, and log handling.

Target Metrics
code_runs == true, runtime_s >= 3.0
VerifiedView Source
ID: fair-fast-classifier-multi-metric-multi-file

Fair + Fast Classifier (Multi-metric, Multi-file)

Fix correctness and performance across a small ML codebase: improve accuracy & loss, reduce training time vs baseline, and reduce group fairness gap.

Target Metrics
accuracy >= 0.93, loss <= 0.25, fairness_gap <= 0.10, training_time < baseline training_time
PARTIAL_SUCCESSView Source

Concurrency

1 Benchmarks
ID: large-repo-pipeline-optimization-multi-file

Large Repo Pipeline Optimization (Multi-file)

A larger Python codebase with an ETL-style pipeline that is correct but slow due to inefficient tokenization and feature building. Requires optimization across multiple modules.

Target Metrics
runtime_s <= 2.0, correctness == true
PARTIAL_SUCCESSView Source

Contribute to the Standard

The Remoroo Benchmark system is open-source. Help us build the ultimate harness for autonomous engineering by proposing new scenarios or optimizing existing ones.