Verifiable Autonomous Engineering
Explore the comprehensive suite of tests that power Remoroo. From environment healing to high-fidelity artifact replays, we measure everything to ensure reliability at scale.
Foundation Tests
Basic functionality and regression tests to ensure core system stability.
Calculate Mean from CSV
Write a script to calculate the mean of values in a CSV file
mean_value is correctEDA Blind Spot (Corrupted Data)
A Data Science task where the agent must perform EDA to discover and filter out specific error codes (encoded as -999) before training a model. Without specific cleaning, the model will fail to meet the R^2 threshold.
r2_score >= 0.9Environment Recovery (Env Doctor)
A repository with missing dependencies. The Environment Doctor must detect the missing package (requests) and install it to make the environment healthy.
exit_code == 0Fix Fibonacci Off-by-One
Fix an off-by-one error in a fibonacci sequence generator
fib_10 == 55Fix String Reversal Bug
Fix a bug in a string reversal function that incorrectly handles special characters
reversed_output == '!dlroW olleH'Patch Brittle Repeated Blocks
A repo with repeated code blocks; stresses patch precision and avoids wrong-block edits.
code_runs == trueRegression Guardrails (Fibonacci)
Implement a Fibonacci calculator. The system must reject inefficient (recursive) implementations that violate the runtime threshold.
runtime_s < 0.5Simpson's Paradox (Causal Inference)
A Data Science task where the global correlation is the inverse of the local correlation. The agent must discover that the 'group' feature is a confounder and include it in the model to achieve the target accuracy.
mse < 50.0Integration Challenges
Intermediate scenarios testing component interaction and state management.
Targeted Ambiguous Date Parsing (Locales)
A Data Processing task where 'date' formats vary by 'region' (US=MM/DD/YYYY, UK=DD/MM/YYYY). The agent must calculate total sales for '2023-01' (January). Naive parsing will misinterpret days <= 12.
accuracy_score == 1.0Complex Engineering
Difficult bug fixes and refactors requiring deep codebase understanding.
Fix Division Bug
Fix a division by zero bug in a calculator module
code_runs == trueComplex Refactor
Refactor a monolithic legacy processor into modular classes.
tests_passed == trueFix Cycle Detection in Directed Graph
Fix broken cycle detection that gives false positives due to incorrect visited state tracking
all_tests_pass == trueFix Data Pipeline Aggregation Bug
Fix a subtle bug in a multi-file data pipeline that incorrectly aggregates time-series data
aggregated_total == 12500Fix Memory Leak in Cache
Fix unbounded cache growth that causes memory issues
memory_stable == trueFix Numerical Instability in Statistics
Fix catastrophic cancellation in variance calculation that gives wrong results for large values
variance_correct == trueFix Race Condition in Counter
Fix a race condition in a multi-threaded counter that causes incorrect final counts
final_count == 10000Fix State Machine Transition Bug
Fix subtle bugs in an order processing state machine that allows invalid state transitions
all_tests_pass == trueHard Discovery (Force Search)
A massive repository where the target code is hidden deep in the directory structure, truncated from the initial index summary.
tests_passed == trueImplement Binary Search Tree
Complete the BST implementation with insert, search, and in-order traversal
traversal_correct == trueImplement Retry Logic with Exponential Backoff
Add retry logic to an API client that fails on transient errors
retry_works == trueLarge Repo Navigation
Navigate a large, noisy repository to fix a specific bug in a deep module.
tests_passed == trueMNIST Training with PyTorch
Download and train a neural network to classify MNIST digits using torch with validation accuracy metric
validation_accuracy >= 0.97Performance Optimization
Optimize a data processing pipeline for both runtime and memory usage.
no metricReadiness Check
Verify the engine can perform a simple plan-patch-test loop.
tests_passed == trueRefactor Callback Hell to Async/Await
Convert deeply nested callback-based code to clean async/await pattern
refactor_complete == trueRefactor Circular Import
Fix circular import error by refactoring code across multiple files
code_runs == trueFix Deadlock in Resource Manager
Fix a deadlock caused by inconsistent lock ordering in a multi-resource system
completed_transfers == 1000Advanced Optimization
High-stakes multi-objective optimization problems with strict constraints.
Impossible Compression Task
Compress any file to 50% size with no loss, always. This is an impossible task that should fail.
compressed_size / original_size <= 0.5 for 100% of filesbaseline_hidden_metrics
Experimental engineering scenario designed to validate specific engine capabilities.
runtime_s <= 2.0, accuracy >= 0.92baseline_log_aliases
Experimental engineering scenario designed to validate specific engine capabilities.
runtime_s <= 2.0, accuracy >= 0.90Cache System Optimization (Multi-Metric)
Optimize a caching system to simultaneously achieve high hit rate, low memory usage, fast access time, and correct eviction behavior
hit_rate >= 0.80, peak_memory_mb <= 128, avg_access_time_ms <= 5, eviction_correct == trueData Pipeline with Schema Validation
Build ETL pipeline that processes CSV to JSON with strict validation: correct schema, no nulls, minimum record count
records_processed >= 100, null_count == 0, schema_valid == trueFair + Fast Classifier (Multi-metric, Multi-file)
Fix correctness and performance across a small ML codebase: improve accuracy & loss, reduce training time vs baseline, and reduce group fairness gap.
accuracy >= 0.9, fairness_gap <= 0.10, training_time < baseline training_timeHard Multi-stage Hidden Metrics
A multi-stage training/evaluation repo that requires multi-file optimization plus careful behavior preservation.
runtime_s <= 2.0, validation_accuracy >= 0.93ML Training with Multiple Metrics
Train a classifier that must satisfy multiple constraints: accuracy >= 0.85, loss <= 0.5, and training time < 30 seconds
accuracy >= 0.85, loss <= 0.5, training_time < 30MNIST Training with PyTorch
Download and train a neural network to classify MNIST digits using torch with validation accuracy metric
validation_accuracy >= 0.97, and inference_time <= 100msNoisy Metrics Stabilization (Determinism + Threshold)
A small repo where metrics are noisy due to missing seeding. The engine must make it deterministic and meet an accuracy threshold.
accuracy >= 0.90, deterministic == truePlanner Suite Multi-Runtime (Baseline-Relative)
A deterministic suite that runs several planners and measures per-planner runtime plus total runtime. Requires improvements across multiple files.
planner_a_runtime_s < baseline planner_a_runtime_s, planner_b_runtime_s < baseline planner_b_runtime_s, planner_c_runtime_s < baseline planner_c_runtime_s, runtime_total_s < baseline runtime_total_sRefactor for Test Coverage
Fix bugs AND achieve test coverage >= 80% by adding tests and making code testable
tests_pass == true, coverage >= 80Slow Train With Streaming Logs
A long-running training loop that prints progress; stresses command execution, timeouts, and log handling.
code_runs == true, runtime_s >= 3.0Fair + Fast Classifier (Multi-metric, Multi-file)
Fix correctness and performance across a small ML codebase: improve accuracy & loss, reduce training time vs baseline, and reduce group fairness gap.
accuracy >= 0.93, loss <= 0.25, fairness_gap <= 0.10, training_time < baseline training_timeConcurrency
Large Repo Pipeline Optimization (Multi-file)
A larger Python codebase with an ETL-style pipeline that is correct but slow due to inefficient tokenization and feature building. Requires optimization across multiple modules.
runtime_s <= 2.0, correctness == true