You write a spec. You point Remoroo at your repo. You go to sleep. You wake up to a verified improvement — or a clear explanation of why the metric didn't move. That's the pitch. This post is about what actually happens under the hood.

❯ remoroo run --local \
    --goal "Follow program.md: improve val_bpb on the Shakespeare dataset" \
    --budget 4

▸ Authenticated as adham@remoroo.com
▸ Run rmo-a8f3 created · Haiku · 4.0h wall time
▸ Brain connected · AgentLoop v2 starting
▸ Reading program.md… ✓
▸ Baseline: val_bpb = 2.2396 (commit 9138841)
▸ Time budget: 20 min per experiment
  ...
▸ 30 experiments · 8 kept · 22 discarded
▸ val_bpb: 2.2396 → 1.5484 (31% lower)
▸ Verdict: VERIFIED · REPRODUCIBLE

This post breaks down every layer — from the spec format to the memory system that keeps the agent coherent across hundreds of tool calls.

The Spec: `program.md`

Every Remoroo run starts with a spec. For the autoresearch workflow, that's a program.md file in your repo:

# Autoresearch Program

## Objective
Minimize val_bpb (validation bits-per-byte) on the Shakespeare char-level
language model. Lower is better.

## Rules
- Only edit train.py. Do not touch prepare.py or the eval harness.
- Each experiment gets a fixed TIME_BUDGET of 1200 seconds (20 minutes).
- Run prepare.py once at the start to build the dataset.
- After each experiment, run the eval: `python prepare.py --eval`
- Log every experiment to results.tsv: experiment_id, description, val_bpb, status

## Evaluation
The eval harness in prepare.py computes evaluate_bpb() on the validation set.
This function is fixed and must not be modified. It is the ground truth.

## Baseline
Run train.py unmodified to establish the baseline val_bpb before making changes.

Why a written spec matters: it's the contract. The agent reads it, pins it permanently in memory, and references it throughout the run. It's also what makes the run reproducible — same spec, same repo state, same results.

Memory: How Remoroo Stays Coherent Across Hundreds of Turns

This is the hardest problem in long-running agent systems, and the part we've thought about the most.

A 4-hour autoresearch run produces 200+ tool calls. That's 200+ assistant reasoning blocks, 200+ tool results (file contents, bash output, training logs), plus system messages. No context window can hold all of that. GPT-4 and Claude top out at 128K-200K tokens. A single training run's stdout can be 50K tokens.

Naive truncation — dropping the oldest messages — is catastrophic. The agent forgets the spec. It forgets the baseline. It re-reads files it already read. It re-tries approaches that already failed.

Remoroo solves this with a demand-paging memory system inspired by OS virtual memory.

Retention Hints: pin, page, ephemeral

Every tool result in Remoroo carries a retention hint — a signal from the agent (or the engine) about how important this result is for future turns:

Hint	Meaning	What happens at compression
`pin`	Keep full content in context	Stays in the window until decayed
`pin_permanent`	Critical for the entire run	Never decayed by age or cap — only replaced by a fresher read
`page`	Safe to evict	Replaced with a one-line retrieval handle
`ephemeral`	Throwaway	Dropped entirely
`uncertain`	Might be important	Treated like `pin` but a candidate for decay later

When the agent reads program.md at the start of a run, it marks it pin_permanent. That spec stays in the context window for the entire 4-hour run. When it reads a training log from experiment #3, it marks it page — useful for the current turn, but safe to evict once the agent has extracted the metrics.

Retrieval Handles and Page Faults

When a tool result is paged out, it doesn't vanish. It becomes a compact stub:

[Paged out: bash — val_bpb: 1.8834 | Epoch 9/10 complete | ... (387 lines)]

The agent can still see that a bash command ran and produced 387 lines of output with val_bpb in it. If it needs the full content, it re-reads or re-runs the command — a "page fault" that brings the content back and re-pins it.

This is inspired by Pichay et al. (arXiv:2603.09023) on demand paging for LLM context. In practice, the fault rate is extremely low — the agent almost never needs to re-fetch paged content, because the things it needs stay pinned.

Pin Decay

Pins can't accumulate forever. Remoroo runs pin decay with two rules:

Max age: A normal pin is demoted to page after 25 actions without being referenced. The spec file (pin_permanent) is exempt.
Max cap: At most 8 normal pins active at once. When a new pin would exceed the cap, the oldest is demoted.

This keeps the working set bounded. In a 30-experiment run, the agent might pin a training log, extract the metrics, and by 25 actions later the pin decays naturally because the agent has moved on to a new experiment.

Working Memory: The `note` Tool

Context compaction is lossy by design — that's the whole point. But some information must never be lost: the best val_bpb so far, the experiments already tried, error patterns discovered.

The note tool is a persistent key-value scratchpad:

note(key="best_val_bpb", value="1.5484 (experiment #12, SSSL attention)")
note(key="failed_approaches", value="depth 6 → worse; pure cosine LR → unstable")
note(key="step_3", value="done: architecture search complete, moving to LR refinement")

Notes survive all compaction. They're injected into every turn's context automatically. Unlike conversation history, they're never truncated, never summarized, never paged out. They're the agent's ground truth about its own progress.

Cross-Run Memory: `recall` and `store`

Memory doesn't stop at the run boundary. Before starting work, the agent calls recall("train.py optimization strategies") and gets back lessons from previous runs:

Found 3 memory entries:
- [SUCCESS] [train.py, lr_schedule] warmdown_frac below 0.25 causes instability
  with current batch_size. Stay in 0.28-0.35 range.
- [PARTIAL] [train.py, attention] SSSL banded attention works but requires
  depth >= 4 to see gains.
- [FACT] [prepare.py] evaluate_bpb reads from val_data.bin — file must exist
  before eval.

Before calling done(), the agent calls record_lesson to persist what it learned:

record_lesson(
  lesson="SSSL banded attention with final_lr_frac=0.10 is the current best
          architecture. Increasing depth beyond 6 hurts val_bpb at this
          model size.",
  tags=["train.py", "attention", "architecture", "val_bpb"]
)

Multi-strategy retrieval means recall works by tag matching (exact lookups like file names) and semantic similarity (abstract concepts like "attention patterns"). The agent builds institutional knowledge about your repo across runs.

The Context Compressor

Behind the retention hints, there's an engine-managed safety net. The ContextCompressor monitors context utilization as a fraction of the window size. When it hits the threshold, it runs a compression pass:

Non-pinned tool results are replaced with retrieval handles.
Ephemeral injections (internal system messages) are dropped.
Assistant messages with tool calls but no prose are compacted (tool calls preserved, text cleared).
A garbage collector removes the oldest fully-stubbed turns when the stub count exceeds 30.

The result is a sawtooth pattern: context fills up, compression runs, context drops, fills up again. The agent can work indefinitely — 4 hours, 10 hours, overnight — without hitting the context wall.

Why This Matters: A Concrete Example

In a 30-experiment autoresearch run:

Turn 1: Agent reads program.md → pin_permanent. Reads train.py → pin. Reads prepare.py → pin.
Turn 5: Baseline captured. Agent starts experiment #1. Pins the edit plan in a note.
Turn 15: Experiment #3 completes. Training logs from experiments #1 and #2 are long paged out — the agent only needs their val_bpb values, which it stored in note("results", "exp1: 1.996, exp2: 1.823, exp3: 1.689").
Turn 80: Experiment #12 discovers SSSL attention. The agent pins the diff, notes the result, and pages everything else.
Turn 190: Experiment #29. program.md is still pinned. The notes contain the full experimental history. 180+ training logs are paged handles. Context is at 45%.
Turn 200: Final experiment. Agent calls done(verdict="success") with full awareness of the trajectory from 2.24 to 1.55.

Without memory management, the agent would forget the baseline by experiment #8 and start re-reading program.md every 5 turns, wasting budget and losing coherence.

What Happens When You Run `remoroo run --local`

Here's the data flow from the moment you hit enter:

1. CLI authenticates against the Control Plane. Your session token (Clerk JWT or API key) is verified. The CLI resolves your local repo path — no packing, no uploading. Your code stays on disk.

2. CLI creates a run on the Control Plane. A POST /runs request sends your goal, metric, time budget (--budget in hours), model tier, and the local repo path. The Control Plane creates a run record and enqueues it.

3. The Brain picks up the run. The Brain process dequeues the run and initializes the AgentLoop — the core planning and reasoning engine. It knows the repo path and will issue execution requests against it.

4. The Worker starts polling. Your local CLI spawns a Worker process that polls the Control Plane for execution requests. When the Brain decides to run a command (e.g., python train.py), the request flows through the Control Plane to your local Worker, which executes it in a sandboxed environment (Docker container or isolated venv) on your machine.

5. Events stream back to the TUI. Every agent action — assistant reasoning, tool calls, bash output, metric snapshots — streams back via SSE (Server-Sent Events) to the full-screen terminal UI on your machine.

The important thing: your code runs on your machine. The Brain (the LLM reasoning loop) runs remotely, but all execution — every python train.py, every file edit, every git diff — happens locally in your sandbox. Your code never leaves your machine.

Architecture

Three logical components, one transport layer:

┌─────────────────────────────────────────────────┐
│                  YOUR MACHINE                    │
│                                                  │
│  ┌──────────┐    ┌───────────────────────────┐   │
│  │   TUI    │    │        Worker             │   │
│  │          │    │  ┌─────────────────────┐   │   │
│  │ Timeline │    │  │ Docker / venv       │   │   │
│  │ Asst Log │    │  │ sandbox             │   │   │
│  │ Tool Log │    │  │                     │   │   │
│  │ Budget   │    │  │ python train.py     │   │   │
│  │          │    │  │ pytest tests/       │   │   │
│  └──────────┘    │  │ git diff            │   │   │
│       ▲          │  └─────────────────────┘   │   │
│       │ SSE      │         ▲  execute         │   │
│       │          │         │  locally          │   │
└───────┼──────────┼─────────┼──────────────────┘   │
        │          │         │                       │
────────┼──────────┼─────────┼───── network ─────────┘
        │          │         │
┌───────┼──────────┼─────────┼──────────────────────┐
│       │   Control Plane    │                       │
│       │          │         │                       │
│  SSE stream    poll      result                    │
│       │       /workers   /jobs                     │
│       │       /poll      /result                   │
│       │          │         │                       │
│       │      ┌───┴─────────┴───┐                   │
│       │      │     Redis       │                   │
│       │      │  outbox / inbox │                   │
│       │      │  heartbeat      │                   │
│       │      │  events         │                   │
│       │      └───┬─────────────┘                   │
│       │          │                                 │
└───────┼──────────┼─────────────────────────────────┘
        │          │
┌───────┼──────────┼─────────────────────────────────┐
│       │     Brain│                                  │
│       │          ▼                                  │
│  ┌────────────────────────────┐                     │
│  │      AgentLoop v2          │                     │
│  │                            │                     │
│  │  LLM call → tool calls    │                     │
│  │  plan / edit / metric_gate │                     │
│  │  note / recall / store     │                     │
│  │  context compressor        │                     │
│  │  budget tracker            │                     │
│  └────────────────────────────┘                     │
└─────────────────────────────────────────────────────┘

The Control Plane is the bridge. It never touches your code. It routes execution requests from Brain to Worker via Redis queues (outbox for Brain-to-Worker, inbox for Worker-to-Brain) and relays events back for the TUI. A heartbeat keeps the connection alive — if the Worker disappears, the Brain knows within seconds.

The TUI: What You Actually See

When you run remoroo run --local, you get a full-screen terminal UI built with Textual. It's a GitHub-dark themed split-pane layout:

┌─ remoroo ──────────────────────────────────────────────────────────┐
│ Run rmo-a8f3 · autoresearch · Haiku                                │
│ Goal: Follow program.md: improve val_bpb on Shakespeare dataset    │
├────────────────────┬───────────────────────────────────────────────┤
│  TIMELINE          │  ASSISTANT                                    │
│                    │  I'll start by reading the program spec and   │
│  ▸ Turn 1          │  understanding the baseline. Let me read     │
│    read_file ✓     │  program.md and list the repo structure.     │
│    list_repo ✓     │                                               │
│  ▸ Turn 2          ├───────────────────────────────────────────────┤
│    metric_gate ✓   │  TOOL OUTPUT                                  │
│    baseline 2.2396 │  ▸ metric_gate (phase=baseline)               │
│  ▸ Turn 3          │  Exit code: 0 | Elapsed: 847s                │
│    edit_file ✓     │  Metrics: {"val_bpb": 2.2396}                │
│    metric_gate ◉   │                                               │
│    WITHIN 3.8h     │  ▸ edit_file train.py                        │
│  ▸ Turn 4          │  - ATTN_PATTERN = "L" * DEPTH                │
│    edit_file ✓     │  + ATTN_PATTERN = "SSSL"                     │
│    metric_gate ✓   │                                               │
│    val_bpb 1.689   │  ▸ metric_gate (phase=current)               │
│                    │  Elapsed: 1141s                               │
│                    │  Metrics: {"val_bpb": 1.6890}                │
│                    │  Comparison: val_bpb 2.2396 -> 1.689 ^       │
├────────────────────┴───────────────────────────────────────────────┤
│ ▶ OK  3.8h cap  ·  Haiku 1×                                       │
│  p pause · ctrl+d detach · q quit · r raw                         │
└────────────────────────────────────────────────────────────────────┘

The left pane is a scrollable timeline — one entry per turn, with tool names, status icons, and time budget badges. The right pane splits into assistant reasoning (top) and tool output (bottom). A budget strip at the bottom shows remaining wall time.

Three key bindings:

p (pause): Cooperative pause. The Brain stops planning new turns. The Worker finishes its current job. Press p again to resume. Useful when you want to inspect intermediate results.
Ctrl+d (detach): tmux-style detach. The TUI closes, but the run stays alive. The Brain keeps working. Reattach later with remoroo run --local --resume rmo-a8f3.
q (quit): Kills the Worker, aborts the run. The run is marked FAILED.

The Agent Loop: Plan, Edit, Train, Evaluate

The AgentLoop v2 is a single LLM tool-calling loop with a strict protocol:

1. UNDERSTAND. The agent starts every run by calling recall(query) to check for past lessons, then list_repo and read_file to understand the codebase. It reads the spec (program.md) and pins it permanently.

2. BASELINE. Before making any changes, the agent captures baseline metrics using metric_gate(phase="baseline"). This runs the eval harness and records the starting point. For autoresearch, that's val_bpb = 2.2396.

3. PLAN. For any non-trivial task, the agent must call plan() before writing code. The plan tool makes a dedicated LLM call to decompose the goal into ordered sub-steps with estimated action counts and risk levels. The plan is stored in working memory so the agent always knows where it is.

4. EDIT. The agent modifies code using edit_file — targeted, minimal patches. It can create files too. Every edit goes through the Worker and executes in the sandbox.

5. VERIFY. After editing, the agent calls metric_gate(phase="current") to re-run the eval. The tool automatically compares current metrics to the baseline and reports the delta.

6. ITERATE or COMPLETE. If the metric improved, the agent either moves to the next experiment or calls done(verdict="success"). If it regressed, the agent analyzes why, reverts, and tries a different approach.

The `metric_gate` Tool

This is the workhorse of verification. It runs any command, extracts metrics from stdout, and compares to baseline. The extraction is multi-strategy:

Explicit format: Lines matching REMOROO_METRIC val_bpb = 1.5484 are parsed directly.
JSON objects: {"val_bpb": 1.5484} found in stdout.
Pytest summaries: 8 passed, 2 failed is automatically captured as tests_passed, tests_failed, test_pass_rate.
Key-value lines: val_bpb: 1.5484 as plain text.

Process Supervision

Long-running commands (like training) are monitored by a process supervisor that emits structured events:

ERROR_SIGNATURE: Detects catastrophic failures — OOM, NaN loss, segfault, missing dependency. High-confidence (threshold ≥0.7) pattern matching on stderr/stdout. When detected, the agent is woken immediately instead of waiting for the command to time out.
SILENT_TIMEOUT: If the agent sets max_silent_s=120 and the command produces zero stdout for 120 seconds, the supervisor wakes the agent with a SILENT_TIMEOUT event. This catches hung processes.
METRIC_TARGET_REACHED: If the agent sets target metrics (e.g., val_bpb < 2.0), the supervisor watches the output stream and wakes the agent the moment the target is hit — no need to wait for training to finish.

Time Budgeting: The System Respects Your Clock

When you say --budget 4, you mean 4 hours. Not 4.5. Not "until the agent decides it's done." The wall clock is the contract.

How It Works

The --budget flag sets the wall-clock time limit in hours (default: 10). The BudgetTracker converts this to seconds and checks elapsed time on every action:

max_wall_time_s = 14400  (--budget 4 → 4 hours)

When the time limit is reached, the run stops cleanly. The agent isn't killed mid-sentence — it gets a chance to wrap up.

Soft Finalize

When 90% of the time budget is consumed, the agent receives a "wrap it up" signal. It has a buffer — 10% of total time, clamped between 5 seconds and 30 seconds — to:

Write final notes summarizing what it learned.
Record lessons for future runs via record_lesson.
Call done() with a verdict and evidence.

This is the difference between a run that ends with BUDGET EXHAUSTED — run terminated and one that ends with a clean done(verdict="success", evidence="val_bpb improved from 2.24 to 1.55 across 30 experiments").

Per-Experiment Time Budgets

Inside a run, each experiment has its own time contract via max_silent_s on metric_gate. If training should produce output every ~30 seconds (loss updates, epoch markers), the agent sets max_silent_s=120. If 2 minutes pass with zero output, the supervisor wakes the agent:

SILENT_TIMEOUT: No output for 120s. Job v2-train-07 may be hung.

The agent can then kill the hung job and move on to the next experiment, instead of burning 20 minutes of wall time on a process that's stuck.

Budget Strip in the TUI

The bottom bar of the TUI shows real-time budget state:

▶ OK  3.2h cap  ·  Haiku 1×

Or, if time is running low:

▶ CLAMPED  0.4h remaining  ·  wrapping up

You can glance at the terminal at any point and know exactly how much time is left and whether the agent is on track.

A Real Example: autoresearch

Here's an end-to-end run using Karpathy's autoresearch — a character-level language model on Shakespeare, optimizing val_bpb (validation bits-per-byte, lower is better).

Setup

git clone https://github.com/Remoroo/autoresearch
cd autoresearch
pip install remoroo
remoroo login

Launch

remoroo run --local \
  --goal "Follow program.md: minimize val_bpb. Only edit train.py. \
          Log experiments to results.tsv." \
  --metrics "val_bpb < 2.0" \
  --budget 6

What the Agent Does

Minutes 0-15: Reads program.md, train.py, prepare.py. Runs python prepare.py to build the dataset. Captures baseline: val_bpb = 2.2396.

Minutes 15-45: Experiment #1: adjusts LR schedule (warmdown 35% → 30%, final LR 8%). Trains for 20 minutes. Result: val_bpb = 1.9960. KEEP — 11% improvement. Notes the result.

Minutes 45-70: Experiment #2: increases model depth from 4 to 6. Trains for 20 minutes. Result: val_bpb = 1.6890. Wait — that's better? No. Checks notes. This is below the depth-4 baseline but the training was unstable (loss spikes in the log). DISCARD — records lesson about depth instability.

Hours 1-5: Experiments #3 through #28. Explores attention patterns, batch sizes, optimizer configs. Some keep, most discard. Uses note to track the Pareto frontier.

Hour 5-6: Experiment #29: combines SSSL banded attention with the best LR schedule found earlier. val_bpb = 1.5484. KEEP — 31% improvement from baseline. Records lesson.

Hour 6: Soft finalize. Agent writes final notes, records lessons for future runs, calls done(verdict="success").

The Output

results.tsv:
exp  description                                    val_bpb  status
1    LR warmdown 35→30%, final LR 8%                1.9960   keep
2    Depth 4→6                                      1.6890   discard
3    Batch size 2^15                                 2.1102   discard
...
12   SSSL attention + final_lr_frac 0.10             1.5484   keep
...
30   Ensemble best: SSSL + warmdown 0.28             1.5512   discard

git diff --stat:
 train.py | 22 +++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

One file changed. 30 experiments run. Metric improved 31%. All verified against the fixed eval harness. Reproducible by checking out the branch and re-running the eval.

What You Wake Up To

After a Remoroo run completes, you have:

A verified patch. The git diff is clean and minimal. The agent only touched what needed changing.
A results log. Every experiment, its description, the metric, and whether it was kept or discarded.
A trace. Full JSONL trace of every agent action in .remoroo/runs/<run-id>/trace.jsonl. You can replay it, audit it, or feed it into your own analysis.
Lessons in memory. The agent's insights persist. The next run on this repo starts with knowledge of what worked and what didn't.

The agent didn't guess. It proved.

pip install remoroo

The Spec: program.md