Letting an Agent Run for a Day — Operations Infra for Long-Horizon Coding Agents

The Problem

For a long time, using a coding agent meant short round-trips. Fix this function, look at this test, with a human grabbing and releasing the keyboard every few minutes. Then OpenAI’s Codex ran for roughly 25 hours with no human intervention while building a design tool from an empty repo — burning 13M tokens and producing 30k lines. Agents can now be let loose by the hour, even by the day, not the minute. The catch is that the moment you let one run that long, everything hidden in short tasks erupts at once. The context window fills and compacts repeatedly, and in that churn the agent quietly drifts from its original spec. Token cost piles up linearly with no governor; one practitioner warns the long-running mode “consumes truly absurd amounts of tokens,” and a five-hour plan limit expires before a big job is finished. Worst of all, a 25-hour run that collapses at 3am goes unnoticed until morning. You can launch the agent — but the operating layer that lets you watch whether it’s actually running, stop it when cost leaks, catch it when it drifts, and page a human when it’s stuck is missing entirely.

Why Now

Several signals landed at once that long-horizon work is crossing from lab demo into real workflow. OpenAI shipped compaction in GPT-5.1-Codex-Max: as the context approaches its limit, the model prunes history while preserving essential state into a fresh window and keeps executing — and internal evals report it running a single task solo for more than 24 hours. Codex CLI added the /goal command in v0.128.0+, handling persistent objectives that survive interruptions, session breaks, and token limits. Instead of one instruction at a time, you set a goal with a verifiable stopping condition and let the agent pursue it across many turns. Model-side efficiency followed too: reports of 30% fewer thinking tokens at the same reasoning effort, with higher accuracy, are exactly what makes day-long autonomy economically viable. So the models and runtimes are already sprinting toward long-horizon. Yet the panel that supervises many runs at once, watches cost and drift in real time, and pages a human at a blocked step is still something each team hand-rolls from markdown files and shell scripts. A textbook gap: standard runtime, no standard ops tooling.

How to Build It

Three on-ramps. One, a checkpoint-and-recovery layer. What actually kept the agent on track over 25 hours was a set of spec, plan, and status markdown files committed to the repo. Lift that externalized-state pattern out of hand-written files and into a product: snapshot runs, revive from the last verified checkpoint when one collapses, and auto-generate an audit trail of what decisions were made and how far the run got. Two, a drift-and-cost control dashboard. See many agent runs on one screen — which milestone each is on, how many tokens it has burned, how far it has strayed from spec (did it touch a file it was told not to change?), where verification broke. Set budget caps so a token spike overnight halts automatically. Three, human-in-the-loop design. The agent pauses at a judgment call, pages a human, and resumes in place once it gets an answer; a policy layer manages the verifiable stopping condition and the “don’t change this” constraints in one place. Monetize as a team subscription — concurrent runs, retained checkpoint storage, number of supervised agents. Rather than chase every model and runtime from day one, attach deeply to one runtime pushing long-horizon first (Codex /goal style) and become the control tower for that ecosystem.

flowchart LR
  A[Long-Running Agent] -->|progress state| B[Checkpoint Layer]
  A -->|token & drift metrics| C[Control Dashboard]
  C -->|over budget| D[Auto-Halt]
  C -->|needs judgment| E[Human-in-the-Loop]
  E -->|answer| A
  B -->|recover on collapse| A

Success Criteria

This market turns on how fast you earn trust. First, the operating layer itself has to be more stable than the agent it supervises. If the control tower wobbles, every run beneath it wobbles too — and the first time a checkpoint recovery misfires and discards good work, the customer walks. Second, cost guardrails have to be real-time blocks, not after-the-fact invoices. “You spent this many tokens yesterday” is already too late; the value is stopping just before the cap. Third, drift detection is where the data lock-in lives. Which signals actually predict real drift — which file edits, which verification-failure patterns — only sharpens as runs accumulate, and that compounding edge is what a fast follower can’t replicate. The most common failure is trying to support every runtime at once and attaching deeply to none.