Enterprise IT Agents Miss More Than Half: The Empty Seat Called the Verification Layer

The Problem

Enterprises are told from every direction to hand IT operations to autonomous agents. Yet even frontier models clear less than half of SRE incident-resolution benchmarks. The more troubling part is the pattern: running one more investigation turn doesn’t move you closer to the answer, it piles on false positives. Candidate causes accumulate, and which one to trust gets murkier, not clearer.

On the ground this isn’t a tidy accuracy stat. The moment you grant an agent execute authority, a coin-flip judgment lands straight in production. Anyone who has carried an on-call pager knows what one wrong rollback at 3 a.m. costs. So plenty of teams chant “adopt agents” while never actually wiring them in.

Why Now

The fact that agents fail is itself the market. If model quality were going to be flawless within a year, this empty seat would be a stopgap, but on long-context, high-blast-radius work like incident resolution, clearing the 50% line overnight is unlikely. In the meantime enterprises sit in an awkward middle: they want agents but can’t grant execute rights. That gap is exactly where demand forms.

How to Build It

The core move is to pin the agent to “propose,” not “execute.” When the agent emits root-cause candidates, a verification layer cross-checks them against logs, metrics, and past incidents, scores confidence, prunes the clearly wrong ones, and narrows it to a short list a human can decide on. Human-in-the-loop is designed as a safety mechanism, not a bottleneck, what the human sees is already verified and narrowed, not raw logs.

flowchart LR
  A[Incident signal] --> B[Agent: generate candidates]
  B --> C[Verification: cross-check logs/metrics]
  C --> D[Drop false positives, score confidence]
  D --> E[Narrowed candidate list]
  E --> F[Human: decide and execute]

An MVP starts narrow, one stack (say Kubernetes plus a single observability tool) with verification rules and an eval pipeline bolted on. Proving you actually cut false positives in one environment beats chasing generality first.