Move debugging into the system: cheaper monitoring, less ops time

Recently I was hunting a memory leak in our product. It took a few iterations to find it, and each iteration moved more of the interpretive work into the running process. By the last one I wasn’t really debugging — the system was telling me what had happened.

That move — pushing interpretation inside, instead of out to dashboards — is an old pattern. The payoff lands on two axes: less operator time spent chasing incidents, and a smaller monitoring bill on the happy path. Agents on both ends of the loop are what finally make it tractable.

The leak hunt, in four rungs

Rung 1 — pprof snapshots, diffed by hand. The starting point: take a Rust pprof snapshot, wait, take another, diff. The output points at a handful of suspect call paths but doesn’t say why anything stuck around. The “why” step — reading code, forming a hypothesis — is mine to do.

Rung 2 — metrics and log lines around acquire/release. I instrumented the lifecycle. Counters on each side of acquire and release, log lines on the interesting transitions, names chosen carefully enough that the dashboards would still make sense in a week. The flow: form a hypothesis in staging, watch the metrics confirm it in production, ship a fix, watch them again. The data was flowing, but the analysis still lived in my head — stitching one answer out of three dashboards. And the counters could tell me a number went down; they couldn’t tell me whether I’d found all the places.

Rung 3 — an in-process lifecycle tracker. So I built one inside the process. A small map of what’s currently held, updated on acquire and release. Every so often, walk it and flag anything that’s been there too long. The question — “what’s still held?” — moved out of dashboards and into the program. The answer came back as a list, not a chart.

Rung 4 — capture the callsite at acquire time. The last move: at acquire time, record the source location. The tracker already knew what was dangling; now it also knew where each thing had originated. When something didn’t release, the report named the file and line where its life began — the question I’d actually been asking, finally answered without me leaving the process.

What the ladder is really showing

Look at the rungs together. Across them, the raw signal didn’t really grow — the interpretation moved. By the last rung, work I used to do in my head as an operator (correlate, judge anomalous, locate the origin) was code the process ran on itself.

That’s the pattern: the debugging logic we’d otherwise apply by hand, after the fact, can be encoded in the program — root cause analysis shifted left, into the producer. And once it is, most of the intermediate noise stops being worth emitting. We’re not filtering harder; the events that would have demanded interpretation simply aren’t produced — the program already drew the conclusion.

Why this didn’t pay off before

The cost lands on both ends of the pipeline — what you can afford to emit, and what someone can usefully read.

Emission side. The dominant pattern is to ship everything and filter on the other end. The bill scales with raw volume, not with what’s useful, and the filter has to decide what matters without knowing what you’ll be looking for. Teams that try to control cost by pruning at the source trade one failure mode for another: now the incident arrives in a corner you didn’t predict, and the data isn’t there.

Consumption side. Even if the bytes were free, the reader isn’t. A human can’t usefully read kilobytes of structured context per error. Dashboards and aggregates exist because raw context has to be reduced to something a person can scan — which means it has to be shaped to be reducible: bounded cardinality, regular fields, no surprises. A fat, irregular bundle per event has nowhere to land in that pipeline.

Accumulate, discard, emit

The move that breaks both sides of that cost is simpler than it sounds: keep the context in-process, throw it away on success, ship it only on error.

RAM is essentially free at the granularity of one request or one transaction. Monitoring bytes aren’t. So the bookkeeping happens locally — accumulate lineage, decisions, intermediate state — and either gets discarded the moment the work completes, or emitted as a single bundle when it doesn’t. The happy path costs nothing in monitoring spend. The failures arrive pre-contextualized, so the operator’s first move isn’t stitching a story across dashboards — it’s reading the bundle.

The leak tracker from rung 3 was already this pattern in miniature. Every acquired resource lived in an in-process record from the moment it was created; on a clean release, the record vanished — free. Only the unreleased ones survived to become a report. The shape isn’t new; what’s been missing is a reason for it to be the default. That’s the next piece.

Why agents change the math

Both sides of that cost flip.

On the production side, the in-process logic is small, pattern-y, and previously the kind of thing that got skipped because nobody had time to write it. Lifecycle trackers, callsite captures, lineage records — mechanical to implement and tedious in volume. LLMs write exactly this kind of code well. The marginal cost of adding self-explanation to a system has dropped to roughly the cost of asking for it.

On the consumption side, the fat bundle that was unreadable to a human is exactly the right input for an agent. A kilobyte of structured context per error is nothing to a model; if anything, more would help. Dashboard-shaped distillation isn’t the only landing pad anymore. An agent reads the bundle directly — classifies the failure, points at the relevant code, sometimes proposes the fix.

The frontier: lineage as the unit of context

The leak hunt tracked resources — pooled handles, buffers, whatever got acquired and released. But the same pattern generalizes to the things the system actually cares about: not software objects but conceptual ones. An order, a session, a batch job, an incoming request.

Each has a lineage — where it originated, what touched it, what decisions changed it as it moved through the system. That lineage is the kind of context that should accumulate in-process and only travel to the monitor when something fails. On error, what lands on the other end isn’t a stack trace plus correlated log lines; it’s the full history of the conceptual object that failed, packaged together.

Most systems already do a partial version of this. We emit logs with the IDs of what we’re working with — order ID, request ID, parent job ID — and reach for those IDs after the fact to reconstruct what happened. Keeping the whole lineage as a first-class object instead is harder, but it doesn’t have to live in the hot path. A sidecar, a sibling pod, somewhere adjacent — accumulate traces there in a sliding window, discard whatever ages out cleanly, ship the lineage only for the things that errored. The routine traces never leave the host. The monitoring bill stops paying for routine work, and the operator’s time stops paying for stitching context after the fact. What does leave carries the full context.

An agent ingesting that artifact starts from a different place. It doesn’t have to reconstruct context from scattered signals — the context arrives as a single coherent thing, designed for the consumer.

Old pattern, new math

The primitives have been in shipped tooling for years — Sentry-style breadcrumbs, JFR, tail-based sampling, the various continuous profilers. What’s been rare is doing this at the application level, for the specific things a particular business cares about, and keeping it on as a standing capability instead of wheeling it out during an incident and turning it off again. There are real reasons. The bookkeeping has to coexist with the application’s domain logic. The schema has to evolve as the application does. A sidecar story has its own operational cost. Rich bundles raise real questions about what data should travel at all.

What changed is the activation energy. Writing the in-process logic is cheap because LLMs are good at exactly this kind of code. Consuming the rich output is cheap because agents read fat structured artifacts the way they’re meant to be read. Both ends got cheaper, simultaneously. The math becomes simple: the monitoring bill drops because routine events never leave the host, and operator time drops because the events that do leave arrive already triaged.

None of this is solved by agents on their own — there are holes in the theory I haven’t worked through. But the practice is now possible in a way it wasn’t.

Move the hunting logic into the system. The routine signal stops traveling, and the monitoring bill drops. What does travel is what the system flagged — and the signal-to-noise ratio improves with it.

The leak hunt, in four rungs#

What the ladder is really showing#

Why this didn’t pay off before#

Accumulate, discard, emit#

Why agents change the math#

The frontier: lineage as the unit of context#

Old pattern, new math#