How an operations swarm finds root cause.
The common way to build an "AI SRE" is to hand the model a shell and let it poke at production until it forms a theory. That is fast to demo and dangerous to run: it is non-deterministic, it mutates the system it is supposed to diagnose, and the cost is unbounded. The operations swarm is built the opposite way — it gathers the evidence first, then reasons over a frozen snapshot.
The engine has three stages: prep, box-in, and orchestrate. The order is the whole point.
Stage 1 — Prep: assemble the evidence before reasoning
When an alarm fires, the first thing that runs is not the model. It is a set of extraction scripts that pull a fixed evidence set into a package — one file per source, plus an index of references the model can cite. For a Kubernetes/cloud incident that includes:
- Live and previous pod state — describe, logs, events, the node it landed on, and the owner chain up to the controller.
- Cloud facts — the underlying instance description and the IAM role attached to it.
- History — prior occurrences of this alarm, co-firing bursts around the same timestamp, and past root causes for the same signal.
- Topology — the one-hop graph neighborhood of the affected node across Kubernetes, cloud, and service tiers.
Because the evidence is captured up front, the model reasons over data, not over live commands it issues itself. The same incident produces the same evidence set every time, which is what makes the analysis reviewable and repeatable instead of a different adventure on each run.
Evidence collection is cost- and shape-gated: each source declares whether it is on by default, how expensive it is, and which incident shapes it applies to. A noisy, cheap signal and an expensive deep pull are not treated the same.
Stage 2 — Box-in: harden the scope
Given the evidence, the second stage constrains the problem before the expensive reasoning runs — narrowing to the entities, the window, and the failure mode that actually matter. This is the step that keeps the final analysis from wandering: the model is handed a bounded problem with the relevant facts already attached, not an open-ended "figure out production."
Stage 3 — Orchestrate: produce a root cause you can act on
Only now does the orchestration stage run the analysis, and its output is structured, not a paragraph of speculation: a root cause, an impact assessment, remediation steps, and a rollback plan — each tied back to the evidence references gathered in stage one. That traceability is what lets a human trust it at 3 a.m., and what lets the routine cases be remediated automatically within policy while the genuinely novel ones escalate with their homework already done.
Why it is built on the connector runtime
The alarms that start this loop are pulled from your existing tools — ServiceNow, Splunk, and observability platforms — through the same per-user brokered connectors the rest of the platform uses, on a schedule or on demand. There is no global admin credential and no separate push pipeline to keep alive. Correlation and topology run on the platform's graph tier, scoped per caller. The result is an operations capability that sits on the same rails — isolation, cost capture, review gates — as every other application.
Honest status. The three-stage root-cause engine, the operator toolchain, and the underlying connectors are real and proven assets. The always-on, unified operations app — pull-based intake across normalized connectors, the Dynatrace/Datadog observability connectors, and the operations cockpit — is in active build (architecture decided, ADR-069). The hard part exists; the productization for mid-market deployment is the current work.