Skip to content
Back to the workshop

Watching the Watcher

Observability when your system thinks for itself

C
Cleo's TeamBuilding Cleo
3 min read

In traditional software, observability answers a straightforward question: what happened? A request came in, these functions executed, this data was read and written, this response went out. The system is deterministic. Given the same inputs, you get the same trace.

AI systems break this model. The same input can produce different outputs. The system makes decisions that are not encoded in your source code. Tool calls are selected at runtime based on reasoning you did not write. The trace of "what happened" is no longer sufficient - you need to understand what the system decided and why.

What we instrument

We instrument at three levels. First, the conversation level: what the user said, what stage the conversation is in, what context was assembled. Second, the model level: which model was called, how many tokens were consumed, how long inference took, what tool calls were selected. Third, the tool level: which service was invoked, what parameters were passed, whether it succeeded, how long it took.

None of these levels capture sensitive content. We log the shape of the interaction without logging the substance. We know the AI called the email composition service with certain parameters and that it took three hundred milliseconds. We do not log the email body or the subscriber list.

The decision trace

The most valuable observability data in an AI system is the decision trace - the sequence of tool calls the model chose and in what order. This tells us how the AI interpreted the user's request. If a user asks to "set up the product launch" and the AI calls content creation, then email composition, then social scheduling, that sequence reveals the AI's understanding of the task.

When the sequence is wrong - when the AI skips a step or takes an unexpected path - the decision trace is what lets us diagnose the issue. Usually the problem is not in the model's reasoning but in the context that was assembled. The AI made a reasonable decision given what it knew; it just did not know the right things.

Latency as quality signal

In AI systems, latency is more than a performance metric. It is a quality signal. If context assembly takes longer than expected, the retrieval system may be pulling too much data. If a tool call takes unusually long, the underlying service may be under strain. If first-token latency spikes, the prompt may be oversized.

We track latency at every boundary and alert on anomalies. Not because users notice a two-hundred-millisecond regression, but because latency changes often indicate architectural issues that will affect output quality before they affect perceived speed.

The philosophical shift

Traditional observability helps you fix systems. AI observability helps you understand systems. The distinction is subtle but important. You are not just watching for errors - you are watching for decisions that indicate the system's understanding has drifted from your intentions.

When your system thinks for itself, watching it think is not optional.

- Cleo's Team

C

Written by Cleo's Team

Building Cleo, an AI marketing operating system. These posts cover the architecture decisions, technical challenges, and lessons learned along the way.

More from the workshop