Operating Active Graph¶
This document is for operators: people responsible for running an Active Graph runtime as part of a system other people depend on. The README is for developers writing behaviors. The audience is different and so is this document.
If you are evaluating Active Graph, read the README first. If you have a behavior that doesn't run on your machine, the README will help. If you have a behavior that runs fine on your machine but you need to put it somewhere a team can rely on it, you are in the right place.
The companion example is examples/operate_a_run.py.
Read it alongside this guide — every CLI command and library call shown
here appears there. If the two ever disagree, the example is right.
The operator surface¶
The framework treats the boundary between itself and the world it runs in as a load-bearing contract. Five primitives compose that surface; together they make a run inspectable, observable, and recoverable without reading source code:
- Postgres as a second
EventStore, behind the same protocol as SQLite. Same schema, same semantics, different driver. - Structured logging with a documented JSON schema. One log line
per event, every line carries
run_id/event_idwhen applicable. - Metrics: a three-method
Metricsprotocol with aNoOpMetricsdefault and a referencePrometheusMetricsimplementation. The runtime emits a fixed, documented set of counters, histograms, and gauges. Custom backends (OpenTelemetry, Datadog, statsd) implement the protocol — three methods. activegraphCLI:inspect,replay,fork,diff,export-trace,migrate,pack,quickstart. The CLI is a thin wrapper around library APIs; anything it does, programmatic callers can do too.- Runtime introspection:
runtime.status(recent=N)returns a frozen snapshot of queue depth, budget remaining, registered behaviors, recent events, and current frame. The CLI'sinspectcommand sits on top of this primitive.
The operator surface introduced in v0.8 has been extended additively
since: v0.9 added the pack format (and activegraph pack for
listing/scaffolding); v1.0 added per-error reference pages
(Reference: Errors)
that every error message links to, plus operator-targeted CLI
follow-on flags (inspect --event <id> for divergence triage,
inspect --behaviors for replay length mismatches, inspect
--pack-version for prompt-hash audits, migrate --skip-corrupted
for corrupted-payload recovery, fork --record for intentional
re-recording).
What the framework deliberately does not ship: a web UI, an HTTP server, a distributed runtime, real-time subscriptions, multi-model LLM routing, or streaming LLM responses. The framework is small, sharp, and operable. Plug in adapters at the boundaries where you need them.
Persistence: SQLite vs Postgres¶
SQLite is the default and the right answer for solo work, demos, ephemeral runs, and most single-machine production cases. The event log fits in one file, WAL mode gives you crash-safe writes, and you have no operational dependencies.
Postgres is the right answer when:
- More than one process or machine needs to inspect a run (the operator on a laptop, a dashboard, a CLI on a jump box, a CI job).
- You already operate Postgres and want one fewer storage system.
- You want to put the JSONB column behind a read replica or pipe it into your data warehouse.
Both stores conform to the same EventStore protocol. The runtime,
the CLI, and every library API treat them identically. Migration is
one-directional and explicit (see below).
Connection URLs¶
Stores are addressed by URL throughout the framework — runtime, CLI, library APIs. The schemes follow the SQLAlchemy convention:
sqlite:///relative/path.db(three slashes — relative path)sqlite:////absolute/path/to/run.db(four slashes — absolute path; the leading/of the absolute path adds the fourth slash)postgres://user:password@host:port/dbnamepostgresql://user:password@host:port/dbname(same scheme)
A path with no scheme is an error. The framework will not guess.
activegraph inspect run.db will fail with a message pointing here.
Use sqlite:///run.db (relative) or sqlite:////tmp/run.db (absolute).
Postgres setup¶
# Postgres 16 or newer, anywhere reachable from the runtime.
createdb activegraph_prod
# Schema is created lazily on first connection. No migration step.
pip install 'activegraph[postgres]' # pulls psycopg>=3.1,<4
The first time the runtime opens a Postgres URL it issues
CREATE TABLE IF NOT EXISTS for events, runs, and meta,
mirroring the SQLite schema with Postgres-native types
(BIGSERIAL, JSONB, TIMESTAMPTZ). Schema version is stored in
meta and verified on every open. A schema version mismatch is a
hard error — the runtime refuses to operate on a log it does not
understand.
Connection management¶
PostgresEventStore accepts:
- A URL string. The store opens a single dedicated connection.
- A
psycopg.Connectionyou already have. The store does not own its lifecycle — you must close it. - A
psycopg_pool.ConnectionPool. The store willgetconn()/putconn()around each operation.
For production, pass a pool. The framework does not ship its own pool because we are not in a position to make tuning decisions for your deployment.
import psycopg_pool
from activegraph.store.postgres import PostgresEventStore
pool = psycopg_pool.ConnectionPool(
conninfo="postgres://localhost/activegraph_prod",
min_size=2,
max_size=10,
)
store = PostgresEventStore(pool, run_id="run_01J...")
Migration (transaction-per-run)¶
Migration semantics:
- Each run in the source migrates in a single transaction against the destination. If a run fails partway, that run's destination state is unchanged (Postgres rolls back).
- Migration is idempotent at the event level: writes use
INSERT ... ON CONFLICT DO NOTHINGagainst theUNIQUE(id, run_id)index. Re-running migration after a partial failure resumes safely. - Runs are migrated independently. A bad run does not block the others.
- The default migrates all runs in the source. To pick one:
--run-id <id>. - A per-run report is printed at the end (machine-readable with
--json). Each entry is{run_id, status, events_migrated, error?}. The CLI exit code is non-zero iff any run failed. - Migration is not bidirectional. There is no
syncmode and no rollback. To go back, migrate the other direction.
When migration is the right tool: you are graduating a run from a laptop SQLite file to a shared Postgres database, or moving a historical archive between Postgres instances. When it is the wrong tool: you are trying to keep two stores in sync. Don't.
Structured logging¶
The framework emits structured logs through stdlib logging. It does
not auto-configure logging on import — a library that does is hostile
to operators who have already configured their own. By default the
framework logs to logging.getLogger("activegraph") and lets your
config handle the rest.
If you want the opinionated setup:
from activegraph.observability import configure_logging
configure_logging(level="INFO", json_output=True)
That installs a JSON formatter on the activegraph logger hierarchy.
Every log line becomes one JSON object on one line, suitable for
ingestion by Loki, Splunk, BigQuery, Cloud Logging, or any other line-
oriented log aggregator.
Log schema¶
Every line is a JSON object. These fields appear when applicable. Fields that don't apply are omitted, not nulled:
| Field | Type | When |
|---|---|---|
timestamp |
string | always (ISO 8601, UTC) |
level |
string | always (DEBUG / INFO / WARNING / ERROR / CRITICAL) |
logger |
string | always (e.g. activegraph.runtime) |
message |
string | always |
run_id |
string | any log line associated with a specific run |
event_id |
string | log lines about a specific event |
behavior |
string | log lines about a specific behavior invocation |
tool |
string | log lines about a tool invocation |
model |
string | log lines about an LLM call |
cache_hit |
bool | LLM/tool calls; true if served from cache |
cost_usd |
string | LLM calls that incurred cost (Decimal-as-string) |
latency_seconds |
number | LLM/tool/behavior calls with measured latency |
reason |
string | failure log lines (see reason taxonomy) |
error_type |
string | failure log lines |
error_message |
string | failure log lines |
The schema is the operator contract. Dashboards built against these field names will keep working across framework versions. Breaking the schema is a breaking change.
Level discipline¶
| Level | What |
|---|---|
| DEBUG | View construction, prompt assembly, cache lookup, queue ops |
| INFO | Every event emitted, every behavior invoked, every tool call |
| WARNING | Budget approaching limits, retries, pattern eval slowness |
| ERROR | behavior.failed with non-budget reasons |
| CRITICAL | Event log inconsistency, schema mismatch, replay divergence |
INFO is a high-volume stream in any active run. Operators typically filter at WARNING for production dashboards and crank to DEBUG when debugging.
There are no print() calls anywhere in the framework. The trace
printer (runtime.print_trace()) is a developer tool, not an
operator tool — it prints to stdout, does not log, and is independent
of the logging configuration.
Payload redaction¶
LLM behaviors include rendered prompts in DEBUG logs. Tool responses include their full payloads. Goals can contain anything the user typed. If your environment requires redaction (PII, secrets, customer data):
def redact(payload: dict) -> dict:
return {k: ("<redacted>" if k == "email" else v) for k, v in payload.items()}
configure_logging(level="INFO", json_output=True, payload_redactor=redact)
The redactor runs on every payload that would otherwise appear in a log message. It does not affect the event log itself — the source of truth keeps the original. Redaction is a logging concern.
Metrics¶
The framework emits metrics through a three-method Metrics protocol:
class Metrics(Protocol):
def counter(self, name: str, tags: dict[str, str], value: float = 1.0) -> None: ...
def histogram(self, name: str, tags: dict[str, str], value: float) -> None: ...
def gauge(self, name: str, tags: dict[str, str], value: float) -> None: ...
That's it. Three methods. No timers (use a histogram with a latency value). No summaries (Prometheus-specific). No custom types.
from activegraph.observability import PrometheusMetrics
rt = Runtime(graph, metrics=PrometheusMetrics())
The default is NoOpMetrics, which does nothing. The runtime is
fully functional with no metrics configured.
PrometheusMetrics lazy-imports prometheus_client. Install with
pip install 'activegraph[prometheus]'.
For OpenTelemetry, Datadog, statsd, or anything else: write a class with three methods. We do not ship adapters.
Standard metrics¶
These metrics are emitted by the runtime. Names follow Prometheus
conventions (snake_case with underscores, _total for counters,
_seconds for duration histograms, _usd for cost histograms). They
are the operator contract: dashboards built against these names
keep working across framework versions.
| Name | Type | Tags |
|---|---|---|
activegraph_events_emitted_total |
counter | event_type |
activegraph_behaviors_invoked_total |
counter | behavior |
activegraph_behaviors_failed_total |
counter | behavior, reason |
activegraph_behaviors_duration_seconds |
histogram | behavior |
activegraph_llm_calls_total |
counter | model |
activegraph_llm_cache_hits_total |
counter | model |
activegraph_llm_failed_total |
counter | model, reason |
activegraph_llm_tokens_in |
histogram | model |
activegraph_llm_tokens_out |
histogram | model |
activegraph_llm_cost_usd |
histogram | model |
activegraph_tools_calls_total |
counter | tool |
activegraph_tools_cache_hits_total |
counter | tool |
activegraph_tools_failed_total |
counter | tool, reason |
activegraph_tools_duration_seconds |
histogram | tool |
activegraph_queue_depth |
gauge | (none) |
activegraph_budget_cost_remaining_usd |
gauge | run_id |
activegraph_budget_events_remaining |
gauge | run_id |
activegraph_patterns_evaluated_total |
counter | (none) |
activegraph_patterns_evaluation_duration_seconds |
histogram | (none) |
activegraph_replay_divergence_detected_total |
counter | reason |
Adding a metric is a public API change. The list is documented and test-pinned. New metrics get added in named releases, not silently.
Cardinality rule (locked)¶
run_idMAY appear as a tag on gauges of active state (where cardinality is bounded by the number of concurrently active runs).run_idMUST NOT appear as a tag on counters or histograms.
This rule prevents the most common Prometheus operational disaster: unbounded cardinality from per-run labels accumulating forever. The budget gauges are the only exception, and they live only for the duration of a run.
The conformance suite enforces this rule against the standard metric
list. If you implement a custom Metrics backend, do the same.
Tag conventions¶
Standard tag keys are: event_type, behavior, tool, model,
reason, run_id (gauges only). Boolean tags (cache_hit is
modeled as a separate counter rather than a tag — see
activegraph_llm_cache_hits_total). If your backend distinguishes
booleans from strings, you won't have to special-case.
Custom tags beyond the standard set are fine but may explode cardinality. The cardinality rule above is your guide.
Runtime introspection¶
runtime.status(recent: int = 20) returns a RuntimeStatus — a
frozen dataclass. Calling it is cheap: no graph traversal, no event
log scan. It is safe to call from any thread.
status = rt.status()
print(status.run_id, status.state, status.queue_depth)
for ev in status.recent_events:
print(ev.id, ev.type)
Shape:
@dataclass(frozen=True)
class RuntimeStatus:
run_id: str
state: Literal["idle", "running", "stopped", "exhausted"]
queue_depth: int
events_processed: int
budget: BudgetSnapshot
frame: FrameSnapshot | None
registered_behaviors: list[BehaviorInfo]
recent_events: list[EventSummary]
recent_events length is recent (default 20). The CLI's
inspect --tail N passes through to this argument.
There is no last_error field. Errors are events. Filter
recent_events for type behavior.failed, or query the event store
directly for a window-independent view.
CLI¶
The activegraph binary is a thin wrapper around library APIs. Every
subcommand calls into Python; nothing is implemented in the CLI itself.
A programmatic user can do everything the CLI does.
activegraph inspect <url> [--run-id <id>] [--tail N] [--json]
[--event <id> | --behaviors | --pack-version]
activegraph replay <url> --run-id <id> [--json]
activegraph fork <url> --run-id <id> --at-event <id> --label <label>
[--to <url>] [--record] [--json]
activegraph diff <url> --run-a <id> --run-b <id> [--json]
activegraph export-trace <url> --run-id <id> [--format text|jsonl] [-o PATH]
activegraph migrate --from <url> --to <url> [--run-id <id>] [--skip-corrupted] [--json]
activegraph pack list
activegraph pack new <name>
--event, --behaviors, and --pack-version on inspect are
mutually-exclusive selectors that focus the output on one section
instead of the full snapshot. See the
CLI reference for the full surface; the
debugging cookbook walks through
diagnostic workflows that build on these flags.
Exit codes¶
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | Generic error |
| 2 | Usage error (bad arguments, missing options) |
| 3 | Not found (run id does not exist, file does not exist) |
| 4 | Corruption (schema version mismatch, event log inconsistency) |
| 5 | Divergence (replay-strict failure) |
These are documented contract. Wrap the CLI in shell scripts or CI jobs against these codes.
inspect¶
Default-mode prints a human-readable snapshot of the most recent run
in the store (or --run-id <id> for a specific run). --json prints
the same data as a single JSON object — the same shape as the
RuntimeStatus returned by the library.
activegraph inspect sqlite:///run.db
activegraph inspect postgres://localhost/agdb --run-id run_01J... --tail 50 --json
Three v1.0 selectors focus the output on one section instead of the full snapshot:
--event <id>prints the full payload of one event. Used when an error message names an event id —ReplayDivergenceErroralways does.--behaviorsprints only the registered-behaviors section. Used when diagnosing a replay length mismatch: compare which behaviors fire now against which fired in the recorded run.--pack-versionprints everypack.loadedevent in the run with prompt content-hash summaries. Used to confirm the pack version (or prompt drift) responsible for a divergence.
The three are mutually exclusive — they're selectors, not filters.
replay¶
Opens the store, rebuilds the graph by replaying the log (no behaviors fire), and prints a summary: event count, object count, relation count. Useful for sanity-checking a run after a crash or after a migration.
fork¶
Creates a new run by copying events from --run-id up to and
including --at-event. --to <url> defaults to the source store;
pass a different URL to fork across stores. Prints the new run id
and the number of events copied.
activegraph fork sqlite:///run.db \
--run-id run_01J... \
--at-event evt_42 \
--label investigate-alternative-thesis
The forked run is dormant — nothing is running it. To continue from
the fork point, load it with Runtime.load(url, run_id=<new_run_id>)
and call run_until_idle().
Pass --record to mark the fork as an intentional re-recording
(used after a ReplayDivergenceError when the divergence was
intentional). The flag appends -recording to the label and prints
follow-on guidance. The fork-with-pack-setting-override workflow
(the canonical recipe behind --set, landing in v1.1) is documented
under Cookbook: common patterns — Fork with a pack-setting
override.
diff¶
Structural diff between two runs in the same store. Prints shared and
divergent event counts, divergent objects, and divergent relations.
The library equivalent is parent.diff(other).
export-trace¶
Dump a run's event log to a file or stdout.
--format text(default) — the human-readable trace printer output.--format jsonl— one JSON event per line. Suitable for ingestion by any log aggregator.
migrate¶
See Migration above. The v1.0
--skip-corrupted flag lets a migration recover the readable subset
when source events have corrupted JSON payloads instead of failing
the whole run; the skipped event ids appear in the per-run report's
skipped_events. The resulting destination run is partial — the
operator is on notice.
Runbook¶
A run is stuck¶
Call runtime.status() (or activegraph inspect). Check state:
idle— the queue is empty, the budget is fine, the run is waiting for new input. This is the normal terminal state for a goal-driven run. Not stuck.exhausted— the run hit a budget limit. Thebudgetfield shows which dimension. Raise the limit or accept the partial result.running— the run is actually working.queue_depthshould be decreasing. If it's increasing or steady, a behavior is producing events faster than the runtime processes them. Check the trace.stopped— the runtime is loaded but norun_until_idle()call is in progress. Call it.
A run is over budget¶
runtime.status().budget shows used vs. limits across dimensions
(events, behavior calls, LLM calls, tool calls, cost, depth, seconds).
The activegraph_budget_* gauges expose the same data. Set up an
alert on cost_remaining_usd < threshold to catch runs before they
exhaust.
To resume a budget-exhausted run with a higher limit:
Replay diverges¶
You called Runtime.load(..., replay_strict=True) and got
ReplayDivergenceError. The runtime's re-execution of recorded
behaviors produced different events than the log. Causes, in order
of likelihood:
- A behavior reads from a non-deterministic source (clock,
random, network) it didn't read on the original run. - A behavior depends on a value (an LLM response, a tool result) that was cached on the original run but no longer is.
- The framework version changed and an event payload shape changed. Check the v0.8 schema-mismatch guard caught this; if not, file an issue.
The error pins the offending event id. Look at it. The fix is in your behavior, not the framework.
Postgres connection saturated¶
You are passing a psycopg.Connection per request. Use a
psycopg_pool.ConnectionPool and pass that instead. The framework
calls getconn() / putconn() around each operation.
Trace lines do not appear in my log aggregator¶
runtime.print_trace() prints to stdout. It is not a log. To get
events into your aggregator:
- Use
activegraph export-trace --format jsonlad-hoc. - Or write a behavior that subscribes to the event types of interest and emits a structured log record. The framework's logging will carry it through.
Capacity planning¶
These are reference numbers from a single Postgres 16 instance on commodity hardware. They are not benchmarks; they are the order of magnitude an operator should expect.
- Event log writes: a single connection sustains a few thousand events per second. With a pool of 10 and writes spread across runs, tens of thousands per second is achievable.
- Event log reads: replay of a 100k-event run on a warm cache takes single-digit seconds. Plan for that on a cold start.
- Storage: ~1-2 KB per event in JSONB form (including indexes). A million-event run is around 1.5 GB.
- Run concurrency: bounded by your connection pool size, not by the framework. The runtime itself is single-threaded.
If your runs are big enough that any of this is a concern, the framework's single-process design is the next constraint you will hit. That is a v1.0+ conversation.
What this guide is not¶
This guide will not tell you how to set up Postgres, configure Prometheus, or operate Grafana. Those are well-documented elsewhere and the framework's integration with them is intentionally generic.
This guide will not recommend SLOs, alerts, or dashboards. Your business context determines those. The metrics list above is the foundation; what you build on top is yours.
This guide will not stay current with every release. The locked contracts — log schema, metric names, exit codes, status shape — will. Examples may drift; the contracts will not.