Operate
Observability
Production telemetry runs on three legs: Sentry for errors, OpenTelemetry traces for the hot path, Horizon for queue health. The admin Site Health pill (see Site health & failed jobs) summarizes them at a glance.
Sentry
Set SENTRY_DSN and unhandled exceptions across the app
flow into Sentry with stack traces, request context, and user/workspace
metadata. The breadcrumb trail captures the last 100 log lines for
every error. Integrated via sentry-laravel.
Useful filters in Sentry:
- Tag
workspace_idto scope errors to a tenant. - Tag
agent_idwhen the error originates in a widget request. - Release tags match deploy commit SHA — easy to bisect when a regression appears.
OpenTelemetry traces
The OTEL exporter ships traces to OTEL_EXPORTER_OTLP_ENDPOINT
— typically Honeycomb or Grafana Cloud Tempo. Spans wrap the hot path:
widget.message.receive— incoming HTTP, validation, JWT verify.rag.curated.match— short-circuit check.rag.embed— query embedding call.rag.vector.search— ANN search.rag.rerank— cross-encoder.rag.prompt.assemble— local CPU work.rag.llm.first_token— time-to-first-token (the headline metric).rag.llm.stream— full stream duration.rag.persist.async— post-stream save.
Each span is tagged with workspace_id, agent_id, conversation_id,
provider (cloudflare / openai), and any cache-hit flags. The big one
is p95 of rag.llm.first_token — that's
your hot-path SLO.
Horizon
/horizon is the queue dashboard. Required for production —
without it, you're blind to backlogs. Watch:
- Wait time — how long jobs sit before being picked up. Healthy is < 1s; investigate > 10s.
- Throughput — jobs/min by queue.
- Failed jobs — anything that lands in
failed_jobsshows here too.
Queues to monitor:
| Queue | What's on it |
|---|---|
default | Misc: usage events, gap detection, audit logs, webhook deliveries. |
crawl | CrawlSourceJob, CrawlPageJob, IngestNotionPageJob, IngestGoogleDocJob. Tends to be the longest queue depth. |
index | IndexDocumentJob, IndexTextSourceJob. Embedding-heavy. |
Logs
Standard Laravel logging. Default channels:
stdout— captured by Laravel Cloud / Docker.sentry— error level and above.slack— critical level, posts to ops channel.
Tail logs locally with php artisan pail.
Health endpoint
GET /up is the readiness probe — returns 200 with a small
JSON body if the app boots. Use it for load balancer health checks.
For deeper checks, App\Support\PlatformAdminHeader
runs the multi-step health check and exposes the result via the
Inertia shared prop on every admin page.
Metrics to watch
The handful of metrics that matter most:
- p95 first-token latency — < 1s.
- p95 full-response latency — < 5s for short answers.
- Crawl queue depth — should drain within minutes.
- Index queue depth — should drain within minutes.
- Failed-jobs count — 0 in steady state. Anything > 50 is alarm-worthy.
- LLM provider error rate — < 1% of streams.
- Vector store query latency — p95 < 100ms.
Alerts
Recommended PagerDuty / Slack alerts:
- Sentry — new release-blocking error.
- Honeycomb — first-token p95 > 1.5s for 5 minutes.
- Horizon — failed-jobs delta > 10 in 5 minutes.
- Stripe webhook — > 5 consecutive verification failures (signing key mismatch).
- Reverb — process down.
Site Health pill
The header pill in the admin panel is a quick visual check that everything is configured. Green is the steady state; if it goes amber, the dropdown tells you exactly which check failed and links to the settings page to fix it. See Site health & failed jobs.