Architecture

Hot path & latency

The hot path is the visitor-message → first-token pipeline. It has a hard 1 second p95 contract — beyond that, the perceived "is this thing alive?" tension breaks. Everything that doesn't have to happen on the hot path is pushed off it.

Latency budget

Target breakdown for first-token at p95:

Phase	p95 target
HTTP receive + auth	30 ms
Curated short-circuit check	5 ms
Embed query	120 ms
Vector search (ANN)	80 ms
Rerank	120 ms
Prompt assembly	10 ms
LLM time-to-first-token	500 ms
Total	~865 ms

Headroom for the rest is < 150ms. Anything beyond first-token streams out incrementally — full-response time is bounded by token rate, not the budget.

Hard rules

These are enforced by code review and by tests:

No DB writes on the hot path. Persistence is async after the stream ends.
No synchronous webhooks. Outgoing webhooks are dispatched as queue jobs.
No retries. If a provider fails mid-stream, the user sees a graceful error and the widget auto-retries client-side. Server doesn't loop.
No N+1 queries. All reads are batched. Recent history comes from Redis (conv:{id}:history), not Postgres.
One LLM call per turn. No multi-step agent reasoning that fans out into multiple model calls.

What's off the hot path

Everything below is dispatched after the stream completes. None of it blocks the visitor:

PersistTurnJob — save user + assistant messages.
IncrementUsageJob — bump the workspace's monthly conversation counter.
DetectGapJob — cluster low-confidence questions for the gap report.
DispatchWebhookJob — fan out to subscribed customer endpoints (workflow side; the lead-captured webhook fires inline via SignedDispatcher when a visitor submits the lead form).
AutoIndexPageVisit — synchronous service triggered at /init time that queues a CrawlPageJob for the visited URL when auto-indexing is enabled. Not a hot-path job, but worth knowing where auto-index runs.

Caching

Two caches keep the hot path tight:

Retrieval cache — Redis, rag:retrieve:{agentId}:{hash(query|currentPageUrl)}, 30-minute TTL. Same question on the same page hits cache. Invalidated when sources change.
Conversation history cache — Redis, conv:{convId}:history, 2-hour TTL, capped at 12 messages (6 turns). Reads from this on every turn instead of Postgres.

Streaming mechanics

SSE is dead simple — keep-alive HTTP, write data: {...}\n\n per token, flush. The widget reads via EventSource (or fetch + reader for older browsers without EventSource on POST).

Critically, the SSE response is constructed before any RAG work runs. We start writing headers immediately on request receipt so any proxy in front of us (Cloudflare, load balancer) commits to streaming early. By the time tokens arrive, the connection is already open.

Where the spans live

OpenTelemetry spans wrap each phase:

widget.message.receive
rag.curated.match
rag.embed
rag.vector.search
rag.rerank
rag.prompt.assemble
rag.llm.first_token
rag.llm.stream
rag.persist.async

Honeycomb / Grafana shows the p95 of each. When the budget breaks, the span heatmap usually points right at the offender.

Failure modes

Failure	Behavior
LLM provider 5xx mid-stream	Stream emits an `error` event. Widget auto-retries up to 3 times.
Vector store unreachable	Pipeline returns the question with no grounding. Confidence is 0.3 → low_confidence flag → "I don't know" answer.
Embed call times out	Same — proceed with no grounding, flag low_confidence.
Quota exceeded	Caught at `/init`, never reaches messages. 429 returned.

The principle: the visitor always gets a response, even if it's "I'm not sure". The agent is allowed to be ignorant; it isn't allowed to silently break.