Architecture
Hot path & latency
The hot path is the visitor-message → first-token pipeline. It has a hard 1 second p95 contract — beyond that, the perceived "is this thing alive?" tension breaks. Everything that doesn't have to happen on the hot path is pushed off it.
Latency budget
Target breakdown for first-token at p95:
| Phase | p95 target |
|---|---|
| HTTP receive + auth | 30 ms |
| Curated short-circuit check | 5 ms |
| Embed query | 120 ms |
| Vector search (ANN) | 80 ms |
| Rerank | 120 ms |
| Prompt assembly | 10 ms |
| LLM time-to-first-token | 500 ms |
| Total | ~865 ms |
Headroom for the rest is < 150ms. Anything beyond first-token streams out incrementally — full-response time is bounded by token rate, not the budget.
Hard rules
These are enforced by code review and by tests:
- No DB writes on the hot path. Persistence is async after the stream ends.
- No synchronous webhooks. Outgoing webhooks are dispatched as queue jobs.
- No retries. If a provider fails mid-stream, the user sees a graceful error and the widget auto-retries client-side. Server doesn't loop.
- No N+1 queries. All reads are batched. Recent history comes from Redis (
conv:{id}:history), not Postgres. - One LLM call per turn. No multi-step agent reasoning that fans out into multiple model calls.
What's off the hot path
Everything below is dispatched after the stream completes. None of it blocks the visitor:
- PersistTurnJob — save user + assistant messages.
- IncrementUsageJob — bump the workspace's monthly conversation counter.
- DetectGapJob — cluster low-confidence questions for the gap report.
- DispatchWebhookJob — fan out to subscribed customer endpoints (workflow side; the lead-captured webhook fires inline via
SignedDispatcherwhen a visitor submits the lead form). - AutoIndexPageVisit — synchronous service triggered at
/inittime that queues aCrawlPageJobfor the visited URL when auto-indexing is enabled. Not a hot-path job, but worth knowing where auto-index runs.
Caching
Two caches keep the hot path tight:
- Retrieval cache — Redis,
rag:retrieve:{agentId}:{hash(query|currentPageUrl)}, 30-minute TTL. Same question on the same page hits cache. Invalidated when sources change. - Conversation history cache — Redis,
conv:{convId}:history, 2-hour TTL, capped at 12 messages (6 turns). Reads from this on every turn instead of Postgres.
Streaming mechanics
SSE is dead simple — keep-alive HTTP, write
data: {...}\n\n per token, flush. The widget reads via
EventSource (or fetch + reader for older browsers without
EventSource on POST).
Critically, the SSE response is constructed before any RAG work runs. We start writing headers immediately on request receipt so any proxy in front of us (Cloudflare, load balancer) commits to streaming early. By the time tokens arrive, the connection is already open.
Where the spans live
OpenTelemetry spans wrap each phase:
widget.message.receiverag.curated.matchrag.embedrag.vector.searchrag.rerankrag.prompt.assemblerag.llm.first_tokenrag.llm.streamrag.persist.async
Honeycomb / Grafana shows the p95 of each. When the budget breaks, the span heatmap usually points right at the offender.
Failure modes
| Failure | Behavior |
|---|---|
| LLM provider 5xx mid-stream | Stream emits an error event. Widget auto-retries up to 3 times. |
| Vector store unreachable | Pipeline returns the question with no grounding. Confidence is 0.3 → low_confidence flag → "I don't know" answer. |
| Embed call times out | Same — proceed with no grounding, flag low_confidence. |
| Quota exceeded | Caught at /init, never reaches messages. 429 returned. |
The principle: the visitor always gets a response, even if it's "I'm not sure". The agent is allowed to be ignorant; it isn't allowed to silently break.