G Growreplies docs

Build your agent

Knowledge sources

Sources are how an agent learns about your business. This page covers every kind of source, the ingestion pipeline, and what to expect after you click "Add".

Source types

TypeUse it forWhat we ingest
urlOne specific pageCrawl + extract main content + chunk + embed
sitemapA whole site at onceRead sitemap, fan out to one CrawlPageJob per URL
feedRSS / Atom blogsSame as sitemap but reads <item> entries
textFAQs, snippets, anything you can pasteSkip the crawl, chunk + embed directly
notionNotion pages or databasesOAuth into Notion, fetch via API, treat each page as a document
google_docGoogle Docs (Workspace)OAuth, fetch via Drive API, ingest as a document
autoPages visitors land onAuto-queued by AutoIndexPageVisit from /v1/widget/init

Add a source

Open /app/agents/{id}/sources. The Add source modal handles all types in one form. Behind the scenes:

  1. Validate. URLs must be http/https; private hosts (10.x, 192.168.x, 127.x, ::1) are blocked to prevent SSRF.
  2. Create the source row with status = pending.
  3. Dispatch a jobCrawlSourceJob for url/sitemap/feed; IngestNotionPageJob/IngestGoogleDocJob for connected sources; IndexTextSourceJob for pasted text.
  4. The job runs on the crawl queue, fetches content, creates Document rows, then dispatches IndexDocumentJob on the index queue.
  5. The status flips from pending → crawling → done (or failed with an error message you can read in the UI).

Auto-discovery

On the sources page, the Discover button takes a domain and probes it for crawlable pages without you having to list them. We:

  • Read robots.txt for sitemap declarations.
  • Probe a sitemap directly when present.
  • Try a small set of common paths: /about, /pricing, /features, /products, /faq, /docs, /help, /support, /contact.
  • Return a checkable list. Tick which to ingest, hit Add selected.

Crawler strategies

The crawler is provider-driven. In order of preference:

  1. Cloudflare Browser Rendering — preferred. Full JS rendering, fast, no SSRF risk because egress is on Cloudflare. Used when CLOUDFLARE_ACCOUNT_ID + CLOUDFLARE_API_TOKEN are set.
  2. Browserless — fallback when BROWSERLESS_TOKEN is set. Same headless-Chrome behavior on a different vendor.
  3. Plain HTTP — last resort for server-rendered sites. No JS execution. Free.

Once HTML is in hand, ReadabilityExtractor strips nav, footer, ads, etc., leaving the article body. Pages under 200 chars or detected as 404s are dropped.

Chunking and embedding

The extractor's text goes into Chunker, a recursive splitter that prefers semantic boundaries:

  1. Split on markdown headings, then blank lines (paragraphs).
  2. Pack paragraphs greedily up to a target size (~2000 chars / ~500 tokens).
  3. If a paragraph is too big, fall back to sentence boundaries.
  4. Char-window as the absolute last resort.
  5. Add a small overlap between chunks so cross-chunk facts stay linkable.

Each chunk is embedded in a batch (default 100 chunks per call) and upserted into the vector store with metadata: agent_id, document_id, chunk_id, url, workspace_id, source_id, lang.

Reindex and preview

From the sources list, each row has:

  • Reindex — re-runs the crawl + chunk + embed pipeline.
  • Preview — shows the extracted documents and a sample of chunks so you can spot bad extraction (e.g. nav bar polluting the text).
  • Delete — removes the source, its documents, its chunks, and the corresponding vector points.

Notion and Google Docs

Both use OAuth. Connect once from /app/integrations; the token is encrypted at rest. After connecting, the source modal lets you pick pages or documents directly.

Re-syncs are manual (per-source Reindex button) — we don't poll your Notion / Drive on a schedule. If you change a Notion page, click Reindex on that source.

"My agent doesn't know about the file I just uploaded"

Cloudflare Vectorize has eventual consistency on metadata-filtered queries — even after an upsert returns 200 OK, an agent_id-filtered query against that vector typically returns 0 hits for the first 30 to 60 seconds while the metadata index propagates across edge regions.

Practical consequence: a freshly uploaded file shows up as status=indexed in the Sources page immediately, but the agent won't be able to answer questions about it until the propagation window closes. The upload-success banner reminds the admin of this. If the agent still doesn't return relevant chunks after a minute, open the source's Preview to confirm the extracted text isn't empty — that's a parser-side issue, not a vector-side one.

Same gotcha applies to the very first upload after creating a Cloudflare Vectorize index for the first time — the index itself has a ~2 minute provisioning lag before any queries return results, even unfiltered ones.

Storage and retention

  • Postgres — sources, documents, chunks (text + metadata).
  • Vector store — embeddings. Cloudflare Vectorize when configured, Qdrant otherwise.
  • R2 / object storage — original artifacts (PDFs, images) when uploaded.

Deleting a source cascades: documents, chunks, and vector points all go in one transaction. There's no soft-delete on sources.