Ask AI in Docs

Ask AI is the lightweight assistant built into the documentation site header. It lets someone ask a question in plain English, searches the published docs, and streams back an answer with source links.

It looks like chat, but it is not a full conversational workspace. Each question stands on its own. The UI keeps a browser session ID so feedback and analytics can be tied back to the same visitor, but the backend does not reuse earlier questions as conversation memory.

What users actually get

From the docs site, a visitor can:

Open the modal from the Ask AI button beside search
Use Ctrl + . on Windows/Linux or Cmd + . on macOS to open it from the keyboard
Start from suggested prompts or type a custom question
Watch the answer stream in as it is generated
Open linked source pages at the end of the answer
Send thumbs up or thumbs down feedback on completed answers

The modal also sets expectations right in the UI:

Answers may be wrong
Each question is processed independently
The assistant only knows what it can find in the docs index

What the frontend does

The docs app wires the feature in through two main files:

apps/docs/src/components/Header.astro
apps/docs/src/components/AskAiChat.astro

Header.astro injects the Ask AI trigger beside the Starlight search box. Clicking that button opens a custom web component named <ask-ai-chat>.

AskAiChat.astro handles the rest:

Renders the modal UI
Creates or reuses an anonymous browser session ID in localStorage
Sends the question to the docs AI API
Reads Server-Sent Events from the response stream
Renders markdown with marked
Sanitizes the rendered HTML with DOMPurify
Builds source links with safe DOM APIs instead of string interpolation
Shows feedback controls only after a successful answer finishes streaming

There are a few small implementation details worth knowing:

Questions must be between 3 and 500 characters
The markdown render is throttled during streaming so the modal does not feel jumpy
The modal appends itself to document.body so it can sit above the sticky header cleanly
Retrieved source pages are deduplicated by slug before they are shown

End-to-end flow

This is the request path from a user’s question to the final answer:

The visitor opens Ask AI and submits a question.
The docs app sends POST /api/docs-ai/ask with the question and the anonymous session ID.
The backend rate-limits the request by client IP.
The backend embeds the question with the configured embedding model.
It searches the docs_embeddings table for the closest matching chunks.
It stores an initial row in docs_ai_interactions with retrieval metrics and request metadata.
It calls the chat model with the retrieved excerpts and starts streaming tokens back over SSE.
The frontend renders tokens as they arrive, then adds source links when the stream ends.
The backend stores the finished answer and marks the request as completed or error.
If self-rating is enabled, the backend runs a second pass that scores the answer for quality.

Where the content comes from

Ask AI does not scrape the live DOM of the docs site. It indexes a static export generated by the docs app:

apps/docs/src/pages/docs-export.json.ts

That route is prerendered at build time and exports every non-draft document in the docs collection. Each exported record includes:

slug
title
description
body
lastUpdated

Draft pages are filtered out before export, so unpublished content does not end up in the search index.

This setup keeps indexing predictable. The backend reads one clean JSON file instead of trying to reconstruct content from HTML after the fact.

How indexing works

The shared indexing pipeline lives here:

apps/backend/src/scripts/services/docs_ai/indexer.py
apps/backend/src/scripts/services/docs_ai/chunker.py
apps/backend/src/scripts/services/docs_ai/embedder.py
apps/backend/cli_tools/index_docs.py

The pipeline does the following:

Load docs-export.json from a URL or local file.
Split each page into chunks.
Compare chunk checksums with what is already in the database.
Skip chunks that have not changed.
Delete orphaned chunks for pages or sections that no longer exist.
Generate embeddings only for new or changed chunks.
Upsert the final rows into docs_embeddings.

That checksum step matters. It keeps reindexing cheap because unchanged content does not get re-embedded.

Chunking rules

Chunking is simple on purpose:

Split on ## and ### headings
Prefix each chunk with the page title
Prefix section chunks with the section heading too
Aim for about 400 tokens per chunk
Carry about 50 tokens of overlap from the previous chunk
Fall back from paragraph splitting to sentence splitting when a section is too long

The overlap helps when an important idea straddles two chunks. It is a small detail, but it improves answer quality more than people usually expect.

How retrieval works

Vector retrieval happens in:

apps/backend/src/scripts/services/docs_ai/retriever.py

Current retrieval settings in code:

Similarity metric: cosine similarity through pgvector
Minimum similarity threshold: 0.3
Maximum chunks returned per question: 5
Customer scope: fixed to the public docs scope

The embeddings are stored in PostgreSQL with pgvector in the docs_embeddings table. Each row keeps:

The doc slug and title
The chunk text
The chunk index
The embedding vector
Token count
Checksum

The current embedding column is vector(4096). Because of that size, the implementation uses an exact scan instead of an HNSW index. That is a practical tradeoff for the current corpus size. The code comments call this out directly.

Low-match detection

Retrieval uses two different thresholds:

0.3 decides whether a chunk is eligible to be returned at all
DOCS_AI_LOW_MATCH_THRESHOLD, which defaults to 0.45, marks whether the best match looked weak

That second threshold does not block the answer. It feeds telemetry so the team can find questions the docs do not support well yet.

How answer generation works

Answer generation lives in:

apps/backend/src/scripts/services/docs_ai/generator.py
apps/backend/src/scripts/services/docs_ai/client.py

The backend sends the retrieved chunks to the chat model with a system prompt that tells Eric to:

Answer only from the provided documentation excerpts
Say so clearly when the docs do not contain the answer
Cite pages in [Page Title](/slug/) format
Keep the tone warm but restrained

The default model settings are currently:

Embeddings: qwen/qwen3-embedding-8b
Chat: minimax/minimax-m2.5

Both go through OpenRouter using the OpenAI-compatible SDK.

Streaming behavior

The response is streamed as Server-Sent Events. The frontend handles these event types:

Event type	What it contains
`meta`	A JSON payload with the `response_id`
`token`	One chunk of answer text
`sources`	A JSON-encoded list of retrieved source docs
`done`	End-of-stream marker
`error`	A user-facing error message

The frontend appends token text as it arrives, then does one final markdown render at the end so the last chunk is not left half-parsed.

Feedback, telemetry, and quality signals

Ask AI records more than the final answer. It also keeps enough context to tell whether the feature is helping or drifting.

The telemetry table is:

docs_ai_interactions

That row stores:

The question and final answer
The anonymous session_id
A hash of the client IP, not the raw IP
The user agent
The chat and embedding model names
Retrieved source metadata
Retrieval scores
Feedback state
Self-rating state
Failure signals
Request status: pending, completed, or error

Feedback flow

Once a streamed answer finishes successfully, the frontend shows:

Thumbs up
Thumbs down

If the user picks thumbs down, they can choose one reason:

incorrect
missing_docs
outdated
unclear
other

They can also add an optional note up to 500 characters.

Feedback is posted to POST /api/docs-ai/feedback. The backend only accepts it if the session_id matches the original ask request. That keeps one browser session from rating another session’s answer.

Self-rating

If DOCS_AI_SELF_RATING_ENABLED is on, the backend runs a follow-up scoring pass after the answer is stored.

That pass:

Grades the answer from 1 to 5
Stores a short reason
Falls back to heuristics if the model returns empty or malformed output

The self-rating logic is opinionated in one important way: if the answer says the docs do not cover the question, the score is forced low. That keeps a polite non-answer from being counted as a success.

Failure signals

The backend rolls several signals into a single failure classification:

low_match
thumbs_down
self_low_score

An interaction is marked as failed when one or more of those signals are present.

Safety and guardrails

This feature is public, so the guardrails are straightforward and mostly practical.

API-level guardrails

POST /api/docs-ai/ask is public, but rate-limited per IP
POST /api/docs-ai/feedback is public, and separately rate-limited
POST /api/docs-ai/reindex requires X-Reindex-Secret
Swagger and OpenAPI routes are only available from localhost
CORS is limited to the docs site origin, plus local dev origins in local environments

Frontend guardrails

Answer markdown is sanitized before insertion into the page
Source links are built from encoded path segments
Malformed SSE lines are logged and ignored instead of crashing the modal
Missing response bodies are treated as errors instead of assumed to be streamable

Data handling

The feature stores a small amount of request metadata for debugging and quality tracking:

Anonymous browser session_id
Hashed client IP
User agent
Question text
Final answer
Feedback and scoring metadata

The raw client IP is not written to the interaction table. It is hashed first with EXTERNAL_API_SALT.

API surface

The docs AI backend lives in apps/backend/src/docs_api.py.

Public endpoints:

Endpoint	Purpose
`GET /api/docs-ai/health`	Health check
`POST /api/docs-ai/ask`	Ask a question and receive a streamed answer
`POST /api/docs-ai/feedback`	Submit thumbs up or thumbs down feedback

Protected endpoint:

Endpoint	Purpose
`POST /api/docs-ai/reindex`	Rebuild the docs embedding index

The FastAPI docs and OpenAPI schema are also exposed, but only to localhost:

/api/docs-ai/docs
/api/docs-ai/openapi.json

Configuration

The main backend settings live in apps/backend/settings.py.

The most important ones are:

OPENROUTER_API_KEY
DOCS_AI_EMBEDDING_MODEL
DOCS_AI_CHAT_MODEL
DOCS_AI_RATE_LIMIT_RPM
DOCS_AI_FEEDBACK_RATE_LIMIT_RPM
DOCS_AI_LOW_MATCH_THRESHOLD
DOCS_AI_SELF_RATING_ENABLED
DOCS_AI_SELF_RATING_MODEL
DOCS_AI_SELF_RATING_FAILURE_THRESHOLD
DOCS_AI_TRUSTED_PROXY_IPS
DOCS_SITE_URL
DOCS_AI_REINDEX_SECRET

The frontend uses:

PUBLIC_DOCS_AI_API_URL

If that frontend value is not set, the component falls back to http://localhost:8042.

Local development

If you want to run the whole feature locally, the usual flow is:

Start the docs API from apps/backend/
Start the docs site from apps/docs/
Build or serve the docs site so docs-export.json exists
Run the indexer against that export

Example commands:

# Start the docs AI API
cd apps/backend
PYTHONPATH=.:src POSTGRESQL_HOST=localhost uv run fastapi dev src/docs_api.py --host 0.0.0.0 --port 8042

# Start the docs frontend
npx nx serve docs

# Index the docs export into pgvector
cd apps/backend
POSTGRESQL_HOST=localhost uv run python -m cli_tools.index_docs --source ../docs/dist/docs-export.json --clear

You can also trigger reindexing through the protected API endpoint if DOCS_AI_REINDEX_SECRET is configured.

Implementation map

If you need to debug or extend the feature, these are the main files:

Area	Files
Docs UI	`apps/docs/src/components/Header.astro`, `apps/docs/src/components/AskAiChat.astro`
Exported corpus	`apps/docs/src/pages/docs-export.json.ts`
API server	`apps/backend/src/docs_api.py`
OpenRouter client	`apps/backend/src/scripts/services/docs_ai/client.py`
Embeddings	`apps/backend/src/scripts/services/docs_ai/embedder.py`
Retrieval	`apps/backend/src/scripts/services/docs_ai/retriever.py`
Answer generation	`apps/backend/src/scripts/services/docs_ai/generator.py`
Chunking	`apps/backend/src/scripts/services/docs_ai/chunker.py`
Indexing	`apps/backend/src/scripts/services/docs_ai/indexer.py`, `apps/backend/cli_tools/index_docs.py`
Feedback and scoring	`apps/backend/src/scripts/services/docs_ai/feedback.py`
Data models	`apps/backend/src/scripts/models/docs_ai_models.py`, `apps/backend/src/scripts/models/models.py`
Database migrations	`apps/backend/migrations/versions/cc2d0adef6af_add_docs_embeddings_table.py`, `apps/backend/migrations/versions/d4e2f6a19c41_add_docs_ai_interactions.py`

One last clarification

Ask AI is intentionally narrow. It answers questions about the published docs. It does not have the full session memory, permissions model, tool access, or long-running workflow support that the in-app Eric assistant has.

That narrower shape is a good thing here. The docs-site assistant is fast, simple, public, and easy to reason about. It only works if the documentation is actually there. When it fails, the team can usually trace the problem to one of four places: the page was never documented, the index is stale, retrieval picked weak chunks, or the answer model could not turn good context into a clean response.