Overview
Chunking and embedding is the process of converting raw document text into searchable vector representations. Text is split into overlapping chunks using a smart break point algorithm that preserves semantic coherence, then each chunk is embedded via an external API and stored in the vector store for retrieval.
The system handles both batch embedding (for document ingestion) and cached query embedding (for search), with separate code paths optimised for each use case.
Key Concepts
- Smart break points — Chunks split at natural language boundaries (paragraph, sentence, line, word) rather than arbitrary character positions, preserving meaning within each chunk.
- Overlap — Adjacent chunks share 200 characters of overlap so that sentences spanning a boundary appear in full in at least one chunk.
- Deterministic point IDs — Each chunk gets a UUID v5 derived from
${documentId}:${chunkIndex}, making upserts idempotent. - Batch processing — Embeddings are generated in batches respecting both item count (max 2048) and token budget (default 100,000 tokens) limits.
- Cached query embeddings — Search queries are embedded once and cached for 24 hours using a SHA-256 hash of the model and text as the cache key.
- Multi-model support — Documents track their embedding provider, model, and dimensions, so the platform can run multiple embedding models simultaneously.
How It Works
Text Splitting
- Input text is normalised (CRLF to LF).
- The splitter walks the text in windows of
CHUNK_SIZE(default 1000 characters). - For each window, it searches for the best break point using a priority order: paragraph break (double newline), sentence break (
.!?followed by space or newline), line break, word break (space). - A break point is only accepted if it falls past 30-50% of the chunk size, preventing tiny fragments.
- Each resulting chunk records:
content,index,startChar,endChar.
Embedding Generation
- Chunks are grouped into batches. Each batch stays within
MAX_ITEMS_PER_BATCH(2048) andDEFAULT_MAX_TOKENS_PER_BATCH(100,000 tokens, estimated at ~4 characters per token). - Batches are sent to the embedding API with concurrency control (default 2 concurrent requests, configurable via
EMBEDDING_API_CONCURRENCY). - Transient errors trigger retry logic; permanent failures propagate immediately.
- Embedded chunks are upserted into the appropriate model-specific Qdrant collection using deterministic point IDs.
Cached Query Embedding
- A SHA-256 hash is computed from the model name and query text.
- If a cached embedding exists and is less than 24 hours old, it is returned immediately.
- Otherwise, the embedding is generated, cached, and returned.
- The cache is immutable — the same model + text always produces the same embedding, so no invalidation logic is needed.
Why It Works This Way
Overlap Prevents Lost Context at Boundaries
The 200-character overlap between adjacent chunks ensures that sentences straddling a chunk boundary appear in full in at least one chunk. Without overlap, a retrieval query matching a split sentence would return a partial, misleading result.
Smart Break Points Preserve Semantic Coherence
Splitting at paragraph, sentence, or line boundaries rather than arbitrary character positions keeps complete thoughts within a single chunk. This directly improves retrieval relevance because the embedding captures a coherent meaning rather than a fragment.
Deterministic Point IDs Enable Idempotent Upserts
UUID v5 generated from documentId + chunkIndex means re-processing a document replaces its existing vectors in Qdrant rather than creating duplicates. This makes the ingestion pipeline safely re-runnable without cleanup steps.
Cached Query Embeddings Avoid Redundant API Calls
Repeated or similar search queries (common in agent tool loops) hit the cache instead of the embedding API. The 24-hour TTL is generous because embeddings for the same model and text are deterministic — the cache never returns stale data.
Configuration
| Env Var | Description |
|---|---|
EMBEDDING_API_CONCURRENCY | Max concurrent embedding API requests (default 2) |
EMBEDDING_MAX_TOKENS_PER_BATCH | Max tokens per embedding batch (default 100000) |
Code Reference
| File | Description |
|---|---|
apps/data-plane/src/lib/text-splitter.ts | splitText(), findBreakPoint(), DEFAULT_CHUNK_SIZE=1000, DEFAULT_OVERLAP=200 |
apps/data-plane/src/services/embedding.ts | generateEmbeddings(), generateCachedQueryEmbedding(), buildBatches() |
apps/data-plane/src/lib/vector-store.ts | pointId() for deterministic UUID v5 generation |
apps/data-plane/src/lib/cache.ts | Caching layer for query embeddings |
Relationships
- Ingestion Pipeline — Chunking and embedding are stages 1 and 2 of the ingestion pipeline
- Vector Store & Search — Embedded chunks are stored in Qdrant and retrieved during search
- Metadata Extraction — Runs after embedding as stage 3, enriching documents with structured metadata
- Spaces — Each chunk carries a
space_idfor RLS-enforced isolation