Overview
The Record Manager handles duplicate detection and document-source linking. Before any document enters the ingestion pipeline, the Record Manager checks whether the same content already exists — either in the target source or elsewhere in the space. This prevents duplicate chunks from appearing in Qdrant and avoids wasting compute on redundant embedding and metadata extraction.
When the same file is uploaded to multiple sources, the Record Manager creates a junction table link instead of duplicating the document. This means a single set of chunks and embeddings serves all sources that reference the same content.
Key Concepts
- SHA-256 content hashing — A hash is computed from the raw file content (ArrayBuffer) before any processing, using the Web Crypto API (
crypto.subtle.digest). This hash is stored indocuments.contentHashand indexed for fast lookups. - 3-tier scope hierarchy — Deduplication checks proceed from narrowest to broadest scope: within source, then cross-space.
- Duplicate modes —
skip(default) avoids re-processing duplicates;overwriteforces re-processing and replaces existing content. - Document-source junction — The
document_sourcestable allows one document to appear in multiple sources without duplicating its chunks or embeddings.
Data Model
document_sources junction table:
| Column | Type | Notes |
|---|---|---|
documentId | uuid (PK, FK) | References documents.id |
sourceId | uuid (PK, FK) | References sources.id |
folderId | uuid (nullable) | Optional folder within the source |
createdAt | timestamp |
documents (relevant columns):
| Column | Type | Notes |
|---|---|---|
contentHash | text | SHA-256 hash of raw file content |
filename | text | Original filename |
Indexes:
| Index | Columns | Notes |
|---|---|---|
| Hash lookup | contentHash | Fast duplicate detection |
| Composite PK | (documentId, sourceId) | On document_sources |
How It Works
- Hash the file — Compute SHA-256 from the raw file content using
crypto.subtle.digest. - Check within source — Query for a document with the same
contentHashin the target source.- Match by hash — In
skipmode: returnskipaction. Inoverwritemode: returnupdateaction. - Match by filename — If no hash match but a document with the same filename exists and has different content: return
updateaction (the file has been revised).
- Match by hash — In
- Check cross-space — Query for a document with the same
contentHashanywhere in the user's space.- Match found — Create a
document_sourcesentry linking the existing document to the new source. Returnlinkaction.
- Match found — Create a
- No match — Create a new document record. Return
createaction, which triggers the full ingestion pipeline.
Return actions:
| Action | Meaning |
|---|---|
create | New document — full pipeline processing |
skip | Exact duplicate exists in the same source — no action |
update | Same filename with different content, or overwrite mode — re-process |
link | Content exists in another source — junction table entry created |
Why It Works This Way
Content Hashing Prevents Duplicate Search Results
SHA-256 content hashing means re-uploading the same file does not create duplicate chunks in Qdrant. Without deduplication, agents would see doubled results for every duplicated document, degrading answer quality and wasting context window tokens.
Cross-Source Junction Avoids Re-Embedding
The document_sources junction table means the same content can appear in multiple sources without generating a second set of embeddings. Since embedding is one of the most expensive stages (both in API cost and latency), this saves significant compute for users who organize the same documents across multiple sources.
3-Tier Scope Hierarchy Minimises Waste
The check order — within-source first, then cross-space, then create — ensures the cheapest check runs first. A within-source hash lookup is a single indexed query. Cross-space checks only run if the first check finds no match. This minimises both storage waste and processing cost.
Configuration
| Env Var | Description |
|---|---|
DATABASE_URL | Postgres connection string for app_user role (RLS enforced) |
Code Reference
| File | Description |
|---|---|
apps/data-plane/src/services/record-manager.ts | RecordManager class with 3-tier deduplication logic |
apps/data-plane/src/services/upload.ts | Upload flow integrating the record manager before pipeline submission |
packages/db/src/schema/documents.ts | documents and document_sources table definitions |
Relationships
- Ingestion Pipeline — The record manager runs before ingestion; its return action determines whether the pipeline runs at all
- Sources — Document-source junction enables one document to belong to multiple sources
- Spaces — Cross-source deduplication is scoped to a single space; RLS enforces isolation