Record Manager — Condelo Platform

Overview

The Record Manager handles duplicate detection and document-source linking. Before any document enters the ingestion pipeline, the Record Manager checks whether the same content already exists — either in the target source or elsewhere in the space. This prevents duplicate chunks from appearing in Qdrant and avoids wasting compute on redundant embedding and metadata extraction.

When the same file is uploaded to multiple sources, the Record Manager creates a junction table link instead of duplicating the document. This means a single set of chunks and embeddings serves all sources that reference the same content.

Key Concepts

SHA-256 content hashing — A hash is computed from the raw file content (ArrayBuffer) before any processing, using the Web Crypto API (crypto.subtle.digest). This hash is stored in documents.contentHash and indexed for fast lookups.
3-tier scope hierarchy — Deduplication checks proceed from narrowest to broadest scope: within source, then cross-space.
Duplicate modes — skip (default) avoids re-processing duplicates; overwrite forces re-processing and replaces existing content.
Document-source junction — The document_sources table allows one document to appear in multiple sources without duplicating its chunks or embeddings.

Data Model

document_sources junction table:

Column	Type	Notes
`documentId`	uuid (PK, FK)	References `documents.id`
`sourceId`	uuid (PK, FK)	References `sources.id`
`folderId`	uuid (nullable)	Optional folder within the source
`createdAt`	timestamp

documents (relevant columns):

Column	Type	Notes
`contentHash`	text	SHA-256 hash of raw file content
`filename`	text	Original filename

Indexes:

Index	Columns	Notes
Hash lookup	`contentHash`	Fast duplicate detection
Composite PK	`(documentId, sourceId)`	On `document_sources`

How It Works

Hash the file — Compute SHA-256 from the raw file content using crypto.subtle.digest.
Check within source — Query for a document with the same contentHash in the target source.
- Match by hash — In skip mode: return skip action. In overwrite mode: return update action.
- Match by filename — If no hash match but a document with the same filename exists and has different content: return update action (the file has been revised).
Check cross-space — Query for a document with the same contentHash anywhere in the user's space.
- Match found — Create a document_sources entry linking the existing document to the new source. Return link action.
No match — Create a new document record. Return create action, which triggers the full ingestion pipeline.

Return actions:

Action	Meaning
`create`	New document — full pipeline processing
`skip`	Exact duplicate exists in the same source — no action
`update`	Same filename with different content, or `overwrite` mode — re-process
`link`	Content exists in another source — junction table entry created

Why It Works This Way

Content Hashing Prevents Duplicate Search Results

SHA-256 content hashing means re-uploading the same file does not create duplicate chunks in Qdrant. Without deduplication, agents would see doubled results for every duplicated document, degrading answer quality and wasting context window tokens.

Cross-Source Junction Avoids Re-Embedding

The document_sources junction table means the same content can appear in multiple sources without generating a second set of embeddings. Since embedding is one of the most expensive stages (both in API cost and latency), this saves significant compute for users who organize the same documents across multiple sources.

3-Tier Scope Hierarchy Minimises Waste

The check order — within-source first, then cross-space, then create — ensures the cheapest check runs first. A within-source hash lookup is a single indexed query. Cross-space checks only run if the first check finds no match. This minimises both storage waste and processing cost.

Configuration

Env Var	Description
`DATABASE_URL`	Postgres connection string for `app_user` role (RLS enforced)

Code Reference

File	Description
`apps/data-plane/src/services/record-manager.ts`	RecordManager class with 3-tier deduplication logic
`apps/data-plane/src/services/upload.ts`	Upload flow integrating the record manager before pipeline submission
`packages/db/src/schema/documents.ts`	`documents` and `document_sources` table definitions

Relationships

Ingestion Pipeline — The record manager runs before ingestion; its return action determines whether the pipeline runs at all
Sources — Document-source junction enables one document to belong to multiple sources
Spaces — Cross-source deduplication is scoped to a single space; RLS enforces isolation