Record Manager

Duplicate detection via SHA-256 content hashing. Resolution modes (skip/overwrite), scope hierarchy (within-source → cross-space), and document-source junction for linking.

Overview

The Record Manager handles duplicate detection and document-source linking. Before any document enters the ingestion pipeline, the Record Manager checks whether the same content already exists — either in the target source or elsewhere in the space. This prevents duplicate chunks from appearing in Qdrant and avoids wasting compute on redundant embedding and metadata extraction.

When the same file is uploaded to multiple sources, the Record Manager creates a junction table link instead of duplicating the document. This means a single set of chunks and embeddings serves all sources that reference the same content.

Key Concepts

  • SHA-256 content hashing — A hash is computed from the raw file content (ArrayBuffer) before any processing, using the Web Crypto API (crypto.subtle.digest). This hash is stored in documents.contentHash and indexed for fast lookups.
  • 3-tier scope hierarchy — Deduplication checks proceed from narrowest to broadest scope: within source, then cross-space.
  • Duplicate modesskip (default) avoids re-processing duplicates; overwrite forces re-processing and replaces existing content.
  • Document-source junction — The document_sources table allows one document to appear in multiple sources without duplicating its chunks or embeddings.

Data Model

document_sources junction table:

ColumnTypeNotes
documentIduuid (PK, FK)References documents.id
sourceIduuid (PK, FK)References sources.id
folderIduuid (nullable)Optional folder within the source
createdAttimestamp

documents (relevant columns):

ColumnTypeNotes
contentHashtextSHA-256 hash of raw file content
filenametextOriginal filename

Indexes:

IndexColumnsNotes
Hash lookupcontentHashFast duplicate detection
Composite PK(documentId, sourceId)On document_sources

How It Works

  1. Hash the file — Compute SHA-256 from the raw file content using crypto.subtle.digest.
  2. Check within source — Query for a document with the same contentHash in the target source.
    • Match by hash — In skip mode: return skip action. In overwrite mode: return update action.
    • Match by filename — If no hash match but a document with the same filename exists and has different content: return update action (the file has been revised).
  3. Check cross-space — Query for a document with the same contentHash anywhere in the user's space.
    • Match found — Create a document_sources entry linking the existing document to the new source. Return link action.
  4. No match — Create a new document record. Return create action, which triggers the full ingestion pipeline.

Return actions:

ActionMeaning
createNew document — full pipeline processing
skipExact duplicate exists in the same source — no action
updateSame filename with different content, or overwrite mode — re-process
linkContent exists in another source — junction table entry created

Why It Works This Way

Content Hashing Prevents Duplicate Search Results

SHA-256 content hashing means re-uploading the same file does not create duplicate chunks in Qdrant. Without deduplication, agents would see doubled results for every duplicated document, degrading answer quality and wasting context window tokens.

Cross-Source Junction Avoids Re-Embedding

The document_sources junction table means the same content can appear in multiple sources without generating a second set of embeddings. Since embedding is one of the most expensive stages (both in API cost and latency), this saves significant compute for users who organize the same documents across multiple sources.

3-Tier Scope Hierarchy Minimises Waste

The check order — within-source first, then cross-space, then create — ensures the cheapest check runs first. A within-source hash lookup is a single indexed query. Cross-space checks only run if the first check finds no match. This minimises both storage waste and processing cost.

Configuration

Env VarDescription
DATABASE_URLPostgres connection string for app_user role (RLS enforced)

Code Reference

FileDescription
apps/data-plane/src/services/record-manager.tsRecordManager class with 3-tier deduplication logic
apps/data-plane/src/services/upload.tsUpload flow integrating the record manager before pipeline submission
packages/db/src/schema/documents.tsdocuments and document_sources table definitions

Relationships

  • Ingestion Pipeline — The record manager runs before ingestion; its return action determines whether the pipeline runs at all
  • Sources — Document-source junction enables one document to belong to multiple sources
  • Spaces — Cross-source deduplication is scoped to a single space; RLS enforces isolation

Making the unknown, known.

© 2026 Condelo. All rights reserved.