Ingestion Pipeline

End-to-end overview of the 4-stage pipeline: conversion (Docling) → chunking → embedding → metadata extraction. Queue-driven architecture, concurrency control, status tracking.

Overview

The ingestion pipeline is a 4-stage document processing system that converts raw files into searchable, queryable knowledge. Each stage — conversion, chunking, embedding, and metadata extraction — is tracked independently per document, so a failure in one stage does not block progress in others. A document becomes searchable as soon as its chunks and embeddings are stored, even if metadata extraction has not yet completed.

The pipeline is queue-driven via BullMQ and Redis. A document worker pulls jobs from the document-processing queue and runs each stage sequentially with per-stage concurrency limiters to prevent any single bottleneck from overwhelming the system.

Key Concepts

  • 4-stage pipeline — Conversion, chunking, embedding, and metadata extraction run as distinct stages with independent status tracking.
  • Stage-level status — Each stage has its own status field (stageConversion, stageChunking, stageEmbedding, stageMetadata) with values: pending, processing, completed, failed, skipped.
  • Overall status derivation — A document's overall status is derived from its stage statuses: pending (all stages pending), processing (any stage in progress), completed (all stages completed/skipped), failed (any stage failed).
  • Concurrency limiters — Each stage has its own concurrency cap to balance throughput and resource usage.
  • Fire-and-forget tasks — After all stages complete, background tasks (feed suggestion, wiki update, agent triggering) run without blocking the response.

How It Works

  1. Conversion (Stage 0) — Binary formats (PDF, DOCX, images) are converted to text via the Docling REST API. Text formats (.txt, .md, .csv, .json) skip this stage entirely. Legacy formats (.doc, .rtf, .odt) are pre-converted via LibreOffice to .docx before Docling processes them.
  2. Chunking (Stage 1) — The extracted text is split into chunks using smart break points (1000 characters default, 200 character overlap). Any existing chunks for the document in Qdrant are deleted before new ones are created.
  3. Embedding (Stage 2) — Vector embeddings are generated for each chunk via the OpenAI or OpenRouter API. Chunks are batched with concurrency control and upserted into a model-specific Qdrant collection.
  4. Metadata Extraction (Stage 3) — An LLM extracts structured metadata from the document: title, summary, category, keywords, document type, and data structure. For long documents, multi-region sampling is used to ensure representative coverage.

After all four stages complete successfully, the following fire-and-forget tasks are triggered:

  1. Feed suggestiongenerateFeeds evaluates whether the new document should be associated with any existing feeds.
  2. Wiki incremental updatetriggerWikiIncremental updates the space's wiki with knowledge from the new document.
  3. Agent triggeringtriggerDocumentEventAgents fires any agents configured with on_document triggers.

Why It Works This Way

Stage Separation Preserves Searchability

The 4-stage separation means a metadata extraction failure does not block the document from being searchable. Once chunks and embeddings are stored (stages 1 and 2), the document appears in search results. Metadata extraction can be retried independently without re-processing earlier stages.

Fire-and-Forget for Responsiveness

Background tasks like feed suggestion, wiki generation, and agent triggering use a fire-and-forget pattern (fn().catch(err => logger.warn(...))). This preserves API responsiveness — the upload request returns immediately after queueing, and downstream tasks run asynchronously.

Per-Stage Concurrency Limiters

Each stage has its own concurrency cap (e.g., conversion: 5, embedding: 3). This prevents a burst of large PDFs from saturating the Docling service while embedding jobs starve, or vice versa. Binary and text document concurrency are also tracked separately since text documents skip conversion entirely.

Stage-Level Status Enables Targeted Retry

Because each stage tracks its own status, a failed metadata extraction can be retried without re-converting, re-chunking, or re-embedding the document. This reduces wasted compute and speeds up recovery from transient failures.

Configuration

Env VarDescription
DOCLING_API_URLDocling REST API endpoint (default http://localhost:5001)
INGESTION_CONCURRENCY_WORKERMax concurrent document worker jobs (default 6)
INGESTION_CONCURRENCY_CONVERSIONMax concurrent conversion tasks (default 5)
INGESTION_CONCURRENCY_EMBEDDINGMax concurrent embedding tasks (default 3)
INGESTION_CONCURRENCY_METADATAMax concurrent metadata extraction tasks (default 3)
INGESTION_CONCURRENCY_BINARY_DOCSMax concurrent binary document jobs (default 4)
INGESTION_CONCURRENCY_TEXT_DOCSMax concurrent text document jobs (default 8)

Code Reference

FileDescription
apps/data-plane/src/services/ingestion.tsPipeline orchestration and all 4 stages
apps/data-plane/src/workers/document.tsBullMQ worker that processes document jobs
apps/data-plane/src/services/upload.tsUpload handling, duplicate detection, queue submission
apps/data-plane/src/lib/queue.tsBullMQ queue configuration
apps/data-plane/src/lib/concurrency.tsConcurrencyLimiter utility

Relationships

  • Record Manager — Deduplication and document-source linking runs before the pipeline starts
  • Tabular Data & SQL — CSV and structured files detected during ingestion are routed to tabular storage
  • Spaces — Every document belongs to a space; RLS enforces isolation throughout the pipeline
  • Feeds — Feed suggestion runs as a fire-and-forget task after ingestion completes
  • Agentson_document agents are triggered after ingestion completes

Making the unknown, known.

© 2026 Condelo. All rights reserved.