Metadata Extraction

LLM-based extraction of title, summary, category, keywords, document type, and data structure. Filename analysis, multi-region sampling, and tabular detection.

Overview

Metadata extraction uses an LLM to classify and enrich documents with structured fields: title, summary, category, keywords, language, document type, and data structure. These fields power filtered search, document routing, and feed organisation across the platform.

Extraction combines three layers of analysis — filename parsing, structural context detection, and LLM-based content analysis — to produce high-quality metadata even for documents with ambiguous or missing titles.

Key Concepts

  • Three enrichment layers — Filename analysis, structural context extraction, and LLM content analysis work together to maximise metadata quality.
  • Multi-region sampling — Long documents are sampled from the start, middle, and end rather than truncated, giving the LLM a representative view.
  • Structured output — Zod schemas enforce the shape and allowed values of extracted metadata via zodResponseFormat.
  • Keyword quality filtering — Generic single-word terms are filtered out post-extraction, keeping only distinctive terms useful for retrieval.
  • Data structure classification — Each document is tagged as prose, tabular, mixed, or structured, enabling downstream routing decisions.

Data Model

Extracted Fields

FieldDescription
titleInferred from content, not the filename
summary1-3 sentences, specific — includes names, numbers, dates
categoryOne of: technical, legal, marketing, research, financial, sales, tutorial, report, correspondence, other
keywords5-10 specific, distinctive terms (multi-word phrases preferred)
languageISO 639-1 code
documentTypeOne of 30+ types: user_manual, product_guide, specification_sheet, installation_guide, troubleshooting_guide, release_notes, whitepaper, case_study, proposal, policy, sop, meeting_notes, research_paper, brochure, memo, transcript, spreadsheet, data_export, ledger, invoice, article, report, email, code, notes, documentation, presentation, contract, other
dataStructureOne of: prose, tabular, mixed, structured

How It Works

Layer 1: Filename Analysis

The filename is parsed to extract:

  • Brand names — Matched against a list of 40+ known brands (Bosch, Samsung, Apple, Siemens, etc.)
  • Model numbers — Detected via regex patterns for alphanumeric model identifiers
  • Document type hints — Matched against 15+ patterns (manual, spec, whitepaper, datasheet, etc.)

These hints are passed to the LLM as prior context, improving extraction accuracy for files with descriptive names.

Layer 2: Structural Context Extraction

The first 200 lines of text are parsed for:

  • Markdown headings and ALL-CAPS headings (potential titles and section names)
  • Model numbers appearing anywhere in the full text
  • Product manual indicators: safety warnings, regulatory marks (CE, FCC, UL), warranty information

Layer 3: LLM Content Analysis

  1. Text preparation — For documents over 5000 characters, multi-region sampling selects the first 2000 + middle 1500 + last 1500 characters, separated by markers.
  2. Prompt construction — The system prompt identifies the LLM as a "document metadata extraction specialist". Filename hints and structural context are included as additional signals.
  3. Extraction — The LLM returns structured output matching the Zod schema, at temperature 0.3 for consistency.
  4. Post-processing — Keywords are filtered against a stopword list (removing generic terms like "document", "information", "data", "overview", "details", "guide", "content"). Multi-word phrases and proper nouns are preserved.

Why It Works This Way

Multi-Region Sampling Beats Truncation

Taking samples from the start, middle, and end of a document gives the LLM exposure to the introduction, core content, and conclusion. Simple truncation would miss key information in longer documents — a financial report's conclusions are at the end, not the beginning.

Filename Analysis Provides Strong Priors

Many enterprise documents follow naming conventions that encode brand, model, and type (e.g., Bosch-GWS-18V-manual.pdf). Extracting these hints before LLM analysis means the model starts with strong priors rather than guessing from content alone.

Data Structure Classification Enables Routing

The dataStructure field (prose, tabular, mixed, structured) lets tools decide how to query a document. Tabular documents can be routed to text-to-SQL queries rather than vector search, which would perform poorly on structured data.

Keyword Quality Filtering Improves Retrieval

Unfiltered LLM-generated keywords often include generic terms that match too broadly in search. Filtering against a stopword list and preferring multi-word phrases ensures that keywords are actually distinctive enough to be useful as search facets.

Low Temperature for Consistency

Temperature 0.3 reduces variance in metadata extraction. The same document processed twice should produce the same category, type, and keywords, making the system predictable and debuggable.

Configuration

Env VarDescription
OPENAI_API_KEYAPI key for the metadata extraction LLM

Code Reference

FileDescription
apps/data-plane/src/services/metadata.tsextractMetadata(), parseFilenameHints(), extractStructuralContext(), prepareTextForExtraction()
apps/data-plane/src/services/tabular-detection.tsdetectTabularStructure() for data structure classification
packages/shared/DocumentMetadata, DocumentCategory, DocumentType, DataStructure types

Relationships

  • Ingestion Pipeline — Metadata extraction is stage 3 of the ingestion pipeline
  • Chunking & Embedding — Runs before metadata extraction as stages 1 and 2
  • Vector Store & Search — Metadata fields are stored as Qdrant payload and used for filtered search
  • Feeds — Feed suggestion uses extracted metadata to associate documents with feeds
  • Tool Registry — The document_info tool exposes extracted metadata to agents

Making the unknown, known.

© 2026 Condelo. All rights reserved.