Overview
Metadata extraction uses an LLM to classify and enrich documents with structured fields: title, summary, category, keywords, language, document type, and data structure. These fields power filtered search, document routing, and feed organisation across the platform.
Extraction combines three layers of analysis — filename parsing, structural context detection, and LLM-based content analysis — to produce high-quality metadata even for documents with ambiguous or missing titles.
Key Concepts
- Three enrichment layers — Filename analysis, structural context extraction, and LLM content analysis work together to maximise metadata quality.
- Multi-region sampling — Long documents are sampled from the start, middle, and end rather than truncated, giving the LLM a representative view.
- Structured output — Zod schemas enforce the shape and allowed values of extracted metadata via
zodResponseFormat. - Keyword quality filtering — Generic single-word terms are filtered out post-extraction, keeping only distinctive terms useful for retrieval.
- Data structure classification — Each document is tagged as
prose,tabular,mixed, orstructured, enabling downstream routing decisions.
Data Model
Extracted Fields
| Field | Description |
|---|---|
title | Inferred from content, not the filename |
summary | 1-3 sentences, specific — includes names, numbers, dates |
category | One of: technical, legal, marketing, research, financial, sales, tutorial, report, correspondence, other |
keywords | 5-10 specific, distinctive terms (multi-word phrases preferred) |
language | ISO 639-1 code |
documentType | One of 30+ types: user_manual, product_guide, specification_sheet, installation_guide, troubleshooting_guide, release_notes, whitepaper, case_study, proposal, policy, sop, meeting_notes, research_paper, brochure, memo, transcript, spreadsheet, data_export, ledger, invoice, article, report, email, code, notes, documentation, presentation, contract, other |
dataStructure | One of: prose, tabular, mixed, structured |
How It Works
Layer 1: Filename Analysis
The filename is parsed to extract:
- Brand names — Matched against a list of 40+ known brands (Bosch, Samsung, Apple, Siemens, etc.)
- Model numbers — Detected via regex patterns for alphanumeric model identifiers
- Document type hints — Matched against 15+ patterns (manual, spec, whitepaper, datasheet, etc.)
These hints are passed to the LLM as prior context, improving extraction accuracy for files with descriptive names.
Layer 2: Structural Context Extraction
The first 200 lines of text are parsed for:
- Markdown headings and ALL-CAPS headings (potential titles and section names)
- Model numbers appearing anywhere in the full text
- Product manual indicators: safety warnings, regulatory marks (CE, FCC, UL), warranty information
Layer 3: LLM Content Analysis
- Text preparation — For documents over 5000 characters, multi-region sampling selects the first 2000 + middle 1500 + last 1500 characters, separated by markers.
- Prompt construction — The system prompt identifies the LLM as a "document metadata extraction specialist". Filename hints and structural context are included as additional signals.
- Extraction — The LLM returns structured output matching the Zod schema, at temperature 0.3 for consistency.
- Post-processing — Keywords are filtered against a stopword list (removing generic terms like "document", "information", "data", "overview", "details", "guide", "content"). Multi-word phrases and proper nouns are preserved.
Why It Works This Way
Multi-Region Sampling Beats Truncation
Taking samples from the start, middle, and end of a document gives the LLM exposure to the introduction, core content, and conclusion. Simple truncation would miss key information in longer documents — a financial report's conclusions are at the end, not the beginning.
Filename Analysis Provides Strong Priors
Many enterprise documents follow naming conventions that encode brand, model, and type (e.g., Bosch-GWS-18V-manual.pdf). Extracting these hints before LLM analysis means the model starts with strong priors rather than guessing from content alone.
Data Structure Classification Enables Routing
The dataStructure field (prose, tabular, mixed, structured) lets tools decide how to query a document. Tabular documents can be routed to text-to-SQL queries rather than vector search, which would perform poorly on structured data.
Keyword Quality Filtering Improves Retrieval
Unfiltered LLM-generated keywords often include generic terms that match too broadly in search. Filtering against a stopword list and preferring multi-word phrases ensures that keywords are actually distinctive enough to be useful as search facets.
Low Temperature for Consistency
Temperature 0.3 reduces variance in metadata extraction. The same document processed twice should produce the same category, type, and keywords, making the system predictable and debuggable.
Configuration
| Env Var | Description |
|---|---|
OPENAI_API_KEY | API key for the metadata extraction LLM |
Code Reference
| File | Description |
|---|---|
apps/data-plane/src/services/metadata.ts | extractMetadata(), parseFilenameHints(), extractStructuralContext(), prepareTextForExtraction() |
apps/data-plane/src/services/tabular-detection.ts | detectTabularStructure() for data structure classification |
packages/shared/ | DocumentMetadata, DocumentCategory, DocumentType, DataStructure types |
Relationships
- Ingestion Pipeline — Metadata extraction is stage 3 of the ingestion pipeline
- Chunking & Embedding — Runs before metadata extraction as stages 1 and 2
- Vector Store & Search — Metadata fields are stored as Qdrant payload and used for filtered search
- Feeds — Feed suggestion uses extracted metadata to associate documents with feeds
- Tool Registry — The
document_infotool exposes extracted metadata to agents