Overview
Condelo supports structured tabular data alongside unstructured documents. When a CSV file is uploaded or tabular structure is detected during ingestion, rows are parsed and stored in a dedicated structured_data_rows table as JSONB. Column metadata — names, types, and row count — is stored alongside the document record.
Text-to-SQL enables natural language querying against this structured data. An LLM generates SQL from a user's question, the query is validated against a strict allowlist of patterns, and results are returned — all without requiring users to learn a query language.
Key Concepts
- Tabular detection — CSV files are detected by extension. Other file formats undergo tabular structure detection during ingestion to identify embedded tables.
- JSONB row storage — Each row is stored as a JSONB object in
structured_data_rows, preserving the original column structure without requiring a fixed schema. - Type inference — Column types are automatically inferred (date, currency, number, percentage, boolean, text) and stored per column for use in query generation.
- Text-to-SQL — An LLM translates natural language questions into validated SQL queries that run against the stored tabular data.
- Data structure classification — Each document is classified as
prose,tabular,mixed, orstructuredvia thedocuments.dataStructurefield.
Data Model
structured_data_rows table:
| Column | Type | Notes |
|---|---|---|
id | uuid (PK) | Primary key |
documentId | uuid (FK) | References documents.id |
userId | uuid | Owner, used for RLS |
spaceId | uuid (FK) | References spaces.id |
sourceId | uuid (FK) | References sources.id |
rowIndex | integer | Row position in the original file |
data | jsonb | Row data as key-value pairs |
createdAt | timestamp |
documents (relevant columns):
| Column | Type | Notes |
|---|---|---|
tabularInfo | jsonb | { columnNames, rowCount, detectedTypes } |
dataStructure | text | One of: prose, tabular, mixed, structured |
How It Works
- Detection — During ingestion, CSV files are identified by extension. Other files undergo tabular structure detection to find embedded tables.
- Row extraction — Rows are parsed from the file and each row is stored as a JSONB object in
structured_data_rows, keyed by column name. - Metadata storage — Column names, row count, and per-column detected types are written to
documents.tabularInfo. The document'sdataStructurefield is set totabular,mixed, orstructuredas appropriate. - Type inference — Each column is analysed to detect its type: date, currency, number, percentage, boolean, or text. These types are stored in
tabularInfo.detectedTypes. - Natural language query — When a user or agent asks a question about tabular data, the text-to-SQL service generates a SQL query.
- Schema description — The LLM receives the table structure, column types, and usage notes via the
DOCUMENTS_SCHEMAconstant, enabling it to generate appropriate WHERE clauses and aggregations. - Query validation — The generated SQL is validated by
validateSqlQuery:- Must start with
SELECT - Must reference
documentsorstructured_data_rows - Must include a
$USER_ID$placeholder (replaced with the actual user ID before execution) - Blocked patterns:
INSERT,UPDATE,DELETE,DROP,ALTER,CREATE,TRUNCATE,UNION, SQL comments, multi-statement queries
- Must start with
- Execution — The validated query runs via
executeSafeQueryRPC function with a maximum of 50 rows returned.
Why It Works This Way
JSONB Storage With Text-to-SQL
JSONB row storage combined with text-to-SQL means agents can query CSV data using natural language. There is no custom query language to learn, and the schema-less JSONB approach handles CSVs with varying column structures without requiring DDL changes.
Query Validation Prevents Injection
The validation layer (SELECT-only, must filter by user ID, blocked mutation patterns) prevents SQL injection while still allowing flexible analytical queries. The $USER_ID$ placeholder ensures RLS-equivalent filtering even when queries run via an RPC function.
Per-Column Type Inference
Storing detected types (date, currency, number, percentage, boolean, text) per column lets the text-to-SQL service generate appropriate query patterns. A currency column gets SUM and AVG aggregations; a date column gets range filters. Without type metadata, the LLM would have to guess column semantics from names alone.
Separate Storage From Documents
Keeping tabular data in its own structured_data_rows table rather than embedding it in the main documents table keeps the document model simple. Documents remain focused on text content and metadata, while structured data gets its own optimised storage with per-row indexing.
Configuration
| Env Var | Description |
|---|---|
DATABASE_URL | Postgres connection string for app_user role (RLS enforced) |
OPENAI_API_KEY | Used for text-to-SQL LLM calls |
OPENROUTER_API_KEY | Alternative LLM provider for text-to-SQL |
Code Reference
| File | Description |
|---|---|
packages/db/src/schema/documents.ts | structured_data_rows table definition |
apps/data-plane/src/services/text-to-sql.ts | SQL generation, validateSqlQuery, DOCUMENTS_SCHEMA constant, execution |
apps/data-plane/src/services/tabular-detection.ts | Tabular structure detection |
apps/data-plane/src/services/structured-extraction.ts | Structured data extraction and row storage |
apps/data-plane/src/tools/definitions/query-documents.ts | LLM tool definition for text-to-SQL |
apps/api/src/services/text-to-sql.ts | API route for text-to-SQL queries |
Relationships
- Ingestion Pipeline — Tabular detection and extraction run as part of the ingestion stages
- Record Manager — Deduplication applies to tabular documents the same as any other file type
- Spaces — All structured data rows are scoped to a space via
spaceIdand RLS - Agents — Agents use the
query-documentstool to run natural language queries against tabular data