Tabular Data & SQL — Condelo Platform

Overview

Condelo supports structured tabular data alongside unstructured documents. When a CSV file is uploaded or tabular structure is detected during ingestion, rows are parsed and stored in a dedicated structured_data_rows table as JSONB. Column metadata — names, types, and row count — is stored alongside the document record.

Text-to-SQL enables natural language querying against this structured data. An LLM generates SQL from a user's question, the query is validated against a strict allowlist of patterns, and results are returned — all without requiring users to learn a query language.

Key Concepts

Tabular detection — CSV files are detected by extension. Other file formats undergo tabular structure detection during ingestion to identify embedded tables.
JSONB row storage — Each row is stored as a JSONB object in structured_data_rows, preserving the original column structure without requiring a fixed schema.
Type inference — Column types are automatically inferred (date, currency, number, percentage, boolean, text) and stored per column for use in query generation.
Text-to-SQL — An LLM translates natural language questions into validated SQL queries that run against the stored tabular data.
Data structure classification — Each document is classified as prose, tabular, mixed, or structured via the documents.dataStructure field.

Data Model

structured_data_rows table:

Column	Type	Notes
`id`	uuid (PK)	Primary key
`documentId`	uuid (FK)	References `documents.id`
`userId`	uuid	Owner, used for RLS
`spaceId`	uuid (FK)	References `spaces.id`
`sourceId`	uuid (FK)	References `sources.id`
`rowIndex`	integer	Row position in the original file
`data`	jsonb	Row data as key-value pairs
`createdAt`	timestamp

documents (relevant columns):

Column	Type	Notes
`tabularInfo`	jsonb	`{ columnNames, rowCount, detectedTypes }`
`dataStructure`	text	One of: `prose`, `tabular`, `mixed`, `structured`

How It Works

Detection — During ingestion, CSV files are identified by extension. Other files undergo tabular structure detection to find embedded tables.
Row extraction — Rows are parsed from the file and each row is stored as a JSONB object in structured_data_rows, keyed by column name.
Metadata storage — Column names, row count, and per-column detected types are written to documents.tabularInfo. The document's dataStructure field is set to tabular, mixed, or structured as appropriate.
Type inference — Each column is analysed to detect its type: date, currency, number, percentage, boolean, or text. These types are stored in tabularInfo.detectedTypes.
Natural language query — When a user or agent asks a question about tabular data, the text-to-SQL service generates a SQL query.
Schema description — The LLM receives the table structure, column types, and usage notes via the DOCUMENTS_SCHEMA constant, enabling it to generate appropriate WHERE clauses and aggregations.
Query validation — The generated SQL is validated by validateSqlQuery:
- Must start with SELECT
- Must reference documents or structured_data_rows
- Must include a $USER_ID$ placeholder (replaced with the actual user ID before execution)
- Blocked patterns: INSERT, UPDATE, DELETE, DROP, ALTER, CREATE, TRUNCATE, UNION, SQL comments, multi-statement queries
Execution — The validated query runs via executeSafeQuery RPC function with a maximum of 50 rows returned.

Why It Works This Way

JSONB Storage With Text-to-SQL

JSONB row storage combined with text-to-SQL means agents can query CSV data using natural language. There is no custom query language to learn, and the schema-less JSONB approach handles CSVs with varying column structures without requiring DDL changes.

Query Validation Prevents Injection

The validation layer (SELECT-only, must filter by user ID, blocked mutation patterns) prevents SQL injection while still allowing flexible analytical queries. The $USER_ID$ placeholder ensures RLS-equivalent filtering even when queries run via an RPC function.

Per-Column Type Inference

Storing detected types (date, currency, number, percentage, boolean, text) per column lets the text-to-SQL service generate appropriate query patterns. A currency column gets SUM and AVG aggregations; a date column gets range filters. Without type metadata, the LLM would have to guess column semantics from names alone.

Separate Storage From Documents

Keeping tabular data in its own structured_data_rows table rather than embedding it in the main documents table keeps the document model simple. Documents remain focused on text content and metadata, while structured data gets its own optimised storage with per-row indexing.

Configuration

Env Var	Description
`DATABASE_URL`	Postgres connection string for `app_user` role (RLS enforced)
`OPENAI_API_KEY`	Used for text-to-SQL LLM calls
`OPENROUTER_API_KEY`	Alternative LLM provider for text-to-SQL

Code Reference

File	Description
`packages/db/src/schema/documents.ts`	`structured_data_rows` table definition
`apps/data-plane/src/services/text-to-sql.ts`	SQL generation, `validateSqlQuery`, `DOCUMENTS_SCHEMA` constant, execution
`apps/data-plane/src/services/tabular-detection.ts`	Tabular structure detection
`apps/data-plane/src/services/structured-extraction.ts`	Structured data extraction and row storage
`apps/data-plane/src/tools/definitions/query-documents.ts`	LLM tool definition for text-to-SQL
`apps/api/src/services/text-to-sql.ts`	API route for text-to-SQL queries

Relationships

Ingestion Pipeline — Tabular detection and extraction run as part of the ingestion stages
Record Manager — Deduplication applies to tabular documents the same as any other file type
Spaces — All structured data rows are scoped to a space via spaceId and RLS
Agents — Agents use the query-documents tool to run natural language queries against tabular data