Tabular Data & SQL

CSV/JSON detection and parsing, structured_data_rows storage (JSONB), text-to-SQL query generation, type inference, query validation and execution.

Overview

Condelo supports structured tabular data alongside unstructured documents. When a CSV file is uploaded or tabular structure is detected during ingestion, rows are parsed and stored in a dedicated structured_data_rows table as JSONB. Column metadata — names, types, and row count — is stored alongside the document record.

Text-to-SQL enables natural language querying against this structured data. An LLM generates SQL from a user's question, the query is validated against a strict allowlist of patterns, and results are returned — all without requiring users to learn a query language.

Key Concepts

  • Tabular detection — CSV files are detected by extension. Other file formats undergo tabular structure detection during ingestion to identify embedded tables.
  • JSONB row storage — Each row is stored as a JSONB object in structured_data_rows, preserving the original column structure without requiring a fixed schema.
  • Type inference — Column types are automatically inferred (date, currency, number, percentage, boolean, text) and stored per column for use in query generation.
  • Text-to-SQL — An LLM translates natural language questions into validated SQL queries that run against the stored tabular data.
  • Data structure classification — Each document is classified as prose, tabular, mixed, or structured via the documents.dataStructure field.

Data Model

structured_data_rows table:

ColumnTypeNotes
iduuid (PK)Primary key
documentIduuid (FK)References documents.id
userIduuidOwner, used for RLS
spaceIduuid (FK)References spaces.id
sourceIduuid (FK)References sources.id
rowIndexintegerRow position in the original file
datajsonbRow data as key-value pairs
createdAttimestamp

documents (relevant columns):

ColumnTypeNotes
tabularInfojsonb{ columnNames, rowCount, detectedTypes }
dataStructuretextOne of: prose, tabular, mixed, structured

How It Works

  1. Detection — During ingestion, CSV files are identified by extension. Other files undergo tabular structure detection to find embedded tables.
  2. Row extraction — Rows are parsed from the file and each row is stored as a JSONB object in structured_data_rows, keyed by column name.
  3. Metadata storage — Column names, row count, and per-column detected types are written to documents.tabularInfo. The document's dataStructure field is set to tabular, mixed, or structured as appropriate.
  4. Type inference — Each column is analysed to detect its type: date, currency, number, percentage, boolean, or text. These types are stored in tabularInfo.detectedTypes.
  5. Natural language query — When a user or agent asks a question about tabular data, the text-to-SQL service generates a SQL query.
  6. Schema description — The LLM receives the table structure, column types, and usage notes via the DOCUMENTS_SCHEMA constant, enabling it to generate appropriate WHERE clauses and aggregations.
  7. Query validation — The generated SQL is validated by validateSqlQuery:
    • Must start with SELECT
    • Must reference documents or structured_data_rows
    • Must include a $USER_ID$ placeholder (replaced with the actual user ID before execution)
    • Blocked patterns: INSERT, UPDATE, DELETE, DROP, ALTER, CREATE, TRUNCATE, UNION, SQL comments, multi-statement queries
  8. Execution — The validated query runs via executeSafeQuery RPC function with a maximum of 50 rows returned.

Why It Works This Way

JSONB Storage With Text-to-SQL

JSONB row storage combined with text-to-SQL means agents can query CSV data using natural language. There is no custom query language to learn, and the schema-less JSONB approach handles CSVs with varying column structures without requiring DDL changes.

Query Validation Prevents Injection

The validation layer (SELECT-only, must filter by user ID, blocked mutation patterns) prevents SQL injection while still allowing flexible analytical queries. The $USER_ID$ placeholder ensures RLS-equivalent filtering even when queries run via an RPC function.

Per-Column Type Inference

Storing detected types (date, currency, number, percentage, boolean, text) per column lets the text-to-SQL service generate appropriate query patterns. A currency column gets SUM and AVG aggregations; a date column gets range filters. Without type metadata, the LLM would have to guess column semantics from names alone.

Separate Storage From Documents

Keeping tabular data in its own structured_data_rows table rather than embedding it in the main documents table keeps the document model simple. Documents remain focused on text content and metadata, while structured data gets its own optimised storage with per-row indexing.

Configuration

Env VarDescription
DATABASE_URLPostgres connection string for app_user role (RLS enforced)
OPENAI_API_KEYUsed for text-to-SQL LLM calls
OPENROUTER_API_KEYAlternative LLM provider for text-to-SQL

Code Reference

FileDescription
packages/db/src/schema/documents.tsstructured_data_rows table definition
apps/data-plane/src/services/text-to-sql.tsSQL generation, validateSqlQuery, DOCUMENTS_SCHEMA constant, execution
apps/data-plane/src/services/tabular-detection.tsTabular structure detection
apps/data-plane/src/services/structured-extraction.tsStructured data extraction and row storage
apps/data-plane/src/tools/definitions/query-documents.tsLLM tool definition for text-to-SQL
apps/api/src/services/text-to-sql.tsAPI route for text-to-SQL queries

Relationships

  • Ingestion Pipeline — Tabular detection and extraction run as part of the ingestion stages
  • Record Manager — Deduplication applies to tabular documents the same as any other file type
  • Spaces — All structured data rows are scoped to a space via spaceId and RLS
  • Agents — Agents use the query-documents tool to run natural language queries against tabular data

Making the unknown, known.

© 2026 Condelo. All rights reserved.