Open Specifications
Build your own tools or integrate with existing workflows
Fact Extract uses open, documented formats for both its extraction structures and its output databases. You are free to use these specifications to build your own tools, import data into other systems, or extend the platform in any way you see fit.
This page documents two things: the JSON structure format that defines what data to extract from your documents, and the SQLite database schema used by Fact Extract Desktop to store and query results.
1. JSON Extraction Structures
An extraction structure defines the fields that FactExtract.Net will extract from your documents. Each field is essentially a question that the AI answers for each chunk of text. Structures are defined in JSON and can be imported directly into FactExtract.Net.
Structure Format
The JSON file contains a metadata section
and a structures array. Each structure
has a name, description, and an array of field definitions.
{
"metadata": {
"description": "Brief description of this import file",
"instructions": "Any notes about the structure"
},
"structures": [
{
"name": "Your Structure Name",
"description": "Describe what this structure extracts",
"overview_prompt": "These documents are [type] from [source]. They contain [topics].",
"structure_data": [
{
"name": "field_name",
"type": "string",
"description": "The question/instruction for what to extract",
"required": true
}
]
}
]
}
Required Properties
Top Level
-
metadataObject withdescriptionandinstructions(recommended) -
structuresArray of structure definitions (required)
For Each Structure
-
nameStructure name (required, max 100 characters) -
descriptionWhat this structure extracts (required) -
overview_promptA Content Description template with bracket placeholders (optional, max 2000 characters). When the structure is selected on the upload page, this text pre-populates the Content Description field, giving users a helpful starting point. -
structure_dataArray of field definitions (required, 1-50 fields)
For Each Field
-
nameField identifier (required, snake_case, letters/numbers/underscores only, max 64 characters) -
typeData type (required, see Field Types below) -
descriptionThe extraction instruction/question (required, max 5000 characters) -
requiredBoolean (optional, defaults to true). FactExtract.Net produces a value for every field regardless of this setting. Recommended: set to true for consistency.
Complete Example
A structure for extracting data from historical correspondence:
{
"metadata": {
"description": "FactExtract.Net Structure for Historical Correspondence",
"instructions": "Extracts key information from letters, memos, and other correspondence."
},
"structures": [
{
"name": "Historical Correspondence",
"description": "Extract key information from historical letters and memos",
"overview_prompt": "These documents are [type of correspondence, e.g., letters, memos] from [time period/source]. They involve [parties/organizations] and contain information about [topics/events].",
"structure_data": [
{
"name": "document_date",
"type": "date",
"description": "The date the document was written or sent",
"required": true
},
{
"name": "author",
"type": "string",
"description": "Who wrote or sent the document",
"required": true
},
{
"name": "recipient",
"type": "string",
"description": "Who the document was addressed to",
"required": true
},
{
"name": "subject_summary",
"type": "string",
"description": "A one-sentence summary of the document's main topic",
"required": true
},
{
"name": "people_mentioned",
"type": "array",
"description": "List all people mentioned in the document by name",
"required": true
},
{
"name": "key_facts",
"type": "string",
"description": "List the most important facts, events, or claims made",
"required": true
}
]
}
]
}
2. Field Types
Standard Types
| Type | Use For | Example |
|---|---|---|
string |
Text values (names, titles, short descriptions) | Person's name, document title |
number |
Decimal numbers | Amounts, prices, measurements |
integer |
Whole numbers | Counts, page numbers, years |
boolean |
Yes/No values | "Is this a complaint?" |
date |
Dates (extracted as YYYY-MM-DD) | Document date, event date |
time |
Times (extracted as HH:MM:SS) | Meeting time |
datetime |
Combined date and time | Timestamp of event |
array |
List of text values (stored as JSON in one cell) | List of names mentioned |
Records Array (Multi-Record Extraction)
The records_array type enables
multi-record extraction, where multiple similar items from a single page or chunk each become
their own row in the output. For example, extracting each day's working hours from a timecard summary
as a separate row in Excel.
records_array field
is allowed per structure. Sub-fields can only be simple types (no nested arrays).
{
"name": "daily_hours",
"type": "records_array",
"description": "Extract each day's working hours from the timecard summary",
"required": true,
"record_fields": [
{"name": "work_date", "type": "date", "description": "Date worked"},
{"name": "hours_regular", "type": "number", "description": "Regular hours worked"},
{"name": "hours_overtime", "type": "number", "description": "Overtime hours worked"},
{"name": "notes", "type": "string", "description": "Any notes or exceptions for the day"}
]
}
Array vs. Records Array
| Feature | array |
records_array |
|---|---|---|
| Output | One cell with JSON/comma-separated values | Multiple rows, one per item |
| Sub-fields | No (just a list of strings) | Yes (each item has defined fields) |
| Best for | Simple lists (names, keywords) | Structured repeated data (transactions, line items) |
| Limit | Unlimited per structure | Only one per structure |
Implementation Notes
records_array is not a standard JSON Schema or LLM tool-use type. Any software that uses it must provide its own methodology for translating between the records array format and what the LLM or other systems require.
Some LLMs only support dates as specially formatted strings rather than distinct types. Software using these structures may need to map date, time, and datetime to string with appropriate instructions. Consult the API documentation for your LLM of choice.
3. SQLite Database Schema
FactExtract.Net outputs a SQLite database that can be opened with Fact Extract Desktop or any SQLite-compatible tool. The schema below shows all tables, including those added by Desktop for collections, notes, tags, and bookmarks.
Table Overview
Core Tables (FactExtract.Net Output)
documents- One row per logical document or chunkpages- One row per page, with OCR contentResponseSchema- Field definitions for the extraction structure, plus any system-added fieldsLLMResults- Extracted values linked to pages and fieldsChunkMetadata- Document boundary detection resultsProcessLog- Processing event logpages_fts- FTS5 full-text search on page contentpages_fts_config,pages_fts_data,pages_fts_docsize,pages_fts_idx- FTS5 internal tables
Desktop Tables (Added by Desktop App)
fe_collections- User-created collectionsfe_collection_items- Pages/documents in collectionsfe_notes- User notes on pages and documentsfe_notes_fts- FTS5 full-text search on notesfe_notes_fts_config,fe_notes_fts_data,fe_notes_fts_docsize,fe_notes_fts_idx- FTS5 internal tablesfe_tags- Tag definitionsfe_document_tags- Tag-to-document assignmentsfe_bookmarks- Page bookmarksfe_edited_pages- Tracks pages with corrected OCR contentfe_edited_fields- Tracks summary fields with corrected valuesfe_custom_fields- User-defined custom field definitionsfe_custom_values- Custom field values per pagefe_vocabulary- Custom vocabulary / dictionaryfe_agent_projects,fe_agent_phases,fe_agent_jobs,fe_agent_progress,fe_agent_sessions,fe_agent_notes- MCP agent project planning and notesfe_settings- Application settingsfe_schema_version- Schema migration trackingcontent_embeddings- Per-page content vector chunksfield_embeddings- Per-document summary field vectorsembedding_metadata- Model configuration and settings
Complete Schema
The full SQLite schema, including indexes and full-text search tables:
-- ===========================================================
-- Core Tables (created by FactExtract.Net processing)
-- ===========================================================
-- Document boundary detection results
CREATE TABLE IF NOT EXISTS "ChunkMetadata" (
"ChunkID" INTEGER PRIMARY KEY AUTOINCREMENT,
"ChunkMethod" TEXT NOT NULL,
"ChunkSize" INTEGER,
"ContextToken" TEXT,
"OriginalPrompt" TEXT,
"Title" TEXT,
"StartPage" INTEGER,
"EndPage" INTEGER,
"Confidence" REAL,
"ProcessedAt" TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Extracted field values linked to pages, fields, and chunks
CREATE TABLE IF NOT EXISTS "LLMResults" (
"ResponseID" INTEGER PRIMARY KEY AUTOINCREMENT,
"PageID" INTEGER,
"LLMResponse" TEXT NOT NULL,
"FieldID" INTEGER,
"ChunkID" INTEGER,
"ProcessedAt" TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY("ChunkID") REFERENCES "ChunkMetadata"("ChunkID"),
FOREIGN KEY("FieldID") REFERENCES "ResponseSchema"("FieldID"),
FOREIGN KEY("PageID") REFERENCES "pages"("page_id")
);
-- Processing event log
CREATE TABLE IF NOT EXISTS "ProcessLog" (
"LogID" INTEGER PRIMARY KEY AUTOINCREMENT,
"Timestamp" TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
"EventType" TEXT NOT NULL,
"Message" TEXT NOT NULL,
"Details" TEXT
);
-- Field definitions for the extraction structure
CREATE TABLE IF NOT EXISTS "ResponseSchema" (
"FieldID" INTEGER PRIMARY KEY AUTOINCREMENT,
"FieldName" TEXT NOT NULL,
"FieldDescription" TEXT,
"FieldDataType" TEXT NOT NULL
);
-- One row per logical document or chunk
CREATE TABLE IF NOT EXISTS "documents" (
"document_id" INTEGER PRIMARY KEY AUTOINCREMENT,
"title" TEXT NOT NULL,
"filename" TEXT NOT NULL,
"source_path" TEXT NOT NULL,
"pdf_path" TEXT,
"page_count" INTEGER NOT NULL,
"import_date" TIMESTAMP NOT NULL,
"source_database" TEXT,
"notes" TEXT
);
-- One row per page, with OCR content
CREATE TABLE IF NOT EXISTS "pages" (
"page_id" INTEGER PRIMARY KEY AUTOINCREMENT,
"document_id" INTEGER NOT NULL,
"page_number" INTEGER NOT NULL,
"content" TEXT NOT NULL,
"content_type" TEXT DEFAULT 'text',
"pdf_page_path" TEXT,
"image_page_path" TEXT,
"import_date" TIMESTAMP NOT NULL,
"embedding_status" TEXT,
"embedding_date" TIMESTAMP,
"embedding_model" TEXT,
FOREIGN KEY("document_id") REFERENCES "documents"("document_id")
);
-- Full-text search index on page content
CREATE VIRTUAL TABLE pages_fts USING fts5(
content,
content='pages',
content_rowid='page_id'
);
-- FTS5 internal tables (auto-created by pages_fts)
CREATE TABLE 'pages_fts_config'(k PRIMARY KEY, v) WITHOUT ROWID;
CREATE TABLE 'pages_fts_data'(id INTEGER PRIMARY KEY, block BLOB);
CREATE TABLE 'pages_fts_docsize'(id INTEGER PRIMARY KEY, sz BLOB);
CREATE TABLE 'pages_fts_idx'(segid, term, pgno, PRIMARY KEY(segid, term)) WITHOUT ROWID;
-- ===========================================================
-- Desktop Tables (added by Fact Extract Desktop)
-- ===========================================================
-- User-created collections
CREATE TABLE IF NOT EXISTS "fe_collections" (
"collection_id" INTEGER PRIMARY KEY,
"name" TEXT NOT NULL,
"description" TEXT,
"color" TEXT,
"created_date" TEXT NOT NULL,
"modified_date" TEXT NOT NULL
);
-- Pages and documents assigned to collections
CREATE TABLE IF NOT EXISTS "fe_collection_items" (
"item_id" INTEGER PRIMARY KEY,
"collection_id" INTEGER NOT NULL,
"document_id" INTEGER NOT NULL,
"page_id" INTEGER,
"added_date" TEXT NOT NULL,
FOREIGN KEY("collection_id") REFERENCES "fe_collections"("collection_id") ON DELETE CASCADE
);
-- User notes on pages and documents
CREATE TABLE IF NOT EXISTS "fe_notes" (
"note_id" INTEGER PRIMARY KEY,
"document_id" INTEGER NOT NULL,
"page_id" INTEGER,
"note_text" TEXT NOT NULL,
"created_date" TEXT NOT NULL,
"modified_date" TEXT NOT NULL
);
-- Page bookmarks
CREATE TABLE IF NOT EXISTS "fe_bookmarks" (
"bookmark_id" INTEGER PRIMARY KEY,
"document_id" INTEGER NOT NULL,
"page_id" INTEGER,
"label" TEXT,
"created_date" TEXT NOT NULL
);
-- Tag definitions
CREATE TABLE IF NOT EXISTS "fe_tags" (
"tag_id" INTEGER PRIMARY KEY,
"name" TEXT NOT NULL UNIQUE,
"color" TEXT
);
-- Tag-to-document assignments
CREATE TABLE IF NOT EXISTS "fe_document_tags" (
"document_id" INTEGER NOT NULL,
"tag_id" INTEGER NOT NULL,
PRIMARY KEY("document_id", "tag_id"),
FOREIGN KEY("tag_id") REFERENCES "fe_tags"("tag_id") ON DELETE CASCADE
);
-- Application settings (key-value store)
CREATE TABLE IF NOT EXISTS "fe_settings" (
"key" TEXT PRIMARY KEY,
"value" TEXT
);
-- Schema migration tracking
CREATE TABLE IF NOT EXISTS "fe_schema_version" (
"version" INTEGER PRIMARY KEY,
"applied_date" TEXT NOT NULL,
"description" TEXT
);
-- Tracks pages where OCR content has been corrected
CREATE TABLE IF NOT EXISTS "fe_edited_pages" (
"page_id" INTEGER PRIMARY KEY
);
-- Tracks summary fields where extracted values have been corrected
CREATE TABLE IF NOT EXISTS "fe_edited_fields" (
"page_id" INTEGER NOT NULL,
"field_id" INTEGER NOT NULL,
PRIMARY KEY("page_id", "field_id")
);
-- User-defined custom field definitions
CREATE TABLE IF NOT EXISTS "fe_custom_fields" (
"field_id" INTEGER PRIMARY KEY,
"field_name" TEXT NOT NULL UNIQUE,
"field_description" TEXT,
"field_data_type" TEXT DEFAULT 'string',
"created_date" TEXT NOT NULL
);
-- Custom field values per page
CREATE TABLE IF NOT EXISTS "fe_custom_values" (
"value_id" INTEGER PRIMARY KEY,
"page_id" INTEGER NOT NULL,
"field_id" INTEGER NOT NULL,
"field_value" TEXT NOT NULL,
"created_date" TEXT NOT NULL,
"modified_date" TEXT NOT NULL,
UNIQUE("page_id", "field_id"),
FOREIGN KEY("field_id") REFERENCES "fe_custom_fields"("field_id") ON DELETE CASCADE
);
-- Custom vocabulary / dictionary
CREATE TABLE IF NOT EXISTS "fe_vocabulary" (
"word" TEXT PRIMARY KEY
);
-- ===========================================================
-- Agent Tables (MCP chatbot project planning and notes)
-- ===========================================================
-- Agent projects
CREATE TABLE IF NOT EXISTS "fe_agent_projects" (
"project_id" INTEGER PRIMARY KEY,
"name" TEXT NOT NULL,
"goal" TEXT NOT NULL,
"status" TEXT NOT NULL DEFAULT 'active'
CHECK("status" IN ('active', 'paused', 'completed', 'archived')),
"created_date" TEXT NOT NULL,
"modified_date" TEXT NOT NULL
);
-- Project phases
CREATE TABLE IF NOT EXISTS "fe_agent_phases" (
"phase_id" INTEGER PRIMARY KEY,
"project_id" INTEGER NOT NULL,
"phase_number" INTEGER NOT NULL,
"name" TEXT NOT NULL,
"description" TEXT,
"status" TEXT NOT NULL DEFAULT 'pending'
CHECK("status" IN ('pending', 'active', 'completed', 'skipped')),
"created_date" TEXT NOT NULL,
"modified_date" TEXT NOT NULL,
FOREIGN KEY("project_id") REFERENCES "fe_agent_projects"("project_id") ON DELETE CASCADE
);
-- Jobs within phases
CREATE TABLE IF NOT EXISTS "fe_agent_jobs" (
"job_id" INTEGER PRIMARY KEY,
"phase_id" INTEGER NOT NULL,
"job_number" INTEGER NOT NULL,
"name" TEXT NOT NULL,
"description" TEXT,
"status" TEXT NOT NULL DEFAULT 'pending'
CHECK("status" IN ('pending', 'active', 'completed', 'skipped', 'blocked')),
"result_summary" TEXT,
"created_date" TEXT NOT NULL,
"modified_date" TEXT NOT NULL,
FOREIGN KEY("phase_id") REFERENCES "fe_agent_phases"("phase_id") ON DELETE CASCADE
);
-- Progress log entries
CREATE TABLE IF NOT EXISTS "fe_agent_progress" (
"entry_id" INTEGER PRIMARY KEY,
"project_id" INTEGER NOT NULL,
"phase_id" INTEGER,
"job_id" INTEGER,
"entry_type" TEXT NOT NULL DEFAULT 'action'
CHECK("entry_type" IN ('action', 'discovery', 'decision', 'observation', 'error', 'milestone')),
"content" TEXT NOT NULL,
"created_date" TEXT NOT NULL,
FOREIGN KEY("project_id") REFERENCES "fe_agent_projects"("project_id") ON DELETE CASCADE,
FOREIGN KEY("phase_id") REFERENCES "fe_agent_phases"("phase_id") ON DELETE SET NULL,
FOREIGN KEY("job_id") REFERENCES "fe_agent_jobs"("job_id") ON DELETE SET NULL
);
-- Agent sessions
CREATE TABLE IF NOT EXISTS "fe_agent_sessions" (
"session_id" INTEGER PRIMARY KEY,
"project_id" INTEGER NOT NULL,
"label" TEXT,
"started_date" TEXT NOT NULL,
"ended_date" TEXT,
"summary" TEXT,
"status" TEXT NOT NULL DEFAULT 'active'
CHECK("status" IN ('active', 'completed', 'abandoned')),
FOREIGN KEY("project_id") REFERENCES "fe_agent_projects"("project_id") ON DELETE CASCADE
);
-- Agent notes
CREATE TABLE IF NOT EXISTS "fe_agent_notes" (
"note_id" INTEGER PRIMARY KEY,
"project_id" INTEGER,
"title" TEXT NOT NULL,
"content" TEXT NOT NULL,
"tags" TEXT DEFAULT '[]',
"created_date" TEXT NOT NULL,
"modified_date" TEXT NOT NULL,
FOREIGN KEY("project_id") REFERENCES "fe_agent_projects"("project_id") ON DELETE SET NULL
);
-- Full-text search index on user notes
CREATE VIRTUAL TABLE fe_notes_fts USING fts5(
note_text,
content='fe_notes',
content_rowid='note_id'
);
-- FTS5 internal tables (auto-created by fe_notes_fts)
CREATE TABLE 'fe_notes_fts_config'(k PRIMARY KEY, v) WITHOUT ROWID;
CREATE TABLE 'fe_notes_fts_data'(id INTEGER PRIMARY KEY, block BLOB);
CREATE TABLE 'fe_notes_fts_docsize'(id INTEGER PRIMARY KEY, sz BLOB);
CREATE TABLE 'fe_notes_fts_idx'(segid, term, pgno, PRIMARY KEY(segid, term)) WITHOUT ROWID;
-- ===========================================================
-- Embeddings Tables (added by Fact Extract Desktop)
-- ===========================================================
-- Per-page content embedding chunks
CREATE TABLE IF NOT EXISTS "content_embeddings" (
"chunk_id" INTEGER PRIMARY KEY AUTOINCREMENT,
"page_id" INTEGER NOT NULL,
"document_id" INTEGER NOT NULL,
"chunk_index" INTEGER NOT NULL,
"chunk_text" TEXT NOT NULL,
"embedding" BLOB NOT NULL,
"token_count" INTEGER,
FOREIGN KEY("page_id") REFERENCES "pages"("page_id"),
FOREIGN KEY("document_id") REFERENCES "documents"("document_id")
);
-- Per-document summary field embeddings (concatenated field values)
CREATE TABLE IF NOT EXISTS "field_embeddings" (
"id" INTEGER PRIMARY KEY AUTOINCREMENT,
"document_id" INTEGER NOT NULL,
"chunk_page_start" INTEGER NOT NULL,
"chunk_page_end" INTEGER NOT NULL,
"embedding" BLOB NOT NULL,
"composite_text" TEXT NOT NULL,
FOREIGN KEY("document_id") REFERENCES "documents"("document_id")
);
-- Embedding model configuration (key-value)
CREATE TABLE IF NOT EXISTS "embedding_metadata" (
"key" TEXT PRIMARY KEY,
"value" TEXT NOT NULL
);
-- ===========================================================
-- Indexes
-- ===========================================================
CREATE INDEX IF NOT EXISTS "idx_pages_documentid"
ON "pages" ("document_id");
CREATE INDEX IF NOT EXISTS "idx_llmresults_pageid"
ON "LLMResults" ("PageID");
CREATE INDEX IF NOT EXISTS "idx_llmresults_fieldid"
ON "LLMResults" ("FieldID");
CREATE INDEX IF NOT EXISTS "idx_llmresults_chunkid"
ON "LLMResults" ("ChunkID");
CREATE INDEX IF NOT EXISTS "idx_fe_bookmarks_document"
ON "fe_bookmarks" ("document_id");
CREATE INDEX IF NOT EXISTS "idx_fe_collection_items_doc"
ON "fe_collection_items" ("document_id");
CREATE UNIQUE INDEX IF NOT EXISTS "idx_fe_collection_items_unique"
ON "fe_collection_items" ("collection_id", "document_id", IFNULL("page_id", 0));
CREATE INDEX IF NOT EXISTS "idx_fe_notes_document"
ON "fe_notes" ("document_id");
CREATE INDEX IF NOT EXISTS "idx_fe_notes_page"
ON "fe_notes" ("page_id");
CREATE INDEX IF NOT EXISTS "idx_fe_custom_values_page"
ON "fe_custom_values" ("page_id");
CREATE INDEX IF NOT EXISTS "idx_content_embeddings_page"
ON "content_embeddings" ("page_id");
CREATE INDEX IF NOT EXISTS "idx_fe_agent_projects_status"
ON "fe_agent_projects" ("status");
CREATE UNIQUE INDEX IF NOT EXISTS "idx_fe_agent_phases_order"
ON "fe_agent_phases" ("project_id", "phase_number");
CREATE INDEX IF NOT EXISTS "idx_fe_agent_phases_project"
ON "fe_agent_phases" ("project_id");
CREATE UNIQUE INDEX IF NOT EXISTS "idx_fe_agent_jobs_order"
ON "fe_agent_jobs" ("phase_id", "job_number");
CREATE INDEX IF NOT EXISTS "idx_fe_agent_jobs_phase"
ON "fe_agent_jobs" ("phase_id");
CREATE INDEX IF NOT EXISTS "idx_fe_agent_progress_project"
ON "fe_agent_progress" ("project_id");
CREATE INDEX IF NOT EXISTS "idx_fe_agent_progress_date"
ON "fe_agent_progress" ("project_id", "created_date");
CREATE INDEX IF NOT EXISTS "idx_fe_agent_sessions_project"
ON "fe_agent_sessions" ("project_id");
CREATE INDEX IF NOT EXISTS "idx_fe_agent_notes_project"
ON "fe_agent_notes" ("project_id");
System-Added Fields
In addition to the fields you define in your extraction structure, FactExtract.Net
adds its own fields to ResponseSchema.
These appear as extra columns in Excel and extra rows in LLMResults.
FactExtract_System_Note: A review note that flags potential issues for human review (ambiguous values, contradictions, OCR artifacts, etc.). The cell is blank when no issues are detected.
When a database is re-processed using Database Upload, each run appends a new
versioned note field: FactExtract_System_Note_1,
FactExtract_System_Note_2, and so on.
Prior notes are preserved.
Software that reads ResponseSchema
dynamically (as recommended in Section 4) will pick these up automatically.
Hard-coded queries should account for field names they don't recognize.
Key Relationships
-
documents → pages: Each document contains multiple pages, linked by
document_id - LLMResults → pages, ResponseSchema, ChunkMetadata: Each extracted value links to its source page, field definition, and chunk
- fe_collections → fe_collection_items: Collections contain document/page references
4. Querying Extracted Data
Because FactExtract.Net supports any number of user-defined fields per structure, the database
cannot use a fixed table with one column per field. Instead, extracted data is stored in a
normalized form across two tables: ResponseSchema
defines the fields, and LLMResults
stores one row per extracted value. Every database will have a different set of fields depending
on the structure that was used to process it. ResponseSchema may also contain system-added
fields (such as FactExtract_System_Note) alongside your extraction fields.
To present extracted data in a usable format (field names as columns, one row per page or document), software must join these tables and pivot the results. This can be done in SQL, in application code, or a combination of both.
Retrieving Extracted Data for a Page
The simplest query joins LLMResults
to ResponseSchema and
pages to get field name / value
pairs for a specific page. This is what Fact Extract Desktop uses in its PDF viewer to display
extracted data alongside the source document:
SELECT rs.FieldName, lr.LLMResponse
FROM LLMResults lr
JOIN pages p ON lr.PageID = p.page_id
JOIN ResponseSchema rs ON lr.FieldID = rs.FieldID
WHERE p.document_id = ?
AND p.page_number = ?
ORDER BY rs.FieldName;
This returns one row per field, e.g. author | John Smith,
document_date | 1962-03-15, etc.
Building the Summary View (Pivot)
For spreadsheet-style output (one row per page, field names as columns), the normalized data
needs to be pivoted. The first step is to query the field names from
ResponseSchema, then
retrieve all values joined across the tables:
-- Step 1: Get the field names (these become your column headers)
SELECT DISTINCT FieldName
FROM ResponseSchema
ORDER BY FieldName;
-- Step 2: Get all extracted values with context
SELECT d.filename, d.document_id, p.page_number,
rs.FieldName, lr.LLMResponse
FROM documents d
JOIN pages p ON d.document_id = p.document_id
JOIN LLMResults lr ON p.page_id = lr.PageID
JOIN ResponseSchema rs ON lr.FieldID = rs.FieldID
ORDER BY d.filename, p.page_number, rs.FieldName;
The pivot itself is typically done in application code: group the rows by
(filename, document_id, page_number)
and accumulate each FieldName /
LLMResponse pair into a dictionary.
Each group becomes one output row, with the field names from Step 1 as column headers.
Alternatively, if you prefer a pure SQL approach, you can use SQLite's
GROUP_CONCAT or
build a dynamic query with CASE WHEN
expressions to create one column per field:
-- Pure SQL pivot (field names must be known in advance)
-- Replace the MAX(CASE...) lines with your actual field names
SELECT
d.filename,
d.document_id,
p.page_number,
MAX(CASE WHEN rs.FieldName = 'author' THEN lr.LLMResponse END) AS author,
MAX(CASE WHEN rs.FieldName = 'document_date' THEN lr.LLMResponse END) AS document_date,
MAX(CASE WHEN rs.FieldName = 'subject_summary' THEN lr.LLMResponse END) AS subject_summary,
MAX(CASE WHEN rs.FieldName = 'key_facts' THEN lr.LLMResponse END) AS key_facts
FROM documents d
JOIN pages p ON d.document_id = p.document_id
JOIN LLMResults lr ON p.page_id = lr.PageID
JOIN ResponseSchema rs ON lr.FieldID = rs.FieldID
GROUP BY d.filename, d.document_id, p.page_number
ORDER BY d.filename, p.page_number;
Because the field names are user-defined and vary per structure, the pure SQL pivot requires
building the query dynamically from ResponseSchema.
Most applications will find it simpler to use the two-step approach: query the normalized rows,
then pivot in code.
5. Embeddings Schema
Fact Extract Desktop can generate vector embeddings for semantic search across page content and summary fields. Embeddings are stored directly in the SQLite database alongside the source data.
Tables
| Table | Purpose |
|---|---|
content_embeddings |
One row per text chunk. Pages are split into paragraph-aware chunks (default max 2048 chars, 200-char overlap). Each chunk's vector is stored as a BLOB. |
field_embeddings |
One row per document/chunk. The concatenated summary field values for a document are embedded as a single vector, with page range recorded. |
embedding_metadata |
Key-value store recording the model, dimensions, pre/post-processing, and chunk strategy used to generate the embeddings. |
pages columns |
embedding_status, embedding_date, embedding_model track per-page embedding state. |
Standard Metadata Keys
model_name -- e.g. "nomic-embed-text-v1.5"
model_file -- e.g. "model_quantized.onnx"
embedding_dimensions -- e.g. "768"
document_prefix -- prefix prepended for indexing, e.g. "search_document: "
query_prefix -- prefix prepended at query time, e.g. "search_query: "
postprocessing -- e.g. "mean_pool+layer_norm+l2_norm"
chunk_level -- "per_page"
chunk_strategy -- e.g. "paragraph_aware|max_chars=2048|overlap_chars=200"
total_content_embeddings
total_field_embeddings
creation_date
factextract_version
6. Edit Tracking, Custom Fields & FTS
Edit Tracking
Fact Extract Desktop allows users to correct OCR errors in page content and fix inaccurate summary field values. These tables record which pages and fields have been edited, not the edit history itself.
| Table | Purpose |
|---|---|
fe_edited_pages |
Contains the page_id of any page whose OCR content has been manually corrected. Presence in this table means the page's pages.content differs from the original AI output. |
fe_edited_fields |
Contains (page_id, field_id) pairs for summary fields whose extracted values have been manually corrected. The corrected value is stored in-place in LLMResults. |
Custom Fields
Users can define their own fields in Desktop and assign values to individual pages, independent of the extraction structure.
| Table | Purpose |
|---|---|
fe_custom_fields |
Defines each custom field: name, description, data type. |
fe_custom_values |
Stores the value for each (page_id, field_id) pair. Unique constraint prevents duplicate assignments. |
Full-Text Search & Proximity Queries
Both pages_fts and
fe_notes_fts are
SQLite FTS5 content-sync tables. They support standard FTS5 query syntax
including NEAR proximity searches.
-- Basic full-text search on page content
SELECT p.page_id, p.document_id, p.page_number,
snippet(pages_fts, 0, '<b>', '</b>', '...', 32) AS excerpt
FROM pages_fts
JOIN pages p ON pages_fts.rowid = p.page_id
WHERE pages_fts MATCH 'search terms'
ORDER BY rank;
-- Proximity search: find "witness" within 10 words of "testimony"
SELECT p.page_id, p.page_number
FROM pages_fts
JOIN pages p ON pages_fts.rowid = p.page_id
WHERE pages_fts MATCH 'NEAR(witness testimony, 10)';
-- Search user notes
SELECT n.note_id, n.document_id, n.page_id,
snippet(fe_notes_fts, 0, '<b>', '</b>', '...', 32) AS excerpt
FROM fe_notes_fts
JOIN fe_notes n ON fe_notes_fts.rowid = n.note_id
WHERE fe_notes_fts MATCH 'search terms';
The fe_vocabulary table is
used by Fact Extract Desktop for fuzzy keyword searching.