Open Specifications

Build your own tools or integrate with existing workflows

Fact Extract uses open, documented formats for both its extraction structures and its output databases. You are free to use these specifications to build your own tools, import data into other systems, or extend the platform in any way you see fit.

This page documents two things: the JSON structure format that defines what data to extract from your documents, and the SQLite database schema used by Fact Extract Desktop to store and query results.

1. JSON Extraction Structures

An extraction structure defines the fields that FactExtract.Net will extract from your documents. Each field is essentially a question that the AI answers for each chunk of text. Structures are defined in JSON and can be imported directly into FactExtract.Net.

Structure Format

The JSON file contains a metadata section and a structures array. Each structure has a name, description, and an array of field definitions.

structure-format.json
{
  "metadata": {
    "description": "Brief description of this import file",
    "instructions": "Any notes about the structure"
  },
  "structures": [
    {
      "name": "Your Structure Name",
      "description": "Describe what this structure extracts",
      "overview_prompt": "These documents are [type] from [source]. They contain [topics].",
      "structure_data": [
        {
          "name": "field_name",
          "type": "string",
          "description": "The question/instruction for what to extract",
          "required": true
        }
      ]
    }
  ]
}

Required Properties

Top Level

  • metadata Object with description and instructions (recommended)
  • structures Array of structure definitions (required)

For Each Structure

  • name Structure name (required, max 100 characters)
  • description What this structure extracts (required)
  • overview_prompt A Content Description template with bracket placeholders (optional, max 2000 characters). When the structure is selected on the upload page, this text pre-populates the Content Description field, giving users a helpful starting point.
  • structure_data Array of field definitions (required, 1-50 fields)

For Each Field

  • name Field identifier (required, snake_case, letters/numbers/underscores only, max 64 characters)
  • type Data type (required, see Field Types below)
  • description The extraction instruction/question (required, max 5000 characters)
  • required Boolean (optional, defaults to true). FactExtract.Net produces a value for every field regardless of this setting. Recommended: set to true for consistency.

Complete Example

A structure for extracting data from historical correspondence:

historical-correspondence.json
{
  "metadata": {
    "description": "FactExtract.Net Structure for Historical Correspondence",
    "instructions": "Extracts key information from letters, memos, and other correspondence."
  },
  "structures": [
    {
      "name": "Historical Correspondence",
      "description": "Extract key information from historical letters and memos",
      "overview_prompt": "These documents are [type of correspondence, e.g., letters, memos] from [time period/source]. They involve [parties/organizations] and contain information about [topics/events].",
      "structure_data": [
        {
          "name": "document_date",
          "type": "date",
          "description": "The date the document was written or sent",
          "required": true
        },
        {
          "name": "author",
          "type": "string",
          "description": "Who wrote or sent the document",
          "required": true
        },
        {
          "name": "recipient",
          "type": "string",
          "description": "Who the document was addressed to",
          "required": true
        },
        {
          "name": "subject_summary",
          "type": "string",
          "description": "A one-sentence summary of the document's main topic",
          "required": true
        },
        {
          "name": "people_mentioned",
          "type": "array",
          "description": "List all people mentioned in the document by name",
          "required": true
        },
        {
          "name": "key_facts",
          "type": "string",
          "description": "List the most important facts, events, or claims made",
          "required": true
        }
      ]
    }
  ]
}

2. Field Types

Standard Types

Type Use For Example
string Text values (names, titles, short descriptions) Person's name, document title
number Decimal numbers Amounts, prices, measurements
integer Whole numbers Counts, page numbers, years
boolean Yes/No values "Is this a complaint?"
date Dates (extracted as YYYY-MM-DD) Document date, event date
time Times (extracted as HH:MM:SS) Meeting time
datetime Combined date and time Timestamp of event
array List of text values (stored as JSON in one cell) List of names mentioned

Records Array (Multi-Record Extraction)

The records_array type enables multi-record extraction, where multiple similar items from a single page or chunk each become their own row in the output. For example, extracting each day's working hours from a timecard summary as a separate row in Excel.

Constraints: Only one records_array field is allowed per structure. Sub-fields can only be simple types (no nested arrays).
records-array-example.json
{
  "name": "daily_hours",
  "type": "records_array",
  "description": "Extract each day's working hours from the timecard summary",
  "required": true,
  "record_fields": [
    {"name": "work_date", "type": "date", "description": "Date worked"},
    {"name": "hours_regular", "type": "number", "description": "Regular hours worked"},
    {"name": "hours_overtime", "type": "number", "description": "Overtime hours worked"},
    {"name": "notes", "type": "string", "description": "Any notes or exceptions for the day"}
  ]
}

Array vs. Records Array

Feature array records_array
Output One cell with JSON/comma-separated values Multiple rows, one per item
Sub-fields No (just a list of strings) Yes (each item has defined fields)
Best for Simple lists (names, keywords) Structured repeated data (transactions, line items)
Limit Unlimited per structure Only one per structure

Implementation Notes

records_array is not a standard JSON Schema or LLM tool-use type. Any software that uses it must provide its own methodology for translating between the records array format and what the LLM or other systems require.

Some LLMs only support dates as specially formatted strings rather than distinct types. Software using these structures may need to map date, time, and datetime to string with appropriate instructions. Consult the API documentation for your LLM of choice.

3. SQLite Database Schema

FactExtract.Net outputs a SQLite database that can be opened with Fact Extract Desktop or any SQLite-compatible tool. The schema below shows all tables, including those added by Desktop for collections, notes, tags, and bookmarks.

Table Overview

Core Tables (FactExtract.Net Output)

  • documents - One row per logical document or chunk
  • pages - One row per page, with OCR content
  • ResponseSchema - Field definitions for the extraction structure, plus any system-added fields
  • LLMResults - Extracted values linked to pages and fields
  • ChunkMetadata - Document boundary detection results
  • ProcessLog - Processing event log
  • pages_fts - FTS5 full-text search on page content
  • pages_fts_config, pages_fts_data, pages_fts_docsize, pages_fts_idx - FTS5 internal tables

Desktop Tables (Added by Desktop App)

  • fe_collections - User-created collections
  • fe_collection_items - Pages/documents in collections
  • fe_notes - User notes on pages and documents
  • fe_notes_fts - FTS5 full-text search on notes
  • fe_notes_fts_config, fe_notes_fts_data, fe_notes_fts_docsize, fe_notes_fts_idx - FTS5 internal tables
  • fe_tags - Tag definitions
  • fe_document_tags - Tag-to-document assignments
  • fe_bookmarks - Page bookmarks
  • fe_edited_pages - Tracks pages with corrected OCR content
  • fe_edited_fields - Tracks summary fields with corrected values
  • fe_custom_fields - User-defined custom field definitions
  • fe_custom_values - Custom field values per page
  • fe_vocabulary - Custom vocabulary / dictionary
  • fe_agent_projects, fe_agent_phases, fe_agent_jobs, fe_agent_progress, fe_agent_sessions, fe_agent_notes - MCP agent project planning and notes
  • fe_settings - Application settings
  • fe_schema_version - Schema migration tracking
  • content_embeddings - Per-page content vector chunks
  • field_embeddings - Per-document summary field vectors
  • embedding_metadata - Model configuration and settings

Complete Schema

The full SQLite schema, including indexes and full-text search tables:

factextract-schema.sql
-- ===========================================================
-- Core Tables (created by FactExtract.Net processing)
-- ===========================================================

-- Document boundary detection results
CREATE TABLE IF NOT EXISTS "ChunkMetadata" (
    "ChunkID"       INTEGER PRIMARY KEY AUTOINCREMENT,
    "ChunkMethod"   TEXT NOT NULL,
    "ChunkSize"     INTEGER,
    "ContextToken"  TEXT,
    "OriginalPrompt" TEXT,
    "Title"         TEXT,
    "StartPage"     INTEGER,
    "EndPage"       INTEGER,
    "Confidence"    REAL,
    "ProcessedAt"   TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Extracted field values linked to pages, fields, and chunks
CREATE TABLE IF NOT EXISTS "LLMResults" (
    "ResponseID"    INTEGER PRIMARY KEY AUTOINCREMENT,
    "PageID"        INTEGER,
    "LLMResponse"   TEXT NOT NULL,
    "FieldID"       INTEGER,
    "ChunkID"       INTEGER,
    "ProcessedAt"   TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY("ChunkID") REFERENCES "ChunkMetadata"("ChunkID"),
    FOREIGN KEY("FieldID") REFERENCES "ResponseSchema"("FieldID"),
    FOREIGN KEY("PageID") REFERENCES "pages"("page_id")
);

-- Processing event log
CREATE TABLE IF NOT EXISTS "ProcessLog" (
    "LogID"         INTEGER PRIMARY KEY AUTOINCREMENT,
    "Timestamp"     TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    "EventType"     TEXT NOT NULL,
    "Message"       TEXT NOT NULL,
    "Details"       TEXT
);

-- Field definitions for the extraction structure
CREATE TABLE IF NOT EXISTS "ResponseSchema" (
    "FieldID"           INTEGER PRIMARY KEY AUTOINCREMENT,
    "FieldName"         TEXT NOT NULL,
    "FieldDescription"  TEXT,
    "FieldDataType"     TEXT NOT NULL
);

-- One row per logical document or chunk
CREATE TABLE IF NOT EXISTS "documents" (
    "document_id"       INTEGER PRIMARY KEY AUTOINCREMENT,
    "title"             TEXT NOT NULL,
    "filename"          TEXT NOT NULL,
    "source_path"       TEXT NOT NULL,
    "pdf_path"          TEXT,
    "page_count"        INTEGER NOT NULL,
    "import_date"       TIMESTAMP NOT NULL,
    "source_database"   TEXT,
    "notes"             TEXT
);

-- One row per page, with OCR content
CREATE TABLE IF NOT EXISTS "pages" (
    "page_id"           INTEGER PRIMARY KEY AUTOINCREMENT,
    "document_id"       INTEGER NOT NULL,
    "page_number"       INTEGER NOT NULL,
    "content"           TEXT NOT NULL,
    "content_type"      TEXT DEFAULT 'text',
    "pdf_page_path"     TEXT,
    "image_page_path"   TEXT,
    "import_date"       TIMESTAMP NOT NULL,
    "embedding_status"  TEXT,
    "embedding_date"    TIMESTAMP,
    "embedding_model"   TEXT,
    FOREIGN KEY("document_id") REFERENCES "documents"("document_id")
);

-- Full-text search index on page content
CREATE VIRTUAL TABLE pages_fts USING fts5(
    content,
    content='pages',
    content_rowid='page_id'
);

-- FTS5 internal tables (auto-created by pages_fts)
CREATE TABLE 'pages_fts_config'(k PRIMARY KEY, v) WITHOUT ROWID;
CREATE TABLE 'pages_fts_data'(id INTEGER PRIMARY KEY, block BLOB);
CREATE TABLE 'pages_fts_docsize'(id INTEGER PRIMARY KEY, sz BLOB);
CREATE TABLE 'pages_fts_idx'(segid, term, pgno, PRIMARY KEY(segid, term)) WITHOUT ROWID;

-- ===========================================================
-- Desktop Tables (added by Fact Extract Desktop)
-- ===========================================================

-- User-created collections
CREATE TABLE IF NOT EXISTS "fe_collections" (
    "collection_id"     INTEGER PRIMARY KEY,
    "name"              TEXT NOT NULL,
    "description"       TEXT,
    "color"             TEXT,
    "created_date"      TEXT NOT NULL,
    "modified_date"     TEXT NOT NULL
);

-- Pages and documents assigned to collections
CREATE TABLE IF NOT EXISTS "fe_collection_items" (
    "item_id"           INTEGER PRIMARY KEY,
    "collection_id"     INTEGER NOT NULL,
    "document_id"       INTEGER NOT NULL,
    "page_id"           INTEGER,
    "added_date"        TEXT NOT NULL,
    FOREIGN KEY("collection_id") REFERENCES "fe_collections"("collection_id") ON DELETE CASCADE
);

-- User notes on pages and documents
CREATE TABLE IF NOT EXISTS "fe_notes" (
    "note_id"           INTEGER PRIMARY KEY,
    "document_id"       INTEGER NOT NULL,
    "page_id"           INTEGER,
    "note_text"         TEXT NOT NULL,
    "created_date"      TEXT NOT NULL,
    "modified_date"     TEXT NOT NULL
);

-- Page bookmarks
CREATE TABLE IF NOT EXISTS "fe_bookmarks" (
    "bookmark_id"       INTEGER PRIMARY KEY,
    "document_id"       INTEGER NOT NULL,
    "page_id"           INTEGER,
    "label"             TEXT,
    "created_date"      TEXT NOT NULL
);

-- Tag definitions
CREATE TABLE IF NOT EXISTS "fe_tags" (
    "tag_id"            INTEGER PRIMARY KEY,
    "name"              TEXT NOT NULL UNIQUE,
    "color"             TEXT
);

-- Tag-to-document assignments
CREATE TABLE IF NOT EXISTS "fe_document_tags" (
    "document_id"       INTEGER NOT NULL,
    "tag_id"            INTEGER NOT NULL,
    PRIMARY KEY("document_id", "tag_id"),
    FOREIGN KEY("tag_id") REFERENCES "fe_tags"("tag_id") ON DELETE CASCADE
);

-- Application settings (key-value store)
CREATE TABLE IF NOT EXISTS "fe_settings" (
    "key"               TEXT PRIMARY KEY,
    "value"             TEXT
);

-- Schema migration tracking
CREATE TABLE IF NOT EXISTS "fe_schema_version" (
    "version"           INTEGER PRIMARY KEY,
    "applied_date"      TEXT NOT NULL,
    "description"       TEXT
);

-- Tracks pages where OCR content has been corrected
CREATE TABLE IF NOT EXISTS "fe_edited_pages" (
    "page_id"           INTEGER PRIMARY KEY
);

-- Tracks summary fields where extracted values have been corrected
CREATE TABLE IF NOT EXISTS "fe_edited_fields" (
    "page_id"           INTEGER NOT NULL,
    "field_id"          INTEGER NOT NULL,
    PRIMARY KEY("page_id", "field_id")
);

-- User-defined custom field definitions
CREATE TABLE IF NOT EXISTS "fe_custom_fields" (
    "field_id"          INTEGER PRIMARY KEY,
    "field_name"        TEXT NOT NULL UNIQUE,
    "field_description" TEXT,
    "field_data_type"   TEXT DEFAULT 'string',
    "created_date"      TEXT NOT NULL
);

-- Custom field values per page
CREATE TABLE IF NOT EXISTS "fe_custom_values" (
    "value_id"          INTEGER PRIMARY KEY,
    "page_id"           INTEGER NOT NULL,
    "field_id"          INTEGER NOT NULL,
    "field_value"       TEXT NOT NULL,
    "created_date"      TEXT NOT NULL,
    "modified_date"     TEXT NOT NULL,
    UNIQUE("page_id", "field_id"),
    FOREIGN KEY("field_id") REFERENCES "fe_custom_fields"("field_id") ON DELETE CASCADE
);

-- Custom vocabulary / dictionary
CREATE TABLE IF NOT EXISTS "fe_vocabulary" (
    "word"              TEXT PRIMARY KEY
);

-- ===========================================================
-- Agent Tables (MCP chatbot project planning and notes)
-- ===========================================================

-- Agent projects
CREATE TABLE IF NOT EXISTS "fe_agent_projects" (
    "project_id"        INTEGER PRIMARY KEY,
    "name"              TEXT NOT NULL,
    "goal"              TEXT NOT NULL,
    "status"            TEXT NOT NULL DEFAULT 'active'
        CHECK("status" IN ('active', 'paused', 'completed', 'archived')),
    "created_date"      TEXT NOT NULL,
    "modified_date"     TEXT NOT NULL
);

-- Project phases
CREATE TABLE IF NOT EXISTS "fe_agent_phases" (
    "phase_id"          INTEGER PRIMARY KEY,
    "project_id"        INTEGER NOT NULL,
    "phase_number"      INTEGER NOT NULL,
    "name"              TEXT NOT NULL,
    "description"       TEXT,
    "status"            TEXT NOT NULL DEFAULT 'pending'
        CHECK("status" IN ('pending', 'active', 'completed', 'skipped')),
    "created_date"      TEXT NOT NULL,
    "modified_date"     TEXT NOT NULL,
    FOREIGN KEY("project_id") REFERENCES "fe_agent_projects"("project_id") ON DELETE CASCADE
);

-- Jobs within phases
CREATE TABLE IF NOT EXISTS "fe_agent_jobs" (
    "job_id"            INTEGER PRIMARY KEY,
    "phase_id"          INTEGER NOT NULL,
    "job_number"        INTEGER NOT NULL,
    "name"              TEXT NOT NULL,
    "description"       TEXT,
    "status"            TEXT NOT NULL DEFAULT 'pending'
        CHECK("status" IN ('pending', 'active', 'completed', 'skipped', 'blocked')),
    "result_summary"    TEXT,
    "created_date"      TEXT NOT NULL,
    "modified_date"     TEXT NOT NULL,
    FOREIGN KEY("phase_id") REFERENCES "fe_agent_phases"("phase_id") ON DELETE CASCADE
);

-- Progress log entries
CREATE TABLE IF NOT EXISTS "fe_agent_progress" (
    "entry_id"          INTEGER PRIMARY KEY,
    "project_id"        INTEGER NOT NULL,
    "phase_id"          INTEGER,
    "job_id"            INTEGER,
    "entry_type"        TEXT NOT NULL DEFAULT 'action'
        CHECK("entry_type" IN ('action', 'discovery', 'decision', 'observation', 'error', 'milestone')),
    "content"           TEXT NOT NULL,
    "created_date"      TEXT NOT NULL,
    FOREIGN KEY("project_id") REFERENCES "fe_agent_projects"("project_id") ON DELETE CASCADE,
    FOREIGN KEY("phase_id") REFERENCES "fe_agent_phases"("phase_id") ON DELETE SET NULL,
    FOREIGN KEY("job_id") REFERENCES "fe_agent_jobs"("job_id") ON DELETE SET NULL
);

-- Agent sessions
CREATE TABLE IF NOT EXISTS "fe_agent_sessions" (
    "session_id"        INTEGER PRIMARY KEY,
    "project_id"        INTEGER NOT NULL,
    "label"             TEXT,
    "started_date"      TEXT NOT NULL,
    "ended_date"        TEXT,
    "summary"           TEXT,
    "status"            TEXT NOT NULL DEFAULT 'active'
        CHECK("status" IN ('active', 'completed', 'abandoned')),
    FOREIGN KEY("project_id") REFERENCES "fe_agent_projects"("project_id") ON DELETE CASCADE
);

-- Agent notes
CREATE TABLE IF NOT EXISTS "fe_agent_notes" (
    "note_id"           INTEGER PRIMARY KEY,
    "project_id"        INTEGER,
    "title"             TEXT NOT NULL,
    "content"           TEXT NOT NULL,
    "tags"              TEXT DEFAULT '[]',
    "created_date"      TEXT NOT NULL,
    "modified_date"     TEXT NOT NULL,
    FOREIGN KEY("project_id") REFERENCES "fe_agent_projects"("project_id") ON DELETE SET NULL
);

-- Full-text search index on user notes
CREATE VIRTUAL TABLE fe_notes_fts USING fts5(
    note_text,
    content='fe_notes',
    content_rowid='note_id'
);

-- FTS5 internal tables (auto-created by fe_notes_fts)
CREATE TABLE 'fe_notes_fts_config'(k PRIMARY KEY, v) WITHOUT ROWID;
CREATE TABLE 'fe_notes_fts_data'(id INTEGER PRIMARY KEY, block BLOB);
CREATE TABLE 'fe_notes_fts_docsize'(id INTEGER PRIMARY KEY, sz BLOB);
CREATE TABLE 'fe_notes_fts_idx'(segid, term, pgno, PRIMARY KEY(segid, term)) WITHOUT ROWID;

-- ===========================================================
-- Embeddings Tables (added by Fact Extract Desktop)
-- ===========================================================

-- Per-page content embedding chunks
CREATE TABLE IF NOT EXISTS "content_embeddings" (
    "chunk_id"          INTEGER PRIMARY KEY AUTOINCREMENT,
    "page_id"           INTEGER NOT NULL,
    "document_id"       INTEGER NOT NULL,
    "chunk_index"       INTEGER NOT NULL,
    "chunk_text"        TEXT NOT NULL,
    "embedding"         BLOB NOT NULL,
    "token_count"       INTEGER,
    FOREIGN KEY("page_id") REFERENCES "pages"("page_id"),
    FOREIGN KEY("document_id") REFERENCES "documents"("document_id")
);

-- Per-document summary field embeddings (concatenated field values)
CREATE TABLE IF NOT EXISTS "field_embeddings" (
    "id"                INTEGER PRIMARY KEY AUTOINCREMENT,
    "document_id"       INTEGER NOT NULL,
    "chunk_page_start"  INTEGER NOT NULL,
    "chunk_page_end"    INTEGER NOT NULL,
    "embedding"         BLOB NOT NULL,
    "composite_text"    TEXT NOT NULL,
    FOREIGN KEY("document_id") REFERENCES "documents"("document_id")
);

-- Embedding model configuration (key-value)
CREATE TABLE IF NOT EXISTS "embedding_metadata" (
    "key"               TEXT PRIMARY KEY,
    "value"             TEXT NOT NULL
);

-- ===========================================================
-- Indexes
-- ===========================================================

CREATE INDEX IF NOT EXISTS "idx_pages_documentid"
    ON "pages" ("document_id");

CREATE INDEX IF NOT EXISTS "idx_llmresults_pageid"
    ON "LLMResults" ("PageID");

CREATE INDEX IF NOT EXISTS "idx_llmresults_fieldid"
    ON "LLMResults" ("FieldID");

CREATE INDEX IF NOT EXISTS "idx_llmresults_chunkid"
    ON "LLMResults" ("ChunkID");

CREATE INDEX IF NOT EXISTS "idx_fe_bookmarks_document"
    ON "fe_bookmarks" ("document_id");

CREATE INDEX IF NOT EXISTS "idx_fe_collection_items_doc"
    ON "fe_collection_items" ("document_id");

CREATE UNIQUE INDEX IF NOT EXISTS "idx_fe_collection_items_unique"
    ON "fe_collection_items" ("collection_id", "document_id", IFNULL("page_id", 0));

CREATE INDEX IF NOT EXISTS "idx_fe_notes_document"
    ON "fe_notes" ("document_id");

CREATE INDEX IF NOT EXISTS "idx_fe_notes_page"
    ON "fe_notes" ("page_id");

CREATE INDEX IF NOT EXISTS "idx_fe_custom_values_page"
    ON "fe_custom_values" ("page_id");

CREATE INDEX IF NOT EXISTS "idx_content_embeddings_page"
    ON "content_embeddings" ("page_id");

CREATE INDEX IF NOT EXISTS "idx_fe_agent_projects_status"
    ON "fe_agent_projects" ("status");

CREATE UNIQUE INDEX IF NOT EXISTS "idx_fe_agent_phases_order"
    ON "fe_agent_phases" ("project_id", "phase_number");

CREATE INDEX IF NOT EXISTS "idx_fe_agent_phases_project"
    ON "fe_agent_phases" ("project_id");

CREATE UNIQUE INDEX IF NOT EXISTS "idx_fe_agent_jobs_order"
    ON "fe_agent_jobs" ("phase_id", "job_number");

CREATE INDEX IF NOT EXISTS "idx_fe_agent_jobs_phase"
    ON "fe_agent_jobs" ("phase_id");

CREATE INDEX IF NOT EXISTS "idx_fe_agent_progress_project"
    ON "fe_agent_progress" ("project_id");

CREATE INDEX IF NOT EXISTS "idx_fe_agent_progress_date"
    ON "fe_agent_progress" ("project_id", "created_date");

CREATE INDEX IF NOT EXISTS "idx_fe_agent_sessions_project"
    ON "fe_agent_sessions" ("project_id");

CREATE INDEX IF NOT EXISTS "idx_fe_agent_notes_project"
    ON "fe_agent_notes" ("project_id");

System-Added Fields

In addition to the fields you define in your extraction structure, FactExtract.Net adds its own fields to ResponseSchema. These appear as extra columns in Excel and extra rows in LLMResults.

FactExtract_System_Note: A review note that flags potential issues for human review (ambiguous values, contradictions, OCR artifacts, etc.). The cell is blank when no issues are detected.

When a database is re-processed using Database Upload, each run appends a new versioned note field: FactExtract_System_Note_1, FactExtract_System_Note_2, and so on. Prior notes are preserved.

Software that reads ResponseSchema dynamically (as recommended in Section 4) will pick these up automatically. Hard-coded queries should account for field names they don't recognize.

Key Relationships

  • documents → pages: Each document contains multiple pages, linked by document_id
  • LLMResults → pages, ResponseSchema, ChunkMetadata: Each extracted value links to its source page, field definition, and chunk
  • fe_collections → fe_collection_items: Collections contain document/page references

4. Querying Extracted Data

Because FactExtract.Net supports any number of user-defined fields per structure, the database cannot use a fixed table with one column per field. Instead, extracted data is stored in a normalized form across two tables: ResponseSchema defines the fields, and LLMResults stores one row per extracted value. Every database will have a different set of fields depending on the structure that was used to process it. ResponseSchema may also contain system-added fields (such as FactExtract_System_Note) alongside your extraction fields.

To present extracted data in a usable format (field names as columns, one row per page or document), software must join these tables and pivot the results. This can be done in SQL, in application code, or a combination of both.

Retrieving Extracted Data for a Page

The simplest query joins LLMResults to ResponseSchema and pages to get field name / value pairs for a specific page. This is what Fact Extract Desktop uses in its PDF viewer to display extracted data alongside the source document:

per-page-query.sql
SELECT rs.FieldName, lr.LLMResponse
FROM LLMResults lr
JOIN pages p ON lr.PageID = p.page_id
JOIN ResponseSchema rs ON lr.FieldID = rs.FieldID
WHERE p.document_id = ?
  AND p.page_number = ?
ORDER BY rs.FieldName;

This returns one row per field, e.g. author | John Smith, document_date | 1962-03-15, etc.

Building the Summary View (Pivot)

For spreadsheet-style output (one row per page, field names as columns), the normalized data needs to be pivoted. The first step is to query the field names from ResponseSchema, then retrieve all values joined across the tables:

step-1-get-field-names.sql
-- Step 1: Get the field names (these become your column headers)
SELECT DISTINCT FieldName
FROM ResponseSchema
ORDER BY FieldName;
step-2-get-all-values.sql
-- Step 2: Get all extracted values with context
SELECT d.filename, d.document_id, p.page_number,
       rs.FieldName, lr.LLMResponse
FROM documents d
JOIN pages p ON d.document_id = p.document_id
JOIN LLMResults lr ON p.page_id = lr.PageID
JOIN ResponseSchema rs ON lr.FieldID = rs.FieldID
ORDER BY d.filename, p.page_number, rs.FieldName;

The pivot itself is typically done in application code: group the rows by (filename, document_id, page_number) and accumulate each FieldName / LLMResponse pair into a dictionary. Each group becomes one output row, with the field names from Step 1 as column headers.

Alternatively, if you prefer a pure SQL approach, you can use SQLite's GROUP_CONCAT or build a dynamic query with CASE WHEN expressions to create one column per field:

pure-sql-pivot.sql
-- Pure SQL pivot (field names must be known in advance)
-- Replace the MAX(CASE...) lines with your actual field names
SELECT
    d.filename,
    d.document_id,
    p.page_number,
    MAX(CASE WHEN rs.FieldName = 'author' THEN lr.LLMResponse END) AS author,
    MAX(CASE WHEN rs.FieldName = 'document_date' THEN lr.LLMResponse END) AS document_date,
    MAX(CASE WHEN rs.FieldName = 'subject_summary' THEN lr.LLMResponse END) AS subject_summary,
    MAX(CASE WHEN rs.FieldName = 'key_facts' THEN lr.LLMResponse END) AS key_facts
FROM documents d
JOIN pages p ON d.document_id = p.document_id
JOIN LLMResults lr ON p.page_id = lr.PageID
JOIN ResponseSchema rs ON lr.FieldID = rs.FieldID
GROUP BY d.filename, d.document_id, p.page_number
ORDER BY d.filename, p.page_number;

Because the field names are user-defined and vary per structure, the pure SQL pivot requires building the query dynamically from ResponseSchema. Most applications will find it simpler to use the two-step approach: query the normalized rows, then pivot in code.

5. Embeddings Schema

Fact Extract Desktop can generate vector embeddings for semantic search across page content and summary fields. Embeddings are stored directly in the SQLite database alongside the source data.

Tables

Table Purpose
content_embeddings One row per text chunk. Pages are split into paragraph-aware chunks (default max 2048 chars, 200-char overlap). Each chunk's vector is stored as a BLOB.
field_embeddings One row per document/chunk. The concatenated summary field values for a document are embedded as a single vector, with page range recorded.
embedding_metadata Key-value store recording the model, dimensions, pre/post-processing, and chunk strategy used to generate the embeddings.
pages columns embedding_status, embedding_date, embedding_model track per-page embedding state.

Standard Metadata Keys

embedding_metadata keys
model_name            -- e.g. "nomic-embed-text-v1.5"
model_file            -- e.g. "model_quantized.onnx"
embedding_dimensions  -- e.g. "768"
document_prefix       -- prefix prepended for indexing, e.g. "search_document: "
query_prefix          -- prefix prepended at query time, e.g. "search_query: "
postprocessing        -- e.g. "mean_pool+layer_norm+l2_norm"
chunk_level           -- "per_page"
chunk_strategy        -- e.g. "paragraph_aware|max_chars=2048|overlap_chars=200"
total_content_embeddings
total_field_embeddings
creation_date
factextract_version

6. Edit Tracking, Custom Fields & FTS

Edit Tracking

Fact Extract Desktop allows users to correct OCR errors in page content and fix inaccurate summary field values. These tables record which pages and fields have been edited, not the edit history itself.

Table Purpose
fe_edited_pages Contains the page_id of any page whose OCR content has been manually corrected. Presence in this table means the page's pages.content differs from the original AI output.
fe_edited_fields Contains (page_id, field_id) pairs for summary fields whose extracted values have been manually corrected. The corrected value is stored in-place in LLMResults.

Custom Fields

Users can define their own fields in Desktop and assign values to individual pages, independent of the extraction structure.

Table Purpose
fe_custom_fields Defines each custom field: name, description, data type.
fe_custom_values Stores the value for each (page_id, field_id) pair. Unique constraint prevents duplicate assignments.

Full-Text Search & Proximity Queries

Both pages_fts and fe_notes_fts are SQLite FTS5 content-sync tables. They support standard FTS5 query syntax including NEAR proximity searches.

fts5-examples.sql
-- Basic full-text search on page content
SELECT p.page_id, p.document_id, p.page_number,
       snippet(pages_fts, 0, '<b>', '</b>', '...', 32) AS excerpt
FROM pages_fts
JOIN pages p ON pages_fts.rowid = p.page_id
WHERE pages_fts MATCH 'search terms'
ORDER BY rank;

-- Proximity search: find "witness" within 10 words of "testimony"
SELECT p.page_id, p.page_number
FROM pages_fts
JOIN pages p ON pages_fts.rowid = p.page_id
WHERE pages_fts MATCH 'NEAR(witness testimony, 10)';

-- Search user notes
SELECT n.note_id, n.document_id, n.page_id,
       snippet(fe_notes_fts, 0, '<b>', '</b>', '...', 32) AS excerpt
FROM fe_notes_fts
JOIN fe_notes n ON fe_notes_fts.rowid = n.note_id
WHERE fe_notes_fts MATCH 'search terms';

The fe_vocabulary table is used by Fact Extract Desktop for fuzzy keyword searching.