Skip to content

Output Format

Overview

Stratum Chunker produces output in the Canonical v1.2 format - a versioned JSON schema designed for: - Consistency across different input formats - Rich metadata for filtering and retrieval - Forward compatibility via schema versioning

Full Schema

{
  "schema_version": "v1.2",
  "document": {
    "doc_id": "string",
    "source_file": "string | null",
    "format": "string | null",
    "title": "string | null",
    "total_pages": "integer | null",
    "enrichments": [
      {
        "name": "string",
        "version": "string",
        "timestamp": "string"
      }
    ]
  },
  "chunks": [
    {
      "id": "string",
      "text": "string",
      "heading_path": ["string"] | null,
      "page_start": "integer | null",
      "page_end": "integer | null",
      "content_flags": {
        "has_table": "boolean",
        "has_image": "boolean",
        "has_code": "boolean",
        "has_formula": "boolean",
        "has_list": "boolean"
      },
      "artifacts": {
        "images": ["string"],
        "tables": ["string"],
        "tables_inline": ["string"]
      },
      "enrichments": {
        "key": "value"
      }
    }
  ],
  "artifacts": {
    "images": ["string"],
    "tables": ["string"],
    "tables_inline": ["string"]
  }
}

Complete Example

{
  "schema_version": "v1.2",
  "document": {
    "doc_id": "research_paper",
    "source_file": "papers/research_paper.pdf",
    "format": "pdf",
    "title": "Machine Learning Approaches",
    "total_pages": 12,
    "enrichments": [
      {
        "name": "intra-document-reference",
        "version": "1.0.0",
        "timestamp": "2026-02-05T10:30:00"
      }
    ]
  },
  "chunks": [
    {
      "id": "research_paper_chunk_001",
      "text": "# Introduction\n\n## Background\n\nMachine learning has revolutionized...",
      "heading_path": ["Introduction", "Background"],
      "page_start": 1,
      "page_end": 2,
      "content_flags": {
        "has_table": false,
        "has_image": false,
        "has_code": false,
        "has_formula": false,
        "has_list": false
      },
      "artifacts": {
        "images": [],
        "tables": []
      }
    },
    {
      "id": "research_paper_chunk_002",
      "text": "# Introduction\n\n## Related Work\n\nPrevious studies have shown...\n\n![Figure 1](images/fig_001.png)\n\nAs illustrated in Figure 1...",
      "heading_path": ["Introduction", "Related Work"],
      "page_start": 2,
      "page_end": 3,
      "content_flags": {
        "has_table": false,
        "has_image": true,
        "has_code": false,
        "has_formula": false,
        "has_list": false
      },
      "artifacts": {
        "images": ["images/research_paper_fig_001.png"],
        "tables": []
      },
      "enrichments": {
        "references": {
          "intra_document": [
            {
              "type": "figure",
              "raw_text": "Figure 1",
              "normalized": "1",
              "position": {"start": 85, "end": 93},
              "confidence": 0.95
            }
          ]
        }
      }
    },
    {
      "id": "research_paper_chunk_003",
      "text": "# Methods\n\n## Data Collection\n\nWe collected data from:\n\n- Source A (n=1000)\n- Source B (n=500)\n- Source C (n=200)",
      "heading_path": ["Methods", "Data Collection"],
      "page_start": 4,
      "page_end": 4,
      "content_flags": {
        "has_table": false,
        "has_image": false,
        "has_code": false,
        "has_formula": false,
        "has_list": true
      },
      "artifacts": {
        "images": [],
        "tables": []
      }
    },
    {
      "id": "research_paper_chunk_004",
      "text": "# Methods\n\n## Implementation\n\n```python\ndef train_model(data):\n    model = Model()\n    model.fit(data)\n    return model\n```",
      "heading_path": ["Methods", "Implementation"],
      "page_start": 5,
      "page_end": 5,
      "content_flags": {
        "has_table": false,
        "has_image": false,
        "has_code": true,
        "has_formula": false,
        "has_list": false
      },
      "artifacts": {
        "images": [],
        "tables": []
      }
    },
    {
      "id": "research_paper_chunk_005",
      "text": "# Results\n\n| Model | Accuracy | F1 Score |\n|-------|----------|----------|\n| A     | 0.95     | 0.93     |\n| B     | 0.92     | 0.90     |",
      "heading_path": ["Results"],
      "page_start": 6,
      "page_end": 6,
      "content_flags": {
        "has_table": true,
        "has_image": false,
        "has_code": false,
        "has_formula": false,
        "has_list": false
      },
      "artifacts": {
        "images": [],
        "tables": []
      }
    }
  ],
  "artifacts": {
    "images": ["images/research_paper_fig_001.png"],
    "tables": []
  }
}

Field Reference

Document Info

Field Type Required Description
doc_id string Yes Unique document identifier (usually filename stem)
source_file string No Original file path
format string No Document format: pdf, markdown, txt, docx
title string No Document title (if detected)
total_pages int No Total page count (for paginated formats)
enrichments list[dict] No List of applied enrichment metadata (name, version, timestamp)

Chunk

Field Type Required Description
id string Yes Unique chunk ID: {doc_id}_chunk_{NNN}
text string Yes Chunk text content
heading_path list[string] No Heading ancestry: ["H1", "H2", "H3"]
page_start int No Starting page (1-indexed, null for non-paginated)
page_end int No Ending page
content_flags object Yes Boolean content type indicators
artifacts object Yes References to extracted artifacts
enrichments dict No Dictionary of enrichment data added by enrichment steps

Content Flags

Flag Description
has_table Chunk contains a table
has_image Chunk contains/references an image
has_code Chunk contains code block(s)
has_formula Chunk contains mathematical formula(s)
has_list Chunk contains bullet/numbered list(s)

Artifacts

Field Type When present Description
images list[string] Always Paths to extracted image files (PNG)
tables list[string] --tables-mode file or both Paths to extracted table files (Markdown .md)
tables_inline list[string] --tables-mode inline or both Table content as Markdown strings (omitted when empty)

Document-level artifacts aggregates all artifacts across chunks. Paths are relative to the working directory (same as passed to --images-dir / --tables-dir).

Tables mode controls how table data is stored (set via --tables-mode):

// --tables-mode file (default): file paths only
{
  "artifacts": {
    "images": [],
    "tables": ["tables/paper_table_001.md"]
  }
}

// --tables-mode inline: markdown text, no files
{
  "artifacts": {
    "images": [],
    "tables": [],
    "tables_inline": ["| Col A | Col B |\n|-------|-------|\n| 1     | 2     |"]
  }
}

// --tables-mode both: file paths + inline text
{
  "artifacts": {
    "images": [],
    "tables": ["tables/paper_table_001.md"],
    "tables_inline": ["| Col A | Col B |\n|-------|-------|\n| 1     | 2     |"]
  }
}

Design rationale: tables_inline is intentionally separate from chunk.text. Chunk text is what embedding models index — keeping raw table markdown out of it avoids polluting vector representations. The tables_inline field is for downstream code that needs the table data explicitly (e.g. table-aware enrichers, UI rendering, re-ranking). Images are always file paths — there is no inline/base64 mode.

Enrichment Data Storage

Enrichment data is stored in two locations:

Chunk-level enrichments (chunk.enrichments): Dictionary containing enrichment data specific to each chunk.

chunk.enrichments = {
    'embedding': [0.1, 0.2, 0.3, ...],
    'references': {
        'intra_document': [
            {
                'type': 'table',
                'raw_text': 'Table 1',
                'normalized': '1',
                'position': {'start': 10, 'end': 17},
                'confidence': 0.95,
                'resolved_to': {
                    'artifact_id': 'doc_table_001',
                    'chunk_id': 'doc_chunk_005'
                }
            }
        ]
    },
    'table_summary': 'This table compares model performance...',
    'image_description': 'Architecture diagram showing neural network layers.'
}

Document-level enrichment metadata (document.enrichments): List tracking which enrichments have been applied.

document.enrichments = [
    {
        'name': 'intra-document-reference',
        'version': '1.1.0',
        'timestamp': '2026-02-05T10:30:00',
        'config': {'resolve_references': True, 'languages': ['en']}
    },
    {
        'name': 'table-summarization',
        'version': '1.0.0',
        'timestamp': '2026-02-05T10:30:01',
        'config': {'llm_available': True, 'model': 'gemini-2.0-flash'}
    },
    {
        'name': 'image-description',
        'version': '1.0.0',
        'timestamp': '2026-02-05T10:30:02',
        'config': {'vlm_available': True, 'model': 'gemini-2.0-flash'}
    }
]

The enrichments field is separate from the extra field to maintain clear separation between enrichment pipeline data and other metadata.

Content Storage

Content Type Where Stored Format
Text chunk.text Plain text
Tables chunk.text Markdown table syntax
Tables (file mode) External file Markdown .md, path in artifacts.tables
Tables (inline mode) artifacts.tables_inline Markdown string — separate from chunk.text
Code chunk.text Markdown code blocks
Lists chunk.text Markdown list syntax
Formulas chunk.text LaTeX or text
Images External file PNG, path in artifacts.images

Chunk ID Format

Chunk IDs follow the pattern: {doc_id}_chunk_{NNN}

  • doc_id: Document identifier (filename stem or custom)
  • NNN: Zero-padded sequence number (001, 002, etc.)

Examples: - paper_chunk_001 - research_paper_chunk_042 - my_doc_chunk_123

Heading Path

The heading_path array represents the full heading ancestry:

{
  "heading_path": ["Introduction", "Background", "Related Work"]
}

This means the chunk is under: - H1: Introduction - H2: Background - H3: Related Work

When include_heading_in_text=true (default), the heading hierarchy is also prepended to the chunk text:

# Introduction

## Background

### Related Work

The actual content starts here...

Page Information

For paginated documents (PDF, DOCX): - page_start: First page containing chunk content (1-indexed) - page_end: Last page containing chunk content

For non-paginated documents (Markdown): - Both fields are null

If a chunk spans multiple pages:

{
  "page_start": 3,
  "page_end": 5
}

Working with Output

Python

from stratum.models.output import CanonicalDocument

# Load
doc = CanonicalDocument.load(Path("output.json"))

# Access
print(f"Document: {doc.document.doc_id}")
print(f"Chunks: {len(doc.chunks)}")

for chunk in doc.chunks:
    print(f"{chunk.id}: {chunk.heading_path}")
    if chunk.content_flags.has_code:
        print("  Contains code!")

# Filter
code_chunks = [c for c in doc.chunks if c.content_flags.has_code]
image_chunks = [c for c in doc.chunks if c.artifacts.images]

# Access enrichment data
for chunk in doc.chunks:
    if 'embedding' in chunk.enrichments:
        print(f"{chunk.id}: {len(chunk.enrichments['embedding'])} dimensions")
    if 'references' in chunk.enrichments:
        refs = chunk.enrichments['references']['intra_document']
        print(f"{chunk.id}: {len(refs)} references")

# Check applied enrichments
for enrichment in doc.document.enrichments:
    print(f"Applied: {enrichment['name']} v{enrichment['version']}")

# Save modified
doc.save(Path("modified.json"))

jq Examples

# Count chunks
jq '.chunks | length' output.json

# Get all chunk IDs
jq '.chunks[].id' output.json

# Find chunks with code
jq '.chunks[] | select(.content_flags.has_code)' output.json

# Get heading paths
jq '.chunks[].heading_path' output.json

# Get chunks from specific page
jq '.chunks[] | select(.page_start == 5)' output.json

# Extract all image paths
jq '[.chunks[].artifacts.images[]] | unique' output.json

# Get inline table content for chunks that have it
jq '.chunks[] | select(.artifacts.tables_inline | length > 0) | {id, tables_inline: .artifacts.tables_inline}' output.json

# Check tables mode used (inline present = inline or both)
jq 'if (.artifacts.tables_inline | length) > 0 then "inline or both" else "file" end' output.json

# List applied enrichments
jq '.document.enrichments[] | "\(.name) v\(.version)"' output.json

# Find chunks with embeddings
jq '.chunks[] | select(.enrichments.embedding)' output.json

# Get all detected references
jq '.chunks[].enrichments.references.intra_document[]?' output.json

# Count chunks with enrichment data
jq '[.chunks[] | select(.enrichments | length > 0)] | length' output.json

Schema Versioning

The schema_version field enables forward compatibility:

{
  "schema_version": "v1.2",
  ...
}

Current version: v1.2

Version history: - v1.2: Added enrichment system (doc_summary, doc_context, classification, keyword, topic) - v1.1: Added artifacts_store, table/image enrichment support, reference resolution - v1: Initial release

Major versions (v2, v3, etc.) increment on breaking changes. Minor versions (v1.1, v1.2) increment for non-breaking additions.

JSONL Format

When using --format jsonl, each document outputs as a single JSON line:

stratum docs/ -o output.jsonl --format jsonl --output-mode combined

Output:

{"schema_version":"v1","document":{"doc_id":"doc1",...},"chunks":[...]}
{"schema_version":"v1","document":{"doc_id":"doc2",...},"chunks":[...]}

Useful for streaming processing and append operations.