Output Format¶

Overview¶

Stratum Chunker produces output in the Canonical v1.2 format - a versioned JSON schema designed for: - Consistency across different input formats - Rich metadata for filtering and retrieval - Forward compatibility via schema versioning

Full Schema¶

{
  "schema_version": "v1.2",
  "document": {
    "doc_id": "string",
    "source_file": "string | null",
    "format": "string | null",
    "title": "string | null",
    "total_pages": "integer | null",
    "enrichments": [
      {
        "name": "string",
        "version": "string",
        "timestamp": "string"
      }
    ]
  },
  "chunks": [
    {
      "id": "string",
      "text": "string",
      "heading_path": ["string"] | null,
      "page_start": "integer | null",
      "page_end": "integer | null",
      "content_flags": {
        "has_table": "boolean",
        "has_image": "boolean",
        "has_code": "boolean",
        "has_formula": "boolean",
        "has_list": "boolean"
      },
      "artifacts": {
        "images": ["string"],
        "tables": ["string"],
        "tables_inline": ["string"]
      },
      "enrichments": {
        "key": "value"
      }
    }
  ],
  "artifacts": {
    "images": ["string"],
    "tables": ["string"],
    "tables_inline": ["string"]
  }
}

Complete Example¶

{
  "schema_version": "v1.2",
  "document": {
    "doc_id": "research_paper",
    "source_file": "papers/research_paper.pdf",
    "format": "pdf",
    "title": "Machine Learning Approaches",
    "total_pages": 12,
    "enrichments": [
      {
        "name": "intra-document-reference",
        "version": "1.0.0",
        "timestamp": "2026-02-05T10:30:00"
      }
    ]
  },
  "chunks": [
    {
      "id": "research_paper_chunk_001",
      "text": "# Introduction\n\n## Background\n\nMachine learning has revolutionized...",
      "heading_path": ["Introduction", "Background"],
      "page_start": 1,
      "page_end": 2,
      "content_flags": {
        "has_table": false,
        "has_image": false,
        "has_code": false,
        "has_formula": false,
        "has_list": false
      },
      "artifacts": {
        "images": [],
        "tables": []
      }
    },
    {
      "id": "research_paper_chunk_002",
      "text": "# Introduction\n\n## Related Work\n\nPrevious studies have shown...\n\n![Figure 1](images/fig_001.png)\n\nAs illustrated in Figure 1...",
      "heading_path": ["Introduction", "Related Work"],
      "page_start": 2,
      "page_end": 3,
      "content_flags": {
        "has_table": false,
        "has_image": true,
        "has_code": false,
        "has_formula": false,
        "has_list": false
      },
      "artifacts": {
        "images": ["images/research_paper_fig_001.png"],
        "tables": []
      },
      "enrichments": {
        "references": {
          "intra_document": [
            {
              "type": "figure",
              "raw_text": "Figure 1",
              "normalized": "1",
              "position": {"start": 85, "end": 93},
              "confidence": 0.95
            }
          ]
        }
      }
    },
    {
      "id": "research_paper_chunk_003",
      "text": "# Methods\n\n## Data Collection\n\nWe collected data from:\n\n- Source A (n=1000)\n- Source B (n=500)\n- Source C (n=200)",
      "heading_path": ["Methods", "Data Collection"],
      "page_start": 4,
      "page_end": 4,
      "content_flags": {
        "has_table": false,
        "has_image": false,
        "has_code": false,
        "has_formula": false,
        "has_list": true
      },
      "artifacts": {
        "images": [],
        "tables": []
      }
    },
    {
      "id": "research_paper_chunk_004",
      "text": "# Methods\n\n## Implementation\n\n```python\ndef train_model(data):\n    model = Model()\n    model.fit(data)\n    return model\n```",
      "heading_path": ["Methods", "Implementation"],
      "page_start": 5,
      "page_end": 5,
      "content_flags": {
        "has_table": false,
        "has_image": false,
        "has_code": true,
        "has_formula": false,
        "has_list": false
      },
      "artifacts": {
        "images": [],
        "tables": []
      }
    },
    {
      "id": "research_paper_chunk_005",
      "text": "# Results\n\n| Model | Accuracy | F1 Score |\n|-------|----------|----------|\n| A     | 0.95     | 0.93     |\n| B     | 0.92     | 0.90     |",
      "heading_path": ["Results"],
      "page_start": 6,
      "page_end": 6,
      "content_flags": {
        "has_table": true,
        "has_image": false,
        "has_code": false,
        "has_formula": false,
        "has_list": false
      },
      "artifacts": {
        "images": [],
        "tables": []
      }
    }
  ],
  "artifacts": {
    "images": ["images/research_paper_fig_001.png"],
    "tables": []
  }
}

Field Reference¶

Document Info¶

Field	Type	Required	Description
`doc_id`	string	Yes	Unique document identifier (usually filename stem)
`source_file`	string	No	Original file path
`format`	string	No	Document format: `pdf`, `markdown`, `txt`, `docx`
`title`	string	No	Document title (if detected)
`total_pages`	int	No	Total page count (for paginated formats)
`enrichments`	list[dict]	No	List of applied enrichment metadata (name, version, timestamp)

Chunk¶

Field	Type	Required	Description
`id`	string	Yes	Unique chunk ID: `{doc_id}_chunk_{NNN}`
`text`	string	Yes	Chunk text content
`heading_path`	list[string]	No	Heading ancestry: `["H1", "H2", "H3"]`
`page_start`	int	No	Starting page (1-indexed, null for non-paginated)
`page_end`	int	No	Ending page
`content_flags`	object	Yes	Boolean content type indicators
`artifacts`	object	Yes	References to extracted artifacts
`enrichments`	dict	No	Dictionary of enrichment data added by enrichment steps

Content Flags¶

Flag	Description
`has_table`	Chunk contains a table
`has_image`	Chunk contains/references an image
`has_code`	Chunk contains code block(s)
`has_formula`	Chunk contains mathematical formula(s)
`has_list`	Chunk contains bullet/numbered list(s)

Artifacts¶

Field	Type	When present	Description
`images`	list[string]	Always	Paths to extracted image files (PNG)
`tables`	list[string]	`--tables-mode file` or `both`	Paths to extracted table files (Markdown `.md`)
`tables_inline`	list[string]	`--tables-mode inline` or `both`	Table content as Markdown strings (omitted when empty)

Document-level artifacts aggregates all artifacts across chunks. Paths are relative to the working directory (same as passed to --images-dir / --tables-dir).

Tables mode controls how table data is stored (set via --tables-mode):

// --tables-mode file (default): file paths only
{
  "artifacts": {
    "images": [],
    "tables": ["tables/paper_table_001.md"]
  }
}

// --tables-mode inline: markdown text, no files
{
  "artifacts": {
    "images": [],
    "tables": [],
    "tables_inline": ["| Col A | Col B |\n|-------|-------|\n| 1     | 2     |"]
  }
}

// --tables-mode both: file paths + inline text
{
  "artifacts": {
    "images": [],
    "tables": ["tables/paper_table_001.md"],
    "tables_inline": ["| Col A | Col B |\n|-------|-------|\n| 1     | 2     |"]
  }
}

Design rationale: tables_inline is intentionally separate from chunk.text. Chunk text is what embedding models index — keeping raw table markdown out of it avoids polluting vector representations. The tables_inline field is for downstream code that needs the table data explicitly (e.g. table-aware enrichers, UI rendering, re-ranking). Images are always file paths — there is no inline/base64 mode.

Enrichment Data Storage¶

Enrichment data is stored in two locations:

Chunk-level enrichments (chunk.enrichments): Dictionary containing enrichment data specific to each chunk.

chunk.enrichments = {
    'embedding': [0.1, 0.2, 0.3, ...],
    'references': {
        'intra_document': [
            {
                'type': 'table',
                'raw_text': 'Table 1',
                'normalized': '1',
                'position': {'start': 10, 'end': 17},
                'confidence': 0.95,
                'resolved_to': {
                    'artifact_id': 'doc_table_001',
                    'chunk_id': 'doc_chunk_005'
                }
            }
        ]
    },
    'table_summary': 'This table compares model performance...',
    'image_description': 'Architecture diagram showing neural network layers.'
}

Document-level enrichment metadata (document.enrichments): List tracking which enrichments have been applied.

document.enrichments = [
    {
        'name': 'intra-document-reference',
        'version': '1.1.0',
        'timestamp': '2026-02-05T10:30:00',
        'config': {'resolve_references': True, 'languages': ['en']}
    },
    {
        'name': 'table-summarization',
        'version': '1.0.0',
        'timestamp': '2026-02-05T10:30:01',
        'config': {'llm_available': True, 'model': 'gemini-2.0-flash'}
    },
    {
        'name': 'image-description',
        'version': '1.0.0',
        'timestamp': '2026-02-05T10:30:02',
        'config': {'vlm_available': True, 'model': 'gemini-2.0-flash'}
    }
]

The enrichments field is separate from the extra field to maintain clear separation between enrichment pipeline data and other metadata.

Content Storage¶

Content Type	Where Stored	Format
Text	`chunk.text`	Plain text
Tables	`chunk.text`	Markdown table syntax
Tables (file mode)	External file	Markdown `.md`, path in `artifacts.tables`
Tables (inline mode)	`artifacts.tables_inline`	Markdown string — separate from `chunk.text`
Code	`chunk.text`	Markdown code blocks
Lists	`chunk.text`	Markdown list syntax
Formulas	`chunk.text`	LaTeX or text
Images	External file	PNG, path in `artifacts.images`

Chunk ID Format¶

Chunk IDs follow the pattern: {doc_id}_chunk_{NNN}

doc_id: Document identifier (filename stem or custom)
NNN: Zero-padded sequence number (001, 002, etc.)

Examples: - paper_chunk_001 - research_paper_chunk_042 - my_doc_chunk_123

Heading Path¶

The heading_path array represents the full heading ancestry:

{
  "heading_path": ["Introduction", "Background", "Related Work"]
}

This means the chunk is under: - H1: Introduction - H2: Background - H3: Related Work

When include_heading_in_text=true (default), the heading hierarchy is also prepended to the chunk text:

# Introduction

## Background

### Related Work

The actual content starts here...

Page Information¶

For paginated documents (PDF, DOCX): - page_start: First page containing chunk content (1-indexed) - page_end: Last page containing chunk content

For non-paginated documents (Markdown): - Both fields are null

If a chunk spans multiple pages:

{
  "page_start": 3,
  "page_end": 5
}

Working with Output¶

Python¶

from stratum.models.output import CanonicalDocument

# Load
doc = CanonicalDocument.load(Path("output.json"))

# Access
print(f"Document: {doc.document.doc_id}")
print(f"Chunks: {len(doc.chunks)}")

for chunk in doc.chunks:
    print(f"{chunk.id}: {chunk.heading_path}")
    if chunk.content_flags.has_code:
        print("  Contains code!")

# Filter
code_chunks = [c for c in doc.chunks if c.content_flags.has_code]
image_chunks = [c for c in doc.chunks if c.artifacts.images]

# Access enrichment data
for chunk in doc.chunks:
    if 'embedding' in chunk.enrichments:
        print(f"{chunk.id}: {len(chunk.enrichments['embedding'])} dimensions")
    if 'references' in chunk.enrichments:
        refs = chunk.enrichments['references']['intra_document']
        print(f"{chunk.id}: {len(refs)} references")

# Check applied enrichments
for enrichment in doc.document.enrichments:
    print(f"Applied: {enrichment['name']} v{enrichment['version']}")

# Save modified
doc.save(Path("modified.json"))

jq Examples¶

# Count chunks
jq '.chunks | length' output.json

# Get all chunk IDs
jq '.chunks[].id' output.json

# Find chunks with code
jq '.chunks[] | select(.content_flags.has_code)' output.json

# Get heading paths
jq '.chunks[].heading_path' output.json

# Get chunks from specific page
jq '.chunks[] | select(.page_start == 5)' output.json

# Extract all image paths
jq '[.chunks[].artifacts.images[]] | unique' output.json

# Get inline table content for chunks that have it
jq '.chunks[] | select(.artifacts.tables_inline | length > 0) | {id, tables_inline: .artifacts.tables_inline}' output.json

# Check tables mode used (inline present = inline or both)
jq 'if (.artifacts.tables_inline | length) > 0 then "inline or both" else "file" end' output.json

# List applied enrichments
jq '.document.enrichments[] | "\(.name) v\(.version)"' output.json

# Find chunks with embeddings
jq '.chunks[] | select(.enrichments.embedding)' output.json

# Get all detected references
jq '.chunks[].enrichments.references.intra_document[]?' output.json

# Count chunks with enrichment data
jq '[.chunks[] | select(.enrichments | length > 0)] | length' output.json

Schema Versioning¶

The schema_version field enables forward compatibility:

{
  "schema_version": "v1.2",
  ...
}

Current version: v1.2

Version history: - v1.2: Added enrichment system (doc_summary, doc_context, classification, keyword, topic) - v1.1: Added artifacts_store, table/image enrichment support, reference resolution - v1: Initial release

Major versions (v2, v3, etc.) increment on breaking changes. Minor versions (v1.1, v1.2) increment for non-breaking additions.

JSONL Format¶

When using --format jsonl, each document outputs as a single JSON line:

stratum docs/ -o output.jsonl --format jsonl --output-mode combined

Output:

{"schema_version":"v1","document":{"doc_id":"doc1",...},"chunks":[...]}
{"schema_version":"v1","document":{"doc_id":"doc2",...},"chunks":[...]}

Useful for streaming processing and append operations.