Output Format¶
Overview¶
Stratum Chunker produces output in the Canonical v1.2 format - a versioned JSON schema designed for: - Consistency across different input formats - Rich metadata for filtering and retrieval - Forward compatibility via schema versioning
Full Schema¶
{
"schema_version": "v1.2",
"document": {
"doc_id": "string",
"source_file": "string | null",
"format": "string | null",
"title": "string | null",
"total_pages": "integer | null",
"enrichments": [
{
"name": "string",
"version": "string",
"timestamp": "string"
}
]
},
"chunks": [
{
"id": "string",
"text": "string",
"heading_path": ["string"] | null,
"page_start": "integer | null",
"page_end": "integer | null",
"content_flags": {
"has_table": "boolean",
"has_image": "boolean",
"has_code": "boolean",
"has_formula": "boolean",
"has_list": "boolean"
},
"artifacts": {
"images": ["string"],
"tables": ["string"],
"tables_inline": ["string"]
},
"enrichments": {
"key": "value"
}
}
],
"artifacts": {
"images": ["string"],
"tables": ["string"],
"tables_inline": ["string"]
}
}
Complete Example¶
{
"schema_version": "v1.2",
"document": {
"doc_id": "research_paper",
"source_file": "papers/research_paper.pdf",
"format": "pdf",
"title": "Machine Learning Approaches",
"total_pages": 12,
"enrichments": [
{
"name": "intra-document-reference",
"version": "1.0.0",
"timestamp": "2026-02-05T10:30:00"
}
]
},
"chunks": [
{
"id": "research_paper_chunk_001",
"text": "# Introduction\n\n## Background\n\nMachine learning has revolutionized...",
"heading_path": ["Introduction", "Background"],
"page_start": 1,
"page_end": 2,
"content_flags": {
"has_table": false,
"has_image": false,
"has_code": false,
"has_formula": false,
"has_list": false
},
"artifacts": {
"images": [],
"tables": []
}
},
{
"id": "research_paper_chunk_002",
"text": "# Introduction\n\n## Related Work\n\nPrevious studies have shown...\n\n\n\nAs illustrated in Figure 1...",
"heading_path": ["Introduction", "Related Work"],
"page_start": 2,
"page_end": 3,
"content_flags": {
"has_table": false,
"has_image": true,
"has_code": false,
"has_formula": false,
"has_list": false
},
"artifacts": {
"images": ["images/research_paper_fig_001.png"],
"tables": []
},
"enrichments": {
"references": {
"intra_document": [
{
"type": "figure",
"raw_text": "Figure 1",
"normalized": "1",
"position": {"start": 85, "end": 93},
"confidence": 0.95
}
]
}
}
},
{
"id": "research_paper_chunk_003",
"text": "# Methods\n\n## Data Collection\n\nWe collected data from:\n\n- Source A (n=1000)\n- Source B (n=500)\n- Source C (n=200)",
"heading_path": ["Methods", "Data Collection"],
"page_start": 4,
"page_end": 4,
"content_flags": {
"has_table": false,
"has_image": false,
"has_code": false,
"has_formula": false,
"has_list": true
},
"artifacts": {
"images": [],
"tables": []
}
},
{
"id": "research_paper_chunk_004",
"text": "# Methods\n\n## Implementation\n\n```python\ndef train_model(data):\n model = Model()\n model.fit(data)\n return model\n```",
"heading_path": ["Methods", "Implementation"],
"page_start": 5,
"page_end": 5,
"content_flags": {
"has_table": false,
"has_image": false,
"has_code": true,
"has_formula": false,
"has_list": false
},
"artifacts": {
"images": [],
"tables": []
}
},
{
"id": "research_paper_chunk_005",
"text": "# Results\n\n| Model | Accuracy | F1 Score |\n|-------|----------|----------|\n| A | 0.95 | 0.93 |\n| B | 0.92 | 0.90 |",
"heading_path": ["Results"],
"page_start": 6,
"page_end": 6,
"content_flags": {
"has_table": true,
"has_image": false,
"has_code": false,
"has_formula": false,
"has_list": false
},
"artifacts": {
"images": [],
"tables": []
}
}
],
"artifacts": {
"images": ["images/research_paper_fig_001.png"],
"tables": []
}
}
Field Reference¶
Document Info¶
| Field | Type | Required | Description |
|---|---|---|---|
doc_id |
string | Yes | Unique document identifier (usually filename stem) |
source_file |
string | No | Original file path |
format |
string | No | Document format: pdf, markdown, txt, docx |
title |
string | No | Document title (if detected) |
total_pages |
int | No | Total page count (for paginated formats) |
enrichments |
list[dict] | No | List of applied enrichment metadata (name, version, timestamp) |
Chunk¶
| Field | Type | Required | Description |
|---|---|---|---|
id |
string | Yes | Unique chunk ID: {doc_id}_chunk_{NNN} |
text |
string | Yes | Chunk text content |
heading_path |
list[string] | No | Heading ancestry: ["H1", "H2", "H3"] |
page_start |
int | No | Starting page (1-indexed, null for non-paginated) |
page_end |
int | No | Ending page |
content_flags |
object | Yes | Boolean content type indicators |
artifacts |
object | Yes | References to extracted artifacts |
enrichments |
dict | No | Dictionary of enrichment data added by enrichment steps |
Content Flags¶
| Flag | Description |
|---|---|
has_table |
Chunk contains a table |
has_image |
Chunk contains/references an image |
has_code |
Chunk contains code block(s) |
has_formula |
Chunk contains mathematical formula(s) |
has_list |
Chunk contains bullet/numbered list(s) |
Artifacts¶
| Field | Type | When present | Description |
|---|---|---|---|
images |
list[string] | Always | Paths to extracted image files (PNG) |
tables |
list[string] | --tables-mode file or both |
Paths to extracted table files (Markdown .md) |
tables_inline |
list[string] | --tables-mode inline or both |
Table content as Markdown strings (omitted when empty) |
Document-level artifacts aggregates all artifacts across chunks.
Paths are relative to the working directory (same as passed to --images-dir / --tables-dir).
Tables mode controls how table data is stored (set via --tables-mode):
// --tables-mode file (default): file paths only
{
"artifacts": {
"images": [],
"tables": ["tables/paper_table_001.md"]
}
}
// --tables-mode inline: markdown text, no files
{
"artifacts": {
"images": [],
"tables": [],
"tables_inline": ["| Col A | Col B |\n|-------|-------|\n| 1 | 2 |"]
}
}
// --tables-mode both: file paths + inline text
{
"artifacts": {
"images": [],
"tables": ["tables/paper_table_001.md"],
"tables_inline": ["| Col A | Col B |\n|-------|-------|\n| 1 | 2 |"]
}
}
Design rationale: tables_inline is intentionally separate from chunk.text. Chunk text is what embedding models index — keeping raw table markdown out of it avoids polluting vector representations. The tables_inline field is for downstream code that needs the table data explicitly (e.g. table-aware enrichers, UI rendering, re-ranking). Images are always file paths — there is no inline/base64 mode.
Enrichment Data Storage¶
Enrichment data is stored in two locations:
Chunk-level enrichments (chunk.enrichments): Dictionary containing enrichment data specific to each chunk.
chunk.enrichments = {
'embedding': [0.1, 0.2, 0.3, ...],
'references': {
'intra_document': [
{
'type': 'table',
'raw_text': 'Table 1',
'normalized': '1',
'position': {'start': 10, 'end': 17},
'confidence': 0.95,
'resolved_to': {
'artifact_id': 'doc_table_001',
'chunk_id': 'doc_chunk_005'
}
}
]
},
'table_summary': 'This table compares model performance...',
'image_description': 'Architecture diagram showing neural network layers.'
}
Document-level enrichment metadata (document.enrichments): List tracking which enrichments have been applied.
document.enrichments = [
{
'name': 'intra-document-reference',
'version': '1.1.0',
'timestamp': '2026-02-05T10:30:00',
'config': {'resolve_references': True, 'languages': ['en']}
},
{
'name': 'table-summarization',
'version': '1.0.0',
'timestamp': '2026-02-05T10:30:01',
'config': {'llm_available': True, 'model': 'gemini-2.0-flash'}
},
{
'name': 'image-description',
'version': '1.0.0',
'timestamp': '2026-02-05T10:30:02',
'config': {'vlm_available': True, 'model': 'gemini-2.0-flash'}
}
]
The enrichments field is separate from the extra field to maintain clear separation between enrichment pipeline data and other metadata.
Content Storage¶
| Content Type | Where Stored | Format |
|---|---|---|
| Text | chunk.text |
Plain text |
| Tables | chunk.text |
Markdown table syntax |
| Tables (file mode) | External file | Markdown .md, path in artifacts.tables |
| Tables (inline mode) | artifacts.tables_inline |
Markdown string — separate from chunk.text |
| Code | chunk.text |
Markdown code blocks |
| Lists | chunk.text |
Markdown list syntax |
| Formulas | chunk.text |
LaTeX or text |
| Images | External file | PNG, path in artifacts.images |
Chunk ID Format¶
Chunk IDs follow the pattern: {doc_id}_chunk_{NNN}
doc_id: Document identifier (filename stem or custom)NNN: Zero-padded sequence number (001, 002, etc.)
Examples:
- paper_chunk_001
- research_paper_chunk_042
- my_doc_chunk_123
Heading Path¶
The heading_path array represents the full heading ancestry:
{
"heading_path": ["Introduction", "Background", "Related Work"]
}
This means the chunk is under: - H1: Introduction - H2: Background - H3: Related Work
When include_heading_in_text=true (default), the heading hierarchy is also prepended to the chunk text:
# Introduction
## Background
### Related Work
The actual content starts here...
Page Information¶
For paginated documents (PDF, DOCX):
- page_start: First page containing chunk content (1-indexed)
- page_end: Last page containing chunk content
For non-paginated documents (Markdown):
- Both fields are null
If a chunk spans multiple pages:
{
"page_start": 3,
"page_end": 5
}
Working with Output¶
Python¶
from stratum.models.output import CanonicalDocument
# Load
doc = CanonicalDocument.load(Path("output.json"))
# Access
print(f"Document: {doc.document.doc_id}")
print(f"Chunks: {len(doc.chunks)}")
for chunk in doc.chunks:
print(f"{chunk.id}: {chunk.heading_path}")
if chunk.content_flags.has_code:
print(" Contains code!")
# Filter
code_chunks = [c for c in doc.chunks if c.content_flags.has_code]
image_chunks = [c for c in doc.chunks if c.artifacts.images]
# Access enrichment data
for chunk in doc.chunks:
if 'embedding' in chunk.enrichments:
print(f"{chunk.id}: {len(chunk.enrichments['embedding'])} dimensions")
if 'references' in chunk.enrichments:
refs = chunk.enrichments['references']['intra_document']
print(f"{chunk.id}: {len(refs)} references")
# Check applied enrichments
for enrichment in doc.document.enrichments:
print(f"Applied: {enrichment['name']} v{enrichment['version']}")
# Save modified
doc.save(Path("modified.json"))
jq Examples¶
# Count chunks
jq '.chunks | length' output.json
# Get all chunk IDs
jq '.chunks[].id' output.json
# Find chunks with code
jq '.chunks[] | select(.content_flags.has_code)' output.json
# Get heading paths
jq '.chunks[].heading_path' output.json
# Get chunks from specific page
jq '.chunks[] | select(.page_start == 5)' output.json
# Extract all image paths
jq '[.chunks[].artifacts.images[]] | unique' output.json
# Get inline table content for chunks that have it
jq '.chunks[] | select(.artifacts.tables_inline | length > 0) | {id, tables_inline: .artifacts.tables_inline}' output.json
# Check tables mode used (inline present = inline or both)
jq 'if (.artifacts.tables_inline | length) > 0 then "inline or both" else "file" end' output.json
# List applied enrichments
jq '.document.enrichments[] | "\(.name) v\(.version)"' output.json
# Find chunks with embeddings
jq '.chunks[] | select(.enrichments.embedding)' output.json
# Get all detected references
jq '.chunks[].enrichments.references.intra_document[]?' output.json
# Count chunks with enrichment data
jq '[.chunks[] | select(.enrichments | length > 0)] | length' output.json
Schema Versioning¶
The schema_version field enables forward compatibility:
{
"schema_version": "v1.2",
...
}
Current version: v1.2
Version history:
- v1.2: Added enrichment system (doc_summary, doc_context, classification, keyword, topic)
- v1.1: Added artifacts_store, table/image enrichment support, reference resolution
- v1: Initial release
Major versions (v2, v3, etc.) increment on breaking changes. Minor versions (v1.1, v1.2) increment for non-breaking additions.
JSONL Format¶
When using --format jsonl, each document outputs as a single JSON line:
stratum docs/ -o output.jsonl --format jsonl --output-mode combined
Output:
{"schema_version":"v1","document":{"doc_id":"doc1",...},"chunks":[...]}
{"schema_version":"v1","document":{"doc_id":"doc2",...},"chunks":[...]}
Useful for streaming processing and append operations.