Skip to content

Usage Guide

Table of Contents

  1. Setup and Installation
  2. CLI Reference
  3. Configuration Reference
  4. Enrichers
  5. Reference Detection Enricher
  6. PII Detection
  7. Supported Formats
  8. Output Format Overview
  9. Examples

Setup and Installation

Install

pip install stratum

# With OCR support (image-based documents)
pip install stratum[ocr]

Required Environment Variables

Before using stratum you must set two environment variables. Stratum will not run without them.

export STRATUM_LICENSE_KEY=<your-key>
export STRATUM_USER_COMPANY_EMAIL=<your-company-email>

Contact your account representative to obtain these credentials.

Optional Environment Variables for Enrichment

LLM-based enrichers read API keys from the environment:

Variable Provider
OPENAI_API_KEY OpenAI
ANTHROPIC_API_KEY Anthropic
GOOGLE_API_KEY Google Gemini
LOCAL_API_KEY Local / custom OpenAI-compatible endpoint (optional)
LLM_API_KEY Generic key used by table/image enrichers
LLM_BASE_URL Base URL for table/image enrichers
LLM_MODEL Model name for table/image enrichers
VLM_API_KEY Vision model key for image description
VLM_BASE_URL Vision model base URL
VLM_MODEL Vision model name

CLI Reference

Single File

# Auto-detect format, print to stdout
stratum document.md

# Write to file
stratum document.md -o output.json

# Process a PDF
stratum paper.pdf -o chunks.json

# Verbose output
stratum document.md -o output.json -v

Directory Processing

# One output file per document (default)
stratum docs/ -o output/

# All results in one file
stratum docs/ -o output/all.json --output-mode combined

# Both separate files and combined
stratum docs/ -o output/ --output-mode both

# Parallel workers
stratum docs/ -o output/ --workers 4

# Progress bar (requires tqdm)
stratum docs/ -o output/ --progress --workers 4

Output Format

# JSON (default)
stratum document.md -o output.json --format json

# JSONL (one document object per line)
stratum document.md -o output.jsonl --format jsonl

Parser Selection

# Auto-detect (default)
stratum document.pdf -o output.json

# Force specific parser
stratum document.pdf --parser docling -o output.json
stratum document.md  --parser md-parser -o output.json

Image and Table Extraction

# Extract images (PNG, 300 DPI for PDFs)
stratum paper.pdf --images-dir output/images -o output.json

# Extract tables as Markdown files (default mode: file)
stratum paper.pdf --tables-dir output/tables -o output.json

# Both together with custom DPI
stratum paper.pdf \
  --images-dir output/images \
  --tables-dir output/tables \
  --image-dpi 150 \
  -o output.json

Images are named {doc_stem}_img_{NNN}.png and tables {doc_stem}_table_{NNN}.md. Their paths appear in chunk artifacts:

{
  "artifacts": {
    "images": ["output/images/paper_img_001.png"],
    "tables": ["output/tables/paper_table_001.md"]
  }
}

Tables Mode (--tables-mode)

Controls how extracted tables are stored. Three modes:

Mode artifacts.tables artifacts.tables_inline Description
file (default) File paths (absent) Tables saved to --tables-dir as .md files
inline (absent) Markdown text strings Table content embedded directly in chunk artifacts, no files written
both File paths Markdown text strings Both file paths and inline text populated simultaneously
# Default: save table files
stratum paper.pdf --tables-dir output/tables -o output.json

# Inline only — no files, table markdown in artifacts.tables_inline
stratum paper.pdf --tables-mode inline -o output.json

# Both: files + inline text
stratum paper.pdf --tables-dir output/tables --tables-mode both -o output.json

Design note: Table content in artifacts.tables_inline is intentionally separate from chunk.text. The chunk text is what embedding models index — keeping raw table markdown out avoids polluting vector representations. The inline field is for downstream code that needs table data explicitly (e.g. rendering, re-ranking, table-aware enrichers).

Images: Images are always stored as file paths (artifacts.images). There is no inline/base64 mode — embedding binary image data in JSON is impractical for RAG pipelines. Use the image-description enricher to generate text descriptions instead.

Configuration Files

# Chunker-only YAML config
stratum document.pdf -c config.yaml -o output.json

# Override config with CLI flags
stratum document.md --target-size 300 --max-size 500 -o output.json

# Full modular pipeline config (parser + chunker + enrichment steps)
stratum document.pdf --pipeline-config pipeline.yaml -o output.json

Full Flag Reference

stratum [OPTIONS] INPUT

Arguments:
  INPUT                         Input file or directory to process

Options:
  -o, --output PATH             Output file (JSON) or directory
  -c, --config PATH             Path to YAML configuration file (chunker settings)
  --target-size INT             Target chunk size (words)
  --max-size INT                Maximum chunk size (words)
  --min-size INT                Minimum chunk size (words)
  --format {json,jsonl}         Output format (default: json)
  --output-mode {separate,combined,both}
                                For directories (default: separate)
  --parser {docling,md-parser}  Parser to use (default: auto-detect)
  --images-dir PATH             Directory to save extracted images (PNG)
  --tables-dir PATH             Directory to save extracted tables (Markdown .md)
  --tables-mode {file,inline,both}
                                Table storage mode (default: file).
                                file: paths in artifacts.tables (requires --tables-dir)
                                inline: markdown text in artifacts.tables_inline
                                both: file paths + inline text
  --image-dpi INT               DPI for image rendering (default: 300).
                                Only applies to PDFs when --images-dir is set.
  --pipeline-config PATH        Modular pipeline YAML (parser + chunker + enrichment)
  -w, --workers INT             Parallel workers for directory processing (default: 1)
  --progress                    Show progress bar for directories (requires tqdm)
  -v, --verbose                 Print verbose output
  --version                     Show version
  --help                        Show help

Configuration Reference

Chunker-Only YAML

Use with stratum document.pdf -c config.yaml:

target_size: 500       # soft target size (words)
max_size: 700          # hard maximum
min_size: 50           # smaller chunks fuse with neighbours
heading_split_levels: [1, 2]
preserve_code: true
preserve_tables: true
overlap_size: 0
Option Default Description
target_size 500 Soft target — splits at good boundaries near this value
max_size 700 Hard limit — always split if exceeded
min_size 50 Fusion threshold — tiny chunks merge with neighbours
size_unit words words or chars
heading_split_levels [1, 2] Heading levels that force a split
always_split_on_heading true Split on headings even when below target_size
preserve_code true Keep code blocks intact
preserve_tables true Keep tables intact
include_heading_in_text true Prepend heading hierarchy to chunk text
overlap_size 0 Words to overlap with previous chunk

Constraint: min_size ≤ target_size ≤ max_size.

Full Pipeline YAML

Use with stratum document.pdf --pipeline-config pipeline.yaml:

name: my-pipeline
version: "1.0.0"

steps:
  - type: chunker
    target_size: 500
    max_size: 700
    min_size: 50
    heading_split_levels: [1, 2]

  - type: enrichment
    name: intra-document-reference
    languages:
      - en
      - fr

  - type: enrichment
    name: doc-summary
    llm_provider: google
    llm_model: gemini-2.0-flash
    max_doc_chars: 50000
    max_tokens: 512

  - type: enrichment
    name: keyword

  - type: enrichment
    name: topic
    llm_provider: google
    llm_model: gemini-2.0-flash

See examples/*/pipeline.yaml for complete pipeline configurations with real output.


Enrichers

Enrichers run after chunking and add metadata to each chunk under chunk.enrichments[key]. Configure them via --pipeline-config.

Overview

Name LLM Required Output Key Description
intra-document-reference No (rule-based) references Detect cross-references to sections, figures, tables, equations, etc.
table-summarization Yes table_summary Concise summary of table content
image-description Yes (VLM) image_description Describe image content using a vision model
doc-summary Yes doc_summary Document-level summary stored in every chunk
doc-context Yes doc_context Per-chunk contextual description
cross-doc-context Yes + embeddings cross_doc_context Context from similar chunks across documents
chunk-classification Yes classification Classify chunk by section type (introduction, methods, results, …)
keyword No (TF-IDF) keywords Top keywords per chunk
topic Yes (1 call/doc) topic Discover and assign topic labels

LLM Provider Support

Provider Config value API key env var
OpenAI openai OPENAI_API_KEY
Anthropic anthropic ANTHROPIC_API_KEY
Google google GOOGLE_API_KEY
Local / custom local LOCAL_API_KEY (optional) — requires llm_base_url

Table Summarization

Generates a concise LLM-based summary of each table chunk.

- type: enrichment
  name: table-summarization
  llm_provider: google
  llm_model: gemini-2.0-flash
  max_tokens: 300
  fallback_enabled: true   # extract table headers when LLM unavailable

Image Description

Describes image content using a vision model (VLM).

- type: enrichment
  name: image-description
  llm_provider: local
  llm_model: /models/gemma-3-12b-it
  llm_base_url: http://localhost:8000/v1
  fallback_enabled: true   # use parser captions when VLM unavailable

Document Summary

One LLM call per document. The summary is stored in every chunk — useful for retrieval systems that need document context alongside individual chunks.

- type: enrichment
  name: doc-summary
  llm_provider: local
  llm_model: gemma3:4b
  llm_base_url: http://localhost:11434/v1
  max_doc_chars: 50000
  max_tokens: 512
  temperature: 0.3

Document Context

Per-chunk contextual description. Two modes:

Mode Description
neighbors (default) Summarises each chunk in light of its surrounding chunks
document Explains each chunk in light of the full document text
- type: enrichment
  name: doc-context
  llm_provider: google
  llm_model: gemini-2.0-flash
  mode: neighbors
  n_neighbors: 3
  max_tokens: 256
  temperature: 0.3
  max_workers: 8

To run two doc-context enrichers in the same pipeline without one overwriting the other, use context_key:

- type: enrichment
  name: doc-context
  mode: neighbors
  n_neighbors: 3
  context_key: doc_context_neighbors
  llm_provider: google
  llm_model: gemini-2.0-flash

- type: enrichment
  name: doc-context
  mode: document
  context_key: doc_context_document
  max_doc_chars: 30000
  llm_provider: google
  llm_model: gemini-2.0-flash

Cross-Document Context

Finds similar chunks across documents using a FAISS vector index, then generates contextual descriptions. Use for corpus-level enrichment.

- type: enrichment
  name: cross-doc-context
  vector_store_index_path: db/corpus/faiss.index
  vector_store_metadata_path: db/corpus/metadata.pkl
  embedding_provider: local
  embedding_model: nomic-embed-text
  embedding_base_url: http://localhost:11434/v1
  embedding_batch_size: 32
  llm_provider: local
  llm_model: gemma3:4b
  llm_base_url: http://localhost:11434/v1
  n_neighbors: 5
  max_tokens: 512
  max_workers: 16

Chunk Classification

Classifies each chunk into a section category (introduction, methods, results, discussion, etc.).

- type: enrichment
  name: chunk-classification
  llm_provider: openai
  llm_model: gpt-4o-mini
  max_tokens: 50
  temperature: 0.0
  max_workers: 8

Keyword Extraction

TF-IDF-based. No LLM required. Output: list of {"term": str, "score": float}.

- type: enrichment
  name: keyword
  top_n: 8

Topic Modeling

Discovers topic labels from chunk keywords using one LLM call per document. Run keyword first.

- type: enrichment
  name: keyword

- type: enrichment
  name: topic
  llm_provider: google
  llm_model: gemini-2.0-flash
  max_keywords: 300
  max_tokens: 400
  temperature: 0.3

Custom Prompt Files

All LLM-based enrichers (doc-summary, doc-context, chunk-classification, topic, cross-doc-context) accept a prompt_file parameter pointing to a YAML file that overrides the built-in prompt.

YAML format:

# system prompt is optional
system: |
  You are a precise document analyst. Be concise and factual.

# user_template is required — use {variable} placeholders
user_template: |
  Summarize the following document in 3 sentences.

  DOCUMENT:
  {document_text}

Available template variables by enricher:

Enricher Variables
doc-summary {document_text}
doc-context (neighbors) {previous_chunks}, {current_chunk}, {following_chunks}
doc-context (document) {document_text}, {current_chunk}
chunk-classification {chunk_text}, {categories} (taxonomy mode only)
topic {keywords_text}
cross-doc-context {context_block}, {target_text}

Pipeline YAML usage:

- type: enrichment
  name: doc-summary
  llm_provider: google
  llm_model: gemini-2.0-flash
  prompt_file: prompts/my_summary.yaml

- type: enrichment
  name: chunk-classification
  llm_provider: openai
  llm_model: gpt-4o-mini
  prompt_file: prompts/my_classification.yaml

DSPy readiness: The YAML format reserves a dspy_signature key (commented out) for future migration to DSPy-based prompt optimization. The current system + user_template structure maps directly to DSPy's Signature paradigm.

vLLM Priority Scheduling

When using vLLM with priority scheduling, add priority: N to any enrichment step. Lower values mean lower priority.

- type: enrichment
  name: doc-context
  llm_provider: local
  llm_model: /models/my-model
  llm_base_url: http://localhost:8000/v1
  max_workers: 8
  priority: 5

Reference Detection Enricher

The intra-document-reference enricher detects internal cross-references in document text (figures, tables, sections, equations, etc.) using rule-based regex patterns. No LLM required.

Supported Reference Types

Type English Examples
section Section 3, Section 2.4.3, §5, Sec. 1.2
figure Figure 1, Fig. 2, Figures 1–3, Fig. 3a
table Table 1, Tab. 2, Tables 1–3, Table A1
appendix Appendix A, App. B.1
equation Equation 1, Eq. (5), Eqn. 3
algorithm Algorithm 1, Alg. 2
listing Listing 1
chapter Chapter 1, Ch. 3

Detection features across all types: - Deep hierarchical numbering (Section 1.2.3.4) - Alphanumeric suffixes (Section 2a, Figure 3b) - Range detection (Sections 2–5, Figures 1–3) - List detection (Figures 1, 2, and 3) - False positive filtering ("figure of speech" excluded)

Supported Languages

Code Language Figure Table Section
en English Figure, Fig. Table, Tab. Section, Sec., §
fr French Figure, Schéma, Graphique, Illustration Tableau, Tab. Section, Partie, Paragraphe
de German Abbildung, Abb. Tabelle, Tab. Abschnitt, Kap.
es Spanish Figura, Fig. Tabla, Tab. Sección, Cap.
it Italian Figura, Fig. Tabella, Tab. Sezione, Cap.
pt Portuguese Figura, Fig. Tabela, Tab. Seção, Cap.
nl Dutch Figuur, Fig. Tabel, Tab. Sectie, H.
ru Russian Рисунок, Рис. Таблица, Табл. Раздел, Гл.
zh Chinese 节、章
ar Arabic شكل جدول قسم، فصل

Full language names (english, french, german, etc.) are also accepted.

YAML Configuration

- type: enrichment
  name: intra-document-reference
  confidence_threshold: 0.5
  # Language codes: en, fr, de, es, it, pt, nl, ru, zh, ar
  # Default: [en]. Combine for multilingual documents.
  languages:
    - en
    - fr
  # Optionally restrict to specific types
  enabled_types:
    - section
    - figure
    - table
  # Or disable specific types
  disabled_types:
    - algorithm
  # Domain-specific reference type names (auto-generates patterns)
  # Each name creates a new ref_type (lowercased) in the output
  custom_reference_types:
    - Theorem
    - Lemma
    - Proposition
  # Map custom keywords to existing ref_types for full resolution support
  custom_type_mappings:
    Illustration: figure
    Diagram: figure
  # Load custom type names or mappings from a file
  custom_types_file: math_types.txt
  # Add raw regex patterns for an existing type
  custom_patterns:
    section:
      - "\\bPart\\s+(\\d+)"
  # Resolve references to artifact/chunk IDs
  resolve_references: true

Custom Types File

Load custom type definitions from a plain-text file:

# math_types.txt
# Lines starting with '#' and blank lines are ignored.

Theorem              # creates new "theorem" ref_type
Lemma                # creates new "lemma" ref_type
Illustration:figure  # maps "Illustration" to existing figure type
Diagram:figure       # maps "Diagram" to existing figure type

Lines without a colon create new ref_types. Lines with Name:target_type create alias mappings. File contents are merged with any values provided directly in the YAML.

Reference Resolution

When resolve_references: true, detected references are linked to their targets:

Type Resolves via artifact_id
figure Caption text in image chunks image artifact ID
table Caption text in table chunks table artifact ID
section Section number in heading path section_<number>
appendix Appendix letter in heading path appendix_<letter>

Output Format

References are stored in chunk.enrichments.references.intra_document as a list:

{
  "type": "section",
  "raw_text": "Section 2.4.3",
  "normalized": "2.4.3",
  "position": { "start": 4, "end": 17 },
  "confidence": 0.95,

  "is_range": false,

  "resolved_to": {
    "artifact_id": "section_2.4.3",
    "chunk_id": "doc_chunk_005"
  }
}

position.start / position.end are character offsets within the chunk's text field.


PII Detection

PII detection infrastructure is present in the pipeline and accepts type: pii steps in pipeline YAML. The pipeline recognises and skips PII steps cleanly. Concrete detection implementations are under development.

steps:
  - type: pii
    name: regex-pii
    mode: detect
    types:
      - email
      - phone

Supported Formats

Format Extension(s) Notes
PDF .pdf
DOCX / DOC .docx, .doc
HTML .html, .htm
Markdown .md, .markdown
Plain text .txt
OCR JSON .json
Images (OCR) .png, .jpg, etc. Requires pip install stratum[ocr]

Output Format Overview

Stratum produces Canonical v1.2 JSON. Each document has:

  • A document object with doc_id, source_file, format, title, total_pages, and an enrichments list tracking which enrichers ran.
  • A chunks array where each chunk has id, text, heading_path, page_start, page_end, content_flags, artifacts, and enrichments.
  • A top-level artifacts object aggregating all image/table paths.

For the complete schema, field reference, and jq examples, see output-format.md.


Examples

The examples/ directory contains complete working examples with real documents and enriched output:

  • examples/arxiv_paper/ — ArXiv paper processed with Gemini 2.0 Flash: pipeline.yaml, output.json, extracted images/ and tables/
  • examples/thesis_physics/ — Three physics PhD theses processed with local Qwen3-4B: pipeline.yaml, output/ (one file per thesis), images/, tables/

Each example includes the exact pipeline.yaml used and the resulting output.