Usage Guide¶

Table of Contents¶

Setup and Installation
CLI Reference
Configuration Reference
Enrichers
Reference Detection Enricher
PII Detection
Supported Formats
Output Format Overview
Examples

Setup and Installation¶

Install¶

pip install stratum

# With OCR support (image-based documents)
pip install stratum[ocr]

Required Environment Variables¶

Before using stratum you must set two environment variables. Stratum will not run without them.

export STRATUM_LICENSE_KEY=<your-key>
export STRATUM_USER_COMPANY_EMAIL=<your-company-email>

Contact your account representative to obtain these credentials.

Optional Environment Variables for Enrichment¶

LLM-based enrichers read API keys from the environment:

Variable	Provider
`OPENAI_API_KEY`	OpenAI
`ANTHROPIC_API_KEY`	Anthropic
`GOOGLE_API_KEY`	Google Gemini
`LOCAL_API_KEY`	Local / custom OpenAI-compatible endpoint (optional)
`LLM_API_KEY`	Generic key used by table/image enrichers
`LLM_BASE_URL`	Base URL for table/image enrichers
`LLM_MODEL`	Model name for table/image enrichers
`VLM_API_KEY`	Vision model key for image description
`VLM_BASE_URL`	Vision model base URL
`VLM_MODEL`	Vision model name

CLI Reference¶

Single File¶

# Auto-detect format, print to stdout
stratum document.md

# Write to file
stratum document.md -o output.json

# Process a PDF
stratum paper.pdf -o chunks.json

# Verbose output
stratum document.md -o output.json -v

Directory Processing¶

# One output file per document (default)
stratum docs/ -o output/

# All results in one file
stratum docs/ -o output/all.json --output-mode combined

# Both separate files and combined
stratum docs/ -o output/ --output-mode both

# Parallel workers
stratum docs/ -o output/ --workers 4

# Progress bar (requires tqdm)
stratum docs/ -o output/ --progress --workers 4

Output Format¶

# JSON (default)
stratum document.md -o output.json --format json

# JSONL (one document object per line)
stratum document.md -o output.jsonl --format jsonl

Parser Selection¶

# Auto-detect (default)
stratum document.pdf -o output.json

# Force specific parser
stratum document.pdf --parser docling -o output.json
stratum document.md  --parser md-parser -o output.json

Image and Table Extraction¶

# Extract images (PNG, 300 DPI for PDFs)
stratum paper.pdf --images-dir output/images -o output.json

# Extract tables as Markdown files (default mode: file)
stratum paper.pdf --tables-dir output/tables -o output.json

# Both together with custom DPI
stratum paper.pdf \
  --images-dir output/images \
  --tables-dir output/tables \
  --image-dpi 150 \
  -o output.json

Images are named {doc_stem}_img_{NNN}.png and tables {doc_stem}_table_{NNN}.md. Their paths appear in chunk artifacts:

{
  "artifacts": {
    "images": ["output/images/paper_img_001.png"],
    "tables": ["output/tables/paper_table_001.md"]
  }
}

Tables Mode (`--tables-mode`)¶

Controls how extracted tables are stored. Three modes:

Mode	`artifacts.tables`	`artifacts.tables_inline`	Description
`file` (default)	File paths	(absent)	Tables saved to `--tables-dir` as `.md` files
`inline`	(absent)	Markdown text strings	Table content embedded directly in chunk artifacts, no files written
`both`	File paths	Markdown text strings	Both file paths and inline text populated simultaneously

# Default: save table files
stratum paper.pdf --tables-dir output/tables -o output.json

# Inline only — no files, table markdown in artifacts.tables_inline
stratum paper.pdf --tables-mode inline -o output.json

# Both: files + inline text
stratum paper.pdf --tables-dir output/tables --tables-mode both -o output.json

Design note: Table content in artifacts.tables_inline is intentionally separate from chunk.text. The chunk text is what embedding models index — keeping raw table markdown out avoids polluting vector representations. The inline field is for downstream code that needs table data explicitly (e.g. rendering, re-ranking, table-aware enrichers).

Images: Images are always stored as file paths (artifacts.images). There is no inline/base64 mode — embedding binary image data in JSON is impractical for RAG pipelines. Use the image-description enricher to generate text descriptions instead.

Configuration Files¶

# Chunker-only YAML config
stratum document.pdf -c config.yaml -o output.json

# Override config with CLI flags
stratum document.md --target-size 300 --max-size 500 -o output.json

# Full modular pipeline config (parser + chunker + enrichment steps)
stratum document.pdf --pipeline-config pipeline.yaml -o output.json

Full Flag Reference¶

stratum [OPTIONS] INPUT

Arguments:
  INPUT                         Input file or directory to process

Options:
  -o, --output PATH             Output file (JSON) or directory
  -c, --config PATH             Path to YAML configuration file (chunker settings)
  --target-size INT             Target chunk size (words)
  --max-size INT                Maximum chunk size (words)
  --min-size INT                Minimum chunk size (words)
  --format {json,jsonl}         Output format (default: json)
  --output-mode {separate,combined,both}
                                For directories (default: separate)
  --parser {docling,md-parser}  Parser to use (default: auto-detect)
  --images-dir PATH             Directory to save extracted images (PNG)
  --tables-dir PATH             Directory to save extracted tables (Markdown .md)
  --tables-mode {file,inline,both}
                                Table storage mode (default: file).
                                file: paths in artifacts.tables (requires --tables-dir)
                                inline: markdown text in artifacts.tables_inline
                                both: file paths + inline text
  --image-dpi INT               DPI for image rendering (default: 300).
                                Only applies to PDFs when --images-dir is set.
  --pipeline-config PATH        Modular pipeline YAML (parser + chunker + enrichment)
  -w, --workers INT             Parallel workers for directory processing (default: 1)
  --progress                    Show progress bar for directories (requires tqdm)
  -v, --verbose                 Print verbose output
  --version                     Show version
  --help                        Show help

Configuration Reference¶

Chunker-Only YAML¶

Use with stratum document.pdf -c config.yaml:

target_size: 500       # soft target size (words)
max_size: 700          # hard maximum
min_size: 50           # smaller chunks fuse with neighbours
heading_split_levels: [1, 2]
preserve_code: true
preserve_tables: true
overlap_size: 0

Option	Default	Description
`target_size`	500	Soft target — splits at good boundaries near this value
`max_size`	700	Hard limit — always split if exceeded
`min_size`	50	Fusion threshold — tiny chunks merge with neighbours
`size_unit`	`words`	`words` or `chars`
`heading_split_levels`	[1, 2]	Heading levels that force a split
`always_split_on_heading`	true	Split on headings even when below target_size
`preserve_code`	true	Keep code blocks intact
`preserve_tables`	true	Keep tables intact
`include_heading_in_text`	true	Prepend heading hierarchy to chunk text
`overlap_size`	0	Words to overlap with previous chunk

Constraint: min_size ≤ target_size ≤ max_size.

Full Pipeline YAML¶

Use with stratum document.pdf --pipeline-config pipeline.yaml:

name: my-pipeline
version: "1.0.0"

steps:
  - type: chunker
    target_size: 500
    max_size: 700
    min_size: 50
    heading_split_levels: [1, 2]

  - type: enrichment
    name: intra-document-reference
    languages:
      - en
      - fr

  - type: enrichment
    name: doc-summary
    llm_provider: google
    llm_model: gemini-2.0-flash
    max_doc_chars: 50000
    max_tokens: 512

  - type: enrichment
    name: keyword

  - type: enrichment
    name: topic
    llm_provider: google
    llm_model: gemini-2.0-flash

See examples/*/pipeline.yaml for complete pipeline configurations with real output.

Enrichers¶

Enrichers run after chunking and add metadata to each chunk under chunk.enrichments[key]. Configure them via --pipeline-config.

Overview¶

Name	LLM Required	Output Key	Description
`intra-document-reference`	No (rule-based)	`references`	Detect cross-references to sections, figures, tables, equations, etc.
`table-summarization`	Yes	`table_summary`	Concise summary of table content
`image-description`	Yes (VLM)	`image_description`	Describe image content using a vision model
`doc-summary`	Yes	`doc_summary`	Document-level summary stored in every chunk
`doc-context`	Yes	`doc_context`	Per-chunk contextual description
`cross-doc-context`	Yes + embeddings	`cross_doc_context`	Context from similar chunks across documents
`chunk-classification`	Yes	`classification`	Classify chunk by section type (introduction, methods, results, …)
`keyword`	No (TF-IDF)	`keywords`	Top keywords per chunk
`topic`	Yes (1 call/doc)	`topic`	Discover and assign topic labels

LLM Provider Support¶

Provider	Config value	API key env var
OpenAI	`openai`	`OPENAI_API_KEY`
Anthropic	`anthropic`	`ANTHROPIC_API_KEY`
Google	`google`	`GOOGLE_API_KEY`
Local / custom	`local`	`LOCAL_API_KEY` (optional) — requires `llm_base_url`

Table Summarization¶

Generates a concise LLM-based summary of each table chunk.

- type: enrichment
  name: table-summarization
  llm_provider: google
  llm_model: gemini-2.0-flash
  max_tokens: 300
  fallback_enabled: true   # extract table headers when LLM unavailable

Image Description¶

Describes image content using a vision model (VLM).

- type: enrichment
  name: image-description
  llm_provider: local
  llm_model: /models/gemma-3-12b-it
  llm_base_url: http://localhost:8000/v1
  fallback_enabled: true   # use parser captions when VLM unavailable

Document Summary¶

One LLM call per document. The summary is stored in every chunk — useful for retrieval systems that need document context alongside individual chunks.

- type: enrichment
  name: doc-summary
  llm_provider: local
  llm_model: gemma3:4b
  llm_base_url: http://localhost:11434/v1
  max_doc_chars: 50000
  max_tokens: 512
  temperature: 0.3

Document Context¶

Per-chunk contextual description. Two modes:

Mode	Description
`neighbors` (default)	Summarises each chunk in light of its surrounding chunks
`document`	Explains each chunk in light of the full document text

- type: enrichment
  name: doc-context
  llm_provider: google
  llm_model: gemini-2.0-flash
  mode: neighbors
  n_neighbors: 3
  max_tokens: 256
  temperature: 0.3
  max_workers: 8

To run two doc-context enrichers in the same pipeline without one overwriting the other, use context_key:

- type: enrichment
  name: doc-context
  mode: neighbors
  n_neighbors: 3
  context_key: doc_context_neighbors
  llm_provider: google
  llm_model: gemini-2.0-flash

- type: enrichment
  name: doc-context
  mode: document
  context_key: doc_context_document
  max_doc_chars: 30000
  llm_provider: google
  llm_model: gemini-2.0-flash

Cross-Document Context¶

Finds similar chunks across documents using a FAISS vector index, then generates contextual descriptions. Use for corpus-level enrichment.

- type: enrichment
  name: cross-doc-context
  vector_store_index_path: db/corpus/faiss.index
  vector_store_metadata_path: db/corpus/metadata.pkl
  embedding_provider: local
  embedding_model: nomic-embed-text
  embedding_base_url: http://localhost:11434/v1
  embedding_batch_size: 32
  llm_provider: local
  llm_model: gemma3:4b
  llm_base_url: http://localhost:11434/v1
  n_neighbors: 5
  max_tokens: 512
  max_workers: 16

Chunk Classification¶

Classifies each chunk into a section category (introduction, methods, results, discussion, etc.).

- type: enrichment
  name: chunk-classification
  llm_provider: openai
  llm_model: gpt-4o-mini
  max_tokens: 50
  temperature: 0.0
  max_workers: 8

Keyword Extraction¶

TF-IDF-based. No LLM required. Output: list of {"term": str, "score": float}.

- type: enrichment
  name: keyword
  top_n: 8

Topic Modeling¶

Discovers topic labels from chunk keywords using one LLM call per document. Run keyword first.

- type: enrichment
  name: keyword

- type: enrichment
  name: topic
  llm_provider: google
  llm_model: gemini-2.0-flash
  max_keywords: 300
  max_tokens: 400
  temperature: 0.3

Custom Prompt Files¶

All LLM-based enrichers (doc-summary, doc-context, chunk-classification, topic, cross-doc-context) accept a prompt_file parameter pointing to a YAML file that overrides the built-in prompt.

YAML format:

# system prompt is optional
system: |
  You are a precise document analyst. Be concise and factual.

# user_template is required — use {variable} placeholders
user_template: |
  Summarize the following document in 3 sentences.

  DOCUMENT:
  {document_text}

Available template variables by enricher:

Enricher	Variables
`doc-summary`	`{document_text}`
`doc-context` (neighbors)	`{previous_chunks}`, `{current_chunk}`, `{following_chunks}`
`doc-context` (document)	`{document_text}`, `{current_chunk}`
`chunk-classification`	`{chunk_text}`, `{categories}` (taxonomy mode only)
`topic`	`{keywords_text}`
`cross-doc-context`	`{context_block}`, `{target_text}`

Pipeline YAML usage:

- type: enrichment
  name: doc-summary
  llm_provider: google
  llm_model: gemini-2.0-flash
  prompt_file: prompts/my_summary.yaml

- type: enrichment
  name: chunk-classification
  llm_provider: openai
  llm_model: gpt-4o-mini
  prompt_file: prompts/my_classification.yaml

DSPy readiness: The YAML format reserves a dspy_signature key (commented out) for future migration to DSPy-based prompt optimization. The current system + user_template structure maps directly to DSPy's Signature paradigm.

vLLM Priority Scheduling¶

When using vLLM with priority scheduling, add priority: N to any enrichment step. Lower values mean lower priority.

- type: enrichment
  name: doc-context
  llm_provider: local
  llm_model: /models/my-model
  llm_base_url: http://localhost:8000/v1
  max_workers: 8
  priority: 5

Reference Detection Enricher¶

The intra-document-reference enricher detects internal cross-references in document text (figures, tables, sections, equations, etc.) using rule-based regex patterns. No LLM required.

Supported Reference Types¶

Type	English Examples
`section`	Section 3, Section 2.4.3, §5, Sec. 1.2
`figure`	Figure 1, Fig. 2, Figures 1–3, Fig. 3a
`table`	Table 1, Tab. 2, Tables 1–3, Table A1
`appendix`	Appendix A, App. B.1
`equation`	Equation 1, Eq. (5), Eqn. 3
`algorithm`	Algorithm 1, Alg. 2
`listing`	Listing 1
`chapter`	Chapter 1, Ch. 3

Detection features across all types: - Deep hierarchical numbering (Section 1.2.3.4) - Alphanumeric suffixes (Section 2a, Figure 3b) - Range detection (Sections 2–5, Figures 1–3) - List detection (Figures 1, 2, and 3) - False positive filtering ("figure of speech" excluded)

Supported Languages¶

Code	Language	Figure	Table	Section
`en`	English	Figure, Fig.	Table, Tab.	Section, Sec., §
`fr`	French	Figure, Schéma, Graphique, Illustration	Tableau, Tab.	Section, Partie, Paragraphe
`de`	German	Abbildung, Abb.	Tabelle, Tab.	Abschnitt, Kap.
`es`	Spanish	Figura, Fig.	Tabla, Tab.	Sección, Cap.
`it`	Italian	Figura, Fig.	Tabella, Tab.	Sezione, Cap.
`pt`	Portuguese	Figura, Fig.	Tabela, Tab.	Seção, Cap.
`nl`	Dutch	Figuur, Fig.	Tabel, Tab.	Sectie, H.
`ru`	Russian	Рисунок, Рис.	Таблица, Табл.	Раздел, Гл.
`zh`	Chinese	图	表	节、章
`ar`	Arabic	شكل	جدول	قسم، فصل

Full language names (english, french, german, etc.) are also accepted.

YAML Configuration¶

- type: enrichment
  name: intra-document-reference
  confidence_threshold: 0.5
  # Language codes: en, fr, de, es, it, pt, nl, ru, zh, ar
  # Default: [en]. Combine for multilingual documents.
  languages:
    - en
    - fr
  # Optionally restrict to specific types
  enabled_types:
    - section
    - figure
    - table
  # Or disable specific types
  disabled_types:
    - algorithm
  # Domain-specific reference type names (auto-generates patterns)
  # Each name creates a new ref_type (lowercased) in the output
  custom_reference_types:
    - Theorem
    - Lemma
    - Proposition
  # Map custom keywords to existing ref_types for full resolution support
  custom_type_mappings:
    Illustration: figure
    Diagram: figure
  # Load custom type names or mappings from a file
  custom_types_file: math_types.txt
  # Add raw regex patterns for an existing type
  custom_patterns:
    section:
      - "\\bPart\\s+(\\d+)"
  # Resolve references to artifact/chunk IDs
  resolve_references: true

Custom Types File¶

Load custom type definitions from a plain-text file:

# math_types.txt
# Lines starting with '#' and blank lines are ignored.

Theorem              # creates new "theorem" ref_type
Lemma                # creates new "lemma" ref_type
Illustration:figure  # maps "Illustration" to existing figure type
Diagram:figure       # maps "Diagram" to existing figure type

Lines without a colon create new ref_types. Lines with Name:target_type create alias mappings. File contents are merged with any values provided directly in the YAML.

Reference Resolution¶

When resolve_references: true, detected references are linked to their targets:

Type	Resolves via	`artifact_id`
`figure`	Caption text in image chunks	image artifact ID
`table`	Caption text in table chunks	table artifact ID
`section`	Section number in heading path	`section_<number>`
`appendix`	Appendix letter in heading path	`appendix_<letter>`

Output Format¶

References are stored in chunk.enrichments.references.intra_document as a list:

{
  "type": "section",
  "raw_text": "Section 2.4.3",
  "normalized": "2.4.3",
  "position": { "start": 4, "end": 17 },
  "confidence": 0.95,

  "is_range": false,

  "resolved_to": {
    "artifact_id": "section_2.4.3",
    "chunk_id": "doc_chunk_005"
  }
}

position.start / position.end are character offsets within the chunk's text field.

PII Detection¶

PII detection infrastructure is present in the pipeline and accepts type: pii steps in pipeline YAML. The pipeline recognises and skips PII steps cleanly. Concrete detection implementations are under development.

steps:
  - type: pii
    name: regex-pii
    mode: detect
    types:
      - email
      - phone

Supported Formats¶

Format	Extension(s)	Notes
PDF	`.pdf`	—
DOCX / DOC	`.docx`, `.doc`	—
HTML	`.html`, `.htm`	—
Markdown	`.md`, `.markdown`	—
Plain text	`.txt`	—
OCR JSON	`.json`	—
Images (OCR)	`.png`, `.jpg`, etc.	Requires `pip install stratum[ocr]`

Output Format Overview¶

Stratum produces Canonical v1.2 JSON. Each document has:

A document object with doc_id, source_file, format, title, total_pages, and an enrichments list tracking which enrichers ran.
A chunks array where each chunk has id, text, heading_path, page_start, page_end, content_flags, artifacts, and enrichments.
A top-level artifacts object aggregating all image/table paths.

For the complete schema, field reference, and jq examples, see output-format.md.

Examples¶

The examples/ directory contains complete working examples with real documents and enriched output:

examples/arxiv_paper/ — ArXiv paper processed with Gemini 2.0 Flash: pipeline.yaml, output.json, extracted images/ and tables/
examples/thesis_physics/ — Three physics PhD theses processed with local Qwen3-4B: pipeline.yaml, output/ (one file per thesis), images/, tables/

Each example includes the exact pipeline.yaml used and the resulting output.

Usage Guide¶

Table of Contents¶

Setup and Installation¶

Install¶

Required Environment Variables¶

Optional Environment Variables for Enrichment¶

CLI Reference¶

Single File¶

Directory Processing¶

Output Format¶

Parser Selection¶

Image and Table Extraction¶

Tables Mode (--tables-mode)¶

Configuration Files¶

Full Flag Reference¶

Configuration Reference¶

Chunker-Only YAML¶

Full Pipeline YAML¶

Enrichers¶

Overview¶

LLM Provider Support¶

Table Summarization¶

Image Description¶

Document Summary¶

Document Context¶

Cross-Document Context¶

Chunk Classification¶

Keyword Extraction¶

Topic Modeling¶

Custom Prompt Files¶

vLLM Priority Scheduling¶

Reference Detection Enricher¶

Supported Reference Types¶

Supported Languages¶

YAML Configuration¶

Custom Types File¶

Reference Resolution¶

Output Format¶

PII Detection¶

Supported Formats¶

Output Format Overview¶

Examples¶

Tables Mode (`--tables-mode`)¶