Usage Guide¶
Table of Contents¶
- Setup and Installation
- CLI Reference
- Configuration Reference
- Enrichers
- Reference Detection Enricher
- PII Detection
- Supported Formats
- Output Format Overview
- Examples
Setup and Installation¶
Install¶
pip install stratum
# With OCR support (image-based documents)
pip install stratum[ocr]
Required Environment Variables¶
Before using stratum you must set two environment variables. Stratum will not run without them.
export STRATUM_LICENSE_KEY=<your-key>
export STRATUM_USER_COMPANY_EMAIL=<your-company-email>
Contact your account representative to obtain these credentials.
Optional Environment Variables for Enrichment¶
LLM-based enrichers read API keys from the environment:
| Variable | Provider |
|---|---|
OPENAI_API_KEY |
OpenAI |
ANTHROPIC_API_KEY |
Anthropic |
GOOGLE_API_KEY |
Google Gemini |
LOCAL_API_KEY |
Local / custom OpenAI-compatible endpoint (optional) |
LLM_API_KEY |
Generic key used by table/image enrichers |
LLM_BASE_URL |
Base URL for table/image enrichers |
LLM_MODEL |
Model name for table/image enrichers |
VLM_API_KEY |
Vision model key for image description |
VLM_BASE_URL |
Vision model base URL |
VLM_MODEL |
Vision model name |
CLI Reference¶
Single File¶
# Auto-detect format, print to stdout
stratum document.md
# Write to file
stratum document.md -o output.json
# Process a PDF
stratum paper.pdf -o chunks.json
# Verbose output
stratum document.md -o output.json -v
Directory Processing¶
# One output file per document (default)
stratum docs/ -o output/
# All results in one file
stratum docs/ -o output/all.json --output-mode combined
# Both separate files and combined
stratum docs/ -o output/ --output-mode both
# Parallel workers
stratum docs/ -o output/ --workers 4
# Progress bar (requires tqdm)
stratum docs/ -o output/ --progress --workers 4
Output Format¶
# JSON (default)
stratum document.md -o output.json --format json
# JSONL (one document object per line)
stratum document.md -o output.jsonl --format jsonl
Parser Selection¶
# Auto-detect (default)
stratum document.pdf -o output.json
# Force specific parser
stratum document.pdf --parser docling -o output.json
stratum document.md --parser md-parser -o output.json
Image and Table Extraction¶
# Extract images (PNG, 300 DPI for PDFs)
stratum paper.pdf --images-dir output/images -o output.json
# Extract tables as Markdown files (default mode: file)
stratum paper.pdf --tables-dir output/tables -o output.json
# Both together with custom DPI
stratum paper.pdf \
--images-dir output/images \
--tables-dir output/tables \
--image-dpi 150 \
-o output.json
Images are named {doc_stem}_img_{NNN}.png and tables {doc_stem}_table_{NNN}.md. Their paths appear in chunk artifacts:
{
"artifacts": {
"images": ["output/images/paper_img_001.png"],
"tables": ["output/tables/paper_table_001.md"]
}
}
Tables Mode (--tables-mode)¶
Controls how extracted tables are stored. Three modes:
| Mode | artifacts.tables |
artifacts.tables_inline |
Description |
|---|---|---|---|
file (default) |
File paths | (absent) | Tables saved to --tables-dir as .md files |
inline |
(absent) | Markdown text strings | Table content embedded directly in chunk artifacts, no files written |
both |
File paths | Markdown text strings | Both file paths and inline text populated simultaneously |
# Default: save table files
stratum paper.pdf --tables-dir output/tables -o output.json
# Inline only — no files, table markdown in artifacts.tables_inline
stratum paper.pdf --tables-mode inline -o output.json
# Both: files + inline text
stratum paper.pdf --tables-dir output/tables --tables-mode both -o output.json
Design note: Table content in artifacts.tables_inline is intentionally separate from chunk.text. The chunk text is what embedding models index — keeping raw table markdown out avoids polluting vector representations. The inline field is for downstream code that needs table data explicitly (e.g. rendering, re-ranking, table-aware enrichers).
Images: Images are always stored as file paths (artifacts.images). There is no inline/base64 mode — embedding binary image data in JSON is impractical for RAG pipelines. Use the image-description enricher to generate text descriptions instead.
Configuration Files¶
# Chunker-only YAML config
stratum document.pdf -c config.yaml -o output.json
# Override config with CLI flags
stratum document.md --target-size 300 --max-size 500 -o output.json
# Full modular pipeline config (parser + chunker + enrichment steps)
stratum document.pdf --pipeline-config pipeline.yaml -o output.json
Full Flag Reference¶
stratum [OPTIONS] INPUT
Arguments:
INPUT Input file or directory to process
Options:
-o, --output PATH Output file (JSON) or directory
-c, --config PATH Path to YAML configuration file (chunker settings)
--target-size INT Target chunk size (words)
--max-size INT Maximum chunk size (words)
--min-size INT Minimum chunk size (words)
--format {json,jsonl} Output format (default: json)
--output-mode {separate,combined,both}
For directories (default: separate)
--parser {docling,md-parser} Parser to use (default: auto-detect)
--images-dir PATH Directory to save extracted images (PNG)
--tables-dir PATH Directory to save extracted tables (Markdown .md)
--tables-mode {file,inline,both}
Table storage mode (default: file).
file: paths in artifacts.tables (requires --tables-dir)
inline: markdown text in artifacts.tables_inline
both: file paths + inline text
--image-dpi INT DPI for image rendering (default: 300).
Only applies to PDFs when --images-dir is set.
--pipeline-config PATH Modular pipeline YAML (parser + chunker + enrichment)
-w, --workers INT Parallel workers for directory processing (default: 1)
--progress Show progress bar for directories (requires tqdm)
-v, --verbose Print verbose output
--version Show version
--help Show help
Configuration Reference¶
Chunker-Only YAML¶
Use with stratum document.pdf -c config.yaml:
target_size: 500 # soft target size (words)
max_size: 700 # hard maximum
min_size: 50 # smaller chunks fuse with neighbours
heading_split_levels: [1, 2]
preserve_code: true
preserve_tables: true
overlap_size: 0
| Option | Default | Description |
|---|---|---|
target_size |
500 | Soft target — splits at good boundaries near this value |
max_size |
700 | Hard limit — always split if exceeded |
min_size |
50 | Fusion threshold — tiny chunks merge with neighbours |
size_unit |
words |
words or chars |
heading_split_levels |
[1, 2] | Heading levels that force a split |
always_split_on_heading |
true | Split on headings even when below target_size |
preserve_code |
true | Keep code blocks intact |
preserve_tables |
true | Keep tables intact |
include_heading_in_text |
true | Prepend heading hierarchy to chunk text |
overlap_size |
0 | Words to overlap with previous chunk |
Constraint: min_size ≤ target_size ≤ max_size.
Full Pipeline YAML¶
Use with stratum document.pdf --pipeline-config pipeline.yaml:
name: my-pipeline
version: "1.0.0"
steps:
- type: chunker
target_size: 500
max_size: 700
min_size: 50
heading_split_levels: [1, 2]
- type: enrichment
name: intra-document-reference
languages:
- en
- fr
- type: enrichment
name: doc-summary
llm_provider: google
llm_model: gemini-2.0-flash
max_doc_chars: 50000
max_tokens: 512
- type: enrichment
name: keyword
- type: enrichment
name: topic
llm_provider: google
llm_model: gemini-2.0-flash
See examples/*/pipeline.yaml for complete pipeline configurations with real output.
Enrichers¶
Enrichers run after chunking and add metadata to each chunk under chunk.enrichments[key]. Configure them via --pipeline-config.
Overview¶
| Name | LLM Required | Output Key | Description |
|---|---|---|---|
intra-document-reference |
No (rule-based) | references |
Detect cross-references to sections, figures, tables, equations, etc. |
table-summarization |
Yes | table_summary |
Concise summary of table content |
image-description |
Yes (VLM) | image_description |
Describe image content using a vision model |
doc-summary |
Yes | doc_summary |
Document-level summary stored in every chunk |
doc-context |
Yes | doc_context |
Per-chunk contextual description |
cross-doc-context |
Yes + embeddings | cross_doc_context |
Context from similar chunks across documents |
chunk-classification |
Yes | classification |
Classify chunk by section type (introduction, methods, results, …) |
keyword |
No (TF-IDF) | keywords |
Top keywords per chunk |
topic |
Yes (1 call/doc) | topic |
Discover and assign topic labels |
LLM Provider Support¶
| Provider | Config value | API key env var |
|---|---|---|
| OpenAI | openai |
OPENAI_API_KEY |
| Anthropic | anthropic |
ANTHROPIC_API_KEY |
google |
GOOGLE_API_KEY |
|
| Local / custom | local |
LOCAL_API_KEY (optional) — requires llm_base_url |
Table Summarization¶
Generates a concise LLM-based summary of each table chunk.
- type: enrichment
name: table-summarization
llm_provider: google
llm_model: gemini-2.0-flash
max_tokens: 300
fallback_enabled: true # extract table headers when LLM unavailable
Image Description¶
Describes image content using a vision model (VLM).
- type: enrichment
name: image-description
llm_provider: local
llm_model: /models/gemma-3-12b-it
llm_base_url: http://localhost:8000/v1
fallback_enabled: true # use parser captions when VLM unavailable
Document Summary¶
One LLM call per document. The summary is stored in every chunk — useful for retrieval systems that need document context alongside individual chunks.
- type: enrichment
name: doc-summary
llm_provider: local
llm_model: gemma3:4b
llm_base_url: http://localhost:11434/v1
max_doc_chars: 50000
max_tokens: 512
temperature: 0.3
Document Context¶
Per-chunk contextual description. Two modes:
| Mode | Description |
|---|---|
neighbors (default) |
Summarises each chunk in light of its surrounding chunks |
document |
Explains each chunk in light of the full document text |
- type: enrichment
name: doc-context
llm_provider: google
llm_model: gemini-2.0-flash
mode: neighbors
n_neighbors: 3
max_tokens: 256
temperature: 0.3
max_workers: 8
To run two doc-context enrichers in the same pipeline without one overwriting the other, use context_key:
- type: enrichment
name: doc-context
mode: neighbors
n_neighbors: 3
context_key: doc_context_neighbors
llm_provider: google
llm_model: gemini-2.0-flash
- type: enrichment
name: doc-context
mode: document
context_key: doc_context_document
max_doc_chars: 30000
llm_provider: google
llm_model: gemini-2.0-flash
Cross-Document Context¶
Finds similar chunks across documents using a FAISS vector index, then generates contextual descriptions. Use for corpus-level enrichment.
- type: enrichment
name: cross-doc-context
vector_store_index_path: db/corpus/faiss.index
vector_store_metadata_path: db/corpus/metadata.pkl
embedding_provider: local
embedding_model: nomic-embed-text
embedding_base_url: http://localhost:11434/v1
embedding_batch_size: 32
llm_provider: local
llm_model: gemma3:4b
llm_base_url: http://localhost:11434/v1
n_neighbors: 5
max_tokens: 512
max_workers: 16
Chunk Classification¶
Classifies each chunk into a section category (introduction, methods, results, discussion, etc.).
- type: enrichment
name: chunk-classification
llm_provider: openai
llm_model: gpt-4o-mini
max_tokens: 50
temperature: 0.0
max_workers: 8
Keyword Extraction¶
TF-IDF-based. No LLM required. Output: list of {"term": str, "score": float}.
- type: enrichment
name: keyword
top_n: 8
Topic Modeling¶
Discovers topic labels from chunk keywords using one LLM call per document. Run keyword first.
- type: enrichment
name: keyword
- type: enrichment
name: topic
llm_provider: google
llm_model: gemini-2.0-flash
max_keywords: 300
max_tokens: 400
temperature: 0.3
Custom Prompt Files¶
All LLM-based enrichers (doc-summary, doc-context, chunk-classification, topic, cross-doc-context) accept a prompt_file parameter pointing to a YAML file that overrides the built-in prompt.
YAML format:
# system prompt is optional
system: |
You are a precise document analyst. Be concise and factual.
# user_template is required — use {variable} placeholders
user_template: |
Summarize the following document in 3 sentences.
DOCUMENT:
{document_text}
Available template variables by enricher:
| Enricher | Variables |
|---|---|
doc-summary |
{document_text} |
doc-context (neighbors) |
{previous_chunks}, {current_chunk}, {following_chunks} |
doc-context (document) |
{document_text}, {current_chunk} |
chunk-classification |
{chunk_text}, {categories} (taxonomy mode only) |
topic |
{keywords_text} |
cross-doc-context |
{context_block}, {target_text} |
Pipeline YAML usage:
- type: enrichment
name: doc-summary
llm_provider: google
llm_model: gemini-2.0-flash
prompt_file: prompts/my_summary.yaml
- type: enrichment
name: chunk-classification
llm_provider: openai
llm_model: gpt-4o-mini
prompt_file: prompts/my_classification.yaml
DSPy readiness: The YAML format reserves a dspy_signature key (commented out) for future migration to DSPy-based prompt optimization. The current system + user_template structure maps directly to DSPy's Signature paradigm.
vLLM Priority Scheduling¶
When using vLLM with priority scheduling, add priority: N to any enrichment step. Lower values mean lower priority.
- type: enrichment
name: doc-context
llm_provider: local
llm_model: /models/my-model
llm_base_url: http://localhost:8000/v1
max_workers: 8
priority: 5
Reference Detection Enricher¶
The intra-document-reference enricher detects internal cross-references in document text (figures, tables, sections, equations, etc.) using rule-based regex patterns. No LLM required.
Supported Reference Types¶
| Type | English Examples |
|---|---|
section |
Section 3, Section 2.4.3, §5, Sec. 1.2 |
figure |
Figure 1, Fig. 2, Figures 1–3, Fig. 3a |
table |
Table 1, Tab. 2, Tables 1–3, Table A1 |
appendix |
Appendix A, App. B.1 |
equation |
Equation 1, Eq. (5), Eqn. 3 |
algorithm |
Algorithm 1, Alg. 2 |
listing |
Listing 1 |
chapter |
Chapter 1, Ch. 3 |
Detection features across all types: - Deep hierarchical numbering (Section 1.2.3.4) - Alphanumeric suffixes (Section 2a, Figure 3b) - Range detection (Sections 2–5, Figures 1–3) - List detection (Figures 1, 2, and 3) - False positive filtering ("figure of speech" excluded)
Supported Languages¶
| Code | Language | Figure | Table | Section |
|---|---|---|---|---|
en |
English | Figure, Fig. | Table, Tab. | Section, Sec., § |
fr |
French | Figure, Schéma, Graphique, Illustration | Tableau, Tab. | Section, Partie, Paragraphe |
de |
German | Abbildung, Abb. | Tabelle, Tab. | Abschnitt, Kap. |
es |
Spanish | Figura, Fig. | Tabla, Tab. | Sección, Cap. |
it |
Italian | Figura, Fig. | Tabella, Tab. | Sezione, Cap. |
pt |
Portuguese | Figura, Fig. | Tabela, Tab. | Seção, Cap. |
nl |
Dutch | Figuur, Fig. | Tabel, Tab. | Sectie, H. |
ru |
Russian | Рисунок, Рис. | Таблица, Табл. | Раздел, Гл. |
zh |
Chinese | 图 | 表 | 节、章 |
ar |
Arabic | شكل | جدول | قسم، فصل |
Full language names (english, french, german, etc.) are also accepted.
YAML Configuration¶
- type: enrichment
name: intra-document-reference
confidence_threshold: 0.5
# Language codes: en, fr, de, es, it, pt, nl, ru, zh, ar
# Default: [en]. Combine for multilingual documents.
languages:
- en
- fr
# Optionally restrict to specific types
enabled_types:
- section
- figure
- table
# Or disable specific types
disabled_types:
- algorithm
# Domain-specific reference type names (auto-generates patterns)
# Each name creates a new ref_type (lowercased) in the output
custom_reference_types:
- Theorem
- Lemma
- Proposition
# Map custom keywords to existing ref_types for full resolution support
custom_type_mappings:
Illustration: figure
Diagram: figure
# Load custom type names or mappings from a file
custom_types_file: math_types.txt
# Add raw regex patterns for an existing type
custom_patterns:
section:
- "\\bPart\\s+(\\d+)"
# Resolve references to artifact/chunk IDs
resolve_references: true
Custom Types File¶
Load custom type definitions from a plain-text file:
# math_types.txt
# Lines starting with '#' and blank lines are ignored.
Theorem # creates new "theorem" ref_type
Lemma # creates new "lemma" ref_type
Illustration:figure # maps "Illustration" to existing figure type
Diagram:figure # maps "Diagram" to existing figure type
Lines without a colon create new ref_types. Lines with Name:target_type create alias mappings. File contents are merged with any values provided directly in the YAML.
Reference Resolution¶
When resolve_references: true, detected references are linked to their targets:
| Type | Resolves via | artifact_id |
|---|---|---|
figure |
Caption text in image chunks | image artifact ID |
table |
Caption text in table chunks | table artifact ID |
section |
Section number in heading path | section_<number> |
appendix |
Appendix letter in heading path | appendix_<letter> |
Output Format¶
References are stored in chunk.enrichments.references.intra_document as a list:
{
"type": "section",
"raw_text": "Section 2.4.3",
"normalized": "2.4.3",
"position": { "start": 4, "end": 17 },
"confidence": 0.95,
"is_range": false,
"resolved_to": {
"artifact_id": "section_2.4.3",
"chunk_id": "doc_chunk_005"
}
}
position.start / position.end are character offsets within the chunk's text field.
PII Detection¶
PII detection infrastructure is present in the pipeline and accepts type: pii steps in pipeline YAML. The pipeline recognises and skips PII steps cleanly. Concrete detection implementations are under development.
steps:
- type: pii
name: regex-pii
mode: detect
types:
- email
- phone
Supported Formats¶
| Format | Extension(s) | Notes |
|---|---|---|
.pdf |
— | |
| DOCX / DOC | .docx, .doc |
— |
| HTML | .html, .htm |
— |
| Markdown | .md, .markdown |
— |
| Plain text | .txt |
— |
| OCR JSON | .json |
— |
| Images (OCR) | .png, .jpg, etc. |
Requires pip install stratum[ocr] |
Output Format Overview¶
Stratum produces Canonical v1.2 JSON. Each document has:
- A
documentobject withdoc_id,source_file,format,title,total_pages, and anenrichmentslist tracking which enrichers ran. - A
chunksarray where each chunk hasid,text,heading_path,page_start,page_end,content_flags,artifacts, andenrichments. - A top-level
artifactsobject aggregating all image/table paths.
For the complete schema, field reference, and jq examples, see output-format.md.
Examples¶
The examples/ directory contains complete working examples with real documents and enriched output:
examples/arxiv_paper/— ArXiv paper processed with Gemini 2.0 Flash:pipeline.yaml,output.json, extractedimages/andtables/examples/thesis_physics/— Three physics PhD theses processed with local Qwen3-4B:pipeline.yaml,output/(one file per thesis),images/,tables/
Each example includes the exact pipeline.yaml used and the resulting output.