Architecture¶
Overview¶
Stratum is a semantic document chunker for RAG pipelines. It provides:
- Multi-format parsing: PDF, DOCX, HTML, Markdown, TXT, OCR JSON
- Semantic chunking: heading-aware splitting with configurable size limits
- Canonical output format: versioned schema (v1.2) with structured metadata
- Dependency injection: customisable components via protocols
- Modular pipeline: pluggable enrichment steps via
--pipeline-config - Benchmarking: comparison suite vs Docling baselines (see benchmarking.md)
High-Level Architecture¶
┌─────────────────────────────────────────────────────────────────┐
│ CLI / API │
│ stratum/cli.py │
└───────────────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ ChunkingPipeline │
│ stratum/chunking_pipeline.py │
│ │
│ Orchestrates: Parser → Chunker → Canonical Output │
│ (Optional: Enrichment → PII via --pipeline-config) │
└───────────────────────────────┬─────────────────────────────────┘
│
┌───────────────────────┼───────────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Parsers │ │ SemanticChunker │ │ CanonicalOutput │
│ │ │ │ │ │
│ DoclingParser │ │ HeadingTracker │ │ CanonicalDocument│
│ (PDF, DOCX) │ │ TextSplitter │ │ CanonicalChunk │
│ MarkdownParser│ │ ChunkOptimizer │ │ DocumentInfo │
│ (.md) │ │ │ │ │
│ TxtParser │ │ │ │ │
│ (.txt) │ │ │ │ │
└───────┬───────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Document │ │ Chunk │ │ JSON Output │
│ (ContentBlock)│ │ (with metadata) │ │ (v1 schema) │
└───────────────┘ └─────────────────┘ └─────────────────┘
Data Flow¶
Input File (PDF/MD/TXT/DOCX/HTML/JSON)
│
▼
┌─────────────────────────────────────────┐
│ 1. PARSING │
│ get_parser(path=path) │
│ → Parser.parse_file(path) │
│ → Document(blocks: list[ContentBlock])│
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ 2. CHUNKING │
│ SemanticChunker.chunk(document) │
│ │
│ Phase 1: Heading hierarchy assignment │
│ Phase 2: Segment by heading levels │
│ Phase 3: Split oversized chunks │
│ (paragraph → sentence → word)│
│ Phase 4: Optimize (fusion, overlap) │
│ │
│ → ChunkingResult(chunks, statistics) │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ 3. CANONICALIZATION │
│ Pipeline._to_canonical(result) │
│ → CanonicalDocument │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ 4. ENRICHMENT (pluggable enrichers) │
│ EnrichmentPipeline.enrich(doc) │
│ → CanonicalDocument (enriched) │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ 5. SERIALIZATION │
│ cli.serialize_document(doc) │
│ → dict (JSON-ready) │
└─────────────────────────────────────────┘
│
▼
JSON Output
Package Structure¶
stratum/
├── __init__.py # Public API exports
├── cli.py # Command-line interface
├── chunking_pipeline.py # ChunkingPipeline composition root
│
├── models/ # Data models
│ ├── block.py # ContentBlock, BlockCategory
│ ├── chunk.py # Chunk, ChunkMetadata, HeadingContext
│ ├── config.py # ChunkerConfig, SizeUnit
│ ├── document.py # Document, DocumentMetadata, DocumentFormat
│ ├── output.py # CanonicalDocument, CanonicalChunk, DocumentInfo
│ └── corpus.py # CorpusIndex (for collections)
│
├── parsers/ # Input parsers
│ ├── base.py # BaseParser (ABC), ParserRegistry
│ ├── registry.py # get_parser(), parse_document()
│ ├── markdown_parser.py # MarkdownParser
│ ├── txt_parser.py # TxtParser (plain text)
│ ├── ocr_json_parser.py # OCRJSONParser
│ └── docling_parser.py # DoclingParser (PDF, DOCX, HTML) — lazy import
│
├── chunking/ # Chunking engine
│ ├── chunker.py # SemanticChunker
│ ├── protocols.py # TextSplitterProtocol, HeadingTrackerProtocol, etc.
│ ├── heading_tracker.py # HeadingTracker
│ ├── text_splitter.py # TextSplitter (hierarchical)
│ ├── optimizer.py # ChunkOptimizer (fusion, overlap)
│ ├── special_content.py # Table/code handlers
│ └── result.py # ChunkingResult, ChunkingStatistics
│
├── export/ # Output formatters
│ └── json_exporter.py # JSONExporter
│
├── pipeline/ # Modular pipeline infrastructure
│ ├── protocols.py # PipelineStep protocol
│ ├── core.py # Pipeline, PipelineBuilder
│ ├── registry.py # StepRegistry
│ ├── config.py # YAML config loading
│ ├── errors.py # Exception hierarchy
│ └── adapters.py # Adapters for existing components
│
├── enrichment/ # Enrichment system
│ ├── base.py # BaseEnrichmentStep (template method)
│ ├── registry.py # EnrichmentRegistry
│ ├── pipeline.py # EnrichmentPipeline
│ ├── llm_client.py # LLM client factory (OpenAI, Anthropic, Google, Local)
│ ├── noop.py # NoOpEnricher (testing/placeholders)
│ ├── reference/ # Intra-document reference detection (rule-based)
│ │ ├── patterns.py # Regex pattern packs (EN/FR/DE/ES/IT/PT/NL/RU/ZH/AR)
│ │ ├── detector.py # ReferenceDetector
│ │ ├── block_index.py # BlockIndex (section/appendix/figure/table resolution)
│ │ └── enricher.py # IntraDocumentReferenceEnricher
│ ├── table/ # Table summarization (LLM)
│ ├── image/ # Image description (VLM)
│ ├── doc_summary/ # Document summary (1 LLM call)
│ ├── doc_context/ # Per-chunk contextual descriptions (1 call/chunk)
│ ├── cross_doc_context/ # Context from other documents in corpus (LLM + FAISS)
│ ├── classification/ # Section classification (1 call/chunk)
│ ├── keyword/ # TF-IDF keyword extraction (no LLM)
│ ├── topic/ # Topic discovery from keywords (1 LLM call)
│ ├── llm/ # Multi-provider LLM abstraction
│ │ ├── base.py # LLMClient ABC + LLMResponse dataclass
│ │ ├── factory.py # create_llm() factory
│ │ ├── openai_compatible.py
│ │ ├── anthropic_client.py
│ │ └── google_client.py
│ └── utils/ # Shared utilities (map-reduce, etc.)
│
├── pii/ # PII detection (infrastructure ready)
│ ├── gliner2_detector.py # GLiNER2-based entity detector
│ ├── pipeline.py # PII pipeline step
│ ├── text_index.py # Span-to-chunk mapping
│ ├── types.py # PiiSpan, DEFAULT_LABEL_DESCRIPTIONS
│ └── __init__.py
│
└── utils/ # Utilities
├── hashing.py # Deterministic chunk hashing
└── text.py # Text utilities
The public import path from stratum.pipeline import ChunkingPipeline, PipelineConfig works via module re-exports — the implementation lives in stratum/chunking_pipeline.py.
Core Components¶
ChunkingPipeline¶
The composition root that wires all components together. Located in stratum/chunking_pipeline.py, exposed via stratum.pipeline.
from stratum.pipeline import ChunkingPipeline, PipelineConfig
from stratum.models.config import ChunkerConfig
config = PipelineConfig(
chunker=ChunkerConfig(target_size=500, max_size=700),
parser_name=None, # None = auto-detect
images_dir="output/images", # extract images (PNG, 300 DPI via PyMuPDF)
tables_dir="output/tables", # export tables as .md files
image_dpi=300,
)
pipeline = ChunkingPipeline.create(config)
result = pipeline.process(Path("document.pdf"))
result.save(Path("output.json"))
When --pipeline-config is provided, the CLI constructs a modular Pipeline from stratum/pipeline/ instead, applying each step (parser, chunker, enrichment, pii) in sequence.
ParserRegistry¶
Auto-selects a parser by file extension. The registry operates with first-registered-wins semantics.
Registration order:
1. Light parsers (MarkdownParser, OCRJSONParser, TxtParser) register eagerly at import time.
2. DoclingParser is registered lazily — the Docling library is not imported at startup. It is imported on first call to get_parser() for a heavy format (PDF, DOCX, HTML).
3. Custom parsers registered before a lazy load win; use force=True to override any existing entry.
from stratum.parsers import get_parser, ParserRegistry
from stratum.models.document import DocumentFormat
# Auto-detect
parser = get_parser(path=Path("doc.pdf"))
# By format
parser = get_parser(format="markdown")
# List formats
formats = ParserRegistry.get_supported_formats()
Note: ParserRegistry.get_for_file() and ParserRegistry.get_for_format() are lower-level calls that do not trigger lazy loading. Use get_parser() or parse_document() for the full lazy-loading behaviour.
| Extension | Parser | Loading |
|---|---|---|
.md, .markdown |
MarkdownParser | Eager |
.txt |
TxtParser | Eager |
.json |
OCRJSONParser | Eager |
.pdf |
DoclingParser | Lazy |
.docx, .doc |
DoclingParser | Lazy |
.html, .htm |
DoclingParser | Lazy |
DoclingParser¶
Uses the Docling library for PDF, DOCX, and HTML parsing.
- High-quality PDF text extraction with layout analysis
- Table detection and structure preservation; export as Markdown
.mdfiles (tables_dir) - HD image extraction via PyMuPDF at configurable DPI (
images_dir,image_dpi). For PDFs, images are rendered from vector data using PyMuPDF; for non-PDF inputs the image embedded in the Docling result is used as a fallback. - Image rendering is parallelised with
ThreadPoolExecutor— each thread opens its ownfitz.Document. - Image naming:
{doc_stem}_img_{NNN}.png; table naming:{doc_stem}_table_{NNN}.md
DotsOCRParser¶
Available but not auto-registered. To use it:
from stratum.parsers.base import ParserRegistry
from stratum.parsers.dots_ocr_parser import DotsOCRParser
ParserRegistry.register(DotsOCRParser, force=True)
SemanticChunker¶
Two-phase chunking algorithm located in stratum/chunking/chunker.py.
Phase 1 — Semantic Segmentation:
For each block in document:
If heading: update HeadingTracker
Assign current heading context to block
If heading level in split_levels: mark segment boundary
→ Result: segments with heading context
Phase 2 — Size-Aware Splitting:
For each segment:
If size <= max_size: keep as-is
If size > max_size: split hierarchically
Try paragraph boundaries (\n\n)
Try sentence boundaries (.!?)
Fall back to word boundaries
→ Result: size-compliant chunks
Phase 3 — Optimization:
Fusion: merge chunks smaller than min_size with neighbours
Overlap: prepend overlap_size words from previous chunk
Cleanup: remove empty chunks, normalise whitespace
→ Result: final chunks
Special content handling:
- Code blocks: kept intact when preserve_code=True, may exceed max_size
- Tables: kept intact when preserve_tables=True; split by row only when very large
- Lists: kept together when possible, split at item boundaries if necessary
HeadingTracker¶
Tracks the heading hierarchy as the document is processed.
- Lower level numbers = higher in hierarchy (H1 > H2 > H3)
- Updating with a lower or equal level pops the stack to the correct depth
- Produces
heading_path(list of strings) andheading_levels(list of ints)
ChunkOptimizer¶
Post-processes chunks produced by the splitter:
- Fusion: merges chunks below
min_sizeinto a neighbouring chunk - Overlap: repeats the last
overlap_sizewords from the previous chunk at the start of the current one - Cleanup: removes empty chunks, normalises whitespace
Dependency Injection¶
All three chunking sub-components implement runtime-checkable protocols in stratum/chunking/protocols.py, enabling injection for testing or customisation:
from stratum.chunking.protocols import (
TextSplitterProtocol,
HeadingTrackerProtocol,
ChunkOptimizerProtocol,
)
class CustomSplitter:
def split(self, text: str, max_size: int) -> list[str]:
return [text] # custom logic
chunker = SemanticChunker(
config=config,
text_splitter=CustomSplitter(),
)
EnrichmentPipeline¶
Runs enrichers sequentially on a CanonicalDocument. Located in stratum/enrichment/pipeline.py.
- Each enricher writes to
chunk.enrichments[key] - Document-level metadata is tracked in
document.document.enrichments - Supports fail-fast or continue-on-error modes
- LLM-based enrichers (
doc-context,cross-doc-context) run chunk-level calls in parallel viaThreadPoolExecutor(controlled bymax_workers)
from stratum.enrichment import EnrichmentRegistry
registry = EnrichmentRegistry.get_global()
enricher = registry.get("doc-summary", {
"llm_provider": "openai",
"llm_model": "gpt-4o-mini",
})
enriched_doc = enricher.enrich(doc)
BlockIndex¶
Located in stratum/enrichment/reference/block_index.py. Built during reference resolution to enable O(1) lookup of:
- Figure and table artifacts by caption text
- Sections by section number (extracted from
heading_path) - Appendices by appendix letter (extracted from
heading_path)
The index is rebuilt per document during IntraDocumentReferenceEnricher.enrich().
License / Fingerprint¶
Stratum verifies STRATUM_LICENSE_KEY and STRATUM_USER_COMPANY_EMAIL at startup. Machine fingerprinting uses a 2-tier approach:
machineidpackage (primary)- Pure-Python MAC address + hostname fallback
Models¶
Input Models¶
| Model | Location | Purpose |
|---|---|---|
ContentBlock |
stratum/models/block.py |
Atomic content unit from parser (text, heading, table, code, etc.) |
BlockCategory |
stratum/models/block.py |
Enum: TITLE, TEXT, TABLE, CODE, PICTURE, FORMULA, LIST_ITEM, CAPTION, PAGE_HEADER, PAGE_FOOTER, FOOTNOTE, UNKNOWN |
Document |
stratum/models/document.py |
Parsed document with blocks and metadata |
DocumentMetadata |
stratum/models/document.py |
Source file, format, title, page count |
DocumentFormat |
stratum/models/document.py |
Enum: PDF, MARKDOWN, TXT, DOCX, HTML, JSON |
Internal Chunking Models¶
| Model | Location | Purpose |
|---|---|---|
Chunk |
stratum/models/chunk.py |
Internal chunk during processing |
ChunkMetadata |
stratum/models/chunk.py |
has_table, has_image, has_code, has_formula, has_list, image_ids, table_ids, categories, is_split |
HeadingContext |
stratum/models/chunk.py |
path: list[str], levels: list[int] |
ChunkingResult |
stratum/chunking/result.py |
chunks + ChunkingStatistics |
ChunkingStatistics |
stratum/chunking/result.py |
total_chunks, avg_size, min_size, max_size, oversized_count, undersized_count |
Output Models (Canonical v1.2)¶
| Model | Location | Purpose |
|---|---|---|
CanonicalDocument |
stratum/models/output.py |
Top-level output container |
CanonicalChunk |
stratum/models/output.py |
Single chunk with text, metadata, and enrichments |
DocumentInfo |
stratum/models/output.py |
doc_id, source_file, format, title, total_pages, enrichments |
ContentFlags |
stratum/models/output.py |
Boolean flags: has_table, has_image, has_code, has_formula, has_list |
ChunkArtifacts |
stratum/models/output.py |
images: list[str], tables: list[str] |
Output Format¶
Canonical v1.2 JSON — abbreviated example:
{
"schema_version": "v1.2",
"document": {
"doc_id": "paper_001",
"source_file": "paper.pdf",
"format": "pdf",
"title": "Research Paper",
"total_pages": 12,
"enrichments": [
{"name": "intra-document-reference", "version": "1.1.0", "timestamp": "…"}
]
},
"chunks": [
{
"id": "paper_001_chunk_001",
"text": "Introduction text...",
"heading_path": ["Introduction", "Background"],
"page_start": 1,
"page_end": 2,
"content_flags": {"has_table": false, "has_image": true, "has_code": false, "has_formula": false, "has_list": true},
"artifacts": {"images": ["fig_001.png"], "tables": []},
"enrichments": {"references": {"intra_document": [...]}}
}
],
"artifacts": {"images": ["fig_001.png"], "tables": []}
}
See output-format.md for the complete field reference.
Extension Points¶
Custom Parsers¶
Implement BaseParser and register with ParserRegistry:
from pathlib import Path
from stratum.parsers.base import BaseParser, ParserRegistry
from stratum.models.document import Document, DocumentFormat, DocumentMetadata
from stratum.models.block import ContentBlock, BlockCategory
class MyCustomParser(BaseParser):
name = "my_parser"
supported_formats = [DocumentFormat.PDF]
def parse_file(self, path: Path) -> Document:
blocks = [
ContentBlock(
text="Parsed content",
category=BlockCategory.TEXT,
page_number=1,
)
]
return Document(
document_id=path.stem,
metadata=DocumentMetadata(
source_file=str(path),
format=DocumentFormat.PDF,
),
blocks=blocks,
)
# Register — use force=True to override an existing parser for the same format
ParserRegistry.register(MyCustomParser, force=True)
Custom Chunking Components¶
Implement any of the protocols from stratum/chunking/protocols.py and inject via SemanticChunker.__init__:
TextSplitterProtocol—split(text: str, max_size: int) -> list[str]HeadingTrackerProtocol—update(level, text),get_path(),get_levels(),reset()ChunkOptimizerProtocol—optimize(chunks: list, config) -> list
All protocols are decorated with @runtime_checkable, so isinstance() checks work.
Custom Enrichers¶
Implement the EnrichmentStep protocol and register with EnrichmentRegistry:
from datetime import datetime
from stratum.models.output import CanonicalDocument
class MyEmbeddingEnricher:
name = "my-embedding"
version = "1.0.0"
def __init__(self, model: str = "text-embedding-3-small"):
self.model = model
def enrich(self, document: CanonicalDocument) -> CanonicalDocument:
for chunk in document.chunks:
chunk.enrichments["embedding"] = self._embed(chunk.text)
document.document.enrichments.append({
"name": self.name,
"version": self.version,
"timestamp": datetime.now().isoformat(),
})
return document
def supports_document(self, document: CanonicalDocument) -> bool:
return True
def _embed(self, text: str) -> list[float]:
...
from stratum.enrichment import EnrichmentRegistry
registry = EnrichmentRegistry.get_global()
registry.register("my-embedding", MyEmbeddingEnricher)
The EnrichmentStep protocol:
from typing import Protocol
from stratum.models.output import CanonicalDocument
class EnrichmentStep(Protocol):
name: str
version: str
def enrich(self, document: CanonicalDocument) -> CanonicalDocument: ...
def supports_document(self, document: CanonicalDocument) -> bool: ...
Related Documentation¶
- usage.md — CLI and Python API usage, configuration reference, enricher guide
- output-format.md — Canonical v1.2 schema details
- benchmarking.md — Benchmarking suite