Architecture¶

Overview¶

Stratum is a semantic document chunker for RAG pipelines. It provides:

Multi-format parsing: PDF, DOCX, HTML, Markdown, TXT, OCR JSON
Semantic chunking: heading-aware splitting with configurable size limits
Canonical output format: versioned schema (v1.2) with structured metadata
Dependency injection: customisable components via protocols
Modular pipeline: pluggable enrichment steps via --pipeline-config
Benchmarking: comparison suite vs Docling baselines (see benchmarking.md)

High-Level Architecture¶

┌─────────────────────────────────────────────────────────────────┐
│                         CLI / API                                │
│                       stratum/cli.py                             │
└───────────────────────────────┬─────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                      ChunkingPipeline                            │
│               stratum/chunking_pipeline.py                       │
│                                                                  │
│  Orchestrates: Parser → Chunker → Canonical Output               │
│  (Optional: Enrichment → PII via --pipeline-config)              │
└───────────────────────────────┬─────────────────────────────────┘
                                │
        ┌───────────────────────┼───────────────────────┐
        ▼                       ▼                       ▼
┌───────────────┐     ┌─────────────────┐     ┌─────────────────┐
│    Parsers    │     │ SemanticChunker │     │ CanonicalOutput │
│               │     │                 │     │                 │
│ DoclingParser │     │ HeadingTracker  │     │ CanonicalDocument│
│  (PDF, DOCX)  │     │ TextSplitter    │     │ CanonicalChunk  │
│ MarkdownParser│     │ ChunkOptimizer  │     │ DocumentInfo    │
│  (.md)        │     │                 │     │                 │
│ TxtParser     │     │                 │     │                 │
│  (.txt)       │     │                 │     │                 │
└───────┬───────┘     └────────┬────────┘     └────────┬────────┘
        │                      │                       │
        ▼                      ▼                       ▼
┌───────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Document    │     │     Chunk       │     │   JSON Output   │
│ (ContentBlock)│     │ (with metadata) │     │   (v1 schema)   │
└───────────────┘     └─────────────────┘     └─────────────────┘

Data Flow¶

Input File (PDF/MD/TXT/DOCX/HTML/JSON)
        │
        ▼
┌─────────────────────────────────────────┐
│ 1. PARSING                               │
│    get_parser(path=path)                 │
│    → Parser.parse_file(path)             │
│    → Document(blocks: list[ContentBlock])│
└─────────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────┐
│ 2. CHUNKING                              │
│    SemanticChunker.chunk(document)       │
│                                          │
│    Phase 1: Heading hierarchy assignment │
│    Phase 2: Segment by heading levels    │
│    Phase 3: Split oversized chunks       │
│             (paragraph → sentence → word)│
│    Phase 4: Optimize (fusion, overlap)   │
│                                          │
│    → ChunkingResult(chunks, statistics)  │
└─────────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────┐
│ 3. CANONICALIZATION                      │
│    Pipeline._to_canonical(result)        │
│    → CanonicalDocument                   │
└─────────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────┐
│ 4. ENRICHMENT (pluggable enrichers)      │
│    EnrichmentPipeline.enrich(doc)        │
│    → CanonicalDocument (enriched)        │
└─────────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────┐
│ 5. SERIALIZATION                         │
│    cli.serialize_document(doc)           │
│    → dict (JSON-ready)                   │
└─────────────────────────────────────────┘
        │
        ▼
    JSON Output

Package Structure¶

stratum/
├── __init__.py              # Public API exports
├── cli.py                   # Command-line interface
├── chunking_pipeline.py     # ChunkingPipeline composition root
│
├── models/                  # Data models
│   ├── block.py             # ContentBlock, BlockCategory
│   ├── chunk.py             # Chunk, ChunkMetadata, HeadingContext
│   ├── config.py            # ChunkerConfig, SizeUnit
│   ├── document.py          # Document, DocumentMetadata, DocumentFormat
│   ├── output.py            # CanonicalDocument, CanonicalChunk, DocumentInfo
│   └── corpus.py            # CorpusIndex (for collections)
│
├── parsers/                 # Input parsers
│   ├── base.py              # BaseParser (ABC), ParserRegistry
│   ├── registry.py          # get_parser(), parse_document()
│   ├── markdown_parser.py   # MarkdownParser
│   ├── txt_parser.py        # TxtParser (plain text)
│   ├── ocr_json_parser.py   # OCRJSONParser
│   └── docling_parser.py    # DoclingParser (PDF, DOCX, HTML) — lazy import
│
├── chunking/                # Chunking engine
│   ├── chunker.py           # SemanticChunker
│   ├── protocols.py         # TextSplitterProtocol, HeadingTrackerProtocol, etc.
│   ├── heading_tracker.py   # HeadingTracker
│   ├── text_splitter.py     # TextSplitter (hierarchical)
│   ├── optimizer.py         # ChunkOptimizer (fusion, overlap)
│   ├── special_content.py   # Table/code handlers
│   └── result.py            # ChunkingResult, ChunkingStatistics
│
├── export/                  # Output formatters
│   └── json_exporter.py     # JSONExporter
│
├── pipeline/                # Modular pipeline infrastructure
│   ├── protocols.py         # PipelineStep protocol
│   ├── core.py              # Pipeline, PipelineBuilder
│   ├── registry.py          # StepRegistry
│   ├── config.py            # YAML config loading
│   ├── errors.py            # Exception hierarchy
│   └── adapters.py          # Adapters for existing components
│
├── enrichment/              # Enrichment system
│   ├── base.py              # BaseEnrichmentStep (template method)
│   ├── registry.py          # EnrichmentRegistry
│   ├── pipeline.py          # EnrichmentPipeline
│   ├── llm_client.py        # LLM client factory (OpenAI, Anthropic, Google, Local)
│   ├── noop.py              # NoOpEnricher (testing/placeholders)
│   ├── reference/           # Intra-document reference detection (rule-based)
│   │   ├── patterns.py      # Regex pattern packs (EN/FR/DE/ES/IT/PT/NL/RU/ZH/AR)
│   │   ├── detector.py      # ReferenceDetector
│   │   ├── block_index.py   # BlockIndex (section/appendix/figure/table resolution)
│   │   └── enricher.py      # IntraDocumentReferenceEnricher
│   ├── table/               # Table summarization (LLM)
│   ├── image/               # Image description (VLM)
│   ├── doc_summary/         # Document summary (1 LLM call)
│   ├── doc_context/         # Per-chunk contextual descriptions (1 call/chunk)
│   ├── cross_doc_context/   # Context from other documents in corpus (LLM + FAISS)
│   ├── classification/      # Section classification (1 call/chunk)
│   ├── keyword/             # TF-IDF keyword extraction (no LLM)
│   ├── topic/               # Topic discovery from keywords (1 LLM call)
│   ├── llm/                 # Multi-provider LLM abstraction
│   │   ├── base.py          # LLMClient ABC + LLMResponse dataclass
│   │   ├── factory.py       # create_llm() factory
│   │   ├── openai_compatible.py
│   │   ├── anthropic_client.py
│   │   └── google_client.py
│   └── utils/               # Shared utilities (map-reduce, etc.)
│
├── pii/                     # PII detection (infrastructure ready)
│   ├── gliner2_detector.py  # GLiNER2-based entity detector
│   ├── pipeline.py          # PII pipeline step
│   ├── text_index.py        # Span-to-chunk mapping
│   ├── types.py             # PiiSpan, DEFAULT_LABEL_DESCRIPTIONS
│   └── __init__.py
│
└── utils/                   # Utilities
    ├── hashing.py           # Deterministic chunk hashing
    └── text.py              # Text utilities

The public import path from stratum.pipeline import ChunkingPipeline, PipelineConfig works via module re-exports — the implementation lives in stratum/chunking_pipeline.py.

Core Components¶

ChunkingPipeline¶

The composition root that wires all components together. Located in stratum/chunking_pipeline.py, exposed via stratum.pipeline.

from stratum.pipeline import ChunkingPipeline, PipelineConfig
from stratum.models.config import ChunkerConfig

config = PipelineConfig(
    chunker=ChunkerConfig(target_size=500, max_size=700),
    parser_name=None,             # None = auto-detect
    images_dir="output/images",   # extract images (PNG, 300 DPI via PyMuPDF)
    tables_dir="output/tables",   # export tables as .md files
    image_dpi=300,
)

pipeline = ChunkingPipeline.create(config)
result = pipeline.process(Path("document.pdf"))
result.save(Path("output.json"))

When --pipeline-config is provided, the CLI constructs a modular Pipeline from stratum/pipeline/ instead, applying each step (parser, chunker, enrichment, pii) in sequence.

ParserRegistry¶

Auto-selects a parser by file extension. The registry operates with first-registered-wins semantics.

Registration order: 1. Light parsers (MarkdownParser, OCRJSONParser, TxtParser) register eagerly at import time. 2. DoclingParser is registered lazily — the Docling library is not imported at startup. It is imported on first call to get_parser() for a heavy format (PDF, DOCX, HTML). 3. Custom parsers registered before a lazy load win; use force=True to override any existing entry.

from stratum.parsers import get_parser, ParserRegistry
from stratum.models.document import DocumentFormat

# Auto-detect
parser = get_parser(path=Path("doc.pdf"))

# By format
parser = get_parser(format="markdown")

# List formats
formats = ParserRegistry.get_supported_formats()

Note: ParserRegistry.get_for_file() and ParserRegistry.get_for_format() are lower-level calls that do not trigger lazy loading. Use get_parser() or parse_document() for the full lazy-loading behaviour.

Extension	Parser	Loading
`.md`, `.markdown`	MarkdownParser	Eager
`.txt`	TxtParser	Eager
`.json`	OCRJSONParser	Eager
`.pdf`	DoclingParser	Lazy
`.docx`, `.doc`	DoclingParser	Lazy
`.html`, `.htm`	DoclingParser	Lazy

DoclingParser¶

Uses the Docling library for PDF, DOCX, and HTML parsing.

High-quality PDF text extraction with layout analysis
Table detection and structure preservation; export as Markdown .md files (tables_dir)
HD image extraction via PyMuPDF at configurable DPI (images_dir, image_dpi). For PDFs, images are rendered from vector data using PyMuPDF; for non-PDF inputs the image embedded in the Docling result is used as a fallback.
Image rendering is parallelised with ThreadPoolExecutor — each thread opens its own fitz.Document.
Image naming: {doc_stem}_img_{NNN}.png; table naming: {doc_stem}_table_{NNN}.md

DotsOCRParser¶

Available but not auto-registered. To use it:

from stratum.parsers.base import ParserRegistry
from stratum.parsers.dots_ocr_parser import DotsOCRParser

ParserRegistry.register(DotsOCRParser, force=True)

SemanticChunker¶

Two-phase chunking algorithm located in stratum/chunking/chunker.py.

Phase 1 — Semantic Segmentation:

For each block in document:
    If heading: update HeadingTracker
    Assign current heading context to block
    If heading level in split_levels: mark segment boundary
→ Result: segments with heading context

Phase 2 — Size-Aware Splitting:

For each segment:
    If size <= max_size: keep as-is
    If size > max_size: split hierarchically
        Try paragraph boundaries (\n\n)
        Try sentence boundaries (.!?)
        Fall back to word boundaries
→ Result: size-compliant chunks

Phase 3 — Optimization:

Fusion: merge chunks smaller than min_size with neighbours
Overlap: prepend overlap_size words from previous chunk
Cleanup: remove empty chunks, normalise whitespace
→ Result: final chunks

Special content handling: - Code blocks: kept intact when preserve_code=True, may exceed max_size - Tables: kept intact when preserve_tables=True; split by row only when very large - Lists: kept together when possible, split at item boundaries if necessary

HeadingTracker¶

Tracks the heading hierarchy as the document is processed.

Lower level numbers = higher in hierarchy (H1 > H2 > H3)
Updating with a lower or equal level pops the stack to the correct depth
Produces heading_path (list of strings) and heading_levels (list of ints)

ChunkOptimizer¶

Post-processes chunks produced by the splitter:

Fusion: merges chunks below min_size into a neighbouring chunk
Overlap: repeats the last overlap_size words from the previous chunk at the start of the current one
Cleanup: removes empty chunks, normalises whitespace

Dependency Injection¶

All three chunking sub-components implement runtime-checkable protocols in stratum/chunking/protocols.py, enabling injection for testing or customisation:

from stratum.chunking.protocols import (
    TextSplitterProtocol,
    HeadingTrackerProtocol,
    ChunkOptimizerProtocol,
)

class CustomSplitter:
    def split(self, text: str, max_size: int) -> list[str]:
        return [text]   # custom logic

chunker = SemanticChunker(
    config=config,
    text_splitter=CustomSplitter(),
)

EnrichmentPipeline¶

Runs enrichers sequentially on a CanonicalDocument. Located in stratum/enrichment/pipeline.py.

Each enricher writes to chunk.enrichments[key]
Document-level metadata is tracked in document.document.enrichments
Supports fail-fast or continue-on-error modes
LLM-based enrichers (doc-context, cross-doc-context) run chunk-level calls in parallel via ThreadPoolExecutor (controlled by max_workers)

from stratum.enrichment import EnrichmentRegistry

registry = EnrichmentRegistry.get_global()
enricher = registry.get("doc-summary", {
    "llm_provider": "openai",
    "llm_model": "gpt-4o-mini",
})
enriched_doc = enricher.enrich(doc)

BlockIndex¶

Located in stratum/enrichment/reference/block_index.py. Built during reference resolution to enable O(1) lookup of:

Figure and table artifacts by caption text
Sections by section number (extracted from heading_path)
Appendices by appendix letter (extracted from heading_path)

The index is rebuilt per document during IntraDocumentReferenceEnricher.enrich().

License / Fingerprint¶

Stratum verifies STRATUM_LICENSE_KEY and STRATUM_USER_COMPANY_EMAIL at startup. Machine fingerprinting uses a 2-tier approach:

machineid package (primary)
Pure-Python MAC address + hostname fallback

Models¶

Input Models¶

Model	Location	Purpose
`ContentBlock`	`stratum/models/block.py`	Atomic content unit from parser (text, heading, table, code, etc.)
`BlockCategory`	`stratum/models/block.py`	Enum: TITLE, TEXT, TABLE, CODE, PICTURE, FORMULA, LIST_ITEM, CAPTION, PAGE_HEADER, PAGE_FOOTER, FOOTNOTE, UNKNOWN
`Document`	`stratum/models/document.py`	Parsed document with blocks and metadata
`DocumentMetadata`	`stratum/models/document.py`	Source file, format, title, page count
`DocumentFormat`	`stratum/models/document.py`	Enum: PDF, MARKDOWN, TXT, DOCX, HTML, JSON

Internal Chunking Models¶

Model	Location	Purpose
`Chunk`	`stratum/models/chunk.py`	Internal chunk during processing
`ChunkMetadata`	`stratum/models/chunk.py`	has_table, has_image, has_code, has_formula, has_list, image_ids, table_ids, categories, is_split
`HeadingContext`	`stratum/models/chunk.py`	`path: list[str]`, `levels: list[int]`
`ChunkingResult`	`stratum/chunking/result.py`	chunks + ChunkingStatistics
`ChunkingStatistics`	`stratum/chunking/result.py`	total_chunks, avg_size, min_size, max_size, oversized_count, undersized_count

Output Models (Canonical v1.2)¶

Model	Location	Purpose
`CanonicalDocument`	`stratum/models/output.py`	Top-level output container
`CanonicalChunk`	`stratum/models/output.py`	Single chunk with text, metadata, and enrichments
`DocumentInfo`	`stratum/models/output.py`	doc_id, source_file, format, title, total_pages, enrichments
`ContentFlags`	`stratum/models/output.py`	Boolean flags: has_table, has_image, has_code, has_formula, has_list
`ChunkArtifacts`	`stratum/models/output.py`	images: list[str], tables: list[str]

Output Format¶

Canonical v1.2 JSON — abbreviated example:

{
  "schema_version": "v1.2",
  "document": {
    "doc_id": "paper_001",
    "source_file": "paper.pdf",
    "format": "pdf",
    "title": "Research Paper",
    "total_pages": 12,
    "enrichments": [
      {"name": "intra-document-reference", "version": "1.1.0", "timestamp": "…"}
    ]
  },
  "chunks": [
    {
      "id": "paper_001_chunk_001",
      "text": "Introduction text...",
      "heading_path": ["Introduction", "Background"],
      "page_start": 1,
      "page_end": 2,
      "content_flags": {"has_table": false, "has_image": true, "has_code": false, "has_formula": false, "has_list": true},
      "artifacts": {"images": ["fig_001.png"], "tables": []},
      "enrichments": {"references": {"intra_document": [...]}}
    }
  ],
  "artifacts": {"images": ["fig_001.png"], "tables": []}
}

See output-format.md for the complete field reference.

Extension Points¶

Custom Parsers¶

Implement BaseParser and register with ParserRegistry:

from pathlib import Path
from stratum.parsers.base import BaseParser, ParserRegistry
from stratum.models.document import Document, DocumentFormat, DocumentMetadata
from stratum.models.block import ContentBlock, BlockCategory

class MyCustomParser(BaseParser):
    name = "my_parser"
    supported_formats = [DocumentFormat.PDF]

    def parse_file(self, path: Path) -> Document:
        blocks = [
            ContentBlock(
                text="Parsed content",
                category=BlockCategory.TEXT,
                page_number=1,
            )
        ]
        return Document(
            document_id=path.stem,
            metadata=DocumentMetadata(
                source_file=str(path),
                format=DocumentFormat.PDF,
            ),
            blocks=blocks,
        )

# Register — use force=True to override an existing parser for the same format
ParserRegistry.register(MyCustomParser, force=True)

Custom Chunking Components¶

Implement any of the protocols from stratum/chunking/protocols.py and inject via SemanticChunker.__init__:

TextSplitterProtocol — split(text: str, max_size: int) -> list[str]
HeadingTrackerProtocol — update(level, text), get_path(), get_levels(), reset()
ChunkOptimizerProtocol — optimize(chunks: list, config) -> list

All protocols are decorated with @runtime_checkable, so isinstance() checks work.

Custom Enrichers¶

Implement the EnrichmentStep protocol and register with EnrichmentRegistry:

from datetime import datetime
from stratum.models.output import CanonicalDocument

class MyEmbeddingEnricher:
    name = "my-embedding"
    version = "1.0.0"

    def __init__(self, model: str = "text-embedding-3-small"):
        self.model = model

    def enrich(self, document: CanonicalDocument) -> CanonicalDocument:
        for chunk in document.chunks:
            chunk.enrichments["embedding"] = self._embed(chunk.text)
        document.document.enrichments.append({
            "name": self.name,
            "version": self.version,
            "timestamp": datetime.now().isoformat(),
        })
        return document

    def supports_document(self, document: CanonicalDocument) -> bool:
        return True

    def _embed(self, text: str) -> list[float]:
        ...

from stratum.enrichment import EnrichmentRegistry

registry = EnrichmentRegistry.get_global()
registry.register("my-embedding", MyEmbeddingEnricher)

The EnrichmentStep protocol:

from typing import Protocol
from stratum.models.output import CanonicalDocument

class EnrichmentStep(Protocol):
    name: str
    version: str

    def enrich(self, document: CanonicalDocument) -> CanonicalDocument: ...
    def supports_document(self, document: CanonicalDocument) -> bool: ...

usage.md — CLI and Python API usage, configuration reference, enricher guide
output-format.md — Canonical v1.2 schema details
benchmarking.md — Benchmarking suite