Skip to content

Architecture

Overview

Stratum is a semantic document chunker for RAG pipelines. It provides:

  • Multi-format parsing: PDF, DOCX, HTML, Markdown, TXT, OCR JSON
  • Semantic chunking: heading-aware splitting with configurable size limits
  • Canonical output format: versioned schema (v1.2) with structured metadata
  • Dependency injection: customisable components via protocols
  • Modular pipeline: pluggable enrichment steps via --pipeline-config
  • Benchmarking: comparison suite vs Docling baselines (see benchmarking.md)

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         CLI / API                                │
│                       stratum/cli.py                             │
└───────────────────────────────┬─────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                      ChunkingPipeline                            │
│               stratum/chunking_pipeline.py                       │
│                                                                  │
│  Orchestrates: Parser → Chunker → Canonical Output               │
│  (Optional: Enrichment → PII via --pipeline-config)              │
└───────────────────────────────┬─────────────────────────────────┘
                                │
        ┌───────────────────────┼───────────────────────┐
        ▼                       ▼                       ▼
┌───────────────┐     ┌─────────────────┐     ┌─────────────────┐
│    Parsers    │     │ SemanticChunker │     │ CanonicalOutput │
│               │     │                 │     │                 │
│ DoclingParser │     │ HeadingTracker  │     │ CanonicalDocument│
│  (PDF, DOCX)  │     │ TextSplitter    │     │ CanonicalChunk  │
│ MarkdownParser│     │ ChunkOptimizer  │     │ DocumentInfo    │
│  (.md)        │     │                 │     │                 │
│ TxtParser     │     │                 │     │                 │
│  (.txt)       │     │                 │     │                 │
└───────┬───────┘     └────────┬────────┘     └────────┬────────┘
        │                      │                       │
        ▼                      ▼                       ▼
┌───────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Document    │     │     Chunk       │     │   JSON Output   │
│ (ContentBlock)│     │ (with metadata) │     │   (v1 schema)   │
└───────────────┘     └─────────────────┘     └─────────────────┘

Data Flow

Input File (PDF/MD/TXT/DOCX/HTML/JSON)
        │
        ▼
┌─────────────────────────────────────────┐
│ 1. PARSING                               │
│    get_parser(path=path)                 │
│    → Parser.parse_file(path)             │
│    → Document(blocks: list[ContentBlock])│
└─────────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────┐
│ 2. CHUNKING                              │
│    SemanticChunker.chunk(document)       │
│                                          │
│    Phase 1: Heading hierarchy assignment │
│    Phase 2: Segment by heading levels    │
│    Phase 3: Split oversized chunks       │
│             (paragraph → sentence → word)│
│    Phase 4: Optimize (fusion, overlap)   │
│                                          │
│    → ChunkingResult(chunks, statistics)  │
└─────────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────┐
│ 3. CANONICALIZATION                      │
│    Pipeline._to_canonical(result)        │
│    → CanonicalDocument                   │
└─────────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────┐
│ 4. ENRICHMENT (pluggable enrichers)      │
│    EnrichmentPipeline.enrich(doc)        │
│    → CanonicalDocument (enriched)        │
└─────────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────┐
│ 5. SERIALIZATION                         │
│    cli.serialize_document(doc)           │
│    → dict (JSON-ready)                   │
└─────────────────────────────────────────┘
        │
        ▼
    JSON Output

Package Structure

stratum/
├── __init__.py              # Public API exports
├── cli.py                   # Command-line interface
├── chunking_pipeline.py     # ChunkingPipeline composition root
│
├── models/                  # Data models
│   ├── block.py             # ContentBlock, BlockCategory
│   ├── chunk.py             # Chunk, ChunkMetadata, HeadingContext
│   ├── config.py            # ChunkerConfig, SizeUnit
│   ├── document.py          # Document, DocumentMetadata, DocumentFormat
│   ├── output.py            # CanonicalDocument, CanonicalChunk, DocumentInfo
│   └── corpus.py            # CorpusIndex (for collections)
│
├── parsers/                 # Input parsers
│   ├── base.py              # BaseParser (ABC), ParserRegistry
│   ├── registry.py          # get_parser(), parse_document()
│   ├── markdown_parser.py   # MarkdownParser
│   ├── txt_parser.py        # TxtParser (plain text)
│   ├── ocr_json_parser.py   # OCRJSONParser
│   └── docling_parser.py    # DoclingParser (PDF, DOCX, HTML) — lazy import
│
├── chunking/                # Chunking engine
│   ├── chunker.py           # SemanticChunker
│   ├── protocols.py         # TextSplitterProtocol, HeadingTrackerProtocol, etc.
│   ├── heading_tracker.py   # HeadingTracker
│   ├── text_splitter.py     # TextSplitter (hierarchical)
│   ├── optimizer.py         # ChunkOptimizer (fusion, overlap)
│   ├── special_content.py   # Table/code handlers
│   └── result.py            # ChunkingResult, ChunkingStatistics
│
├── export/                  # Output formatters
│   └── json_exporter.py     # JSONExporter
│
├── pipeline/                # Modular pipeline infrastructure
│   ├── protocols.py         # PipelineStep protocol
│   ├── core.py              # Pipeline, PipelineBuilder
│   ├── registry.py          # StepRegistry
│   ├── config.py            # YAML config loading
│   ├── errors.py            # Exception hierarchy
│   └── adapters.py          # Adapters for existing components
│
├── enrichment/              # Enrichment system
│   ├── base.py              # BaseEnrichmentStep (template method)
│   ├── registry.py          # EnrichmentRegistry
│   ├── pipeline.py          # EnrichmentPipeline
│   ├── llm_client.py        # LLM client factory (OpenAI, Anthropic, Google, Local)
│   ├── noop.py              # NoOpEnricher (testing/placeholders)
│   ├── reference/           # Intra-document reference detection (rule-based)
│   │   ├── patterns.py      # Regex pattern packs (EN/FR/DE/ES/IT/PT/NL/RU/ZH/AR)
│   │   ├── detector.py      # ReferenceDetector
│   │   ├── block_index.py   # BlockIndex (section/appendix/figure/table resolution)
│   │   └── enricher.py      # IntraDocumentReferenceEnricher
│   ├── table/               # Table summarization (LLM)
│   ├── image/               # Image description (VLM)
│   ├── doc_summary/         # Document summary (1 LLM call)
│   ├── doc_context/         # Per-chunk contextual descriptions (1 call/chunk)
│   ├── cross_doc_context/   # Context from other documents in corpus (LLM + FAISS)
│   ├── classification/      # Section classification (1 call/chunk)
│   ├── keyword/             # TF-IDF keyword extraction (no LLM)
│   ├── topic/               # Topic discovery from keywords (1 LLM call)
│   ├── llm/                 # Multi-provider LLM abstraction
│   │   ├── base.py          # LLMClient ABC + LLMResponse dataclass
│   │   ├── factory.py       # create_llm() factory
│   │   ├── openai_compatible.py
│   │   ├── anthropic_client.py
│   │   └── google_client.py
│   └── utils/               # Shared utilities (map-reduce, etc.)
│
├── pii/                     # PII detection (infrastructure ready)
│   ├── gliner2_detector.py  # GLiNER2-based entity detector
│   ├── pipeline.py          # PII pipeline step
│   ├── text_index.py        # Span-to-chunk mapping
│   ├── types.py             # PiiSpan, DEFAULT_LABEL_DESCRIPTIONS
│   └── __init__.py
│
└── utils/                   # Utilities
    ├── hashing.py           # Deterministic chunk hashing
    └── text.py              # Text utilities

The public import path from stratum.pipeline import ChunkingPipeline, PipelineConfig works via module re-exports — the implementation lives in stratum/chunking_pipeline.py.


Core Components

ChunkingPipeline

The composition root that wires all components together. Located in stratum/chunking_pipeline.py, exposed via stratum.pipeline.

from stratum.pipeline import ChunkingPipeline, PipelineConfig
from stratum.models.config import ChunkerConfig

config = PipelineConfig(
    chunker=ChunkerConfig(target_size=500, max_size=700),
    parser_name=None,             # None = auto-detect
    images_dir="output/images",   # extract images (PNG, 300 DPI via PyMuPDF)
    tables_dir="output/tables",   # export tables as .md files
    image_dpi=300,
)

pipeline = ChunkingPipeline.create(config)
result = pipeline.process(Path("document.pdf"))
result.save(Path("output.json"))

When --pipeline-config is provided, the CLI constructs a modular Pipeline from stratum/pipeline/ instead, applying each step (parser, chunker, enrichment, pii) in sequence.

ParserRegistry

Auto-selects a parser by file extension. The registry operates with first-registered-wins semantics.

Registration order: 1. Light parsers (MarkdownParser, OCRJSONParser, TxtParser) register eagerly at import time. 2. DoclingParser is registered lazily — the Docling library is not imported at startup. It is imported on first call to get_parser() for a heavy format (PDF, DOCX, HTML). 3. Custom parsers registered before a lazy load win; use force=True to override any existing entry.

from stratum.parsers import get_parser, ParserRegistry
from stratum.models.document import DocumentFormat

# Auto-detect
parser = get_parser(path=Path("doc.pdf"))

# By format
parser = get_parser(format="markdown")

# List formats
formats = ParserRegistry.get_supported_formats()

Note: ParserRegistry.get_for_file() and ParserRegistry.get_for_format() are lower-level calls that do not trigger lazy loading. Use get_parser() or parse_document() for the full lazy-loading behaviour.

Extension Parser Loading
.md, .markdown MarkdownParser Eager
.txt TxtParser Eager
.json OCRJSONParser Eager
.pdf DoclingParser Lazy
.docx, .doc DoclingParser Lazy
.html, .htm DoclingParser Lazy

DoclingParser

Uses the Docling library for PDF, DOCX, and HTML parsing.

  • High-quality PDF text extraction with layout analysis
  • Table detection and structure preservation; export as Markdown .md files (tables_dir)
  • HD image extraction via PyMuPDF at configurable DPI (images_dir, image_dpi). For PDFs, images are rendered from vector data using PyMuPDF; for non-PDF inputs the image embedded in the Docling result is used as a fallback.
  • Image rendering is parallelised with ThreadPoolExecutor — each thread opens its own fitz.Document.
  • Image naming: {doc_stem}_img_{NNN}.png; table naming: {doc_stem}_table_{NNN}.md

DotsOCRParser

Available but not auto-registered. To use it:

from stratum.parsers.base import ParserRegistry
from stratum.parsers.dots_ocr_parser import DotsOCRParser

ParserRegistry.register(DotsOCRParser, force=True)

SemanticChunker

Two-phase chunking algorithm located in stratum/chunking/chunker.py.

Phase 1 — Semantic Segmentation:

For each block in document:
    If heading: update HeadingTracker
    Assign current heading context to block
    If heading level in split_levels: mark segment boundary
→ Result: segments with heading context

Phase 2 — Size-Aware Splitting:

For each segment:
    If size <= max_size: keep as-is
    If size > max_size: split hierarchically
        Try paragraph boundaries (\n\n)
        Try sentence boundaries (.!?)
        Fall back to word boundaries
→ Result: size-compliant chunks

Phase 3 — Optimization:

Fusion: merge chunks smaller than min_size with neighbours
Overlap: prepend overlap_size words from previous chunk
Cleanup: remove empty chunks, normalise whitespace
→ Result: final chunks

Special content handling: - Code blocks: kept intact when preserve_code=True, may exceed max_size - Tables: kept intact when preserve_tables=True; split by row only when very large - Lists: kept together when possible, split at item boundaries if necessary

HeadingTracker

Tracks the heading hierarchy as the document is processed.

  • Lower level numbers = higher in hierarchy (H1 > H2 > H3)
  • Updating with a lower or equal level pops the stack to the correct depth
  • Produces heading_path (list of strings) and heading_levels (list of ints)

ChunkOptimizer

Post-processes chunks produced by the splitter:

  • Fusion: merges chunks below min_size into a neighbouring chunk
  • Overlap: repeats the last overlap_size words from the previous chunk at the start of the current one
  • Cleanup: removes empty chunks, normalises whitespace

Dependency Injection

All three chunking sub-components implement runtime-checkable protocols in stratum/chunking/protocols.py, enabling injection for testing or customisation:

from stratum.chunking.protocols import (
    TextSplitterProtocol,
    HeadingTrackerProtocol,
    ChunkOptimizerProtocol,
)

class CustomSplitter:
    def split(self, text: str, max_size: int) -> list[str]:
        return [text]   # custom logic

chunker = SemanticChunker(
    config=config,
    text_splitter=CustomSplitter(),
)

EnrichmentPipeline

Runs enrichers sequentially on a CanonicalDocument. Located in stratum/enrichment/pipeline.py.

  • Each enricher writes to chunk.enrichments[key]
  • Document-level metadata is tracked in document.document.enrichments
  • Supports fail-fast or continue-on-error modes
  • LLM-based enrichers (doc-context, cross-doc-context) run chunk-level calls in parallel via ThreadPoolExecutor (controlled by max_workers)
from stratum.enrichment import EnrichmentRegistry

registry = EnrichmentRegistry.get_global()
enricher = registry.get("doc-summary", {
    "llm_provider": "openai",
    "llm_model": "gpt-4o-mini",
})
enriched_doc = enricher.enrich(doc)

BlockIndex

Located in stratum/enrichment/reference/block_index.py. Built during reference resolution to enable O(1) lookup of:

  • Figure and table artifacts by caption text
  • Sections by section number (extracted from heading_path)
  • Appendices by appendix letter (extracted from heading_path)

The index is rebuilt per document during IntraDocumentReferenceEnricher.enrich().

License / Fingerprint

Stratum verifies STRATUM_LICENSE_KEY and STRATUM_USER_COMPANY_EMAIL at startup. Machine fingerprinting uses a 2-tier approach:

  1. machineid package (primary)
  2. Pure-Python MAC address + hostname fallback

Models

Input Models

Model Location Purpose
ContentBlock stratum/models/block.py Atomic content unit from parser (text, heading, table, code, etc.)
BlockCategory stratum/models/block.py Enum: TITLE, TEXT, TABLE, CODE, PICTURE, FORMULA, LIST_ITEM, CAPTION, PAGE_HEADER, PAGE_FOOTER, FOOTNOTE, UNKNOWN
Document stratum/models/document.py Parsed document with blocks and metadata
DocumentMetadata stratum/models/document.py Source file, format, title, page count
DocumentFormat stratum/models/document.py Enum: PDF, MARKDOWN, TXT, DOCX, HTML, JSON

Internal Chunking Models

Model Location Purpose
Chunk stratum/models/chunk.py Internal chunk during processing
ChunkMetadata stratum/models/chunk.py has_table, has_image, has_code, has_formula, has_list, image_ids, table_ids, categories, is_split
HeadingContext stratum/models/chunk.py path: list[str], levels: list[int]
ChunkingResult stratum/chunking/result.py chunks + ChunkingStatistics
ChunkingStatistics stratum/chunking/result.py total_chunks, avg_size, min_size, max_size, oversized_count, undersized_count

Output Models (Canonical v1.2)

Model Location Purpose
CanonicalDocument stratum/models/output.py Top-level output container
CanonicalChunk stratum/models/output.py Single chunk with text, metadata, and enrichments
DocumentInfo stratum/models/output.py doc_id, source_file, format, title, total_pages, enrichments
ContentFlags stratum/models/output.py Boolean flags: has_table, has_image, has_code, has_formula, has_list
ChunkArtifacts stratum/models/output.py images: list[str], tables: list[str]

Output Format

Canonical v1.2 JSON — abbreviated example:

{
  "schema_version": "v1.2",
  "document": {
    "doc_id": "paper_001",
    "source_file": "paper.pdf",
    "format": "pdf",
    "title": "Research Paper",
    "total_pages": 12,
    "enrichments": [
      {"name": "intra-document-reference", "version": "1.1.0", "timestamp": "…"}
    ]
  },
  "chunks": [
    {
      "id": "paper_001_chunk_001",
      "text": "Introduction text...",
      "heading_path": ["Introduction", "Background"],
      "page_start": 1,
      "page_end": 2,
      "content_flags": {"has_table": false, "has_image": true, "has_code": false, "has_formula": false, "has_list": true},
      "artifacts": {"images": ["fig_001.png"], "tables": []},
      "enrichments": {"references": {"intra_document": [...]}}
    }
  ],
  "artifacts": {"images": ["fig_001.png"], "tables": []}
}

See output-format.md for the complete field reference.


Extension Points

Custom Parsers

Implement BaseParser and register with ParserRegistry:

from pathlib import Path
from stratum.parsers.base import BaseParser, ParserRegistry
from stratum.models.document import Document, DocumentFormat, DocumentMetadata
from stratum.models.block import ContentBlock, BlockCategory

class MyCustomParser(BaseParser):
    name = "my_parser"
    supported_formats = [DocumentFormat.PDF]

    def parse_file(self, path: Path) -> Document:
        blocks = [
            ContentBlock(
                text="Parsed content",
                category=BlockCategory.TEXT,
                page_number=1,
            )
        ]
        return Document(
            document_id=path.stem,
            metadata=DocumentMetadata(
                source_file=str(path),
                format=DocumentFormat.PDF,
            ),
            blocks=blocks,
        )

# Register — use force=True to override an existing parser for the same format
ParserRegistry.register(MyCustomParser, force=True)

Custom Chunking Components

Implement any of the protocols from stratum/chunking/protocols.py and inject via SemanticChunker.__init__:

  • TextSplitterProtocolsplit(text: str, max_size: int) -> list[str]
  • HeadingTrackerProtocolupdate(level, text), get_path(), get_levels(), reset()
  • ChunkOptimizerProtocoloptimize(chunks: list, config) -> list

All protocols are decorated with @runtime_checkable, so isinstance() checks work.

Custom Enrichers

Implement the EnrichmentStep protocol and register with EnrichmentRegistry:

from datetime import datetime
from stratum.models.output import CanonicalDocument

class MyEmbeddingEnricher:
    name = "my-embedding"
    version = "1.0.0"

    def __init__(self, model: str = "text-embedding-3-small"):
        self.model = model

    def enrich(self, document: CanonicalDocument) -> CanonicalDocument:
        for chunk in document.chunks:
            chunk.enrichments["embedding"] = self._embed(chunk.text)
        document.document.enrichments.append({
            "name": self.name,
            "version": self.version,
            "timestamp": datetime.now().isoformat(),
        })
        return document

    def supports_document(self, document: CanonicalDocument) -> bool:
        return True

    def _embed(self, text: str) -> list[float]:
        ...

from stratum.enrichment import EnrichmentRegistry

registry = EnrichmentRegistry.get_global()
registry.register("my-embedding", MyEmbeddingEnricher)

The EnrichmentStep protocol:

from typing import Protocol
from stratum.models.output import CanonicalDocument

class EnrichmentStep(Protocol):
    name: str
    version: str

    def enrich(self, document: CanonicalDocument) -> CanonicalDocument: ...
    def supports_document(self, document: CanonicalDocument) -> bool: ...