Stratum¶

Semantic document chunking for RAG pipelines.

Stratum is a two-phase semantic chunker that splits documents into clean, retrieval-optimized chunks while preserving heading hierarchy, structure, and rich metadata.

Why Stratum?¶

Chunk quality directly impacts retrieval performance. Stratum is built around two ideas:

Structure-aware — splits respect heading boundaries so each chunk carries its full context (section → subsection → content)
Size-aware — hierarchical fallback splitting (paragraph → sentence → word) ensures no chunk blows past your size limit

Features¶

Feature	Details
Multi-format input	PDF, Markdown, TXT, DOCX
Heading hierarchy	Full ancestor path on every chunk
Configurable sizes	Target, max, and min sizes in words or characters
Enrichment system	LLM-powered summaries, context, classification, keywords, topics
PII detection	GLiNER2-based entity redaction
Canonical output	Versioned JSON schema (v1.2) with structured metadata
Pluggable pipeline	Swap parsers, splitters, and enrichers via config or DI

Installation¶

# Core
pip install --index-url https://license:{license}@pypi.pkg.keygen.sh/pleias/simple stratum

replace "{license}" with your license key

Quick Start¶

CLIPython

```bash

Process a single file¶

stratum document.pdf -o output.json

Process a directory¶

stratum docs/ -o output/

Custom chunk sizes¶

stratum document.md --target-size 300 --max-size 500 -o output.json ```

```python from pathlib import Path from stratum.pipeline import ChunkingPipeline, PipelineConfig from stratum.models.config import ChunkerConfig

config = PipelineConfig( chunker=ChunkerConfig(target_size=500, max_size=700) )

pipeline = ChunkingPipeline.create(config) result = pipeline.process(Path("document.pdf"))

for chunk in result.chunks: print(chunk.id, "—", " > ".join(chunk.heading_path or []))

result.save(Path("output.json")) ```

Output Format¶

Every run produces a canonical v1.2 JSON document:

{
  "document": {
    "doc_id": "paper",
    "source_file": "paper.pdf",
    "format": "pdf",
    "total_pages": 12
  },
  "chunks": [
    {
      "id": "paper_chunk_001",
      "text": "Introduction text...",
      "heading_path": ["Introduction", "Background"],
      "page_start": 1,
      "page_end": 2,
      "content_flags": {
        "has_table": false,
        "has_image": true,
        "has_code": false
      }
    }
  ]
}

See Output Format for the full schema reference.

Documentation¶

Usage Guide

CLI reference and Python API walkthrough.
Configuration

All chunker, parser, and pipeline options.
Architecture

How the pipeline components fit together.
Output Format

Canonical v1.2 schema with field descriptions.
Components

Deep-dives into chunking, parsers, enrichment, and PII.
Integration Guide

Connecting Stratum to your RAG stack.