Stratum¶
Semantic document chunking for RAG pipelines.
Stratum is a two-phase semantic chunker that splits documents into clean, retrieval-optimized chunks while preserving heading hierarchy, structure, and rich metadata.
Why Stratum?¶
Chunk quality directly impacts retrieval performance. Stratum is built around two ideas:
- Structure-aware — splits respect heading boundaries so each chunk carries its full context (section → subsection → content)
- Size-aware — hierarchical fallback splitting (paragraph → sentence → word) ensures no chunk blows past your size limit
Features¶
| Feature | Details |
|---|---|
| Multi-format input | PDF, Markdown, TXT, DOCX |
| Heading hierarchy | Full ancestor path on every chunk |
| Configurable sizes | Target, max, and min sizes in words or characters |
| Enrichment system | LLM-powered summaries, context, classification, keywords, topics |
| PII detection | GLiNER2-based entity redaction |
| Canonical output | Versioned JSON schema (v1.2) with structured metadata |
| Pluggable pipeline | Swap parsers, splitters, and enrichers via config or DI |
Installation¶
# Core
pip install --index-url https://license:{license}@pypi.pkg.keygen.sh/pleias/simple stratum
- replace "{license}" with your license key
Quick Start¶
```bash
Process a single file¶
stratum document.pdf -o output.json
Process a directory¶
stratum docs/ -o output/
Custom chunk sizes¶
stratum document.md --target-size 300 --max-size 500 -o output.json ```
```python from pathlib import Path from stratum.pipeline import ChunkingPipeline, PipelineConfig from stratum.models.config import ChunkerConfig
config = PipelineConfig( chunker=ChunkerConfig(target_size=500, max_size=700) )
pipeline = ChunkingPipeline.create(config) result = pipeline.process(Path("document.pdf"))
for chunk in result.chunks: print(chunk.id, "—", " > ".join(chunk.heading_path or []))
result.save(Path("output.json")) ```
Output Format¶
Every run produces a canonical v1.2 JSON document:
{
"document": {
"doc_id": "paper",
"source_file": "paper.pdf",
"format": "pdf",
"total_pages": 12
},
"chunks": [
{
"id": "paper_chunk_001",
"text": "Introduction text...",
"heading_path": ["Introduction", "Background"],
"page_start": 1,
"page_end": 2,
"content_flags": {
"has_table": false,
"has_image": true,
"has_code": false
}
}
]
}
See Output Format for the full schema reference.
Documentation¶
-
CLI reference and Python API walkthrough.
-
All chunker, parser, and pipeline options.
-
How the pipeline components fit together.
-
Canonical v1.2 schema with field descriptions.
-
Deep-dives into chunking, parsers, enrichment, and PII.
-
Connecting Stratum to your RAG stack.