Skip to content

Stratum

Semantic document chunking for RAG pipelines.

Stratum is a two-phase semantic chunker that splits documents into clean, retrieval-optimized chunks while preserving heading hierarchy, structure, and rich metadata.


Why Stratum?

Chunk quality directly impacts retrieval performance. Stratum is built around two ideas:

  • Structure-aware — splits respect heading boundaries so each chunk carries its full context (section → subsection → content)
  • Size-aware — hierarchical fallback splitting (paragraph → sentence → word) ensures no chunk blows past your size limit

Features

Feature Details
Multi-format input PDF, Markdown, TXT, DOCX
Heading hierarchy Full ancestor path on every chunk
Configurable sizes Target, max, and min sizes in words or characters
Enrichment system LLM-powered summaries, context, classification, keywords, topics
PII detection GLiNER2-based entity redaction
Canonical output Versioned JSON schema (v1.2) with structured metadata
Pluggable pipeline Swap parsers, splitters, and enrichers via config or DI

Installation

# Core
pip install --index-url https://license:{license}@pypi.pkg.keygen.sh/pleias/simple stratum 
  • replace "{license}" with your license key

Quick Start

```bash

Process a single file

stratum document.pdf -o output.json

Process a directory

stratum docs/ -o output/

Custom chunk sizes

stratum document.md --target-size 300 --max-size 500 -o output.json ```

```python from pathlib import Path from stratum.pipeline import ChunkingPipeline, PipelineConfig from stratum.models.config import ChunkerConfig

config = PipelineConfig( chunker=ChunkerConfig(target_size=500, max_size=700) )

pipeline = ChunkingPipeline.create(config) result = pipeline.process(Path("document.pdf"))

for chunk in result.chunks: print(chunk.id, "—", " > ".join(chunk.heading_path or []))

result.save(Path("output.json")) ```


Output Format

Every run produces a canonical v1.2 JSON document:

{
  "document": {
    "doc_id": "paper",
    "source_file": "paper.pdf",
    "format": "pdf",
    "total_pages": 12
  },
  "chunks": [
    {
      "id": "paper_chunk_001",
      "text": "Introduction text...",
      "heading_path": ["Introduction", "Background"],
      "page_start": 1,
      "page_end": 2,
      "content_flags": {
        "has_table": false,
        "has_image": true,
        "has_code": false
      }
    }
  ]
}

See Output Format for the full schema reference.


Documentation