Chunking Component¶
Overview¶
The chunking component (stratum/chunking/) implements a two-phase semantic chunking algorithm:
- Semantic Segmentation - Split by heading hierarchy
- Size-Aware Splitting - Enforce size limits with intelligent boundaries
Architecture¶
┌─────────────────────────────────────────────────────────────┐
│ SemanticChunker │
│ (chunker.py) │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │HeadingTracker│ │TextSplitter │ │ ChunkOptimizer │ │
│ │ │ │ │ │ │ │
│ │ - update() │ │ - split() │ │ - optimize() │ │
│ │ - get_path()│ │ │ │ (fusion, overlap) │ │
│ │ - reset() │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
│ ▲ ▲ ▲ │
│ │ │ │ │
│ └────────────────┴────────────────────┘ │
│ Protocols (DI) │
└─────────────────────────────────────────────────────────────┘
Components¶
SemanticChunker¶
Main entry point. Orchestrates the chunking process.
from stratum import SemanticChunker, ChunkerConfig
config = ChunkerConfig(target_size=500, max_size=700)
chunker = SemanticChunker(config)
# From Document
result = chunker.chunk(document)
# From file (auto-parses)
result = chunker.chunk_file(Path("document.pdf"))
# From blocks
result = chunker.chunk_blocks(blocks, document_id="my_doc")
Chunking Algorithm:
Phase 1: Semantic Segmentation
├── For each block in document:
│ ├── If heading: update HeadingTracker
│ ├── Assign current heading context to block
│ └── If heading level in split_levels: mark segment boundary
└── Result: segments with heading context
Phase 2: Size-Aware Splitting
├── For each segment:
│ ├── If size <= max_size: keep as-is
│ └── If size > max_size: split hierarchically
│ ├── Try paragraph boundaries
│ ├── Try sentence boundaries
│ └── Fall back to word boundaries
└── Result: size-compliant chunks
Phase 3: Optimization
├── Fusion: merge undersized chunks
├── Overlap: add overlap if configured
└── Result: final chunks
HeadingTracker¶
Tracks heading hierarchy as document is processed.
from stratum.chunking.heading_tracker import HeadingTracker
tracker = HeadingTracker()
# Process headings
tracker.update(level=1, text="Introduction")
tracker.update(level=2, text="Background")
# Get current path
path = tracker.get_path() # ["Introduction", "Background"]
levels = tracker.get_levels() # [1, 2]
# Reset for new document
tracker.reset()
Behavior: - Lower level numbers = higher hierarchy (H1 > H2 > H3) - Updating with lower/equal level pops the stack - Example: H1 → H2 → H2 results in path being replaced
TextSplitter¶
Hierarchical text splitting with boundary awareness.
from stratum.chunking.text_splitter import TextSplitter
splitter = TextSplitter()
# Split text to max 100 chars
segments = splitter.split(text, max_size=100)
Splitting Hierarchy:
1. Paragraph boundaries (\n\n)
2. Sentence boundaries (.!? followed by space)
3. Word boundaries (spaces)
4. Character boundaries (last resort)
The splitter tries higher-level boundaries first, falling back only when necessary.
ChunkOptimizer¶
Post-processes chunks for quality.
from stratum.chunking.optimizer import ChunkOptimizer
optimizer = ChunkOptimizer()
optimized = optimizer.optimize(chunks, config)
Optimizations:
- Fusion: Merge chunks smaller than min_size with neighbors
- Overlap: Add overlap_size words from previous chunk
- Cleanup: Remove empty chunks, normalize whitespace
Protocols (DI)¶
All components implement protocols for dependency injection:
from stratum.chunking.protocols import (
TextSplitterProtocol,
HeadingTrackerProtocol,
ChunkOptimizerProtocol,
)
TextSplitterProtocol¶
from typing import Protocol, runtime_checkable
@runtime_checkable
class TextSplitterProtocol(Protocol):
def split(self, text: str, max_size: int) -> list[str]:
"""
Split text into segments not exceeding max_size.
Args:
text: Text to split
max_size: Maximum size per segment (characters)
Returns:
List of text segments
"""
...
HeadingTrackerProtocol¶
@runtime_checkable
class HeadingTrackerProtocol(Protocol):
def update(self, level: int, text: str) -> None:
"""Update with new heading."""
...
def get_path(self) -> list[str]:
"""Get current heading path."""
...
def get_levels(self) -> list[int]:
"""Get heading levels for path."""
...
def reset(self) -> None:
"""Reset to empty state."""
...
ChunkOptimizerProtocol¶
@runtime_checkable
class ChunkOptimizerProtocol(Protocol):
def optimize(self, chunks: list, config) -> list:
"""
Optimize chunks according to config.
Args:
chunks: List of chunks
config: ChunkerConfig
Returns:
Optimized list of chunks
"""
...
Dependency Injection¶
Inject custom implementations:
from stratum import SemanticChunker, ChunkerConfig
class CustomSplitter:
def split(self, text: str, max_size: int) -> list[str]:
# Custom logic
return [text[:max_size], text[max_size:]] if len(text) > max_size else [text]
class CustomTracker:
def __init__(self):
self._path = []
self._levels = []
def update(self, level: int, text: str) -> None:
self._path.append(text)
self._levels.append(level)
def get_path(self) -> list[str]:
return self._path.copy()
def get_levels(self) -> list[int]:
return self._levels.copy()
def reset(self) -> None:
self._path = []
self._levels = []
# Inject custom components
chunker = SemanticChunker(
config=ChunkerConfig(max_size=1000),
text_splitter=CustomSplitter(),
heading_tracker=CustomTracker(),
)
Special Content Handling¶
Code Blocks¶
When preserve_code=True:
- Code blocks are kept intact (not split)
- May exceed max_size to preserve integrity
- Flagged with has_code=True
Tables¶
When preserve_tables=True:
- Tables kept intact when under max_size
- Large tables split by row (table_split_strategy=BY_ROW)
- Flagged with has_table=True
Lists¶
- Lists are kept together when possible
- Split at item boundaries if necessary
- Flagged with
has_list=True
ChunkingResult¶
Output from chunk():
result = chunker.chunk(document)
# Access chunks
for chunk in result.chunks:
print(chunk.id)
print(chunk.text)
print(chunk.word_count)
print(chunk.heading.path) # Heading ancestry
print(chunk.metadata.has_code)
# Access statistics
stats = result.statistics
print(f"Total chunks: {stats.total_chunks}")
print(f"Average size: {stats.avg_size}")
print(f"Min/Max: {stats.min_size}/{stats.max_size}")
print(f"Oversized: {stats.oversized_count}")
print(f"Undersized: {stats.undersized_count}")
# Access document info
print(result.document.document_id)
print(result.document.format)
Internal Models¶
Chunk¶
Internal chunk representation during processing:
@dataclass
class Chunk:
id: str
text: str
heading: HeadingContext
page_start: int
page_end: int
metadata: ChunkMetadata
word_count: int
char_count: int
ChunkMetadata¶
@dataclass
class ChunkMetadata:
has_table: bool
has_image: bool
has_code: bool
has_formula: bool
has_list: bool
image_ids: list[str] # Paths to extracted image files
table_ids: list[str] # Paths to extracted table files (.md)
categories: list[BlockCategory]
is_split: bool # Was this chunk split from larger?
HeadingContext¶
@dataclass
class HeadingContext:
path: list[str] # ["Introduction", "Background"]
levels: list[int] # [1, 2]
def to_breadcrumb(self) -> str:
return " > ".join(self.path)
Testing¶
Mock implementations for testing:
class RecordingTextSplitter:
"""Records all calls for verification."""
def __init__(self):
self.calls = []
def split(self, text: str, max_size: int) -> list[str]:
self.calls.append({"text": text, "max_size": max_size})
return [text[i:i+max_size] for i in range(0, len(text), max_size)]
# Use in tests
splitter = RecordingTextSplitter()
chunker = SemanticChunker(text_splitter=splitter)
chunker.chunk(document)
# Verify calls
assert len(splitter.calls) > 0
See tests/chunking/ for comprehensive test examples.