Chunking Component¶

Overview¶

The chunking component (stratum/chunking/) implements a two-phase semantic chunking algorithm:

Semantic Segmentation - Split by heading hierarchy
Size-Aware Splitting - Enforce size limits with intelligent boundaries

Architecture¶

┌─────────────────────────────────────────────────────────────┐
│                    SemanticChunker                           │
│                   (chunker.py)                               │
│                                                              │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │HeadingTracker│  │TextSplitter │  │  ChunkOptimizer     │  │
│  │             │  │             │  │                     │  │
│  │ - update()  │  │ - split()   │  │ - optimize()        │  │
│  │ - get_path()│  │             │  │   (fusion, overlap) │  │
│  │ - reset()   │  │             │  │                     │  │
│  └─────────────┘  └─────────────┘  └─────────────────────┘  │
│         ▲                ▲                    ▲              │
│         │                │                    │              │
│         └────────────────┴────────────────────┘              │
│                    Protocols (DI)                            │
└─────────────────────────────────────────────────────────────┘

Components¶

SemanticChunker¶

Main entry point. Orchestrates the chunking process.

from stratum import SemanticChunker, ChunkerConfig

config = ChunkerConfig(target_size=500, max_size=700)
chunker = SemanticChunker(config)

# From Document
result = chunker.chunk(document)

# From file (auto-parses)
result = chunker.chunk_file(Path("document.pdf"))

# From blocks
result = chunker.chunk_blocks(blocks, document_id="my_doc")

Chunking Algorithm:

Phase 1: Semantic Segmentation
├── For each block in document:
│   ├── If heading: update HeadingTracker
│   ├── Assign current heading context to block
│   └── If heading level in split_levels: mark segment boundary
└── Result: segments with heading context

Phase 2: Size-Aware Splitting
├── For each segment:
│   ├── If size <= max_size: keep as-is
│   └── If size > max_size: split hierarchically
│       ├── Try paragraph boundaries
│       ├── Try sentence boundaries
│       └── Fall back to word boundaries
└── Result: size-compliant chunks

Phase 3: Optimization
├── Fusion: merge undersized chunks
├── Overlap: add overlap if configured
└── Result: final chunks

HeadingTracker¶

Tracks heading hierarchy as document is processed.

from stratum.chunking.heading_tracker import HeadingTracker

tracker = HeadingTracker()

# Process headings
tracker.update(level=1, text="Introduction")
tracker.update(level=2, text="Background")

# Get current path
path = tracker.get_path()  # ["Introduction", "Background"]
levels = tracker.get_levels()  # [1, 2]

# Reset for new document
tracker.reset()

Behavior: - Lower level numbers = higher hierarchy (H1 > H2 > H3) - Updating with lower/equal level pops the stack - Example: H1 → H2 → H2 results in path being replaced

TextSplitter¶

Hierarchical text splitting with boundary awareness.

from stratum.chunking.text_splitter import TextSplitter

splitter = TextSplitter()

# Split text to max 100 chars
segments = splitter.split(text, max_size=100)

Splitting Hierarchy: 1. Paragraph boundaries (\n\n) 2. Sentence boundaries (.!? followed by space) 3. Word boundaries (spaces) 4. Character boundaries (last resort)

The splitter tries higher-level boundaries first, falling back only when necessary.

ChunkOptimizer¶

Post-processes chunks for quality.

from stratum.chunking.optimizer import ChunkOptimizer

optimizer = ChunkOptimizer()
optimized = optimizer.optimize(chunks, config)

Optimizations: - Fusion: Merge chunks smaller than min_size with neighbors - Overlap: Add overlap_size words from previous chunk - Cleanup: Remove empty chunks, normalize whitespace

Protocols (DI)¶

All components implement protocols for dependency injection:

from stratum.chunking.protocols import (
    TextSplitterProtocol,
    HeadingTrackerProtocol,
    ChunkOptimizerProtocol,
)

TextSplitterProtocol¶

from typing import Protocol, runtime_checkable

@runtime_checkable
class TextSplitterProtocol(Protocol):
    def split(self, text: str, max_size: int) -> list[str]:
        """
        Split text into segments not exceeding max_size.

        Args:
            text: Text to split
            max_size: Maximum size per segment (characters)

        Returns:
            List of text segments
        """
        ...

HeadingTrackerProtocol¶

@runtime_checkable
class HeadingTrackerProtocol(Protocol):
    def update(self, level: int, text: str) -> None:
        """Update with new heading."""
        ...

    def get_path(self) -> list[str]:
        """Get current heading path."""
        ...

    def get_levels(self) -> list[int]:
        """Get heading levels for path."""
        ...

    def reset(self) -> None:
        """Reset to empty state."""
        ...

ChunkOptimizerProtocol¶

@runtime_checkable
class ChunkOptimizerProtocol(Protocol):
    def optimize(self, chunks: list, config) -> list:
        """
        Optimize chunks according to config.

        Args:
            chunks: List of chunks
            config: ChunkerConfig

        Returns:
            Optimized list of chunks
        """
        ...

Dependency Injection¶

Inject custom implementations:

from stratum import SemanticChunker, ChunkerConfig

class CustomSplitter:
    def split(self, text: str, max_size: int) -> list[str]:
        # Custom logic
        return [text[:max_size], text[max_size:]] if len(text) > max_size else [text]

class CustomTracker:
    def __init__(self):
        self._path = []
        self._levels = []

    def update(self, level: int, text: str) -> None:
        self._path.append(text)
        self._levels.append(level)

    def get_path(self) -> list[str]:
        return self._path.copy()

    def get_levels(self) -> list[int]:
        return self._levels.copy()

    def reset(self) -> None:
        self._path = []
        self._levels = []

# Inject custom components
chunker = SemanticChunker(
    config=ChunkerConfig(max_size=1000),
    text_splitter=CustomSplitter(),
    heading_tracker=CustomTracker(),
)

Special Content Handling¶

Code Blocks¶

When preserve_code=True: - Code blocks are kept intact (not split) - May exceed max_size to preserve integrity - Flagged with has_code=True

Tables¶

When preserve_tables=True: - Tables kept intact when under max_size - Large tables split by row (table_split_strategy=BY_ROW) - Flagged with has_table=True

Lists¶

Lists are kept together when possible
Split at item boundaries if necessary
Flagged with has_list=True

ChunkingResult¶

Output from chunk():

result = chunker.chunk(document)

# Access chunks
for chunk in result.chunks:
    print(chunk.id)
    print(chunk.text)
    print(chunk.word_count)
    print(chunk.heading.path)  # Heading ancestry
    print(chunk.metadata.has_code)

# Access statistics
stats = result.statistics
print(f"Total chunks: {stats.total_chunks}")
print(f"Average size: {stats.avg_size}")
print(f"Min/Max: {stats.min_size}/{stats.max_size}")
print(f"Oversized: {stats.oversized_count}")
print(f"Undersized: {stats.undersized_count}")

# Access document info
print(result.document.document_id)
print(result.document.format)

Internal Models¶

Chunk¶

Internal chunk representation during processing:

@dataclass
class Chunk:
    id: str
    text: str
    heading: HeadingContext
    page_start: int
    page_end: int
    metadata: ChunkMetadata
    word_count: int
    char_count: int

ChunkMetadata¶

@dataclass
class ChunkMetadata:
    has_table: bool
    has_image: bool
    has_code: bool
    has_formula: bool
    has_list: bool
    image_ids: list[str]   # Paths to extracted image files
    table_ids: list[str]   # Paths to extracted table files (.md)
    categories: list[BlockCategory]
    is_split: bool  # Was this chunk split from larger?

HeadingContext¶

@dataclass
class HeadingContext:
    path: list[str]   # ["Introduction", "Background"]
    levels: list[int] # [1, 2]

    def to_breadcrumb(self) -> str:
        return " > ".join(self.path)

Testing¶

Mock implementations for testing:

class RecordingTextSplitter:
    """Records all calls for verification."""

    def __init__(self):
        self.calls = []

    def split(self, text: str, max_size: int) -> list[str]:
        self.calls.append({"text": text, "max_size": max_size})
        return [text[i:i+max_size] for i in range(0, len(text), max_size)]

# Use in tests
splitter = RecordingTextSplitter()
chunker = SemanticChunker(text_splitter=splitter)
chunker.chunk(document)

# Verify calls
assert len(splitter.calls) > 0

See tests/chunking/ for comprehensive test examples.