Benchmarking¶
Benchmarking suite for comparing Stratum Chunker against Docling baselines.
Goal¶
Prove that Stratum Chunker delivers value: 1. Don't break: Stratum chunking + naive summarization >= Docling out-of-box 2. Smart is smart: DocSummaryEnricher + DocContextEnricher measurably improves quality 3. LLM tiers: Works with Gemma/Gemini (lower bound) and Claude/GPT-4o (upper bound) 4. Value: Saves data scientist time without breaking anything
Quick Start¶
# Quick smoke test (cheapest — gemini-flash only, 20 chunks)
python benchmarks/run_benchmark.py examples/arxiv_paper/arxiv_example.pdf --tier lower --max-chunks 20
# Full run (all tiers, all chunkers, judge evaluation)
python benchmarks/run_benchmark.py examples/arxiv_paper/arxiv_example.pdf
# Structural metrics only (no LLM judge, fastest)
python benchmarks/run_benchmark.py examples/arxiv_paper/arxiv_example.pdf --skip-judge
# Upper tier only (Claude Sonnet)
python benchmarks/run_benchmark.py examples/arxiv_paper/arxiv_example.pdf --tier upper
Requires: GEMINI_API_KEY (or GOOGLE_API_KEY) for lower tier, ANTHROPIC_API_KEY for upper tier.
Architecture¶
benchmarks/
├── configs.py # BenchmarkConfig, LLMTier, ChunkerVariant
├── chunkers.py # SharedParse (parse once) + 3 chunker adapters
├── summarizers.py # NaiveSummarizer (control) + SmartEnricher
├── evaluators.py # ChunkingMetrics, SummarizationJudge, CostMetrics
├── report.py # Markdown tables + JSON output
└── run_benchmark.py # CLI entry point
Key design: PDF is parsed once with Docling and shared across all chunkers via SharedParse. This avoids re-parsing (~13s per parse) and ensures identical input.
Experiments¶
Experiment 1: Chunking + Naive Summarization¶
Each chunker gets the identical prompt — the control variable:
"Summarize the following text in 1-2 sentences. Output only the summary."
| Chunker | Method |
|---|---|
| Docling Hierarchical | HierarchicalChunker().chunk(docling_doc) |
| Docling Hybrid (512 tokens) | HybridChunker(tokenizer=cl100k_base, max_tokens=512) |
| Stratum (default) | SemanticChunker(target_size=500, max_size=700) |
Experiment 2: Smart Enrichment (Stratum only)¶
DocSummaryEnricher(1 LLM call) thenDocContextEnricher(N parallel calls)- Compare smart
doc_contextvs naive per-chunk summary
Metrics¶
| Category | Metrics |
|---|---|
| Chunking structure | chunk count, avg/std size, heading preservation % |
| Summary quality | coherence, faithfulness, completeness (1-5, LLM-as-judge) |
| Cost | LLM calls, prompt/completion tokens, estimated USD |
| Speed | parse time, chunk time, summarization time |
CLI Options¶
| Flag | Default | Description |
|---|---|---|
--tier |
all |
lower (gemini-flash), upper (claude-sonnet), or all |
--max-chunks |
all | Limit chunks per chunker (for quick tests) |
--skip-judge |
false | Skip LLM-as-judge, only structural metrics |
--sample-size |
30 | Number of chunks to judge |
--workers |
4 | Parallel LLM workers |
-o |
benchmarks/results/ |
Output directory |
Output¶
Results are saved as:
- results.json — full structured data
- report.md — 4 markdown tables (chunking, quality, smart vs naive, cost & speed)
Smoke Test Results (preliminary)¶
These are test runs to verify the pipeline works, not production benchmarks. Full benchmarks with proper sample sizes and both LLM tiers are pending.
Config: 20 chunks, gemini-flash only, sample_size=5 for judge.
Chunking Comparison¶
| Chunker | Chunks | Avg Size (words) | Std Size | Headings % |
|---|---|---|---|---|
| docling-hierarchical | 20 | 57 | 80 | 100% |
| docling-hybrid | 20 | 232 | 91 | 100% |
| stratum | 20 | 69 | 39 | 85% |
Stratum produces the most consistent chunk sizes (lowest std deviation).
Summarization Quality (Naive, gemini-flash)¶
| Chunker | Coherence | Faithfulness | Completeness |
|---|---|---|---|
| docling-hierarchical | 4.2 | 5.0 | 3.0 |
| docling-hybrid | 5.0 | 5.0 | 4.0 |
| stratum | 5.0 | 5.0 | 4.2 |
Smart vs Naive (Stratum, gemini-flash)¶
| Type | Coherence | Faithfulness | Completeness |
|---|---|---|---|
| Naive | 5.0 | 5.0 | 4.2 |
| Smart | 5.0 | 5.0 | 4.0 |
| Delta | +0.00 | +0.00 | -0.20 |
Note: Smart vs Naive delta is inconclusive on 5 judged samples. Full run needed.