Benchmarking¶

Benchmarking suite for comparing Stratum Chunker against Docling baselines.

Goal¶

Prove that Stratum Chunker delivers value: 1. Don't break: Stratum chunking + naive summarization >= Docling out-of-box 2. Smart is smart: DocSummaryEnricher + DocContextEnricher measurably improves quality 3. LLM tiers: Works with Gemma/Gemini (lower bound) and Claude/GPT-4o (upper bound) 4. Value: Saves data scientist time without breaking anything

Quick Start¶

# Quick smoke test (cheapest — gemini-flash only, 20 chunks)
python benchmarks/run_benchmark.py examples/arxiv_paper/arxiv_example.pdf --tier lower --max-chunks 20

# Full run (all tiers, all chunkers, judge evaluation)
python benchmarks/run_benchmark.py examples/arxiv_paper/arxiv_example.pdf

# Structural metrics only (no LLM judge, fastest)
python benchmarks/run_benchmark.py examples/arxiv_paper/arxiv_example.pdf --skip-judge

# Upper tier only (Claude Sonnet)
python benchmarks/run_benchmark.py examples/arxiv_paper/arxiv_example.pdf --tier upper

Requires: GEMINI_API_KEY (or GOOGLE_API_KEY) for lower tier, ANTHROPIC_API_KEY for upper tier.

Architecture¶

benchmarks/
├── configs.py         # BenchmarkConfig, LLMTier, ChunkerVariant
├── chunkers.py        # SharedParse (parse once) + 3 chunker adapters
├── summarizers.py     # NaiveSummarizer (control) + SmartEnricher
├── evaluators.py      # ChunkingMetrics, SummarizationJudge, CostMetrics
├── report.py          # Markdown tables + JSON output
└── run_benchmark.py   # CLI entry point

Key design: PDF is parsed once with Docling and shared across all chunkers via SharedParse. This avoids re-parsing (~13s per parse) and ensures identical input.

Experiments¶

Experiment 1: Chunking + Naive Summarization¶

Each chunker gets the identical prompt — the control variable:

"Summarize the following text in 1-2 sentences. Output only the summary."

Chunker	Method
Docling Hierarchical	`HierarchicalChunker().chunk(docling_doc)`
Docling Hybrid (512 tokens)	`HybridChunker(tokenizer=cl100k_base, max_tokens=512)`
Stratum (default)	`SemanticChunker(target_size=500, max_size=700)`

Experiment 2: Smart Enrichment (Stratum only)¶

DocSummaryEnricher (1 LLM call) then DocContextEnricher (N parallel calls)
Compare smart doc_context vs naive per-chunk summary

Metrics¶

Category	Metrics
Chunking structure	chunk count, avg/std size, heading preservation %
Summary quality	coherence, faithfulness, completeness (1-5, LLM-as-judge)
Cost	LLM calls, prompt/completion tokens, estimated USD
Speed	parse time, chunk time, summarization time

CLI Options¶

Flag	Default	Description
`--tier`	`all`	`lower` (gemini-flash), `upper` (claude-sonnet), or `all`
`--max-chunks`	all	Limit chunks per chunker (for quick tests)
`--skip-judge`	false	Skip LLM-as-judge, only structural metrics
`--sample-size`	30	Number of chunks to judge
`--workers`	4	Parallel LLM workers
`-o`	`benchmarks/results/`	Output directory

Output¶

Results are saved as: - results.json — full structured data - report.md — 4 markdown tables (chunking, quality, smart vs naive, cost & speed)

Smoke Test Results (preliminary)¶

These are test runs to verify the pipeline works, not production benchmarks. Full benchmarks with proper sample sizes and both LLM tiers are pending.

Config: 20 chunks, gemini-flash only, sample_size=5 for judge.

Chunking Comparison¶

Chunker	Chunks	Avg Size (words)	Std Size	Headings %
docling-hierarchical	20	57	80	100%
docling-hybrid	20	232	91	100%
stratum	20	69	39	85%

Stratum produces the most consistent chunk sizes (lowest std deviation).

Summarization Quality (Naive, gemini-flash)¶

Chunker	Coherence	Faithfulness	Completeness
docling-hierarchical	4.2	5.0	3.0
docling-hybrid	5.0	5.0	4.0
stratum	5.0	5.0	4.2

Smart vs Naive (Stratum, gemini-flash)¶

Type	Coherence	Faithfulness	Completeness
Naive	5.0	5.0	4.2
Smart	5.0	5.0	4.0
Delta	+0.00	+0.00	-0.20

Note: Smart vs Naive delta is inconclusive on 5 judged samples. Full run needed.