Skip to content

Benchmarking

Benchmarking suite for comparing Stratum Chunker against Docling baselines.

Goal

Prove that Stratum Chunker delivers value: 1. Don't break: Stratum chunking + naive summarization >= Docling out-of-box 2. Smart is smart: DocSummaryEnricher + DocContextEnricher measurably improves quality 3. LLM tiers: Works with Gemma/Gemini (lower bound) and Claude/GPT-4o (upper bound) 4. Value: Saves data scientist time without breaking anything

Quick Start

# Quick smoke test (cheapest — gemini-flash only, 20 chunks)
python benchmarks/run_benchmark.py examples/arxiv_paper/arxiv_example.pdf --tier lower --max-chunks 20

# Full run (all tiers, all chunkers, judge evaluation)
python benchmarks/run_benchmark.py examples/arxiv_paper/arxiv_example.pdf

# Structural metrics only (no LLM judge, fastest)
python benchmarks/run_benchmark.py examples/arxiv_paper/arxiv_example.pdf --skip-judge

# Upper tier only (Claude Sonnet)
python benchmarks/run_benchmark.py examples/arxiv_paper/arxiv_example.pdf --tier upper

Requires: GEMINI_API_KEY (or GOOGLE_API_KEY) for lower tier, ANTHROPIC_API_KEY for upper tier.

Architecture

benchmarks/
├── configs.py         # BenchmarkConfig, LLMTier, ChunkerVariant
├── chunkers.py        # SharedParse (parse once) + 3 chunker adapters
├── summarizers.py     # NaiveSummarizer (control) + SmartEnricher
├── evaluators.py      # ChunkingMetrics, SummarizationJudge, CostMetrics
├── report.py          # Markdown tables + JSON output
└── run_benchmark.py   # CLI entry point

Key design: PDF is parsed once with Docling and shared across all chunkers via SharedParse. This avoids re-parsing (~13s per parse) and ensures identical input.

Experiments

Experiment 1: Chunking + Naive Summarization

Each chunker gets the identical prompt — the control variable:

"Summarize the following text in 1-2 sentences. Output only the summary."

Chunker Method
Docling Hierarchical HierarchicalChunker().chunk(docling_doc)
Docling Hybrid (512 tokens) HybridChunker(tokenizer=cl100k_base, max_tokens=512)
Stratum (default) SemanticChunker(target_size=500, max_size=700)

Experiment 2: Smart Enrichment (Stratum only)

  • DocSummaryEnricher (1 LLM call) then DocContextEnricher (N parallel calls)
  • Compare smart doc_context vs naive per-chunk summary

Metrics

Category Metrics
Chunking structure chunk count, avg/std size, heading preservation %
Summary quality coherence, faithfulness, completeness (1-5, LLM-as-judge)
Cost LLM calls, prompt/completion tokens, estimated USD
Speed parse time, chunk time, summarization time

CLI Options

Flag Default Description
--tier all lower (gemini-flash), upper (claude-sonnet), or all
--max-chunks all Limit chunks per chunker (for quick tests)
--skip-judge false Skip LLM-as-judge, only structural metrics
--sample-size 30 Number of chunks to judge
--workers 4 Parallel LLM workers
-o benchmarks/results/ Output directory

Output

Results are saved as: - results.json — full structured data - report.md — 4 markdown tables (chunking, quality, smart vs naive, cost & speed)

Smoke Test Results (preliminary)

These are test runs to verify the pipeline works, not production benchmarks. Full benchmarks with proper sample sizes and both LLM tiers are pending.

Config: 20 chunks, gemini-flash only, sample_size=5 for judge.

Chunking Comparison

Chunker Chunks Avg Size (words) Std Size Headings %
docling-hierarchical 20 57 80 100%
docling-hybrid 20 232 91 100%
stratum 20 69 39 85%

Stratum produces the most consistent chunk sizes (lowest std deviation).

Summarization Quality (Naive, gemini-flash)

Chunker Coherence Faithfulness Completeness
docling-hierarchical 4.2 5.0 3.0
docling-hybrid 5.0 5.0 4.0
stratum 5.0 5.0 4.2

Smart vs Naive (Stratum, gemini-flash)

Type Coherence Faithfulness Completeness
Naive 5.0 5.0 4.2
Smart 5.0 5.0 4.0
Delta +0.00 +0.00 -0.20

Note: Smart vs Naive delta is inconclusive on 5 judged samples. Full run needed.