Skip to content

Enrichment Component

Status: Fully implemented with 10 enrichers (+ 2 placeholders)

Overview

The enrichment component adds metadata to chunks without modifying the canonical document structure. Multiple enrichers are available for reference detection, table/image understanding, document summarization, cross-document context, chunk classification, keyword extraction, and topic discovery.

Current Architecture

┌─────────────────────────────────────────────────────────────┐
│               CLI / Pipeline / Python API                    │
│                                                              │
│  --pipeline-config pipeline.yaml                             │
│  --enrich-doc-summary --enrich-doc-context                   │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│        EnrichmentRegistry + EnrichmentPipeline               │
│                                                              │
│  - Loads enrichers from global registry                      │
│  - Applies each enricher sequentially                        │
│  - Supports fail-fast or continue-on-error modes             │
└─────────────────────────────────────────────────────────────┘
                              │
    ┌──────────┬──────────┬───┴────┬──────────┬──────────┐
    ▼          ▼          ▼        ▼          ▼          ▼
┌────────┐┌────────┐┌────────┐┌────────┐┌────────┐┌────────┐
│  ref   ││ table  ││ image  ││doc-sum ││doc-ctx ││cross-  │
│detect  ││summary ││describe││  mary  ││  ext   ││doc-ctx │
│        ││        ││        ││        ││        ││        │
│No LLM  ││VLM/LLM││VLM/LLM ││  LLM   ││  LLM   ││LLM+emb│
└────────┘└────────┘└────────┘└────────┘└────────┘└────────┘

Usage

CLI Flags (Quick Start)

# Document summary (requires --llm-provider and --llm-model)
stratum document.pdf \
  --enrich-doc-summary \
  --llm-provider openai --llm-model gpt-4o-mini \
  -o output.json

# Document summary + per-chunk context
stratum document.pdf \
  --enrich-doc-summary --enrich-doc-context \
  --doc-context-mode neighbors --doc-context-n-neighbors 3 \
  --llm-provider local --llm-model gemma3:4b \
  --llm-base-url http://localhost:11434/v1 \
  -o output.json

# Table/image enrichers use environment variables
LLM_API_KEY=your-key stratum document.pdf \
  --pipeline-config examples/arxiv_paper/pipeline.yaml \
  -o output.json -v

CLI with Pipeline Config

# Use pipeline config with enrichment steps
stratum document.pdf --pipeline-config pipeline.yaml -o output.json -v

Pipeline Configuration

# pipeline.yaml
name: enrichment-pipeline
version: "1.0.0"

steps:
  - type: parser
    name: docling

  - type: chunker
    target_size: 1000

  # Reference detection (no LLM needed)
  - type: enrichment
    name: intra-document-reference
    confidence_threshold: 0.5
    enabled_types: [section, figure, table]

  # Document summary (LLM-based)
  - type: enrichment
    name: doc-summary
    llm_provider: openai
    llm_model: gpt-4o-mini
    max_doc_chars: 50000
    max_tokens: 512

  # Per-chunk context (LLM-based, parallel)
  - type: enrichment
    name: doc-context
    llm_provider: openai
    llm_model: gpt-4o-mini
    mode: neighbors
    n_neighbors: 3
    max_workers: 16

  # Table summarization (uses LLM_API_KEY env var)
  - type: enrichment
    name: table-summarization

  # Image description (uses VLM_API_KEY env var)
  - type: enrichment
    name: image-description

See examples/*/pipeline.yaml for complete pipeline configurations with real output.

Python API

from stratum.enrichment import (
    EnrichmentRegistry,
    DocSummaryEnricher,
    DocContextEnricher,
    TableSummarizationEnricher,
    IntraDocumentReferenceEnricher,
)

# Get global registry with all default enrichers
registry = EnrichmentRegistry.get_global()
print(registry.list_enrichers())
# ['chunk-classification', 'cross-doc-context', 'doc-context', 'doc-summary',
#  'embedding', 'entity', 'image-description', 'intra-document-reference',
#  'keyword', 'noop', 'table-summarization', 'topic']

# Via registry (uses config dict)
enricher = registry.get("doc-summary", {
    "llm_provider": "openai",
    "llm_model": "gpt-4o-mini",
})
enriched_doc = enricher.enrich(doc)

# Or instantiate directly
enricher = DocSummaryEnricher(
    llm_provider="openai",
    llm_model="gpt-4o-mini",
    max_doc_chars=50000,
)
enriched_doc = enricher.enrich(doc)

# Access results — all enrichments live in chunk.enrichments dict
for chunk in enriched_doc.chunks:
    print(chunk.enrichments['doc_summary'])    # Document summary (same for all chunks)
    print(chunk.enrichments['doc_context'])    # Per-chunk context
    print(chunk.enrichments.get('keywords'))   # TF-IDF keywords (if keyword enricher ran)
    print(chunk.enrichments.get('topic'))      # Topic label (if topic enricher ran)

Implementing Custom Enrichers

To implement a real enricher, create a class that satisfies the EnrichmentStep protocol:

from stratum.enrichment import EnrichmentStep, EnrichmentRegistry
from stratum.models.output import CanonicalDocument

class MyEmbeddingEnricher:
    """Real embedding enricher."""

    name = "my-embedding"
    version = "1.0.0"

    def __init__(self, model: str = "text-embedding-3-small"):
        self.model = model
        # Initialize embedding model...

    def enrich(self, document: CanonicalDocument) -> CanonicalDocument:
        """Add embeddings to chunks."""
        for chunk in document.chunks:
            # Generate embedding for chunk.text
            embedding = self._generate_embedding(chunk.text)
            # Store in chunk's enrichments field
            chunk.enrichments['embedding'] = embedding

        # Track enrichment metadata at document level
        document.document.enrichments.append({
            'name': self.name,
            'version': self.version,
            'timestamp': datetime.now().isoformat()
        })
        return document

    def supports_document(self, document: CanonicalDocument) -> bool:
        """Support all documents."""
        return True

    def _generate_embedding(self, text: str) -> list[float]:
        # Call embedding API...
        pass

# Register the enricher
registry = EnrichmentRegistry.get_global()
registry.register("my-embedding", MyEmbeddingEnricher)

Default Registered Enrichers

The global registry comes with the following enrichers:

Name Description LLM Required Storage
intra-document-reference Internal document reference detection No (rule-based) chunk.enrichments['references']
table-summarization LLM-based table summarization Yes (VLM, with fallback) chunk.enrichments['table_summary']
image-description VLM-based image description Yes (VLM, with fallback) chunk.enrichments['image_description']
doc-summary Document-level summary attached to all chunks Yes chunk.enrichments['doc_summary']
doc-context Per-chunk contextual description Yes chunk.enrichments['doc_context']
cross-doc-context Context from similar chunks in other documents Yes (LLM + embeddings) chunk.enrichments['cross_doc_context']
chunk-classification Per-chunk category label (taxonomy or freeform) Yes chunk.enrichments['classification']
keyword TF-IDF keyword extraction (no LLM, multilingual) No (scikit-learn) chunk.enrichments['keywords']
topic Topic discovery from keywords + assignment by overlap Yes (1 LLM call) chunk.enrichments['topic']
noop No-operation enricher (testing) No metadata only
embedding Vector embeddings Placeholder (no-op) -
entity Named entity extraction Placeholder (no-op) -

Placeholders allow pipeline configs to reference future enrichers without errors.

LLM Provider Support

The doc-level enrichers (doc-summary, doc-context, cross-doc-context, chunk-classification, topic) support multiple LLM backends:

Provider Config Value API Key Env Var Notes
OpenAI openai OPENAI_API_KEY GPT-4o, GPT-4o-mini, etc.
Anthropic anthropic ANTHROPIC_API_KEY Claude models
Google google GOOGLE_API_KEY Gemini models
Local/Custom local LOCAL_API_KEY (optional) Ollama, vLLM, any OpenAI-compatible. Requires llm_base_url.

All enrichers — including table-summarization and image-description — use the same llm_provider / llm_model / llm_base_url configuration as the doc-level enrichers. Configure them the same way in pipeline.yaml:

- type: enrichment
  name: table-summarization
  llm_provider: google
  llm_model: gemini-2.0-flash
  max_tokens: 300

- type: enrichment
  name: image-description
  llm_provider: local
  llm_model: /models/gemma-3-12b-it
  llm_base_url: http://localhost:8000/v1

priority / extra_body

When using vLLM with priority scheduling, add priority: N to any enrichment step. This is shorthand for extra_body: {"priority": N}:

- type: enrichment
  name: doc-context
  llm_provider: local
  llm_model: /models/my-model
  llm_base_url: http://localhost:8000/v1
  max_workers: 8
  priority: 5           # vLLM scheduling priority (lower = less priority)

extra_body can also be set explicitly for any additional vendor-specific fields to merge into every request body.

Intra-Document Reference Enricher

The intra-document-reference enricher detects and extracts internal document references (cross-references) from chunk text.

Supported Reference Types

English Patterns:

Type Examples Patterns
Section Section 3, Section 2.4.3, §5, Sec. 1.2 Deep hierarchical numbering, alphanumeric suffixes
Figure Figure 1, Fig. 2, Figures 1-3 Subfigures (3a, 3b), ranges, lists
Table Table 1, Tab. 2, Tables 1-3 Alphanumeric (A1, B2), ranges
Appendix Appendix A, App. B.1, Appendix C.1.2.3 Deep subsections
Equation Equation 1, Eq. (5), Eqn. 3 Parenthesized, ranges
Algorithm Algorithm 1, Alg. 2
Listing Listing 1
Chapter Chapter 1, Ch. 3 Ranges

French (fr) Patterns:

Type Examples Patterns
Section Section 3, Partie 1, Paragraphe 2 Same as English + Partie, Paragraphe
Figure Figure 1, Schéma 2, Graphique 3, Illustration 1 Schéma, Graphique, Encadré
Table Tableau 1, Tab. 2, Tableaux 1-3 Tableau/Tableaux
Appendix Annexe A, Annexe B.1 Annexe
Equation Équation 1, Formule 2, Théorème 3, Lemme 1 Équation, Formule, Théorème, Lemme, Corollaire
Algorithm Algorithme 1, Alg. 2 Algorithme
Listing Exemple 1, Définition 2 Exemple, Définition
Chapter Chapitre 1, Chap. 3 Chapitre

Other Language Packs:

Code Language Figure keyword Table keyword Section keyword
de German Abbildung, Abb. Tabelle, Tab. Abschnitt, Kap.
es Spanish Figura, Fig. Tabla, Tab. Sección, Cap.
it Italian Figura, Fig. Tabella, Tab. Sezione, Cap.
pt Portuguese Figura, Fig. Tabela, Tab. Seção, Cap.
nl Dutch Figuur, Fig. Tabel, Tab. Sectie, H.
ru Russian Рисунок, Рис. Таблица, Табл. Раздел, Гл.
zh Chinese 节、章
ar Arabic شكل جدول قسم، فصل

Full language names (english, french, german, …) are also accepted as aliases.

Usage

from stratum.enrichment import IntraDocumentReferenceEnricher

# English only (default)
enricher = IntraDocumentReferenceEnricher(
    confidence_threshold=0.5,
    enabled_types=["section", "figure", "table"],
)

# French documents
enricher_fr = IntraDocumentReferenceEnricher(
    languages=["fr"],
    confidence_threshold=0.5,
)

# Multilingual (English + French)
enricher_multi = IntraDocumentReferenceEnricher(
    languages=["en", "fr"],
    confidence_threshold=0.5,
    custom_patterns={"section": [r"\bPart\s+(\d+)"]},
)

enriched_doc = enricher.enrich(document)

# Access detected references
for chunk in enriched_doc.chunks:
    if "references" in chunk.enrichments:
        refs = chunk.enrichments["references"]["intra_document"]
        for ref in refs:
            print(f"{ref['type']}: {ref['normalized']} at {ref['position']}")

# Check enrichment metadata
for metadata in enriched_doc.document.enrichments:
    print(f"Applied: {metadata['name']} v{metadata['version']} at {metadata['timestamp']}")

Pipeline Configuration

steps:
  - type: enrichment
    name: intra-document-reference
    confidence_threshold: 0.5
    # Language support. Supported: en, fr, de, es, it, pt, nl, ru, zh, ar
    # Default: ['en']. Combine for multilingual documents.
    languages:
      - en
      - fr
    # Optionally restrict to specific types
    enabled_types:
      - section
      - figure
      - table
    # Or disable specific types
    disabled_types:
      - algorithm
    # High-level custom reference type names (auto-generates patterns).
    # Creates a new ref_type (lowercased) in the output.
    # Detects "Theorem 3", "Lemma 5.2", "Theorems 1 and 2", etc.
    custom_reference_types:
      - Theorem
      - Lemma
      - Proposition
    # Map custom keywords to existing ref_types so they participate in
    # resolution. E.g. if your document calls figures "Illustration":
    custom_type_mappings:
      Illustration: figure
      Diagram: figure
    # Add raw regex patterns for an existing type
    custom_patterns:
      section:
        - "\\bPart\\s+(\\d+)"

Custom Reference Type Names

For domain-specific references not in the built-in list, use custom_reference_types:

enricher = IntraDocumentReferenceEnricher(
    custom_reference_types=["Theorem", "Lemma", "Proposition"],
)

This auto-generates patterns that match Theorem 3, Theorems 2 and 3, Lemma 5.2, etc. Each name becomes a new ref_type (lowercased) in the output.

Mapping Custom Keywords to Existing Types

If your document uses non-standard names for figures, tables, etc., use custom_type_mappings to map the keyword to an existing ref_type:

enricher = IntraDocumentReferenceEnricher(
    custom_type_mappings={
        "Illustration": "figure",   # "Illustration 3" -> figure ref, gets resolved
        "Diagram": "figure",
        "Spreadsheet": "table",
    },
)

Unlike custom_reference_types, mapped keywords are merged into the target ref_type so they participate fully in resolution (i.e. resolved_to gets populated for figure/table, and for section/appendix the chunk is found via heading_path).

Custom Types File

Use custom_types_file to load type names from a file instead of embedding them in code:

# math_types.txt
Theorem              # creates new "theorem" ref_type
Lemma                # creates new "lemma" ref_type
Illustration:figure  # maps "Illustration" to existing figure type
Diagram:figure       # maps "Diagram" to existing figure type
enricher = IntraDocumentReferenceEnricher(
    custom_types_file="math_types.txt",
)

For raw regex control over patterns, use custom_patterns:

enricher = IntraDocumentReferenceEnricher(
    custom_patterns={"section": [r"\bPart\s+(\d+)"]},
)

Reference Storage Format

References are stored in chunk.enrichments['references']['intra_document']:

{
    "type": "section",           # Reference type (section/figure/table/appendix/…)
    "raw_text": "Section 2.4.3", # Original text matched
    "normalized": "2.4.3",       # Normalized identifier (number, letter, etc.)
    "position": {
        "start": 4,              # Character offset within THIS chunk's text (not the full document)
        "end": 17
    },
    "confidence": 0.95,          # Confidence score (0.0-1.0)
    # For ranges:
    "is_range": true,
    "range_start": "2",
    "range_end": "5",
    # For lists:
    "is_list": true,
    "list_items": ["1", "2", "3"]
}

Note on position: start/end are character offsets within the chunk's own text field, not document-level offsets. Use chunk.text[ref["position"]["start"]:ref["position"]["end"]] to recover the matched text.

Note: Enrichment metadata is tracked at the document level in document.document.enrichments.

Features

  • Deep hierarchical numbering: Section 1.2.3.4.5.6
  • Alphanumeric suffixes: Section 2a, Figure 3b
  • Letter-prefixed sections: Section A.1, Section B.2.3
  • Range detection: Sections 2-5, Figures 1-3
  • List detection: Figures 1, 2, and 3
  • False positive filtering: Ignores "figure of speech", "section of the code"
  • Custom patterns: Add project-specific patterns via configuration
  • Reference resolution: Link references to actual table/figure artifacts
  • LLM validation (planned): Optional validation with Gemini/OpenAI

Reference Resolution

Enable resolve_references=True to link detected references to their targets in the document:

enricher = IntraDocumentReferenceEnricher(resolve_references=True)
result = enricher.enrich(document)

for chunk in result.chunks:
    if "references" in chunk.enrichments:
        for ref in chunk.enrichments["references"]["intra_document"]:
            if "resolved_to" in ref:
                print(f"{ref['raw_text']} -> chunk {ref['resolved_to']['chunk_id']}")

What gets resolved:

Type How it resolves artifact_id
figure Matches caption text in chunks with images Image artifact ID (e.g. doc_image_001)
table Matches caption text in chunks with tables Table artifact ID (e.g. doc_table_001)
section Matches section number in chunk.heading_path section_<number> (synthetic)
appendix Matches appendix letter in chunk.heading_path appendix_<letter> (synthetic)
custom (mapped to figure/table) Same as figure/table Same as figure/table

Section and appendix resolution works via the document's heading structure — each chunk that begins a new section has a heading_path (e.g. ["Chapter 2", "2.3 Experimental Setup"]), and the section number is extracted from there.

Resolved reference format:

{
    "type": "table",
    "raw_text": "Table 1",
    "normalized": "1",
    "resolved_to": {
        "artifact_id": "doc_table_001",   # artifact or synthetic section ID
        "chunk_id": "doc_chunk_005"        # chunk containing the target
    }
}

If a reference cannot be resolved (e.g. the referenced section is in a different document, or its number doesn't match any heading), resolved_to is omitted.

Table Summarization Enricher

The table-summarization enricher uses LLM to generate concise summaries of tables.

Configuration

Requires OpenAI-compatible API. Set environment variables:

# Option 1: Direct LLM config
LLM_API_KEY=your-api-key
LLM_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
LLM_MODEL=gemini-2.0-flash

# Option 2: Google API key (auto-configures Gemini)
GOOGLE_API_KEY=your-google-api-key

Usage

from stratum.enrichment.table import TableSummarizationEnricher

# Create from environment
enricher = TableSummarizationEnricher.from_env()

# Or with explicit config
from stratum.enrichment.llm_client import LLMConfig

config = LLMConfig(
    api_key="your-key",
    base_url="https://api.openai.com/v1",
    model="gpt-4o-mini",
)
enricher = TableSummarizationEnricher(config=config)

# Enrich document
result = enricher.enrich(document)

# Access summaries
for chunk in result.chunks:
    if "table_summary" in chunk.enrichments:
        print(chunk.enrichments["table_summary"])

Fallback Mode

When LLM is unavailable, falls back to header extraction:

# Fallback only (no LLM)
enricher = TableSummarizationEnricher(config=None, fallback_enabled=True)

Pipeline Configuration

steps:
  - type: enrichment
    name: table-summarization
    fallback_enabled: true

Image Description Enricher

The image-description enricher uses VLM (Vision Language Model) to describe images.

Configuration

# Option 1: Direct VLM config
VLM_API_KEY=your-api-key
VLM_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
VLM_MODEL=gemini-2.0-flash

# Option 2: Google API key (auto-configures Gemini)
GOOGLE_API_KEY=your-google-api-key

Usage

from stratum.enrichment.image import ImageDescriptionEnricher

# Create from environment
enricher = ImageDescriptionEnricher.from_env()

# Enrich document
result = enricher.enrich(document)

# Access descriptions
for chunk in result.chunks:
    if "image_description" in chunk.enrichments:
        print(chunk.enrichments["image_description"])

Fallback Mode

When VLM is unavailable, uses image captions from parser:

enricher = ImageDescriptionEnricher(config=None, fallback_enabled=True)

Pipeline Configuration

steps:
  - type: enrichment
    name: image-description
    fallback_enabled: true

Document Summary Enricher

The doc-summary enricher generates a document-level summary using an LLM and stores it in every chunk's enrichments['doc_summary']. This allows retrieval systems to include document context alongside individual chunks.

Usage

from stratum.enrichment import DocSummaryEnricher

enricher = DocSummaryEnricher(
    llm_provider="openai",       # or "anthropic", "google", "local"
    llm_model="gpt-4o-mini",
    max_doc_chars=50000,         # truncate long documents
    max_tokens=512,
    temperature=0.3,
)

result = enricher.enrich(document)

# Same summary in every chunk
for chunk in result.chunks:
    print(chunk.enrichments['doc_summary'])

Pipeline Configuration

steps:
  - type: enrichment
    name: doc-summary
    llm_provider: local
    llm_model: gemma3:4b
    llm_base_url: http://localhost:11434/v1
    max_doc_chars: 50000
    max_tokens: 512
    temperature: 0.3

Document Context Enricher

The doc-context enricher generates per-chunk contextual descriptions, explaining each chunk in relation to its surrounding content or the broader document. Results are stored in chunk.enrichments[context_key] (default key: doc_context).

Two Modes

Mode Description When to use
neighbors (default) Summarizes each chunk in light of its surrounding chunks (positional window). Fast. Good for section-level context.
document Explains each chunk in light of the full document text (possibly truncated). Richer context. Higher token cost.

Neighbors mode: context window = n_neighbors chunks on each side of the current chunk.

Document mode: the full document text (concatenated chunks) is passed to the LLM, truncated to max_doc_chars if set.

Multiple doc-context steps

Use context_key to run two doc-context enrichers in the same pipeline without one overwriting the other:

steps:
  - type: enrichment
    name: doc-context
    mode: neighbors
    n_neighbors: 3
    context_key: doc_context_neighbors    # stored in chunk.enrichments['doc_context_neighbors']
    llm_provider: google
    llm_model: gemini-2.0-flash

  - type: enrichment
    name: doc-context
    mode: document
    context_key: doc_context_document     # stored in chunk.enrichments['doc_context_document']
    max_doc_chars: 30000
    llm_provider: google
    llm_model: gemini-2.0-flash

Without context_key, the default key doc_context is used and the second step will overwrite the first.

Usage

from stratum.enrichment import DocContextEnricher

enricher = DocContextEnricher(
    llm_provider="openai",
    llm_model="gpt-4o-mini",
    mode="neighbors",          # or "document"
    n_neighbors=3,             # chunks on each side (neighbors mode)
    max_doc_chars=50000,       # truncate in document mode
    max_tokens=512,
    temperature=0.3,
    max_workers=16,            # parallel LLM requests
    context_key="doc_context", # output key in chunk.enrichments
)

result = enricher.enrich(document)

for chunk in result.chunks:
    print(chunk.enrichments['doc_context'])   # unique context per chunk

Pipeline Configuration

steps:
  - type: enrichment
    name: doc-context
    llm_provider: local
    llm_model: gemma3:4b
    llm_base_url: http://localhost:11434/v1
    mode: neighbors
    n_neighbors: 3
    max_tokens: 512
    max_workers: 16
    context_key: doc_context    # optional; default is "doc_context"

Cross-Document Context Enricher

The cross-doc-context enricher finds similar chunks from OTHER documents using a FAISS vector index, then generates contextual descriptions. Useful for corpus-level enrichment.

Workflow

  1. All chunks are embedded and stored in a FAISS vector index
  2. For each chunk, similar chunks from other documents are retrieved
  3. An LLM generates context based on the similar chunks

Usage

from stratum.enrichment import CrossDocContextEnricher

enricher = CrossDocContextEnricher(
    vector_store_index_path="db/corpus/faiss.index",
    vector_store_metadata_path="db/corpus/metadata.pkl",
    embedding_provider="local",
    embedding_model="nomic-embed-text",
    embedding_base_url="http://localhost:11434/v1",
    llm_provider="local",
    llm_model="gemma3:4b",
    llm_base_url="http://localhost:11434/v1",
    n_neighbors=5,
    max_workers=16,
)

# Process multiple documents — index builds incrementally
for doc in documents:
    enriched = enricher.enrich(doc)

# Access results
for chunk in enriched.chunks:
    if "cross_doc_context" in chunk.enrichments:
        ctx = chunk.enrichments["cross_doc_context"]
        print(ctx["context"])       # LLM-generated context
        print(ctx["similar_docs"])  # list of doc_ids

Pipeline Configuration

steps:
  - type: enrichment
    name: cross-doc-context
    vector_store_index_path: db/corpus/faiss.index
    vector_store_metadata_path: db/corpus/metadata.pkl
    embedding_provider: local
    embedding_model: nomic-embed-text
    embedding_base_url: http://localhost:11434/v1
    embedding_batch_size: 32
    llm_provider: local
    llm_model: gemma3:4b
    llm_base_url: http://localhost:11434/v1
    n_neighbors: 5
    max_tokens: 512
    max_workers: 16

Custom Prompt Files

All LLM-based enrichers that use a PromptBuilder class accept a prompt_file parameter. This overrides the built-in hardcoded prompt with a YAML file you control.

YAML format

# system prompt (optional)
system: |
  You are a precise document analyst. Be concise and factual.

# user_template (required) — use {variable} placeholders
user_template: |
  Summarize this document in 3 sentences.

  DOCUMENT:
  {document_text}

# dspy_signature:  # reserved for future DSPy migration

The system key is optional — if absent, no system message is sent. The user_template key is required and must contain all {variable} placeholders expected by the enricher.

Supported enrichers and their template variables

Enricher Template variables
doc-summary {document_text}
doc-context (neighbors mode) {previous_chunks}, {current_chunk}, {following_chunks}
doc-context (document mode) {document_text}, {current_chunk}
chunk-classification (taxonomy) {chunk_text}, {categories}
chunk-classification (freeform) {chunk_text}
topic {keywords_text}
cross-doc-context {context_block}, {target_text}

Pipeline YAML usage

- type: enrichment
  name: doc-summary
  llm_provider: google
  llm_model: gemini-2.0-flash
  prompt_file: prompts/my_summary.yaml

- type: enrichment
  name: chunk-classification
  llm_provider: openai
  llm_model: gpt-4o-mini
  prompt_file: prompts/my_classification.yaml

Default YAML prompts

Ready-to-use reference prompts are bundled in stratum/enrichment/prompts/:

stratum/enrichment/prompts/
├── __init__.py                    # load_prompt_file() helper
├── doc_summary.yaml
├── doc_context_neighbors.yaml
├── doc_context_document.yaml
├── classification.yaml
├── classification_freeform.yaml
├── topic.yaml
└── cross_doc_context.yaml

Copy and customise any of these as a starting point.

DSPy readiness

The # dspy_signature: comment in YAML files is reserved for future migration to DSPy-based prompt optimization. The current system + user_template structure maps directly to DSPy's Signature paradigm, so migration will only require filling in that key — no structural changes.


Module Structure

stratum/enrichment/
├── __init__.py        # Public exports (all enrichers)
├── protocols.py       # EnrichmentStep protocol
├── base.py            # BaseEnrichmentStep abstract class
├── base_prompt.py     # BasePromptBuilder for prompt construction
├── registry.py        # EnrichmentRegistry with global singleton
├── pipeline.py        # EnrichmentPipeline for chaining
├── noop.py            # NoOpEnricher implementation
├── llm_client.py      # OpenAI-compatible LLM/VLM client (table/image)
├── prompts/           # Bundled YAML prompt files + load_prompt_file()
│   ├── __init__.py    # load_prompt_file(path) helper
│   ├── doc_summary.yaml
│   ├── doc_context_neighbors.yaml
│   ├── doc_context_document.yaml
│   ├── classification.yaml
│   ├── classification_freeform.yaml
│   ├── topic.yaml
│   └── cross_doc_context.yaml
├── llm/               # Multi-provider LLM abstraction (doc-level)
│   ├── __init__.py
│   ├── base.py        # LLMClient ABC + LLMResponse dataclass
│   ├── factory.py     # create_llm() factory function
│   ├── openai_compatible.py  # OpenAI/local provider
│   ├── anthropic_client.py   # Anthropic Claude provider
│   └── google_client.py      # Google Gemini provider
├── reference/         # Intra-document reference detection
│   ├── __init__.py
│   ├── patterns.py    # Configurable regex patterns (EN/FR/DE/ES/IT/PT/NL/RU/ZH/AR)
│   ├── detector.py    # ReferenceDetector class
│   ├── block_index.py # BlockIndex for reference resolution
│   └── enricher.py    # IntraDocumentReferenceEnricher
├── table/             # Table summarization
│   ├── __init__.py
│   └── summarizer.py  # TableSummarizationEnricher
├── image/             # Image description
│   ├── __init__.py
│   └── describer.py   # ImageDescriptionEnricher
├── doc_summary/       # Document-level summary
│   ├── __init__.py
│   ├── enricher.py    # DocSummaryEnricher
│   └── prompt.py      # DocSummaryPromptBuilder
├── doc_context/       # Per-chunk contextual description
│   ├── __init__.py
│   ├── enricher.py    # DocContextEnricher
│   └── prompt.py      # DocContextPromptBuilder
├── cross_doc_context/ # Cross-document context
│   ├── __init__.py
│   ├── enricher.py    # CrossDocContextEnricher
│   └── prompt.py      # CrossDocContextPromptBuilder
├── classification/    # Chunk classification
│   ├── __init__.py
│   ├── enricher.py    # ChunkClassificationEnricher
│   └── prompt.py      # ClassificationPromptBuilder
├── keyword/           # TF-IDF keyword extraction (no LLM)
│   ├── __init__.py
│   ├── extractor.py   # KeywordExtractor (scikit-learn)
│   └── enricher.py    # KeywordEnricher
├── topic/             # Topic discovery and assignment
│   ├── __init__.py
│   ├── prompt.py      # TopicDiscoveryPromptBuilder
│   └── enricher.py    # TopicEnricher
└── context/           # Vector index for cross-doc context
    ├── __init__.py
    └── index.py       # FAISS vector index

EnrichmentStep Protocol

from typing import Protocol
from stratum.models.output import CanonicalDocument

class EnrichmentStep(Protocol):
    """Protocol for document enrichers."""

    name: str
    version: str

    def enrich(self, document: CanonicalDocument) -> CanonicalDocument:
        """Enrich document with additional metadata."""
        ...

    def supports_document(self, document: CanonicalDocument) -> bool:
        """Check if enricher supports this document."""
        ...

Verbose Mode

With -v flag, CLI shows enrichment progress:

$ stratum doc.pdf --pipeline-config pipeline.yaml -v
Loaded pipeline config: enrichment-pipeline v1.0.0
Applied enricher: embedding
Applied enricher: summary
Processed: doc.pdf
Chunks: 15
  • Architecture - System architecture and pipeline documentation