Enrichment Component¶

Status: Fully implemented with 10 enrichers (+ 2 placeholders)

Overview¶

The enrichment component adds metadata to chunks without modifying the canonical document structure. Multiple enrichers are available for reference detection, table/image understanding, document summarization, cross-document context, chunk classification, keyword extraction, and topic discovery.

Current Architecture¶

┌─────────────────────────────────────────────────────────────┐
│               CLI / Pipeline / Python API                    │
│                                                              │
│  --pipeline-config pipeline.yaml                             │
│  --enrich-doc-summary --enrich-doc-context                   │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│        EnrichmentRegistry + EnrichmentPipeline               │
│                                                              │
│  - Loads enrichers from global registry                      │
│  - Applies each enricher sequentially                        │
│  - Supports fail-fast or continue-on-error modes             │
└─────────────────────────────────────────────────────────────┘
                              │
    ┌──────────┬──────────┬───┴────┬──────────┬──────────┐
    ▼          ▼          ▼        ▼          ▼          ▼
┌────────┐┌────────┐┌────────┐┌────────┐┌────────┐┌────────┐
│  ref   ││ table  ││ image  ││doc-sum ││doc-ctx ││cross-  │
│detect  ││summary ││describe││  mary  ││  ext   ││doc-ctx │
│        ││        ││        ││        ││        ││        │
│No LLM  ││VLM/LLM││VLM/LLM ││  LLM   ││  LLM   ││LLM+emb│
└────────┘└────────┘└────────┘└────────┘└────────┘└────────┘

Usage¶

CLI Flags (Quick Start)¶

# Document summary (requires --llm-provider and --llm-model)
stratum document.pdf \
  --enrich-doc-summary \
  --llm-provider openai --llm-model gpt-4o-mini \
  -o output.json

# Document summary + per-chunk context
stratum document.pdf \
  --enrich-doc-summary --enrich-doc-context \
  --doc-context-mode neighbors --doc-context-n-neighbors 3 \
  --llm-provider local --llm-model gemma3:4b \
  --llm-base-url http://localhost:11434/v1 \
  -o output.json

# Table/image enrichers use environment variables
LLM_API_KEY=your-key stratum document.pdf \
  --pipeline-config examples/arxiv_paper/pipeline.yaml \
  -o output.json -v

CLI with Pipeline Config¶

# Use pipeline config with enrichment steps
stratum document.pdf --pipeline-config pipeline.yaml -o output.json -v

Pipeline Configuration¶

# pipeline.yaml
name: enrichment-pipeline
version: "1.0.0"

steps:
  - type: parser
    name: docling

  - type: chunker
    target_size: 1000

  # Reference detection (no LLM needed)
  - type: enrichment
    name: intra-document-reference
    confidence_threshold: 0.5
    enabled_types: [section, figure, table]

  # Document summary (LLM-based)
  - type: enrichment
    name: doc-summary
    llm_provider: openai
    llm_model: gpt-4o-mini
    max_doc_chars: 50000
    max_tokens: 512

  # Per-chunk context (LLM-based, parallel)
  - type: enrichment
    name: doc-context
    llm_provider: openai
    llm_model: gpt-4o-mini
    mode: neighbors
    n_neighbors: 3
    max_workers: 16

  # Table summarization (uses LLM_API_KEY env var)
  - type: enrichment
    name: table-summarization

  # Image description (uses VLM_API_KEY env var)
  - type: enrichment
    name: image-description

See examples/*/pipeline.yaml for complete pipeline configurations with real output.

Python API¶

from stratum.enrichment import (
    EnrichmentRegistry,
    DocSummaryEnricher,
    DocContextEnricher,
    TableSummarizationEnricher,
    IntraDocumentReferenceEnricher,
)

# Get global registry with all default enrichers
registry = EnrichmentRegistry.get_global()
print(registry.list_enrichers())
# ['chunk-classification', 'cross-doc-context', 'doc-context', 'doc-summary',
#  'embedding', 'entity', 'image-description', 'intra-document-reference',
#  'keyword', 'noop', 'table-summarization', 'topic']

# Via registry (uses config dict)
enricher = registry.get("doc-summary", {
    "llm_provider": "openai",
    "llm_model": "gpt-4o-mini",
})
enriched_doc = enricher.enrich(doc)

# Or instantiate directly
enricher = DocSummaryEnricher(
    llm_provider="openai",
    llm_model="gpt-4o-mini",
    max_doc_chars=50000,
)
enriched_doc = enricher.enrich(doc)

# Access results — all enrichments live in chunk.enrichments dict
for chunk in enriched_doc.chunks:
    print(chunk.enrichments['doc_summary'])    # Document summary (same for all chunks)
    print(chunk.enrichments['doc_context'])    # Per-chunk context
    print(chunk.enrichments.get('keywords'))   # TF-IDF keywords (if keyword enricher ran)
    print(chunk.enrichments.get('topic'))      # Topic label (if topic enricher ran)

Implementing Custom Enrichers¶

To implement a real enricher, create a class that satisfies the EnrichmentStep protocol:

from stratum.enrichment import EnrichmentStep, EnrichmentRegistry
from stratum.models.output import CanonicalDocument

class MyEmbeddingEnricher:
    """Real embedding enricher."""

    name = "my-embedding"
    version = "1.0.0"

    def __init__(self, model: str = "text-embedding-3-small"):
        self.model = model
        # Initialize embedding model...

    def enrich(self, document: CanonicalDocument) -> CanonicalDocument:
        """Add embeddings to chunks."""
        for chunk in document.chunks:
            # Generate embedding for chunk.text
            embedding = self._generate_embedding(chunk.text)
            # Store in chunk's enrichments field
            chunk.enrichments['embedding'] = embedding

        # Track enrichment metadata at document level
        document.document.enrichments.append({
            'name': self.name,
            'version': self.version,
            'timestamp': datetime.now().isoformat()
        })
        return document

    def supports_document(self, document: CanonicalDocument) -> bool:
        """Support all documents."""
        return True

    def _generate_embedding(self, text: str) -> list[float]:
        # Call embedding API...
        pass

# Register the enricher
registry = EnrichmentRegistry.get_global()
registry.register("my-embedding", MyEmbeddingEnricher)

Default Registered Enrichers¶

The global registry comes with the following enrichers:

Name	Description	LLM Required	Storage
`intra-document-reference`	Internal document reference detection	No (rule-based)	`chunk.enrichments['references']`
`table-summarization`	LLM-based table summarization	Yes (VLM, with fallback)	`chunk.enrichments['table_summary']`
`image-description`	VLM-based image description	Yes (VLM, with fallback)	`chunk.enrichments['image_description']`
`doc-summary`	Document-level summary attached to all chunks	Yes	`chunk.enrichments['doc_summary']`
`doc-context`	Per-chunk contextual description	Yes	`chunk.enrichments['doc_context']`
`cross-doc-context`	Context from similar chunks in other documents	Yes (LLM + embeddings)	`chunk.enrichments['cross_doc_context']`
`chunk-classification`	Per-chunk category label (taxonomy or freeform)	Yes	`chunk.enrichments['classification']`
`keyword`	TF-IDF keyword extraction (no LLM, multilingual)	No (scikit-learn)	`chunk.enrichments['keywords']`
`topic`	Topic discovery from keywords + assignment by overlap	Yes (1 LLM call)	`chunk.enrichments['topic']`
`noop`	No-operation enricher (testing)	No	metadata only
`embedding`	Vector embeddings	Placeholder (no-op)	-
`entity`	Named entity extraction	Placeholder (no-op)	-

Placeholders allow pipeline configs to reference future enrichers without errors.

LLM Provider Support¶

The doc-level enrichers (doc-summary, doc-context, cross-doc-context, chunk-classification, topic) support multiple LLM backends:

Provider	Config Value	API Key Env Var	Notes
OpenAI	`openai`	`OPENAI_API_KEY`	GPT-4o, GPT-4o-mini, etc.
Anthropic	`anthropic`	`ANTHROPIC_API_KEY`	Claude models
Google	`google`	`GOOGLE_API_KEY`	Gemini models
Local/Custom	`local`	`LOCAL_API_KEY` (optional)	Ollama, vLLM, any OpenAI-compatible. Requires `llm_base_url`.

All enrichers — including table-summarization and image-description — use the same llm_provider / llm_model / llm_base_url configuration as the doc-level enrichers. Configure them the same way in pipeline.yaml:

- type: enrichment
  name: table-summarization
  llm_provider: google
  llm_model: gemini-2.0-flash
  max_tokens: 300

- type: enrichment
  name: image-description
  llm_provider: local
  llm_model: /models/gemma-3-12b-it
  llm_base_url: http://localhost:8000/v1

`priority` / `extra_body`¶

When using vLLM with priority scheduling, add priority: N to any enrichment step. This is shorthand for extra_body: {"priority": N}:

- type: enrichment
  name: doc-context
  llm_provider: local
  llm_model: /models/my-model
  llm_base_url: http://localhost:8000/v1
  max_workers: 8
  priority: 5           # vLLM scheduling priority (lower = less priority)

extra_body can also be set explicitly for any additional vendor-specific fields to merge into every request body.

Intra-Document Reference Enricher¶

The intra-document-reference enricher detects and extracts internal document references (cross-references) from chunk text.

Supported Reference Types¶

English Patterns:

Type	Examples	Patterns
Section	Section 3, Section 2.4.3, §5, Sec. 1.2	Deep hierarchical numbering, alphanumeric suffixes
Figure	Figure 1, Fig. 2, Figures 1-3	Subfigures (3a, 3b), ranges, lists
Table	Table 1, Tab. 2, Tables 1-3	Alphanumeric (A1, B2), ranges
Appendix	Appendix A, App. B.1, Appendix C.1.2.3	Deep subsections
Equation	Equation 1, Eq. (5), Eqn. 3	Parenthesized, ranges
Algorithm	Algorithm 1, Alg. 2
Listing	Listing 1
Chapter	Chapter 1, Ch. 3	Ranges

French (fr) Patterns:

Type	Examples	Patterns
Section	Section 3, Partie 1, Paragraphe 2	Same as English + Partie, Paragraphe
Figure	Figure 1, Schéma 2, Graphique 3, Illustration 1	Schéma, Graphique, Encadré
Table	Tableau 1, Tab. 2, Tableaux 1-3	Tableau/Tableaux
Appendix	Annexe A, Annexe B.1	Annexe
Equation	Équation 1, Formule 2, Théorème 3, Lemme 1	Équation, Formule, Théorème, Lemme, Corollaire
Algorithm	Algorithme 1, Alg. 2	Algorithme
Listing	Exemple 1, Définition 2	Exemple, Définition
Chapter	Chapitre 1, Chap. 3	Chapitre

Other Language Packs:

Code	Language	Figure keyword	Table keyword	Section keyword
`de`	German	Abbildung, Abb.	Tabelle, Tab.	Abschnitt, Kap.
`es`	Spanish	Figura, Fig.	Tabla, Tab.	Sección, Cap.
`it`	Italian	Figura, Fig.	Tabella, Tab.	Sezione, Cap.
`pt`	Portuguese	Figura, Fig.	Tabela, Tab.	Seção, Cap.
`nl`	Dutch	Figuur, Fig.	Tabel, Tab.	Sectie, H.
`ru`	Russian	Рисунок, Рис.	Таблица, Табл.	Раздел, Гл.
`zh`	Chinese	图	表	节、章
`ar`	Arabic	شكل	جدول	قسم، فصل

Full language names (english, french, german, …) are also accepted as aliases.

Usage¶

from stratum.enrichment import IntraDocumentReferenceEnricher

# English only (default)
enricher = IntraDocumentReferenceEnricher(
    confidence_threshold=0.5,
    enabled_types=["section", "figure", "table"],
)

# French documents
enricher_fr = IntraDocumentReferenceEnricher(
    languages=["fr"],
    confidence_threshold=0.5,
)

# Multilingual (English + French)
enricher_multi = IntraDocumentReferenceEnricher(
    languages=["en", "fr"],
    confidence_threshold=0.5,
    custom_patterns={"section": [r"\bPart\s+(\d+)"]},
)

enriched_doc = enricher.enrich(document)

# Access detected references
for chunk in enriched_doc.chunks:
    if "references" in chunk.enrichments:
        refs = chunk.enrichments["references"]["intra_document"]
        for ref in refs:
            print(f"{ref['type']}: {ref['normalized']} at {ref['position']}")

# Check enrichment metadata
for metadata in enriched_doc.document.enrichments:
    print(f"Applied: {metadata['name']} v{metadata['version']} at {metadata['timestamp']}")

Pipeline Configuration¶

steps:
  - type: enrichment
    name: intra-document-reference
    confidence_threshold: 0.5
    # Language support. Supported: en, fr, de, es, it, pt, nl, ru, zh, ar
    # Default: ['en']. Combine for multilingual documents.
    languages:
      - en
      - fr
    # Optionally restrict to specific types
    enabled_types:
      - section
      - figure
      - table
    # Or disable specific types
    disabled_types:
      - algorithm
    # High-level custom reference type names (auto-generates patterns).
    # Creates a new ref_type (lowercased) in the output.
    # Detects "Theorem 3", "Lemma 5.2", "Theorems 1 and 2", etc.
    custom_reference_types:
      - Theorem
      - Lemma
      - Proposition
    # Map custom keywords to existing ref_types so they participate in
    # resolution. E.g. if your document calls figures "Illustration":
    custom_type_mappings:
      Illustration: figure
      Diagram: figure
    # Add raw regex patterns for an existing type
    custom_patterns:
      section:
        - "\\bPart\\s+(\\d+)"

Custom Reference Type Names¶

For domain-specific references not in the built-in list, use custom_reference_types:

enricher = IntraDocumentReferenceEnricher(
    custom_reference_types=["Theorem", "Lemma", "Proposition"],
)

This auto-generates patterns that match Theorem 3, Theorems 2 and 3, Lemma 5.2, etc. Each name becomes a new ref_type (lowercased) in the output.

Mapping Custom Keywords to Existing Types¶

If your document uses non-standard names for figures, tables, etc., use custom_type_mappings to map the keyword to an existing ref_type:

enricher = IntraDocumentReferenceEnricher(
    custom_type_mappings={
        "Illustration": "figure",   # "Illustration 3" -> figure ref, gets resolved
        "Diagram": "figure",
        "Spreadsheet": "table",
    },
)

Unlike custom_reference_types, mapped keywords are merged into the target ref_type so they participate fully in resolution (i.e. resolved_to gets populated for figure/table, and for section/appendix the chunk is found via heading_path).

Custom Types File¶

Use custom_types_file to load type names from a file instead of embedding them in code:

# math_types.txt
Theorem              # creates new "theorem" ref_type
Lemma                # creates new "lemma" ref_type
Illustration:figure  # maps "Illustration" to existing figure type
Diagram:figure       # maps "Diagram" to existing figure type

enricher = IntraDocumentReferenceEnricher(
    custom_types_file="math_types.txt",
)

For raw regex control over patterns, use custom_patterns:

enricher = IntraDocumentReferenceEnricher(
    custom_patterns={"section": [r"\bPart\s+(\d+)"]},
)

Reference Storage Format¶

References are stored in chunk.enrichments['references']['intra_document']:

{
    "type": "section",           # Reference type (section/figure/table/appendix/…)
    "raw_text": "Section 2.4.3", # Original text matched
    "normalized": "2.4.3",       # Normalized identifier (number, letter, etc.)
    "position": {
        "start": 4,              # Character offset within THIS chunk's text (not the full document)
        "end": 17
    },
    "confidence": 0.95,          # Confidence score (0.0-1.0)
    # For ranges:
    "is_range": true,
    "range_start": "2",
    "range_end": "5",
    # For lists:
    "is_list": true,
    "list_items": ["1", "2", "3"]
}

Note on position: start/end are character offsets within the chunk's own text field, not document-level offsets. Use chunk.text[ref["position"]["start"]:ref["position"]["end"]] to recover the matched text.

Note: Enrichment metadata is tracked at the document level in document.document.enrichments.

Features¶

Deep hierarchical numbering: Section 1.2.3.4.5.6
Alphanumeric suffixes: Section 2a, Figure 3b
Letter-prefixed sections: Section A.1, Section B.2.3
Range detection: Sections 2-5, Figures 1-3
List detection: Figures 1, 2, and 3
False positive filtering: Ignores "figure of speech", "section of the code"
Custom patterns: Add project-specific patterns via configuration
Reference resolution: Link references to actual table/figure artifacts
LLM validation (planned): Optional validation with Gemini/OpenAI

Reference Resolution¶

Enable resolve_references=True to link detected references to their targets in the document:

enricher = IntraDocumentReferenceEnricher(resolve_references=True)
result = enricher.enrich(document)

for chunk in result.chunks:
    if "references" in chunk.enrichments:
        for ref in chunk.enrichments["references"]["intra_document"]:
            if "resolved_to" in ref:
                print(f"{ref['raw_text']} -> chunk {ref['resolved_to']['chunk_id']}")

What gets resolved:

Type	How it resolves	`artifact_id`
`figure`	Matches caption text in chunks with images	Image artifact ID (e.g. `doc_image_001`)
`table`	Matches caption text in chunks with tables	Table artifact ID (e.g. `doc_table_001`)
`section`	Matches section number in `chunk.heading_path`	`section_<number>` (synthetic)
`appendix`	Matches appendix letter in `chunk.heading_path`	`appendix_<letter>` (synthetic)
custom (mapped to figure/table)	Same as figure/table	Same as figure/table

Section and appendix resolution works via the document's heading structure — each chunk that begins a new section has a heading_path (e.g. ["Chapter 2", "2.3 Experimental Setup"]), and the section number is extracted from there.

Resolved reference format:

{
    "type": "table",
    "raw_text": "Table 1",
    "normalized": "1",
    "resolved_to": {
        "artifact_id": "doc_table_001",   # artifact or synthetic section ID
        "chunk_id": "doc_chunk_005"        # chunk containing the target
    }
}

If a reference cannot be resolved (e.g. the referenced section is in a different document, or its number doesn't match any heading), resolved_to is omitted.

Table Summarization Enricher¶

The table-summarization enricher uses LLM to generate concise summaries of tables.

Configuration¶

Requires OpenAI-compatible API. Set environment variables:

# Option 1: Direct LLM config
LLM_API_KEY=your-api-key
LLM_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
LLM_MODEL=gemini-2.0-flash

# Option 2: Google API key (auto-configures Gemini)
GOOGLE_API_KEY=your-google-api-key

Usage¶

from stratum.enrichment.table import TableSummarizationEnricher

# Create from environment
enricher = TableSummarizationEnricher.from_env()

# Or with explicit config
from stratum.enrichment.llm_client import LLMConfig

config = LLMConfig(
    api_key="your-key",
    base_url="https://api.openai.com/v1",
    model="gpt-4o-mini",
)
enricher = TableSummarizationEnricher(config=config)

# Enrich document
result = enricher.enrich(document)

# Access summaries
for chunk in result.chunks:
    if "table_summary" in chunk.enrichments:
        print(chunk.enrichments["table_summary"])

Fallback Mode¶

When LLM is unavailable, falls back to header extraction:

# Fallback only (no LLM)
enricher = TableSummarizationEnricher(config=None, fallback_enabled=True)

Pipeline Configuration¶

steps:
  - type: enrichment
    name: table-summarization
    fallback_enabled: true

Image Description Enricher¶

The image-description enricher uses VLM (Vision Language Model) to describe images.

Configuration¶

# Option 1: Direct VLM config
VLM_API_KEY=your-api-key
VLM_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
VLM_MODEL=gemini-2.0-flash

# Option 2: Google API key (auto-configures Gemini)
GOOGLE_API_KEY=your-google-api-key

Usage¶

from stratum.enrichment.image import ImageDescriptionEnricher

# Create from environment
enricher = ImageDescriptionEnricher.from_env()

# Enrich document
result = enricher.enrich(document)

# Access descriptions
for chunk in result.chunks:
    if "image_description" in chunk.enrichments:
        print(chunk.enrichments["image_description"])

Fallback Mode¶

When VLM is unavailable, uses image captions from parser:

enricher = ImageDescriptionEnricher(config=None, fallback_enabled=True)

Pipeline Configuration¶

steps:
  - type: enrichment
    name: image-description
    fallback_enabled: true

Document Summary Enricher¶

The doc-summary enricher generates a document-level summary using an LLM and stores it in every chunk's enrichments['doc_summary']. This allows retrieval systems to include document context alongside individual chunks.

Usage¶

from stratum.enrichment import DocSummaryEnricher

enricher = DocSummaryEnricher(
    llm_provider="openai",       # or "anthropic", "google", "local"
    llm_model="gpt-4o-mini",
    max_doc_chars=50000,         # truncate long documents
    max_tokens=512,
    temperature=0.3,
)

result = enricher.enrich(document)

# Same summary in every chunk
for chunk in result.chunks:
    print(chunk.enrichments['doc_summary'])

Pipeline Configuration¶

steps:
  - type: enrichment
    name: doc-summary
    llm_provider: local
    llm_model: gemma3:4b
    llm_base_url: http://localhost:11434/v1
    max_doc_chars: 50000
    max_tokens: 512
    temperature: 0.3

Document Context Enricher¶

The doc-context enricher generates per-chunk contextual descriptions, explaining each chunk in relation to its surrounding content or the broader document. Results are stored in chunk.enrichments[context_key] (default key: doc_context).

Two Modes¶

Mode	Description	When to use
`neighbors` (default)	Summarizes each chunk in light of its surrounding chunks (positional window).	Fast. Good for section-level context.
`document`	Explains each chunk in light of the full document text (possibly truncated).	Richer context. Higher token cost.

Neighbors mode: context window = n_neighbors chunks on each side of the current chunk.

Document mode: the full document text (concatenated chunks) is passed to the LLM, truncated to max_doc_chars if set.

Multiple doc-context steps¶

Use context_key to run two doc-context enrichers in the same pipeline without one overwriting the other:

steps:
  - type: enrichment
    name: doc-context
    mode: neighbors
    n_neighbors: 3
    context_key: doc_context_neighbors    # stored in chunk.enrichments['doc_context_neighbors']
    llm_provider: google
    llm_model: gemini-2.0-flash

  - type: enrichment
    name: doc-context
    mode: document
    context_key: doc_context_document     # stored in chunk.enrichments['doc_context_document']
    max_doc_chars: 30000
    llm_provider: google
    llm_model: gemini-2.0-flash

Without context_key, the default key doc_context is used and the second step will overwrite the first.

Usage¶

from stratum.enrichment import DocContextEnricher

enricher = DocContextEnricher(
    llm_provider="openai",
    llm_model="gpt-4o-mini",
    mode="neighbors",          # or "document"
    n_neighbors=3,             # chunks on each side (neighbors mode)
    max_doc_chars=50000,       # truncate in document mode
    max_tokens=512,
    temperature=0.3,
    max_workers=16,            # parallel LLM requests
    context_key="doc_context", # output key in chunk.enrichments
)

result = enricher.enrich(document)

for chunk in result.chunks:
    print(chunk.enrichments['doc_context'])   # unique context per chunk

Pipeline Configuration¶

steps:
  - type: enrichment
    name: doc-context
    llm_provider: local
    llm_model: gemma3:4b
    llm_base_url: http://localhost:11434/v1
    mode: neighbors
    n_neighbors: 3
    max_tokens: 512
    max_workers: 16
    context_key: doc_context    # optional; default is "doc_context"

Cross-Document Context Enricher¶

The cross-doc-context enricher finds similar chunks from OTHER documents using a FAISS vector index, then generates contextual descriptions. Useful for corpus-level enrichment.

Workflow¶

All chunks are embedded and stored in a FAISS vector index
For each chunk, similar chunks from other documents are retrieved
An LLM generates context based on the similar chunks

Usage¶

from stratum.enrichment import CrossDocContextEnricher

enricher = CrossDocContextEnricher(
    vector_store_index_path="db/corpus/faiss.index",
    vector_store_metadata_path="db/corpus/metadata.pkl",
    embedding_provider="local",
    embedding_model="nomic-embed-text",
    embedding_base_url="http://localhost:11434/v1",
    llm_provider="local",
    llm_model="gemma3:4b",
    llm_base_url="http://localhost:11434/v1",
    n_neighbors=5,
    max_workers=16,
)

# Process multiple documents — index builds incrementally
for doc in documents:
    enriched = enricher.enrich(doc)

# Access results
for chunk in enriched.chunks:
    if "cross_doc_context" in chunk.enrichments:
        ctx = chunk.enrichments["cross_doc_context"]
        print(ctx["context"])       # LLM-generated context
        print(ctx["similar_docs"])  # list of doc_ids

Pipeline Configuration¶

steps:
  - type: enrichment
    name: cross-doc-context
    vector_store_index_path: db/corpus/faiss.index
    vector_store_metadata_path: db/corpus/metadata.pkl
    embedding_provider: local
    embedding_model: nomic-embed-text
    embedding_base_url: http://localhost:11434/v1
    embedding_batch_size: 32
    llm_provider: local
    llm_model: gemma3:4b
    llm_base_url: http://localhost:11434/v1
    n_neighbors: 5
    max_tokens: 512
    max_workers: 16

Custom Prompt Files¶

All LLM-based enrichers that use a PromptBuilder class accept a prompt_file parameter. This overrides the built-in hardcoded prompt with a YAML file you control.

YAML format¶

# system prompt (optional)
system: |
  You are a precise document analyst. Be concise and factual.

# user_template (required) — use {variable} placeholders
user_template: |
  Summarize this document in 3 sentences.

  DOCUMENT:
  {document_text}

# dspy_signature:  # reserved for future DSPy migration

The system key is optional — if absent, no system message is sent. The user_template key is required and must contain all {variable} placeholders expected by the enricher.

Supported enrichers and their template variables¶

Enricher	Template variables
`doc-summary`	`{document_text}`
`doc-context` (neighbors mode)	`{previous_chunks}`, `{current_chunk}`, `{following_chunks}`
`doc-context` (document mode)	`{document_text}`, `{current_chunk}`
`chunk-classification` (taxonomy)	`{chunk_text}`, `{categories}`
`chunk-classification` (freeform)	`{chunk_text}`
`topic`	`{keywords_text}`
`cross-doc-context`	`{context_block}`, `{target_text}`

Pipeline YAML usage¶

- type: enrichment
  name: doc-summary
  llm_provider: google
  llm_model: gemini-2.0-flash
  prompt_file: prompts/my_summary.yaml

- type: enrichment
  name: chunk-classification
  llm_provider: openai
  llm_model: gpt-4o-mini
  prompt_file: prompts/my_classification.yaml

Default YAML prompts¶

Ready-to-use reference prompts are bundled in stratum/enrichment/prompts/:

stratum/enrichment/prompts/
├── __init__.py                    # load_prompt_file() helper
├── doc_summary.yaml
├── doc_context_neighbors.yaml
├── doc_context_document.yaml
├── classification.yaml
├── classification_freeform.yaml
├── topic.yaml
└── cross_doc_context.yaml

Copy and customise any of these as a starting point.

DSPy readiness¶

The # dspy_signature: comment in YAML files is reserved for future migration to DSPy-based prompt optimization. The current system + user_template structure maps directly to DSPy's Signature paradigm, so migration will only require filling in that key — no structural changes.

Module Structure¶

stratum/enrichment/
├── __init__.py        # Public exports (all enrichers)
├── protocols.py       # EnrichmentStep protocol
├── base.py            # BaseEnrichmentStep abstract class
├── base_prompt.py     # BasePromptBuilder for prompt construction
├── registry.py        # EnrichmentRegistry with global singleton
├── pipeline.py        # EnrichmentPipeline for chaining
├── noop.py            # NoOpEnricher implementation
├── llm_client.py      # OpenAI-compatible LLM/VLM client (table/image)
├── prompts/           # Bundled YAML prompt files + load_prompt_file()
│   ├── __init__.py    # load_prompt_file(path) helper
│   ├── doc_summary.yaml
│   ├── doc_context_neighbors.yaml
│   ├── doc_context_document.yaml
│   ├── classification.yaml
│   ├── classification_freeform.yaml
│   ├── topic.yaml
│   └── cross_doc_context.yaml
├── llm/               # Multi-provider LLM abstraction (doc-level)
│   ├── __init__.py
│   ├── base.py        # LLMClient ABC + LLMResponse dataclass
│   ├── factory.py     # create_llm() factory function
│   ├── openai_compatible.py  # OpenAI/local provider
│   ├── anthropic_client.py   # Anthropic Claude provider
│   └── google_client.py      # Google Gemini provider
├── reference/         # Intra-document reference detection
│   ├── __init__.py
│   ├── patterns.py    # Configurable regex patterns (EN/FR/DE/ES/IT/PT/NL/RU/ZH/AR)
│   ├── detector.py    # ReferenceDetector class
│   ├── block_index.py # BlockIndex for reference resolution
│   └── enricher.py    # IntraDocumentReferenceEnricher
├── table/             # Table summarization
│   ├── __init__.py
│   └── summarizer.py  # TableSummarizationEnricher
├── image/             # Image description
│   ├── __init__.py
│   └── describer.py   # ImageDescriptionEnricher
├── doc_summary/       # Document-level summary
│   ├── __init__.py
│   ├── enricher.py    # DocSummaryEnricher
│   └── prompt.py      # DocSummaryPromptBuilder
├── doc_context/       # Per-chunk contextual description
│   ├── __init__.py
│   ├── enricher.py    # DocContextEnricher
│   └── prompt.py      # DocContextPromptBuilder
├── cross_doc_context/ # Cross-document context
│   ├── __init__.py
│   ├── enricher.py    # CrossDocContextEnricher
│   └── prompt.py      # CrossDocContextPromptBuilder
├── classification/    # Chunk classification
│   ├── __init__.py
│   ├── enricher.py    # ChunkClassificationEnricher
│   └── prompt.py      # ClassificationPromptBuilder
├── keyword/           # TF-IDF keyword extraction (no LLM)
│   ├── __init__.py
│   ├── extractor.py   # KeywordExtractor (scikit-learn)
│   └── enricher.py    # KeywordEnricher
├── topic/             # Topic discovery and assignment
│   ├── __init__.py
│   ├── prompt.py      # TopicDiscoveryPromptBuilder
│   └── enricher.py    # TopicEnricher
└── context/           # Vector index for cross-doc context
    ├── __init__.py
    └── index.py       # FAISS vector index

EnrichmentStep Protocol¶

from typing import Protocol
from stratum.models.output import CanonicalDocument

class EnrichmentStep(Protocol):
    """Protocol for document enrichers."""

    name: str
    version: str

    def enrich(self, document: CanonicalDocument) -> CanonicalDocument:
        """Enrich document with additional metadata."""
        ...

    def supports_document(self, document: CanonicalDocument) -> bool:
        """Check if enricher supports this document."""
        ...

Verbose Mode¶

With -v flag, CLI shows enrichment progress:

$ stratum doc.pdf --pipeline-config pipeline.yaml -v
Loaded pipeline config: enrichment-pipeline v1.0.0
Applied enricher: embedding
Applied enricher: summary
Processed: doc.pdf
Chunks: 15

Architecture - System architecture and pipeline documentation

Enrichment Component¶

Overview¶

Current Architecture¶

Usage¶

CLI Flags (Quick Start)¶

CLI with Pipeline Config¶

Pipeline Configuration¶

Python API¶

Implementing Custom Enrichers¶

Default Registered Enrichers¶

LLM Provider Support¶

priority / extra_body¶

Intra-Document Reference Enricher¶

Supported Reference Types¶

Usage¶

Pipeline Configuration¶

Custom Reference Type Names¶

Mapping Custom Keywords to Existing Types¶

Custom Types File¶

Reference Storage Format¶

Features¶

Reference Resolution¶

Table Summarization Enricher¶

Configuration¶

Usage¶

Fallback Mode¶

Pipeline Configuration¶

Image Description Enricher¶

Configuration¶

Usage¶

Fallback Mode¶

Pipeline Configuration¶

Document Summary Enricher¶

Usage¶

Pipeline Configuration¶

Document Context Enricher¶

Two Modes¶

Multiple doc-context steps¶

Usage¶

Pipeline Configuration¶

Cross-Document Context Enricher¶

Workflow¶

Usage¶

Pipeline Configuration¶

Custom Prompt Files¶

YAML format¶

Supported enrichers and their template variables¶

Pipeline YAML usage¶

Default YAML prompts¶

DSPy readiness¶

Module Structure¶

EnrichmentStep Protocol¶

Verbose Mode¶

Related¶

`priority` / `extra_body`¶