Enrichment Component¶
Status: Fully implemented with 10 enrichers (+ 2 placeholders)
Overview¶
The enrichment component adds metadata to chunks without modifying the canonical document structure. Multiple enrichers are available for reference detection, table/image understanding, document summarization, cross-document context, chunk classification, keyword extraction, and topic discovery.
Current Architecture¶
┌─────────────────────────────────────────────────────────────┐
│ CLI / Pipeline / Python API │
│ │
│ --pipeline-config pipeline.yaml │
│ --enrich-doc-summary --enrich-doc-context │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ EnrichmentRegistry + EnrichmentPipeline │
│ │
│ - Loads enrichers from global registry │
│ - Applies each enricher sequentially │
│ - Supports fail-fast or continue-on-error modes │
└─────────────────────────────────────────────────────────────┘
│
┌──────────┬──────────┬───┴────┬──────────┬──────────┐
▼ ▼ ▼ ▼ ▼ ▼
┌────────┐┌────────┐┌────────┐┌────────┐┌────────┐┌────────┐
│ ref ││ table ││ image ││doc-sum ││doc-ctx ││cross- │
│detect ││summary ││describe││ mary ││ ext ││doc-ctx │
│ ││ ││ ││ ││ ││ │
│No LLM ││VLM/LLM││VLM/LLM ││ LLM ││ LLM ││LLM+emb│
└────────┘└────────┘└────────┘└────────┘└────────┘└────────┘
Usage¶
CLI Flags (Quick Start)¶
# Document summary (requires --llm-provider and --llm-model)
stratum document.pdf \
--enrich-doc-summary \
--llm-provider openai --llm-model gpt-4o-mini \
-o output.json
# Document summary + per-chunk context
stratum document.pdf \
--enrich-doc-summary --enrich-doc-context \
--doc-context-mode neighbors --doc-context-n-neighbors 3 \
--llm-provider local --llm-model gemma3:4b \
--llm-base-url http://localhost:11434/v1 \
-o output.json
# Table/image enrichers use environment variables
LLM_API_KEY=your-key stratum document.pdf \
--pipeline-config examples/arxiv_paper/pipeline.yaml \
-o output.json -v
CLI with Pipeline Config¶
# Use pipeline config with enrichment steps
stratum document.pdf --pipeline-config pipeline.yaml -o output.json -v
Pipeline Configuration¶
# pipeline.yaml
name: enrichment-pipeline
version: "1.0.0"
steps:
- type: parser
name: docling
- type: chunker
target_size: 1000
# Reference detection (no LLM needed)
- type: enrichment
name: intra-document-reference
confidence_threshold: 0.5
enabled_types: [section, figure, table]
# Document summary (LLM-based)
- type: enrichment
name: doc-summary
llm_provider: openai
llm_model: gpt-4o-mini
max_doc_chars: 50000
max_tokens: 512
# Per-chunk context (LLM-based, parallel)
- type: enrichment
name: doc-context
llm_provider: openai
llm_model: gpt-4o-mini
mode: neighbors
n_neighbors: 3
max_workers: 16
# Table summarization (uses LLM_API_KEY env var)
- type: enrichment
name: table-summarization
# Image description (uses VLM_API_KEY env var)
- type: enrichment
name: image-description
See examples/*/pipeline.yaml for complete pipeline configurations with real output.
Python API¶
from stratum.enrichment import (
EnrichmentRegistry,
DocSummaryEnricher,
DocContextEnricher,
TableSummarizationEnricher,
IntraDocumentReferenceEnricher,
)
# Get global registry with all default enrichers
registry = EnrichmentRegistry.get_global()
print(registry.list_enrichers())
# ['chunk-classification', 'cross-doc-context', 'doc-context', 'doc-summary',
# 'embedding', 'entity', 'image-description', 'intra-document-reference',
# 'keyword', 'noop', 'table-summarization', 'topic']
# Via registry (uses config dict)
enricher = registry.get("doc-summary", {
"llm_provider": "openai",
"llm_model": "gpt-4o-mini",
})
enriched_doc = enricher.enrich(doc)
# Or instantiate directly
enricher = DocSummaryEnricher(
llm_provider="openai",
llm_model="gpt-4o-mini",
max_doc_chars=50000,
)
enriched_doc = enricher.enrich(doc)
# Access results — all enrichments live in chunk.enrichments dict
for chunk in enriched_doc.chunks:
print(chunk.enrichments['doc_summary']) # Document summary (same for all chunks)
print(chunk.enrichments['doc_context']) # Per-chunk context
print(chunk.enrichments.get('keywords')) # TF-IDF keywords (if keyword enricher ran)
print(chunk.enrichments.get('topic')) # Topic label (if topic enricher ran)
Implementing Custom Enrichers¶
To implement a real enricher, create a class that satisfies the EnrichmentStep protocol:
from stratum.enrichment import EnrichmentStep, EnrichmentRegistry
from stratum.models.output import CanonicalDocument
class MyEmbeddingEnricher:
"""Real embedding enricher."""
name = "my-embedding"
version = "1.0.0"
def __init__(self, model: str = "text-embedding-3-small"):
self.model = model
# Initialize embedding model...
def enrich(self, document: CanonicalDocument) -> CanonicalDocument:
"""Add embeddings to chunks."""
for chunk in document.chunks:
# Generate embedding for chunk.text
embedding = self._generate_embedding(chunk.text)
# Store in chunk's enrichments field
chunk.enrichments['embedding'] = embedding
# Track enrichment metadata at document level
document.document.enrichments.append({
'name': self.name,
'version': self.version,
'timestamp': datetime.now().isoformat()
})
return document
def supports_document(self, document: CanonicalDocument) -> bool:
"""Support all documents."""
return True
def _generate_embedding(self, text: str) -> list[float]:
# Call embedding API...
pass
# Register the enricher
registry = EnrichmentRegistry.get_global()
registry.register("my-embedding", MyEmbeddingEnricher)
Default Registered Enrichers¶
The global registry comes with the following enrichers:
| Name | Description | LLM Required | Storage |
|---|---|---|---|
intra-document-reference |
Internal document reference detection | No (rule-based) | chunk.enrichments['references'] |
table-summarization |
LLM-based table summarization | Yes (VLM, with fallback) | chunk.enrichments['table_summary'] |
image-description |
VLM-based image description | Yes (VLM, with fallback) | chunk.enrichments['image_description'] |
doc-summary |
Document-level summary attached to all chunks | Yes | chunk.enrichments['doc_summary'] |
doc-context |
Per-chunk contextual description | Yes | chunk.enrichments['doc_context'] |
cross-doc-context |
Context from similar chunks in other documents | Yes (LLM + embeddings) | chunk.enrichments['cross_doc_context'] |
chunk-classification |
Per-chunk category label (taxonomy or freeform) | Yes | chunk.enrichments['classification'] |
keyword |
TF-IDF keyword extraction (no LLM, multilingual) | No (scikit-learn) | chunk.enrichments['keywords'] |
topic |
Topic discovery from keywords + assignment by overlap | Yes (1 LLM call) | chunk.enrichments['topic'] |
noop |
No-operation enricher (testing) | No | metadata only |
embedding |
Vector embeddings | Placeholder (no-op) | - |
entity |
Named entity extraction | Placeholder (no-op) | - |
Placeholders allow pipeline configs to reference future enrichers without errors.
LLM Provider Support¶
The doc-level enrichers (doc-summary, doc-context, cross-doc-context, chunk-classification, topic) support multiple LLM backends:
| Provider | Config Value | API Key Env Var | Notes |
|---|---|---|---|
| OpenAI | openai |
OPENAI_API_KEY |
GPT-4o, GPT-4o-mini, etc. |
| Anthropic | anthropic |
ANTHROPIC_API_KEY |
Claude models |
google |
GOOGLE_API_KEY |
Gemini models | |
| Local/Custom | local |
LOCAL_API_KEY (optional) |
Ollama, vLLM, any OpenAI-compatible. Requires llm_base_url. |
All enrichers — including table-summarization and image-description — use the same llm_provider / llm_model / llm_base_url configuration as the doc-level enrichers. Configure them the same way in pipeline.yaml:
- type: enrichment
name: table-summarization
llm_provider: google
llm_model: gemini-2.0-flash
max_tokens: 300
- type: enrichment
name: image-description
llm_provider: local
llm_model: /models/gemma-3-12b-it
llm_base_url: http://localhost:8000/v1
priority / extra_body¶
When using vLLM with priority scheduling, add priority: N to any enrichment step. This is shorthand for extra_body: {"priority": N}:
- type: enrichment
name: doc-context
llm_provider: local
llm_model: /models/my-model
llm_base_url: http://localhost:8000/v1
max_workers: 8
priority: 5 # vLLM scheduling priority (lower = less priority)
extra_body can also be set explicitly for any additional vendor-specific fields to merge into every request body.
Intra-Document Reference Enricher¶
The intra-document-reference enricher detects and extracts internal document references (cross-references) from chunk text.
Supported Reference Types¶
English Patterns:
| Type | Examples | Patterns |
|---|---|---|
| Section | Section 3, Section 2.4.3, §5, Sec. 1.2 | Deep hierarchical numbering, alphanumeric suffixes |
| Figure | Figure 1, Fig. 2, Figures 1-3 | Subfigures (3a, 3b), ranges, lists |
| Table | Table 1, Tab. 2, Tables 1-3 | Alphanumeric (A1, B2), ranges |
| Appendix | Appendix A, App. B.1, Appendix C.1.2.3 | Deep subsections |
| Equation | Equation 1, Eq. (5), Eqn. 3 | Parenthesized, ranges |
| Algorithm | Algorithm 1, Alg. 2 | |
| Listing | Listing 1 | |
| Chapter | Chapter 1, Ch. 3 | Ranges |
French (fr) Patterns:
| Type | Examples | Patterns |
|---|---|---|
| Section | Section 3, Partie 1, Paragraphe 2 | Same as English + Partie, Paragraphe |
| Figure | Figure 1, Schéma 2, Graphique 3, Illustration 1 | Schéma, Graphique, Encadré |
| Table | Tableau 1, Tab. 2, Tableaux 1-3 | Tableau/Tableaux |
| Appendix | Annexe A, Annexe B.1 | Annexe |
| Equation | Équation 1, Formule 2, Théorème 3, Lemme 1 | Équation, Formule, Théorème, Lemme, Corollaire |
| Algorithm | Algorithme 1, Alg. 2 | Algorithme |
| Listing | Exemple 1, Définition 2 | Exemple, Définition |
| Chapter | Chapitre 1, Chap. 3 | Chapitre |
Other Language Packs:
| Code | Language | Figure keyword | Table keyword | Section keyword |
|---|---|---|---|---|
de |
German | Abbildung, Abb. | Tabelle, Tab. | Abschnitt, Kap. |
es |
Spanish | Figura, Fig. | Tabla, Tab. | Sección, Cap. |
it |
Italian | Figura, Fig. | Tabella, Tab. | Sezione, Cap. |
pt |
Portuguese | Figura, Fig. | Tabela, Tab. | Seção, Cap. |
nl |
Dutch | Figuur, Fig. | Tabel, Tab. | Sectie, H. |
ru |
Russian | Рисунок, Рис. | Таблица, Табл. | Раздел, Гл. |
zh |
Chinese | 图 | 表 | 节、章 |
ar |
Arabic | شكل | جدول | قسم، فصل |
Full language names (english, french, german, …) are also accepted as aliases.
Usage¶
from stratum.enrichment import IntraDocumentReferenceEnricher
# English only (default)
enricher = IntraDocumentReferenceEnricher(
confidence_threshold=0.5,
enabled_types=["section", "figure", "table"],
)
# French documents
enricher_fr = IntraDocumentReferenceEnricher(
languages=["fr"],
confidence_threshold=0.5,
)
# Multilingual (English + French)
enricher_multi = IntraDocumentReferenceEnricher(
languages=["en", "fr"],
confidence_threshold=0.5,
custom_patterns={"section": [r"\bPart\s+(\d+)"]},
)
enriched_doc = enricher.enrich(document)
# Access detected references
for chunk in enriched_doc.chunks:
if "references" in chunk.enrichments:
refs = chunk.enrichments["references"]["intra_document"]
for ref in refs:
print(f"{ref['type']}: {ref['normalized']} at {ref['position']}")
# Check enrichment metadata
for metadata in enriched_doc.document.enrichments:
print(f"Applied: {metadata['name']} v{metadata['version']} at {metadata['timestamp']}")
Pipeline Configuration¶
steps:
- type: enrichment
name: intra-document-reference
confidence_threshold: 0.5
# Language support. Supported: en, fr, de, es, it, pt, nl, ru, zh, ar
# Default: ['en']. Combine for multilingual documents.
languages:
- en
- fr
# Optionally restrict to specific types
enabled_types:
- section
- figure
- table
# Or disable specific types
disabled_types:
- algorithm
# High-level custom reference type names (auto-generates patterns).
# Creates a new ref_type (lowercased) in the output.
# Detects "Theorem 3", "Lemma 5.2", "Theorems 1 and 2", etc.
custom_reference_types:
- Theorem
- Lemma
- Proposition
# Map custom keywords to existing ref_types so they participate in
# resolution. E.g. if your document calls figures "Illustration":
custom_type_mappings:
Illustration: figure
Diagram: figure
# Add raw regex patterns for an existing type
custom_patterns:
section:
- "\\bPart\\s+(\\d+)"
Custom Reference Type Names¶
For domain-specific references not in the built-in list, use custom_reference_types:
enricher = IntraDocumentReferenceEnricher(
custom_reference_types=["Theorem", "Lemma", "Proposition"],
)
This auto-generates patterns that match Theorem 3, Theorems 2 and 3, Lemma 5.2, etc. Each name becomes a new ref_type (lowercased) in the output.
Mapping Custom Keywords to Existing Types¶
If your document uses non-standard names for figures, tables, etc., use custom_type_mappings to map the keyword to an existing ref_type:
enricher = IntraDocumentReferenceEnricher(
custom_type_mappings={
"Illustration": "figure", # "Illustration 3" -> figure ref, gets resolved
"Diagram": "figure",
"Spreadsheet": "table",
},
)
Unlike custom_reference_types, mapped keywords are merged into the target ref_type so they participate fully in resolution (i.e. resolved_to gets populated for figure/table, and for section/appendix the chunk is found via heading_path).
Custom Types File¶
Use custom_types_file to load type names from a file instead of embedding them in code:
# math_types.txt
Theorem # creates new "theorem" ref_type
Lemma # creates new "lemma" ref_type
Illustration:figure # maps "Illustration" to existing figure type
Diagram:figure # maps "Diagram" to existing figure type
enricher = IntraDocumentReferenceEnricher(
custom_types_file="math_types.txt",
)
For raw regex control over patterns, use custom_patterns:
enricher = IntraDocumentReferenceEnricher(
custom_patterns={"section": [r"\bPart\s+(\d+)"]},
)
Reference Storage Format¶
References are stored in chunk.enrichments['references']['intra_document']:
{
"type": "section", # Reference type (section/figure/table/appendix/…)
"raw_text": "Section 2.4.3", # Original text matched
"normalized": "2.4.3", # Normalized identifier (number, letter, etc.)
"position": {
"start": 4, # Character offset within THIS chunk's text (not the full document)
"end": 17
},
"confidence": 0.95, # Confidence score (0.0-1.0)
# For ranges:
"is_range": true,
"range_start": "2",
"range_end": "5",
# For lists:
"is_list": true,
"list_items": ["1", "2", "3"]
}
Note on position: start/end are character offsets within the chunk's own text field, not document-level offsets. Use chunk.text[ref["position"]["start"]:ref["position"]["end"]] to recover the matched text.
Note: Enrichment metadata is tracked at the document level in document.document.enrichments.
Features¶
- Deep hierarchical numbering: Section 1.2.3.4.5.6
- Alphanumeric suffixes: Section 2a, Figure 3b
- Letter-prefixed sections: Section A.1, Section B.2.3
- Range detection: Sections 2-5, Figures 1-3
- List detection: Figures 1, 2, and 3
- False positive filtering: Ignores "figure of speech", "section of the code"
- Custom patterns: Add project-specific patterns via configuration
- Reference resolution: Link references to actual table/figure artifacts
- LLM validation (planned): Optional validation with Gemini/OpenAI
Reference Resolution¶
Enable resolve_references=True to link detected references to their targets in the document:
enricher = IntraDocumentReferenceEnricher(resolve_references=True)
result = enricher.enrich(document)
for chunk in result.chunks:
if "references" in chunk.enrichments:
for ref in chunk.enrichments["references"]["intra_document"]:
if "resolved_to" in ref:
print(f"{ref['raw_text']} -> chunk {ref['resolved_to']['chunk_id']}")
What gets resolved:
| Type | How it resolves | artifact_id |
|---|---|---|
figure |
Matches caption text in chunks with images | Image artifact ID (e.g. doc_image_001) |
table |
Matches caption text in chunks with tables | Table artifact ID (e.g. doc_table_001) |
section |
Matches section number in chunk.heading_path |
section_<number> (synthetic) |
appendix |
Matches appendix letter in chunk.heading_path |
appendix_<letter> (synthetic) |
| custom (mapped to figure/table) | Same as figure/table | Same as figure/table |
Section and appendix resolution works via the document's heading structure — each chunk that begins a new section has a heading_path (e.g. ["Chapter 2", "2.3 Experimental Setup"]), and the section number is extracted from there.
Resolved reference format:
{
"type": "table",
"raw_text": "Table 1",
"normalized": "1",
"resolved_to": {
"artifact_id": "doc_table_001", # artifact or synthetic section ID
"chunk_id": "doc_chunk_005" # chunk containing the target
}
}
If a reference cannot be resolved (e.g. the referenced section is in a different document, or its number doesn't match any heading), resolved_to is omitted.
Table Summarization Enricher¶
The table-summarization enricher uses LLM to generate concise summaries of tables.
Configuration¶
Requires OpenAI-compatible API. Set environment variables:
# Option 1: Direct LLM config
LLM_API_KEY=your-api-key
LLM_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
LLM_MODEL=gemini-2.0-flash
# Option 2: Google API key (auto-configures Gemini)
GOOGLE_API_KEY=your-google-api-key
Usage¶
from stratum.enrichment.table import TableSummarizationEnricher
# Create from environment
enricher = TableSummarizationEnricher.from_env()
# Or with explicit config
from stratum.enrichment.llm_client import LLMConfig
config = LLMConfig(
api_key="your-key",
base_url="https://api.openai.com/v1",
model="gpt-4o-mini",
)
enricher = TableSummarizationEnricher(config=config)
# Enrich document
result = enricher.enrich(document)
# Access summaries
for chunk in result.chunks:
if "table_summary" in chunk.enrichments:
print(chunk.enrichments["table_summary"])
Fallback Mode¶
When LLM is unavailable, falls back to header extraction:
# Fallback only (no LLM)
enricher = TableSummarizationEnricher(config=None, fallback_enabled=True)
Pipeline Configuration¶
steps:
- type: enrichment
name: table-summarization
fallback_enabled: true
Image Description Enricher¶
The image-description enricher uses VLM (Vision Language Model) to describe images.
Configuration¶
# Option 1: Direct VLM config
VLM_API_KEY=your-api-key
VLM_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
VLM_MODEL=gemini-2.0-flash
# Option 2: Google API key (auto-configures Gemini)
GOOGLE_API_KEY=your-google-api-key
Usage¶
from stratum.enrichment.image import ImageDescriptionEnricher
# Create from environment
enricher = ImageDescriptionEnricher.from_env()
# Enrich document
result = enricher.enrich(document)
# Access descriptions
for chunk in result.chunks:
if "image_description" in chunk.enrichments:
print(chunk.enrichments["image_description"])
Fallback Mode¶
When VLM is unavailable, uses image captions from parser:
enricher = ImageDescriptionEnricher(config=None, fallback_enabled=True)
Pipeline Configuration¶
steps:
- type: enrichment
name: image-description
fallback_enabled: true
Document Summary Enricher¶
The doc-summary enricher generates a document-level summary using an LLM and stores it in every chunk's enrichments['doc_summary']. This allows retrieval systems to include document context alongside individual chunks.
Usage¶
from stratum.enrichment import DocSummaryEnricher
enricher = DocSummaryEnricher(
llm_provider="openai", # or "anthropic", "google", "local"
llm_model="gpt-4o-mini",
max_doc_chars=50000, # truncate long documents
max_tokens=512,
temperature=0.3,
)
result = enricher.enrich(document)
# Same summary in every chunk
for chunk in result.chunks:
print(chunk.enrichments['doc_summary'])
Pipeline Configuration¶
steps:
- type: enrichment
name: doc-summary
llm_provider: local
llm_model: gemma3:4b
llm_base_url: http://localhost:11434/v1
max_doc_chars: 50000
max_tokens: 512
temperature: 0.3
Document Context Enricher¶
The doc-context enricher generates per-chunk contextual descriptions, explaining each chunk in relation to its surrounding content or the broader document. Results are stored in chunk.enrichments[context_key] (default key: doc_context).
Two Modes¶
| Mode | Description | When to use |
|---|---|---|
neighbors (default) |
Summarizes each chunk in light of its surrounding chunks (positional window). | Fast. Good for section-level context. |
document |
Explains each chunk in light of the full document text (possibly truncated). | Richer context. Higher token cost. |
Neighbors mode: context window = n_neighbors chunks on each side of the current chunk.
Document mode: the full document text (concatenated chunks) is passed to the LLM, truncated to max_doc_chars if set.
Multiple doc-context steps¶
Use context_key to run two doc-context enrichers in the same pipeline without one overwriting the other:
steps:
- type: enrichment
name: doc-context
mode: neighbors
n_neighbors: 3
context_key: doc_context_neighbors # stored in chunk.enrichments['doc_context_neighbors']
llm_provider: google
llm_model: gemini-2.0-flash
- type: enrichment
name: doc-context
mode: document
context_key: doc_context_document # stored in chunk.enrichments['doc_context_document']
max_doc_chars: 30000
llm_provider: google
llm_model: gemini-2.0-flash
Without context_key, the default key doc_context is used and the second step will overwrite the first.
Usage¶
from stratum.enrichment import DocContextEnricher
enricher = DocContextEnricher(
llm_provider="openai",
llm_model="gpt-4o-mini",
mode="neighbors", # or "document"
n_neighbors=3, # chunks on each side (neighbors mode)
max_doc_chars=50000, # truncate in document mode
max_tokens=512,
temperature=0.3,
max_workers=16, # parallel LLM requests
context_key="doc_context", # output key in chunk.enrichments
)
result = enricher.enrich(document)
for chunk in result.chunks:
print(chunk.enrichments['doc_context']) # unique context per chunk
Pipeline Configuration¶
steps:
- type: enrichment
name: doc-context
llm_provider: local
llm_model: gemma3:4b
llm_base_url: http://localhost:11434/v1
mode: neighbors
n_neighbors: 3
max_tokens: 512
max_workers: 16
context_key: doc_context # optional; default is "doc_context"
Cross-Document Context Enricher¶
The cross-doc-context enricher finds similar chunks from OTHER documents using a FAISS vector index, then generates contextual descriptions. Useful for corpus-level enrichment.
Workflow¶
- All chunks are embedded and stored in a FAISS vector index
- For each chunk, similar chunks from other documents are retrieved
- An LLM generates context based on the similar chunks
Usage¶
from stratum.enrichment import CrossDocContextEnricher
enricher = CrossDocContextEnricher(
vector_store_index_path="db/corpus/faiss.index",
vector_store_metadata_path="db/corpus/metadata.pkl",
embedding_provider="local",
embedding_model="nomic-embed-text",
embedding_base_url="http://localhost:11434/v1",
llm_provider="local",
llm_model="gemma3:4b",
llm_base_url="http://localhost:11434/v1",
n_neighbors=5,
max_workers=16,
)
# Process multiple documents — index builds incrementally
for doc in documents:
enriched = enricher.enrich(doc)
# Access results
for chunk in enriched.chunks:
if "cross_doc_context" in chunk.enrichments:
ctx = chunk.enrichments["cross_doc_context"]
print(ctx["context"]) # LLM-generated context
print(ctx["similar_docs"]) # list of doc_ids
Pipeline Configuration¶
steps:
- type: enrichment
name: cross-doc-context
vector_store_index_path: db/corpus/faiss.index
vector_store_metadata_path: db/corpus/metadata.pkl
embedding_provider: local
embedding_model: nomic-embed-text
embedding_base_url: http://localhost:11434/v1
embedding_batch_size: 32
llm_provider: local
llm_model: gemma3:4b
llm_base_url: http://localhost:11434/v1
n_neighbors: 5
max_tokens: 512
max_workers: 16
Custom Prompt Files¶
All LLM-based enrichers that use a PromptBuilder class accept a prompt_file parameter. This overrides the built-in hardcoded prompt with a YAML file you control.
YAML format¶
# system prompt (optional)
system: |
You are a precise document analyst. Be concise and factual.
# user_template (required) — use {variable} placeholders
user_template: |
Summarize this document in 3 sentences.
DOCUMENT:
{document_text}
# dspy_signature: # reserved for future DSPy migration
The system key is optional — if absent, no system message is sent. The user_template key is required and must contain all {variable} placeholders expected by the enricher.
Supported enrichers and their template variables¶
| Enricher | Template variables |
|---|---|
doc-summary |
{document_text} |
doc-context (neighbors mode) |
{previous_chunks}, {current_chunk}, {following_chunks} |
doc-context (document mode) |
{document_text}, {current_chunk} |
chunk-classification (taxonomy) |
{chunk_text}, {categories} |
chunk-classification (freeform) |
{chunk_text} |
topic |
{keywords_text} |
cross-doc-context |
{context_block}, {target_text} |
Pipeline YAML usage¶
- type: enrichment
name: doc-summary
llm_provider: google
llm_model: gemini-2.0-flash
prompt_file: prompts/my_summary.yaml
- type: enrichment
name: chunk-classification
llm_provider: openai
llm_model: gpt-4o-mini
prompt_file: prompts/my_classification.yaml
Default YAML prompts¶
Ready-to-use reference prompts are bundled in stratum/enrichment/prompts/:
stratum/enrichment/prompts/
├── __init__.py # load_prompt_file() helper
├── doc_summary.yaml
├── doc_context_neighbors.yaml
├── doc_context_document.yaml
├── classification.yaml
├── classification_freeform.yaml
├── topic.yaml
└── cross_doc_context.yaml
Copy and customise any of these as a starting point.
DSPy readiness¶
The # dspy_signature: comment in YAML files is reserved for future migration to DSPy-based prompt optimization. The current system + user_template structure maps directly to DSPy's Signature paradigm, so migration will only require filling in that key — no structural changes.
Module Structure¶
stratum/enrichment/
├── __init__.py # Public exports (all enrichers)
├── protocols.py # EnrichmentStep protocol
├── base.py # BaseEnrichmentStep abstract class
├── base_prompt.py # BasePromptBuilder for prompt construction
├── registry.py # EnrichmentRegistry with global singleton
├── pipeline.py # EnrichmentPipeline for chaining
├── noop.py # NoOpEnricher implementation
├── llm_client.py # OpenAI-compatible LLM/VLM client (table/image)
├── prompts/ # Bundled YAML prompt files + load_prompt_file()
│ ├── __init__.py # load_prompt_file(path) helper
│ ├── doc_summary.yaml
│ ├── doc_context_neighbors.yaml
│ ├── doc_context_document.yaml
│ ├── classification.yaml
│ ├── classification_freeform.yaml
│ ├── topic.yaml
│ └── cross_doc_context.yaml
├── llm/ # Multi-provider LLM abstraction (doc-level)
│ ├── __init__.py
│ ├── base.py # LLMClient ABC + LLMResponse dataclass
│ ├── factory.py # create_llm() factory function
│ ├── openai_compatible.py # OpenAI/local provider
│ ├── anthropic_client.py # Anthropic Claude provider
│ └── google_client.py # Google Gemini provider
├── reference/ # Intra-document reference detection
│ ├── __init__.py
│ ├── patterns.py # Configurable regex patterns (EN/FR/DE/ES/IT/PT/NL/RU/ZH/AR)
│ ├── detector.py # ReferenceDetector class
│ ├── block_index.py # BlockIndex for reference resolution
│ └── enricher.py # IntraDocumentReferenceEnricher
├── table/ # Table summarization
│ ├── __init__.py
│ └── summarizer.py # TableSummarizationEnricher
├── image/ # Image description
│ ├── __init__.py
│ └── describer.py # ImageDescriptionEnricher
├── doc_summary/ # Document-level summary
│ ├── __init__.py
│ ├── enricher.py # DocSummaryEnricher
│ └── prompt.py # DocSummaryPromptBuilder
├── doc_context/ # Per-chunk contextual description
│ ├── __init__.py
│ ├── enricher.py # DocContextEnricher
│ └── prompt.py # DocContextPromptBuilder
├── cross_doc_context/ # Cross-document context
│ ├── __init__.py
│ ├── enricher.py # CrossDocContextEnricher
│ └── prompt.py # CrossDocContextPromptBuilder
├── classification/ # Chunk classification
│ ├── __init__.py
│ ├── enricher.py # ChunkClassificationEnricher
│ └── prompt.py # ClassificationPromptBuilder
├── keyword/ # TF-IDF keyword extraction (no LLM)
│ ├── __init__.py
│ ├── extractor.py # KeywordExtractor (scikit-learn)
│ └── enricher.py # KeywordEnricher
├── topic/ # Topic discovery and assignment
│ ├── __init__.py
│ ├── prompt.py # TopicDiscoveryPromptBuilder
│ └── enricher.py # TopicEnricher
└── context/ # Vector index for cross-doc context
├── __init__.py
└── index.py # FAISS vector index
EnrichmentStep Protocol¶
from typing import Protocol
from stratum.models.output import CanonicalDocument
class EnrichmentStep(Protocol):
"""Protocol for document enrichers."""
name: str
version: str
def enrich(self, document: CanonicalDocument) -> CanonicalDocument:
"""Enrich document with additional metadata."""
...
def supports_document(self, document: CanonicalDocument) -> bool:
"""Check if enricher supports this document."""
...
Verbose Mode¶
With -v flag, CLI shows enrichment progress:
$ stratum doc.pdf --pipeline-config pipeline.yaml -v
Loaded pipeline config: enrichment-pipeline v1.0.0
Applied enricher: embedding
Applied enricher: summary
Processed: doc.pdf
Chunks: 15
Related¶
- Architecture - System architecture and pipeline documentation