Parsers Component¶
Overview¶
The parsers component (stratum/parsers/) converts various document formats into a unified Document representation with ContentBlock elements.
Current Status¶
| Parser | Formats | Status | Loading |
|---|---|---|---|
| DoclingParser | PDF, DOCX, HTML | ✅ Supported and tested | Lazy (requires pip install stratum) |
| MarkdownParser | Markdown (.md) | ✅ Supported and tested | Eager |
| TxtParser | Plain text (.txt) | ✅ Supported and tested | Eager |
| OCRJSONParser | OCR JSON (.json) | ✅ Supported and tested | Eager |
| DotsOCRParser | PDF, IMAGE, OCR JSON | ✅ Available | Not auto-registered; register with force=True if needed |
Architecture¶
┌─────────────────────────────────────────────────────────────┐
│ ParserRegistry │
│ (base.py) │
│ │
│ Maintains mapping: DocumentFormat → Parser class │
│ Auto-selects parser based on file extension │
└─────────────────────────────────────────────────────────────┘
│
┌────────────────┬──────┴──────┬────────────────┐
▼ ▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Docling │ │ Markdown │ │ Txt │
│ Parser │ │ Parser │ │ Parser │
│ │ │ │ │ │
│ PDF, DOCX │ │ Markdown │ │ Plain │
│ │ │ only │ │ text │
└───────────┘ └───────────┘ └───────────┘
│ │ │
└────────────────┴──────┬──────┘
▼
┌─────────────────┐
│ Document │
│ (ContentBlock[])│
└─────────────────┘
ParserRegistry¶
Central registry for parser discovery and selection.
from stratum.parsers import ParserRegistry, get_parser
# Auto-select by file extension
parser = ParserRegistry.get_for_file(Path("document.pdf"))
# Select by format
from stratum.models.document import DocumentFormat
parser = ParserRegistry.get_for_format(DocumentFormat.PDF)
# Convenience function
parser = get_parser(path=Path("document.pdf"))
parser = get_parser(format="pdf")
# List supported formats
formats = ParserRegistry.get_supported_formats()
# [DocumentFormat.PDF, DocumentFormat.MARKDOWN, ...]
Available Parsers¶
DoclingParser¶
Uses the Docling library for document parsing.
Supported formats: PDF, DOCX
from stratum.parsers import DoclingParser
# Images + tables extraction
parser = DoclingParser(
images_dir="output/images",
tables_dir="output/tables",
image_dpi=300, # DPI for PDF image rendering (default: 300)
)
document = parser.parse_file(Path("paper.pdf"))
Features:
- High-quality PDF text extraction
- Table detection and structure preservation
- Table export as Markdown .md files (when tables_dir set)
- HD image extraction via PyMuPDF at configurable DPI (when images_dir set + PDF)
- Parallel image rendering with ThreadPoolExecutor
- Heading level detection
- Page number tracking
Requirements:
pip install stratum
MarkdownParser¶
Built-in parser for Markdown files. No external dependencies.
Supported formats: Markdown (.md)
from stratum.parsers import MarkdownParser
parser = MarkdownParser()
document = parser.parse_file(Path("readme.md"))
Features: - Heading level detection (H1-H6) - Code block preservation (fenced and indented) - Table detection - List detection - No external dependencies
TxtParser¶
Built-in parser for plain text files. No external dependencies.
Supported formats: Plain text (.txt)
from stratum.parsers import TxtParser
parser = TxtParser()
document = parser.parse_file(Path("notes.txt"))
Features: - Paragraph detection (separated by blank lines) - Encoding auto-detection (UTF-8, UTF-16, Latin-1) - BOM handling (UTF-8, UTF-16) - No external dependencies
BaseParser¶
Abstract base class for all parsers.
from stratum.parsers.base import BaseParser
class BaseParser(ABC):
name: str # Parser identifier
supported_formats: list[DocumentFormat] # Supported formats
@abstractmethod
def parse_file(self, path: Path) -> Document:
"""Parse file and return Document."""
...
def parse_bytes(self, data: bytes, filename: str) -> Document:
"""Parse from bytes (optional, may not be supported)."""
...
Document Model¶
Parser output is a Document with ContentBlock elements:
@dataclass
class Document:
document_id: str
metadata: DocumentMetadata
blocks: list[ContentBlock]
ContentBlock¶
Atomic content unit:
@dataclass
class ContentBlock:
text: str
category: BlockCategory
level: int | None = None # Heading level (1-6)
page_number: int | None = None
bbox: tuple | None = None # Bounding box
metadata: dict = field(default_factory=dict)
BlockCategory¶
Content type enumeration:
class BlockCategory(Enum):
# Headings
TITLE = "Title"
SECTION_HEADER = "Section-header"
HEADING = "Title" # Alias for TITLE
# Body content
TEXT = "Text" # Paragraphs
LIST_ITEM = "List-item" # List items
# Special content
TABLE = "Table" # Tables
FORMULA = "Formula" # Math formulas
CODE = "Code" # Code blocks
# Media
PICTURE = "Picture" # Images
CAPTION = "Caption" # Figure/table captions
# Page elements (typically skipped)
PAGE_HEADER = "Page-header"
PAGE_FOOTER = "Page-footer"
FOOTNOTE = "Footnote" # Footnotes
# Fallback
UNKNOWN = "Unknown"
DocumentMetadata¶
Source document information:
@dataclass
class DocumentMetadata:
source_file: str | None = None
format: DocumentFormat | None = None
title: str | None = None
total_pages: int | None = None
extra: dict = field(default_factory=dict)
Creating Custom Parsers¶
Implement BaseParser:
from pathlib import Path
from stratum.parsers.base import BaseParser, ParserRegistry
from stratum.models.document import Document, DocumentFormat, DocumentMetadata
from stratum.models.block import ContentBlock, BlockCategory
class MyCustomParser(BaseParser):
name = "my_parser"
supported_formats = [DocumentFormat.PDF]
def __init__(self, **kwargs):
# Custom initialization
self.options = kwargs
def parse_file(self, path: Path) -> Document:
# Read and parse file
content = path.read_text()
# Create blocks
blocks = [
ContentBlock(
text="Parsed content",
category=BlockCategory.TEXT,
page_number=1,
)
]
# Return Document
return Document(
document_id=path.stem,
metadata=DocumentMetadata(
source_file=str(path),
format=DocumentFormat.PDF,
),
blocks=blocks,
)
# Register parser
ParserRegistry.register(MyCustomParser)
# Now available via registry
parser = ParserRegistry.get_for_format(DocumentFormat.PDF)
Parser Selection Priority¶
ParserRegistry uses first-registered wins semantics:
- Light parsers (MarkdownParser, OCRJSONParser, TxtParser) are registered eagerly at import time.
- DoclingParser is registered lazily — only imported when a heavy format (PDF/DOCX/HTML) is first requested via
get_parser(). - Custom parsers registered before the lazy import wins; use
force=Trueto override any existing entry.
# Override default PDF parser (must use force=True — DoclingParser is already registered)
from stratum.parsers.base import ParserRegistry
ParserRegistry.register(MyPDFParser, force=True) # Now used for PDF
Note:
ParserRegistry.get_for_file()andParserRegistry.get_for_format()are lower-level APIs that do not trigger lazy loading of DoclingParser. Useget_parser()(orparse_document()) for the full lazy-loading behaviour.
Convenience Functions¶
from stratum.parsers import (
get_parser,
parse_document,
get_supported_formats,
)
# Get parser instance
parser = get_parser(path=Path("doc.pdf"))
parser = get_parser(format="markdown")
# Parse directly
doc = parse_document(path=Path("doc.pdf"))
doc = parse_document(data=pdf_bytes, filename="doc.pdf")
# List formats
formats = get_supported_formats()
# ["pdf", "markdown", "txt", "docx"]
Image Extraction¶
DoclingParser saves images when images_dir is set. For PDFs, images are rendered via PyMuPDF at the specified DPI (default 300), producing sharp exports of vector figures. For non-PDF inputs the image embedded in the Docling parse result is used as a fallback.
parser = DoclingParser(images_dir="output/images", image_dpi=300)
document = parser.parse_file(Path("paper.pdf"))
# Images saved to output/images/
# References in blocks:
for block in document.blocks:
if block.category == BlockCategory.PICTURE:
print(f"Image: {block.metadata.get('image_path')}")
Image naming: {doc_stem}_img_{NNN}.png
Example: paper_img_001.png
Rendering is parallelised with ThreadPoolExecutor — each thread opens its own fitz.Document so multiple pages render concurrently.
Table Export¶
DoclingParser exports tables as Markdown files when tables_dir is set:
parser = DoclingParser(tables_dir="output/tables")
document = parser.parse_file(Path("paper.pdf"))
# Saves: output/tables/paper_table_001.md, paper_table_002.md, ...
Table naming: {doc_stem}_table_{NNN}.md
The Markdown file contains the full table in GitHub-Flavored Markdown syntax. Falls back to HTML if Markdown export fails.
Error Handling¶
from stratum.parsers import get_parser
# Unknown format
try:
parser = get_parser(format="xyz")
except ValueError as e:
print(f"Unsupported format: {e}")
# Parse error
try:
doc = parser.parse_file(Path("corrupted.pdf"))
except Exception as e:
print(f"Parse error: {e}")