Skip to content

Parsers Component

Overview

The parsers component (stratum/parsers/) converts various document formats into a unified Document representation with ContentBlock elements.

Current Status

Parser Formats Status Loading
DoclingParser PDF, DOCX, HTML ✅ Supported and tested Lazy (requires pip install stratum)
MarkdownParser Markdown (.md) ✅ Supported and tested Eager
TxtParser Plain text (.txt) ✅ Supported and tested Eager
OCRJSONParser OCR JSON (.json) ✅ Supported and tested Eager
DotsOCRParser PDF, IMAGE, OCR JSON ✅ Available Not auto-registered; register with force=True if needed

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     ParserRegistry                           │
│                      (base.py)                               │
│                                                              │
│  Maintains mapping: DocumentFormat → Parser class            │
│  Auto-selects parser based on file extension                 │
└─────────────────────────────────────────────────────────────┘
                              │
      ┌────────────────┬──────┴──────┬────────────────┐
      ▼                ▼             ▼                ▼
┌───────────┐  ┌───────────┐  ┌───────────┐
│ Docling   │  │ Markdown  │  │   Txt     │
│ Parser    │  │ Parser    │  │  Parser   │
│           │  │           │  │           │
│ PDF, DOCX │  │ Markdown  │  │ Plain     │
│           │  │ only      │  │ text      │
└───────────┘  └───────────┘  └───────────┘
      │                │             │
      └────────────────┴──────┬──────┘
                              ▼
                    ┌─────────────────┐
                    │    Document     │
                    │ (ContentBlock[])│
                    └─────────────────┘

ParserRegistry

Central registry for parser discovery and selection.

from stratum.parsers import ParserRegistry, get_parser

# Auto-select by file extension
parser = ParserRegistry.get_for_file(Path("document.pdf"))

# Select by format
from stratum.models.document import DocumentFormat
parser = ParserRegistry.get_for_format(DocumentFormat.PDF)

# Convenience function
parser = get_parser(path=Path("document.pdf"))
parser = get_parser(format="pdf")

# List supported formats
formats = ParserRegistry.get_supported_formats()
# [DocumentFormat.PDF, DocumentFormat.MARKDOWN, ...]

Available Parsers

DoclingParser

Uses the Docling library for document parsing.

Supported formats: PDF, DOCX

from stratum.parsers import DoclingParser

# Images + tables extraction
parser = DoclingParser(
    images_dir="output/images",
    tables_dir="output/tables",
    image_dpi=300,           # DPI for PDF image rendering (default: 300)
)
document = parser.parse_file(Path("paper.pdf"))

Features: - High-quality PDF text extraction - Table detection and structure preservation - Table export as Markdown .md files (when tables_dir set) - HD image extraction via PyMuPDF at configurable DPI (when images_dir set + PDF) - Parallel image rendering with ThreadPoolExecutor - Heading level detection - Page number tracking

Requirements:

pip install stratum

MarkdownParser

Built-in parser for Markdown files. No external dependencies.

Supported formats: Markdown (.md)

from stratum.parsers import MarkdownParser

parser = MarkdownParser()
document = parser.parse_file(Path("readme.md"))

Features: - Heading level detection (H1-H6) - Code block preservation (fenced and indented) - Table detection - List detection - No external dependencies

TxtParser

Built-in parser for plain text files. No external dependencies.

Supported formats: Plain text (.txt)

from stratum.parsers import TxtParser

parser = TxtParser()
document = parser.parse_file(Path("notes.txt"))

Features: - Paragraph detection (separated by blank lines) - Encoding auto-detection (UTF-8, UTF-16, Latin-1) - BOM handling (UTF-8, UTF-16) - No external dependencies

BaseParser

Abstract base class for all parsers.

from stratum.parsers.base import BaseParser

class BaseParser(ABC):
    name: str  # Parser identifier
    supported_formats: list[DocumentFormat]  # Supported formats

    @abstractmethod
    def parse_file(self, path: Path) -> Document:
        """Parse file and return Document."""
        ...

    def parse_bytes(self, data: bytes, filename: str) -> Document:
        """Parse from bytes (optional, may not be supported)."""
        ...

Document Model

Parser output is a Document with ContentBlock elements:

@dataclass
class Document:
    document_id: str
    metadata: DocumentMetadata
    blocks: list[ContentBlock]

ContentBlock

Atomic content unit:

@dataclass
class ContentBlock:
    text: str
    category: BlockCategory
    level: int | None = None      # Heading level (1-6)
    page_number: int | None = None
    bbox: tuple | None = None     # Bounding box
    metadata: dict = field(default_factory=dict)

BlockCategory

Content type enumeration:

class BlockCategory(Enum):
    # Headings
    TITLE = "Title"
    SECTION_HEADER = "Section-header"
    HEADING = "Title"         # Alias for TITLE

    # Body content
    TEXT = "Text"             # Paragraphs
    LIST_ITEM = "List-item"   # List items

    # Special content
    TABLE = "Table"           # Tables
    FORMULA = "Formula"       # Math formulas
    CODE = "Code"             # Code blocks

    # Media
    PICTURE = "Picture"       # Images
    CAPTION = "Caption"       # Figure/table captions

    # Page elements (typically skipped)
    PAGE_HEADER = "Page-header"
    PAGE_FOOTER = "Page-footer"
    FOOTNOTE = "Footnote"     # Footnotes

    # Fallback
    UNKNOWN = "Unknown"

DocumentMetadata

Source document information:

@dataclass
class DocumentMetadata:
    source_file: str | None = None
    format: DocumentFormat | None = None
    title: str | None = None
    total_pages: int | None = None
    extra: dict = field(default_factory=dict)

Creating Custom Parsers

Implement BaseParser:

from pathlib import Path
from stratum.parsers.base import BaseParser, ParserRegistry
from stratum.models.document import Document, DocumentFormat, DocumentMetadata
from stratum.models.block import ContentBlock, BlockCategory

class MyCustomParser(BaseParser):
    name = "my_parser"
    supported_formats = [DocumentFormat.PDF]

    def __init__(self, **kwargs):
        # Custom initialization
        self.options = kwargs

    def parse_file(self, path: Path) -> Document:
        # Read and parse file
        content = path.read_text()

        # Create blocks
        blocks = [
            ContentBlock(
                text="Parsed content",
                category=BlockCategory.TEXT,
                page_number=1,
            )
        ]

        # Return Document
        return Document(
            document_id=path.stem,
            metadata=DocumentMetadata(
                source_file=str(path),
                format=DocumentFormat.PDF,
            ),
            blocks=blocks,
        )

# Register parser
ParserRegistry.register(MyCustomParser)

# Now available via registry
parser = ParserRegistry.get_for_format(DocumentFormat.PDF)

Parser Selection Priority

ParserRegistry uses first-registered wins semantics:

  1. Light parsers (MarkdownParser, OCRJSONParser, TxtParser) are registered eagerly at import time.
  2. DoclingParser is registered lazily — only imported when a heavy format (PDF/DOCX/HTML) is first requested via get_parser().
  3. Custom parsers registered before the lazy import wins; use force=True to override any existing entry.
# Override default PDF parser (must use force=True — DoclingParser is already registered)
from stratum.parsers.base import ParserRegistry
ParserRegistry.register(MyPDFParser, force=True)  # Now used for PDF

Note: ParserRegistry.get_for_file() and ParserRegistry.get_for_format() are lower-level APIs that do not trigger lazy loading of DoclingParser. Use get_parser() (or parse_document()) for the full lazy-loading behaviour.

Convenience Functions

from stratum.parsers import (
    get_parser,
    parse_document,
    get_supported_formats,
)

# Get parser instance
parser = get_parser(path=Path("doc.pdf"))
parser = get_parser(format="markdown")

# Parse directly
doc = parse_document(path=Path("doc.pdf"))
doc = parse_document(data=pdf_bytes, filename="doc.pdf")

# List formats
formats = get_supported_formats()
# ["pdf", "markdown", "txt", "docx"]

Image Extraction

DoclingParser saves images when images_dir is set. For PDFs, images are rendered via PyMuPDF at the specified DPI (default 300), producing sharp exports of vector figures. For non-PDF inputs the image embedded in the Docling parse result is used as a fallback.

parser = DoclingParser(images_dir="output/images", image_dpi=300)
document = parser.parse_file(Path("paper.pdf"))

# Images saved to output/images/
# References in blocks:
for block in document.blocks:
    if block.category == BlockCategory.PICTURE:
        print(f"Image: {block.metadata.get('image_path')}")

Image naming: {doc_stem}_img_{NNN}.png

Example: paper_img_001.png

Rendering is parallelised with ThreadPoolExecutor — each thread opens its own fitz.Document so multiple pages render concurrently.

Table Export

DoclingParser exports tables as Markdown files when tables_dir is set:

parser = DoclingParser(tables_dir="output/tables")
document = parser.parse_file(Path("paper.pdf"))
# Saves: output/tables/paper_table_001.md, paper_table_002.md, ...

Table naming: {doc_stem}_table_{NNN}.md

The Markdown file contains the full table in GitHub-Flavored Markdown syntax. Falls back to HTML if Markdown export fails.

Error Handling

from stratum.parsers import get_parser

# Unknown format
try:
    parser = get_parser(format="xyz")
except ValueError as e:
    print(f"Unsupported format: {e}")

# Parse error
try:
    doc = parser.parse_file(Path("corrupted.pdf"))
except Exception as e:
    print(f"Parse error: {e}")