Skip to content

PII Detection Component

Status: Infrastructure ready, implementation pending

Overview

The PII (Personal Identifiable Information) component will detect and optionally redact sensitive information from documents. The pipeline infrastructure supports PII steps, but concrete implementations are pending.

Pipeline Configuration

PII steps can be specified in pipeline config:

# pipeline.yaml
name: pii-pipeline
version: "1.0.0"

steps:
  - type: parser

  # PII detection before chunking
  - type: pii
    name: regex-pii
    mode: detect  # detect | redact
    types:
      - email
      - phone
      - ssn
    confidence_threshold: 0.7

  - type: chunker
    target_size: 500

Current Behavior

With -v flag, CLI shows that PII steps are recognized but not yet implemented:

$ stratum doc.pdf --pipeline-config pii-pipeline.yaml -v
Loaded pipeline config: pii-pipeline v1.0.0
PII step 'regex-pii' not yet implemented, skipping
Processed: doc.pdf
Chunks: 10

The document passes through unchanged - this is the no-op behavior.

Planned Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Pipeline Config                          │
│                                                              │
│  steps:                                                      │
│    - type: pii                                               │
│      name: regex-pii                                         │
│      mode: detect                                            │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                   apply_pii_steps()                          │
│                      (cli.py)                                │
│                                                              │
│  - Currently: logs and passes through (no-op)                │
│  - Future: loads PIIDetector from PIIRegistry                │
│  - Applies detection/redaction based on config               │
└─────────────────────────────────────────────────────────────┘
                              │
         ┌────────────────────┼────────────────────┐
         ▼                    ▼                    ▼
  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
  │  regex-pii  │     │ presidio    │     │  custom     │
  │             │     │             │     │             │
  │ Regex-based │     │ ML-based    │     │ User impl   │
  │ detection   │     │ detection   │     │             │
  └─────────────┘     └─────────────┘     └─────────────┘

Planned PII Types

Type Example Detection Method
email john@example.com Regex
phone +1-555-123-4567 Regex
ssn 123-45-6789 Regex
credit_card 4111-1111-1111-1111 Regex + Luhn
name John Smith NER model
address 123 Main St, City NER model
date_of_birth 01/15/1990 Regex + context
ip_address 192.168.1.1 Regex

Planned Modes

Detect Mode

Only detect PII, add metadata but don't modify text:

- type: pii
  mode: detect

Output includes PII locations:

{
  "chunks": [{
    "text": "Contact john@example.com for details",
    "pii_detected": [
      {"type": "email", "start": 8, "end": 24, "confidence": 0.99}
    ]
  }]
}

Redact Mode

Detect and replace PII with placeholders:

- type: pii
  mode: redact
  strategy: mask  # mask | remove | replace

Output with redaction:

{
  "chunks": [{
    "text": "Contact [EMAIL] for details",
    "pii_redacted": true
  }]
}

Redaction Strategies

Strategy Input Output
mask Email: john@example.com Email: [EMAIL]
remove Email: john@example.com Email:
replace Email: john@example.com Email: user123@example.org

Implementing PII Detector

When implementing, create a class satisfying the PIIDetector protocol:

from stratum.models.output import CanonicalDocument

class RegexPIIDetector:
    """Regex-based PII detector."""

    name = "regex-pii"
    version = "1.0.0"
    supported_types = ["email", "phone", "ssn"]

    def __init__(self, types: list[str] = None, confidence_threshold: float = 0.7):
        self.types = types or self.supported_types
        self.threshold = confidence_threshold

    def detect(self, document: CanonicalDocument) -> list[dict]:
        """Detect PII in document chunks."""
        matches = []
        for chunk in document.chunks:
            # Apply regex patterns
            for pii_type in self.types:
                pattern = self._get_pattern(pii_type)
                # Find matches in chunk.text
                ...
        return matches

    def redact(self, document: CanonicalDocument, matches: list[dict], strategy: str = "mask") -> CanonicalDocument:
        """Redact detected PII."""
        # Apply redaction strategy
        ...
        return document

Placement in Pipeline

PII can be applied at different stages:

Pre-Chunking (affects boundaries)

steps:
  - type: parser
  - type: pii        # Before chunking
    mode: redact
  - type: chunker

Redaction before chunking may affect semantic boundaries.

Post-Chunking (preserves boundaries)

steps:
  - type: parser
  - type: chunker
  - type: pii        # After chunking
    mode: detect

Detection after chunking preserves chunk boundaries, useful for filtering at retrieval time.