PII Detection Component¶
Status: Infrastructure ready, implementation pending
Overview¶
The PII (Personal Identifiable Information) component will detect and optionally redact sensitive information from documents. The pipeline infrastructure supports PII steps, but concrete implementations are pending.
Pipeline Configuration¶
PII steps can be specified in pipeline config:
# pipeline.yaml
name: pii-pipeline
version: "1.0.0"
steps:
- type: parser
# PII detection before chunking
- type: pii
name: regex-pii
mode: detect # detect | redact
types:
- email
- phone
- ssn
confidence_threshold: 0.7
- type: chunker
target_size: 500
Current Behavior¶
With -v flag, CLI shows that PII steps are recognized but not yet implemented:
$ stratum doc.pdf --pipeline-config pii-pipeline.yaml -v
Loaded pipeline config: pii-pipeline v1.0.0
PII step 'regex-pii' not yet implemented, skipping
Processed: doc.pdf
Chunks: 10
The document passes through unchanged - this is the no-op behavior.
Planned Architecture¶
┌─────────────────────────────────────────────────────────────┐
│ Pipeline Config │
│ │
│ steps: │
│ - type: pii │
│ name: regex-pii │
│ mode: detect │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ apply_pii_steps() │
│ (cli.py) │
│ │
│ - Currently: logs and passes through (no-op) │
│ - Future: loads PIIDetector from PIIRegistry │
│ - Applies detection/redaction based on config │
└─────────────────────────────────────────────────────────────┘
│
┌────────────────────┼────────────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ regex-pii │ │ presidio │ │ custom │
│ │ │ │ │ │
│ Regex-based │ │ ML-based │ │ User impl │
│ detection │ │ detection │ │ │
└─────────────┘ └─────────────┘ └─────────────┘
Planned PII Types¶
| Type | Example | Detection Method |
|---|---|---|
email |
john@example.com |
Regex |
phone |
+1-555-123-4567 |
Regex |
ssn |
123-45-6789 |
Regex |
credit_card |
4111-1111-1111-1111 |
Regex + Luhn |
name |
John Smith |
NER model |
address |
123 Main St, City |
NER model |
date_of_birth |
01/15/1990 |
Regex + context |
ip_address |
192.168.1.1 |
Regex |
Planned Modes¶
Detect Mode¶
Only detect PII, add metadata but don't modify text:
- type: pii
mode: detect
Output includes PII locations:
{
"chunks": [{
"text": "Contact john@example.com for details",
"pii_detected": [
{"type": "email", "start": 8, "end": 24, "confidence": 0.99}
]
}]
}
Redact Mode¶
Detect and replace PII with placeholders:
- type: pii
mode: redact
strategy: mask # mask | remove | replace
Output with redaction:
{
"chunks": [{
"text": "Contact [EMAIL] for details",
"pii_redacted": true
}]
}
Redaction Strategies¶
| Strategy | Input | Output |
|---|---|---|
mask |
Email: john@example.com |
Email: [EMAIL] |
remove |
Email: john@example.com |
Email: |
replace |
Email: john@example.com |
Email: user123@example.org |
Implementing PII Detector¶
When implementing, create a class satisfying the PIIDetector protocol:
from stratum.models.output import CanonicalDocument
class RegexPIIDetector:
"""Regex-based PII detector."""
name = "regex-pii"
version = "1.0.0"
supported_types = ["email", "phone", "ssn"]
def __init__(self, types: list[str] = None, confidence_threshold: float = 0.7):
self.types = types or self.supported_types
self.threshold = confidence_threshold
def detect(self, document: CanonicalDocument) -> list[dict]:
"""Detect PII in document chunks."""
matches = []
for chunk in document.chunks:
# Apply regex patterns
for pii_type in self.types:
pattern = self._get_pattern(pii_type)
# Find matches in chunk.text
...
return matches
def redact(self, document: CanonicalDocument, matches: list[dict], strategy: str = "mask") -> CanonicalDocument:
"""Redact detected PII."""
# Apply redaction strategy
...
return document
Placement in Pipeline¶
PII can be applied at different stages:
Pre-Chunking (affects boundaries)¶
steps:
- type: parser
- type: pii # Before chunking
mode: redact
- type: chunker
Redaction before chunking may affect semantic boundaries.
Post-Chunking (preserves boundaries)¶
steps:
- type: parser
- type: chunker
- type: pii # After chunking
mode: detect
Detection after chunking preserves chunk boundaries, useful for filtering at retrieval time.
Related¶
- Architecture - System architecture and pipeline documentation
- Usage Guide - Integration and usage patterns