PHI Scanning Skill — Screen Before You Share

Usage: Load this file into Claude Code (via CLAUDE.md pointer or skill trigger) or paste into an LLM session before publishing analysis outputs or exposing clinical repos to cloud services. Ensures PHI/PII is detected locally before data leaves the machine.

Principle: Clinical data environments contain PHI by default. Any output — rendered reports, CSV exports, notebooks, scripts with hardcoded values — must be scanned before publishing, sharing, or sending to cloud-based LLMs. The cost of a 30-second scan is always less than a HIPAA incident.

1) When This Skill Activates

This skill should be invoked before any data crosses a trust boundary:

Finished HTML reports or rendered notebooks (.html, .pdf)
Pulling repos from organizational DevOps (Azure DevOps, GitHub Enterprise) that may contain clinical data — scan before pointing cloud tools at them
Creating txtarchives or code bundles for LLM ingestion
Committing CSV/data exports to version control
Sharing analysis files with collaborators outside the secure environment

Do NOT skip scanning because: - You “de-identified the data” (check — residual identifiers are common) - The file is “just code” (scripts can contain hardcoded MRNs, names in comments, sample data) - The output is aggregated (small cell counts, unique combinations can re-identify)

2) Tool: Microsoft Presidio + Custom Healthcare Recognizers

What It Is

Microsoft Presidio is an open-source PII detection framework. It combines NLP (spaCy) with pattern matching and context-aware scoring. The A PHI scanner that wraps Presidio with custom recognizers (e.g. my asksage/phi_scanner.py) extends it for healthcare-specific patterns that Presidio doesn’t cover by default.

Architecture

Input files → Text extraction → Presidio AnalyzerEngine → Findings with scores
                                       │
                                       ├── Built-in recognizers (PERSON, SSN, EMAIL, PHONE, ...)
                                       └── Custom recognizers (MRN, DOB, AGE, NPI)

All processing is local — no data leaves the machine. Uses spaCy en_core_web_lg model on CPU (no GPU required).

Built-in Entity Types (Presidio)

Entity	Examples
`PERSON`	John Smith, Dr. Wilson
`US_SSN`	234-56-7890
`EMAIL_ADDRESS`	user@example.com
`PHONE_NUMBER`	555-123-4567
`LOCATION`	Springfield, IL
`CREDIT_CARD`	4111-1111-1111-1111
`DATE_TIME`	01/15/2024, March 2023
`URL`	example.com
`US_DRIVER_LICENSE`	State-specific patterns
`US_PASSPORT`	Passport numbers
`IBAN_CODE`	International bank accounts
`IP_ADDRESS`	192.168.1.1

Custom Healthcare Entity Types

Entity	Pattern	Confidence
`MEDICAL_RECORD_NUMBER`	6-8 digits near MRN context words	0.85-1.0
`DATE_OF_BIRTH`	Dates near DOB/birthdate context	0.90
`AGE`	“XX year old”, “age XX”, “XX y/o”	0.70-0.75
`NPI`	10-digit number near NPI context	0.85

3) CLI Reference

Installation (Nix + Python venv)

# In the asksage project with Nix dev shell
pip install presidio-analyzer presidio-anonymizer
python -m spacy download en_core_web_lg

The Nix flake.nix must include LD_LIBRARY_PATH for stdenv.cc.cc.lib so numpy/spaCy can find libstdc++.so.6.

Basic Usage

# Scan a single file
python phi_scanner.py report.md

# Scan a directory recursively
python phi_scanner.py ./my_analysis/

# Scan specific file types
python phi_scanner.py . --pattern "*.csv" "*.html" "*.md" "*.ipynb"

# Higher confidence threshold (fewer false positives)
python phi_scanner.py . --threshold 0.8

# JSON output for programmatic use
python phi_scanner.py . --format json

# CSV output saved to file
python phi_scanner.py . --format csv -o phi_report.csv

Default File Patterns

*.md *.txt *.csv *.py *.R *.qmd *.ipynb *.html *.json *.yaml *.yml

Exit Codes

Code	Meaning
0	Clean — no PHI/PII detected
1	PHI/PII detected — review findings
2	Error (file not found, etc.)

Exit code 1 enables use in CI pipelines and pre-commit hooks.

Output Formats

Table (default) — human-readable terminal output with per-file findings and summary:

Presidio scanning /path/to/target for PHI before exposing it to cloud LLM.
Initializing analyzer...
  [1/3] ./report.md
  [2/3] ./data.csv
  [3/3] ./analysis.ipynb

========================================================================
  ./report.md
========================================================================
  Line   Type                         Score  Text
  ------ ---------------------------- ------ ----------------------------
  3      PERSON                       0.85   John Smith
  5      MEDICAL_RECORD_NUMBER        1.0    MRN: 12345678

JSON — structured output with findings array and summary object.

CSV — one row per finding with columns: file, line, entity_type, score, text, start, end.

4) Workflow Integration

Before Publishing Reports

# After rendering a Quarto document
quarto render analysis.qmd
python phi_scanner.py analysis.html --threshold 0.7

Before Exposing a Repo to Cloud LLMs

# After cloning from Azure DevOps or internal Git
git clone <internal-repo-url> ./project
python phi_scanner.py ./project/ --pattern "*.csv" "*.md" "*.ipynb" "*.R" "*.py"
# Only proceed with cloud tools if exit code is 0

Before Creating txtarchives

# Scan source files, then archive
python phi_scanner.py ./my_analysis/
python -m txtarchive archive ./my_analysis/ output.txt --llm-friendly

In a Pre-Commit Hook

#!/bin/bash
# .git/hooks/pre-commit
python phi_scanner.py . --pattern "*.csv" "*.md" "*.ipynb" --threshold 0.7
if [ $? -eq 1 ]; then
    echo "PHI detected — commit blocked. Review findings above."
    exit 1
fi

5) Interpreting Results

Confidence Scores

Range	Interpretation	Action
0.9-1.0	High confidence — very likely real PII	Must review and remediate
0.7-0.89	Moderate confidence — probable PII	Review in context
0.5-0.69	Low confidence — possible false positive	Check if sensitive in context

Common False Positives

Pattern	Why It Fires	Mitigation
Author names in headers	`PERSON` recognizer	Raise threshold to 0.7+
URLs in code/config	`URL` recognizer	Expected — ignore or raise threshold
Dates in non-clinical context	`DATE_TIME` recognizer	Context-dependent — review
Package version numbers	Sometimes matches patterns	Usually low confidence — ignored

When Findings Require Action

Real PHI found — Remove or redact before proceeding. Do NOT commit, share, or send to cloud.
False positive confirmed — Safe to proceed. Consider raising --threshold for this scan.
Uncertain — Treat as real PHI. Err on the side of caution.

6) Extending the Scanner

Adding Custom Recognizers

The scanner uses Presidio’s PatternRecognizer class. To add a new entity type:

from presidio_analyzer import Pattern, PatternRecognizer

def _build_my_recognizer() -> PatternRecognizer:
    patterns = [
        Pattern("MY_PATTERN", r"regex_here", 0.85),
    ]
    return PatternRecognizer(
        supported_entity="MY_ENTITY_TYPE",
        patterns=patterns,
        context=["context", "words", "that", "boost", "confidence"],
    )

Then add it to the registry in build_analyzer():

registry.add_recognizer(_build_my_recognizer())

Context Words

Presidio boosts confidence scores when context words appear near a pattern match. For healthcare scanning, relevant context includes: patient, chart, medical, clinical, hospital, physician, diagnosis, treatment, admitted, discharged.

7) Dependencies and Environment

Package	Purpose	Size
`presidio-analyzer`	Core PII detection engine	~5 MB
`presidio-anonymizer`	Anonymization utilities	~1 MB
`spacy`	NLP framework	~15 MB
`en_core_web_lg`	English NLP model (NER, POS)	~400 MB

Nix Environment Note

When running in a Nix dev shell, LD_LIBRARY_PATH must include stdenv.cc.cc.lib for numpy’s C extensions to find libstdc++.so.6. This is configured in flake.nix:

LD_LIBRARY_PATH = pkgs.lib.makeLibraryPath [
  pkgs.stdenv.cc.cc.lib
];

8) Relationship to AskSage Data Labeling Plugin

AskSage offers a Data Labeling plugin that labels content as PII/PHI/CUI via cloud API. The local Presidio scanner and the AskSage plugin serve complementary roles:

Aspect	Local Presidio Scanner	AskSage Data Labeling
Where it runs	Local machine	AskSage cloud
Data leaves machine?	No	Yes (sent to AskSage API)
Use case	Pre-screening before cloud exposure	Labeling data already in AskSage
Cost	Free (open source)	Free plugin (uses inference tokens)
When to use	BEFORE sending to any cloud service	AFTER data is already in AskSage

Workflow: Always run the local scanner first. Only use the AskSage plugin for data that has already passed local screening and been intentionally uploaded.