PII Scan
Heratio includes a built-in personal-data scanner that examines free-text fields in archival descriptions for emails, phone numbers, national identifiers, credit-card numbers, IP addresses and dates of birth. The scanner is pattern-based - it runs entirely on the host with no external AI calls - so it is safe to use against sensitive content that may not leave your jurisdiction.
Running a scan
You can invoke a scan on a single information_object from the CLI:
php artisan privacy:scan-io 12345
Add --no-persist to print findings to the terminal without writing a scan-report row. Use --jurisdiction= (one of gdpr, popia, uk_gdpr, ccpa) to override the configured default for one-off jurisdictional checks.
Each persisted scan writes a row in the ahg_pii_scan_report table. The row captures total hits, a count per type, and the full finding list (capped at 500 entries) as JSON.
Reading the result
A finding is a tuple of type, value, offset_start, offset_end and confidence. Confidence between 0 and 1 reflects how strong the signal is:
- ~0.9-0.95 for emails, well-formed SSNs and Luhn-validated credit cards
- ~0.7-0.85 for E.164 phone numbers and IPs
- ~0.55 for free-form dates of birth (further bounded by a 1900-current-year sanity window)
- Lower values when a regex matches but a checksum (Luhn) fails - useful at review time as a soft signal rather than a hard hit
Open a finding and decide one of four outcomes:
pending(default) - newly scanned, not yet reviewedreviewed- the privacy officer has examined the findingsredacted- the underlying record has been redacted to remove the PIIaccepted_risk- PII is intentional (donor agreement, archival context) and remains
Privacy jurisdiction
The configured jurisdiction is stored in ahg_setting.privacy_jurisdiction. Set it to your home market to favour locally-relevant regex sets (e.g. SA ID numbers under POPIA, NINOs under UK GDPR, SSNs under CCPA). When set to gdpr, the scanner uses a maximum-recall union of all national-id and phone patterns - useful for institutions that hold international holdings.
What the scanner does not do
The Phase 1 scanner does not OCR images, does not analyse PDF or TIFF binary content, and does not produce redaction artefacts. Image and document redaction is handled by the visual redaction editor (separate Phase 1 deliverable). Auto-deletion + subject rights portal arrive in a later phase.