Private PII redaction for Snowflake and self-hosted Docker

Support logs, call transcripts, intake forms, chat histories, and agent output are full of PII. Teams are generating and storing more sensitive text than ever, faster than humans can safely clean it.

Detect, redact, and de-identify PII where your data already lives: run in Snowflake or self-host the Docker container. No external API calls for payload processing.

Start Snowflake Trial Get Self-hosted Trial Key

One engine, multiple private runtime paths. Choose the Snowflake Native App for governed SQL masking, or the self-hosted Docker container for private HTTP APIs, AI pipelines, and infrastructure you control.

99% Name detection accuracy

18 Languages

0 Data egress

<30min Setup time

Deployments

Choose your private PII redaction runtime

Same PII detection and de-identification engine. Pick Snowflake for warehouse text columns or self-hosted Docker for private API workflows.

Snowflake Native App

Data already lives in Snowflake

For governed warehouse workflows that need PII detection, redaction, and de-identification from SQL, dbt, tasks, and notebooks.

Call from SQL

1SELECT agent_mask_en.app_public.mask(
2  ARRAY_CONSTRUCT(note_text)
3);

Best for: dbt, tasks, dynamic tables
Interface: One Snowflake function call
Ops model: Snowpark Container Services; no separate API tier to host
Boundary: Runs inside your Snowflake account

Start Snowflake Trial View Snowflake details

Self-hosted Container

Data moves through private services

For apps, agents, files, queues, and AI pipelines that need a self-hosted PII redaction API inside your infrastructure.

Call from your API tier

1POST /mask/sync
2Authorization: Bearer $API_KEY
3{ "data": [[0, text, {}]] }

Best for: Apps, AI pipelines, files, APIs, VPC workflows
Interface: Docker container with private HTTP API endpoints
Ops model: You control container, GPU/runtime, ingress, and storage
Boundary: No outbound internet required at runtime

Get Self-hosted Trial Key Docker Hub image Read install docs

The Hidden Cost

The Data Tax

You’re paying to store data you can’t safely use.

A customer shares their credit card number in a live chat to process a refund.
A nurse types a patient’s name into a support ticket to report a software bug.
An account holder reads their SSN to a rep to verify their identity.
An employee signs their name in an "anonymous" survey to ensure a follow-up.

Now there’s a credit card number in your chat logs. A patient identity in your help desk. A Social Security number in your call transcripts. A personal identity in your "anonymous" dataset.

That’s the data tax. You can’t safely feed it to your LLMs. Your analysts need six approvals to query it. Your ML team won’t touch it for training. It’s either a compliance liability or wasted insights—pick one.

That data has value—if you could just separate what’s sensitive from what’s useful.

Regulatory Risk

Unmasked PII = audit findings, fines, breach risk

Locked Analytics

Data teams can’t use what they can’t safely access

Blocked AI/ML

Models shouldn’t be trained on data riddled with PII

Risky Data Sprawl

PII copies spreading across dev, test, and staging

The Solution

Reclaim the Value

Find hidden PII in unstructured text, text columns, and documents, then redact or de-identify it inside the private runtime you choose.

Solutions for

HIPAA-Oriented PHI De-Identification

Clinical notes, medical transcripts, and discharge summaries contain PII and PHI in unstructured text. Agent Mask helps teams detect and mask PII/PHI for de-identification workflows without sending payloads to an external API.

De-identify clinical text for research and analytics workflows
Review detected PHI categories before downstream sharing
Train ML models on de-identified language patterns while reducing raw PHI exposure

Outcome

Move clinical notes and transcripts into controlled research, analytics, and review workflows with de-identified output plus audit/review evidence.

Discharge Summary

Margaret ChenPERSON, 67F, discharged 03/14/2024DATE following cardiac catheterization. Attending: Dr. Robert OkonkwoPERSON (NPI: 1528496379NPI). Pt to follow up with cardiology in 2 weeks. Daughter Linda ChenPERSON (415-555-0189PHONE) designated emergency contact. Insurance: Blue Cross ID 7294851036INSURANCE_ID.

PCI-DSS and Privacy Redaction Workflows

Loan applications, transaction notes, customer communications—financial data lives in documents and conversations. Agent Mask helps find and mask sensitive values wherever they hide.

Clean transaction data for fraud analytics
Give BI teams governed, low-risk views of sensitive notes
Provision masked datasets for dev and QA environments

Outcome

Run fraud models on transaction notes that were previously off-limits. Support data-driven decisions while preserving your existing privacy and compliance controls.

Advisor Call Notes

Client David RamirezPERSON from Meridian Capital PartnersORG called re: wire. IBAN: DE89370400440532013000IBAN. Verified via SSN 412-68-6789SSN and DOB 11/03/1978DOB. Card 4532-7891-2345-4421CREDIT_CARD. Callback: 832-555-0147PHONE.

Safe Data Sharing at Scale

Employee feedback, user research, customer surveys—valuable data locked behind privacy concerns. Agent Mask makes it safe to share across teams.

De-identify employee surveys for workforce analytics
Clean user research before sharing with product
Prepare customer feedback for company-wide insights

Outcome

Turn restricted data into company-wide assets. Analyze employee feedback without exposing who said what.

Employee Survey Response

Honestly, my manager Kevin WalshPERSON has been great but the workload since MarchDATE is unsustainable. I've talked to PriyaPERSON and JamesPERSON on my team and they feel the same. I'm starting to look elsewhere. You can reach me at t.morrison@company.comEMAIL if HR wants to discuss.

FOIA and Public Records Workflows

Court filings, body cam transcripts, investigative reports—government records require redaction before release. Agent Mask reduces the manual review burden by detecting and masking sensitive values before final review.

Accelerate FOIA response turnaround
Enable public records search with less raw data exposure
Prepare documents for inter-agency sharing

Outcome

Prepare public records with masked output and review metadata, then make release decisions through your normal disclosure process.

Constituent Complaint

My name is Barbara HendricksPERSON and I live at 2847 Oak Street, Apt 4BADDRESS. I'm writing about the situation at Riverside ElementaryORG. Please contact me at bhendricks@gmail.comEMAIL or 555-294-8831PHONE. My case number is GOV-2024-08472CASE_ID.

Under the Hood

Comprehensive Detection

Detecting SSNs and credit cards is easy. Agent Mask finds the messy cases generic tools miss: Austin as a person, not a city; drug names buried in clinical prose; and sensitive data and patterns unique to your industry.

Agent Mask understands context, resolves name variants to a single identity, and lets you define custom categories in plain English.

One engine for healthcare, finance, government, and enterprise data across supported languages.

Flexible De-Identification Choose how each entity is de-identified, from simple placeholders to hashing or reversible encryption. See how ↓ Custom Entity Detection Describe what to find in plain English. The model infers what matches. See how ↓ Identity-Aware Masking Simple redaction replaces text. Identity-aware masking understands that “Chen” and “Sarah Elizabeth Chen” are the same person—and masks them consistently. See how ↓

Private Deployment Paths

Run as a Snowflake Native App for warehouse workflows or self-host the container for private API and AI pipelines. Either way, Agent Mask does not process your data outside the environment you choose.

Document Redaction

Submit PDFs, scanned PDFs, images, DICOM, DOCX, RTF, and ZIP archives. Get de-identified document text, redaction metadata, and visually redacted outputs for supported visual formats.

Multi-Language

Detect PII across supported languages with language-aware models. Your EU, APAC, and LATAM data gets the same coverage — no extra tools, no extra vendors.

Not Just Redaction

Flexible De-Identification

Choose a replacement method for each entity type. Every response includes a full entity mapping for audit trails and authorized re-identification.

Patient Record

Patient Sarah Chen (DOB: 03/15/1987, SSN: 078-05-1120) presented with recurring lower back pain and bilateral hip stiffness. Symptoms began approximately six months ago and have worsened with prolonged sitting. No history of acute trauma. Referring physician Dr. James Whitfield documented initial assessment on 01/08/2025 and noted prior conservative treatment including physical therapy and NSAIDs with limited improvement. Imaging ordered. Patient to follow up with orthopedics within two weeks. Reach Sarah Chen at sarah@acme.com or 555-867-5309 to confirm scheduling.

Describe It, Find It

Your Business. Their Blind Spot.

Other tools ship with fixed lists and patterns. Your most sensitive data falls through the cracks. Describe what you're looking for in plain English—Agent Mask figures out what matches.

Semantic Inference, Not Pattern Matching

Category Definitions Reset

mrn
medical record numbers (MRN)
insurance
health insurance: plan names,
group numbers
rx_med
prescription drug names: Zoloft,
Prozac, Ambien, metformin
dosage
medication dosages: 50mg BID,
10mg IV push, 500mg TID
mental_dx
psychiatric diagnoses: schizophrenia,
OCD, anorexia, ADHD
substance
substance abuse: cocaine, heroin,
methamphetamine, alcohol dependence

6 custom categories

Patient Record

CONFIDENTIAL - Integrated Care Assessment Patient: Maria SantosPERSON (MRN: 847291MRN, DOB: 03/15/1978DATE) Insurance: Blue Cross PPOINSURANCE Psychiatric History: Current medications: - LexaproRX_MED - 20mg PO dailyDOSAGE for generalized anxiety disorderMENTAL_DX - KlonopinRX_MED - 0.5mg SL PRNDOSAGE for panic attacks - SeroquelRX_MED - 100mg PO QHSDOSAGE for sleep/mood Diagnoses: bipolar II disorderMENTAL_DX, post-traumatic stress disorderMENTAL_DX, persistent depressive disorderMENTAL_DX Pain Management: - PercocetRX_MED for cancer pain - SuboxoneRX_MED - 8mg/2mg SL dailyDOSAGE for opioid addictionSUBSTANCE, in remission

18 entities • 6 custom categories

No regex. No lookup tables. Just describe what's sensitive — Agent Mask understands what you mean.

Industry Starter Kits

Pre-built. Ready to go.

Load a preset and start detecting industry-specific data immediately—diagnoses, medications, account numbers, employee IDs, and more. Mix with your own definitions for complete coverage.

Format Enforcement

Flexible detection. Strict matching.

Layer pattern rules on top of semantic detection to kill false positives. Enforce org-specific formats like MRNs, account numbers, and case IDs—the model detects, you decide what’s real.

Fine-Grained Control

Tune each type independently.

Set different sensitivity levels for different data types. Aggressive detection for medication names, strict matching for structured IDs—each with its own de-identification method, without one affecting the other.

That clinical record above? Your data looks just like it. Psychiatric diagnoses, substance history, medication names, dosages — buried in free text that generic tools don’t know to look for.

Start Snowflake Trial Get Self-hosted Trial Key

Context, Not Strings

Same Entity. Same Mask.

Other tools give the same person three different placeholders — and your data stops making sense. Contextual matching and cross-field consistency keep your data analytically useful.

PERSON_1 · Sarah Elizabeth Chen, Chen, Sarah

PERSON_2 · James Park, Park

PERSON_3 · Lisa Chen-Nakamura, Lisa

LOCATION_1 · Mercy General Hospital, Mercy General

Demographics

Patient: Sarah Elizabeth ChenPERSON_1 Employer: Mercy General HospitalLOCATION_1 Referred by: Dr. James ParkPERSON_2 Emergency contact: Lisa Chen-NakamuraPERSON_3 (sister)

Clinical Notes

Dr. ParkPERSON_2 referred pt for chronic migraine. ChenPERSON_1 reports worsening with aura. Seen at Mercy GeneralLOCATION_1 outpatient neuro. SarahPERSON_1 declines imaging. Sister LisaPERSON_3 present. ParkPERSON_2 to follow up in 4wk.

Reference Resolution

Every variant, one mask.

Contextual AI matches name variants, abbreviations, and acronyms that rules alone would miss—so “Sarah Elizabeth Chen”, “Chen”, and “Sarah” all collapse to a single placeholder. Deterministic normalization does the same for structured data—“(555) 123-4567” and “555.123.4567”, or “January 15, 2024” and “01/15/2024”. Your de-identified data reads like real data—not a bag of disconnected placeholders.

Corpus Consistency

Mix formats. One identity map.

Send text columns, PDFs, images, and DOCX files together. Agent Mask links the same identity across them, so each name gets one replacement everywhere it appears. No manual alignment required.

vs. Simple Redaction

Every mention = different placeholder
Your data loses all referential meaning
Useless for analytics or ML

vs. Rule-Based & Name-Part Matching

Exact strings or split name parts—no context
Can’t disambiguate “Smith” when John Smith and Jane Smith both appear
Names only—no locations, orgs, dates, or phone numbers

vs. Manual Review

Doesn't scale past dozens of records
Human reviewers miss cross-column links
Can't link name variants to the same person

Specifications

Built for Production

Technical Capabilities

Supported Entities

Dozens of built-in entity types across personal, financial, healthcare, and digital categories—plus unlimited custom types

Languages

Supported languages with language-aware models

Detection

Context-aware AI — 99% person name detection, 95% overall NER quality across 17 locales

Processing

GPU-optimized batch processing for high-volume workloads

Deployment

Snowflake Native App or self-hosted Docker container

Data Residency

Processing stays inside your Snowflake account or your self-hosted environment

De-Identification

Pseudonymization, masking, hashing, encryption, synthetic data (Faker), redaction, keep (detect-only)

Document Formats

PDF, images, and DICOM visual redaction with bounding-box metadata; DOCX and RTF extraction; ZIP archive processing

Entity Collapsing

AI-driven coreference for names, places, and orgs (strict / moderate / broad threshold) + deterministic normalization for everything else

Personal Identifiers

Person Email Phone SSN ITIN Driver's License Passport Vehicle ID

Financial

Credit Card Account Number Bank Account / Routing Number

Healthcare

NPI MBI DEA Health Plan ID Medical Record Number Medical License Device Identifier Date

Digital & Location

IP Address URL ZIP Code Location

Organizations & Groups

Organization Religion, Nationality, Political Affiliation

Your Custom Entities

Define domain-specific entity types with natural language, let the model do the rest. Ship with industry starter kits or build your own.

See custom detection ↑

HIPAA

HIPAA De-Identification Workflows

Masks names, dates, contacts, SSNs, and medical record numbers across clinical text.

GDPR

EU Personal Data

Pseudonymization and data minimization across supported languages.

CCPA

Consumer Data Controls

Detects and de-identifies personal information in unstructured text to support consumer privacy workflows.

PCI

Payment Data

Detects and masks common payment identifiers such as credit card numbers, expiration dates, and CVVs

SOC2

Audit/Review Controls

Private-runtime processing, permission-scoped access, and entity evidence give auditors and reviewers concrete controls to inspect.

Agent Mask is not a legal determination or compliance certification. PII/PHI detection is probabilistic, and outputs should be validated against your policy, configuration, and target standard before external sharing or regulated use.

Controlled-Environment Architecture

Agent Mask runs where you deploy it. We do not receive, store, or process your data outside your Snowflake account or self-hosted environment. The app uses only the access you grant—nothing more.

Private redaction from SQL or HTTP.

Call Agent Mask from dbt models, scheduled tasks, batch queries, or a private REST API. It only touches the fields you hand it. No scanning. No crawling. No third-party processing by Agent Mask.

The Trade-Offs End Here

Where Alternatives Break Down

LLM APIs look easy until token costs, parsing, and auditability show up. Cloud APIs need pipelines. Enterprise platforms weren't built for free text. Agent Mask keeps de-identification in your private runtime: Snowflake or self-hosted.

You're probably not using any of these.

Most teams are not replacing a tool. They are leaving sensitive data unprotected until an audit, breach, or AI project forces the issue. Every month without coverage keeps your data risky, restricted, and unusable for AI. Agent Mask makes it easy to start today.

LLM APIs

GPT · Claude · Gemini · Bedrock

You probably thought of using an LLM already. Send text to GPT, Claude, or Gemini and prompt-engineer it to redact the PII. Works great for a demo. Then you do the math on per-token costs at production volume. And shipping your most sensitive data to a third-party API doesn't sit right. Oh, and they just raised their prices. Again.

Their Limitations

Metered per-token pricing. The prompts, the inputs, the outputs, the retries—costs grow with every batch
Send sensitive text to a third-party API, parse the response, and write results back; most of your focus is on maintaining another bespoke pipeline.
Non-deterministic output can vary between runs, or miss real PII

Agent Mask Advantage

Private-runtime pricing based on your deployment path, not per-token LLM usage
One SQL function or private API—process text inside Snowflake or your own infrastructure without an external LLM pipeline
Deterministic output with exact character positions and reproducible audit trails

Read the full LLM API redaction breakdown

Snowflake AI_REDACT

Cortex Built-in Function

Snowflake's built-in option. Their docs say it "works best with well-formed English text." Convenient for a quick demo, but the gaps show fast in production.

Their Limitations

Very limited. 4K token limit on input and output combined; 1K token output cap
English-optimized only—Currently supports only US PII and some UK and Canadian PII
No pseudonymization, configurable replacements, or cross-column consistency

Agent Mask Advantage

No token limits—process documents of any length
Supported languages with broader regional coverage and custom entity types
Pseudonymization and other advanced replacements, and cross-column coreference resolution

Read the full AI_REDACT alternative breakdown

John Snow Labs

Healthcare NLP

John Snow Labs spans clinical NLP, extraction, annotation, and model workflows. Agent Mask is built for the de-identification control layer: custom sensitive categories, replacement choices, audit evidence, related-field consistency, supported files, and private deployment in Snowflake or self-hosted infrastructure.

Their Limitations

The Snowflake app returns redacted text, not analyzer results, span evidence, or an original-to-replacement ledger
Custom sensitive categories are not defined inside the de-identification call; specialized categories live in separate extraction workflows
The Snowflake app centers on masked or obfuscated output, not per-entity operators or format enforcement
Related fields, partial names, and aliases do not share one identity map in the Snowflake app

Agent Mask Advantage

Return de-identified output plus analyzer results, span metadata, and entity ledger entries for review
Define custom entities in plain English or regex inside the same de-identification request
Choose replacement methods per entity: placeholders, masking, hashing, reversible encryption, synthetic values, keep rules, and date shifting
Keep one replacement for the same person, place, or organization across related fields, aliases, and supported files
Run the same de-identification engine in Snowflake or a self-hosted container for private API workflows

Read the full John Snow Labs breakdown

Google Sensitive Data Protection

Cloud DLP

Google Sensitive Data Protection is a serious platform: a deep library of built-in detectors, custom infoTypes, deterministic tokenization, and mature DLP controls. If your data already lives in Google Cloud, this one belongs on the shortlist. If not, you still need to move sensitive text through a separate Google Cloud workflow and stitch the results back into your systems.

Their Limitations

Sensitive text has to be sent to Google, processed through DLP jobs or API calls, then written back to the source workflow
Usage-based per-GB pricing adds a second meter for inspection, transformation, storage, and orchestration
Deterministic tokenization preserves identical strings, but it does not resolve aliases like "John Smith," "Dr. Smith," and "Smith" as one person

Agent Mask Advantage

Runs in the private runtime you choose: Snowflake Native App for warehouse workflows or self-hosted container for private services
No DLP job orchestration, templates, cross-cloud staging, or writeback path just to de-identify free text
Coreference-aware pseudonymization with exact match locations—the same real-world identity gets the same replacement, even when mentions vary

Read the full Google Sensitive Data Protection breakdown

AWS Comprehend

PII Detection API

AWS Comprehend PII is useful if your stack already runs on AWS and you need English or Spanish PII offsets or redaction. But it requires stitching together more cloud infrastructure: IAM, S3 or API jobs, retries, parsing, and writeback. Custom entities are a separate trained-model path, not something you describe in the de-identification call.

Their Limitations

Teams still build AWS plumbing: IAM, S3 or API jobs, retry handling, output parsing, and writeback
Comprehend PII is English and Spanish with fixed PII types; Comprehend Medical is a separate English clinical service
Custom entities require annotations or entity lists to train a separate recognizer, not a plain-English description in the masking request

Agent Mask Advantage

One SQL function or private HTTP API for text fields and documents—no AWS job orchestration
Built-in and plain-English custom entity detection across supported languages
Custom entity support—describe what you want to detect in plain English

Read the full AWS Comprehend PII breakdown

Data Privacy Vaults

Skyflow · Protegrity

Enterprise platforms built around structured-data governance—Skyflow’s privacy vault for tokenized PII columns, Protegrity’s field-level protection across enterprise data stores. Both have layered on unstructured-text capabilities recently, but their flagship products were designed for known PII in fixed schemas. Expect enterprise sales cycles, governance rollouts, and six-figure annual contracts—for a platform built around a different problem than yours.

Their Limitations

Flagship products are structured-data platforms—Skyflow’s privacy vault and Protegrity’s field-level tokenization; unstructured-text capabilities are recent layers on top, not the core engineering focus
Some deployment patterns add a separate vault or external processing layer before text can be de-identified
Enterprise procurement path—security review, governance approval, and professional-services engagement—before the first query runs

Agent Mask Advantage

Purpose-built for unstructured text with custom entity types, coreference resolution, and cross-column consistency
Runs inside your chosen private runtime—no vault infrastructure and no Agent Mask data egress
Focused unstructured-text de-identification—start from Snowflake Marketplace or a self-hosted trial key without a field-mapping engagement

Read the full breakdowns: Skyflow, Protegrity

Get Started

Stop Paying the Data Tax

You already know your data is filled with PII. You’re already paying the data tax. Pick the runtime that fits your workflow and get de-identification moving this afternoon.

Under 30-minute trial setup Self-hosted container trial Free proof of concept

Choose the route that matches where your sensitive data already lives.

Choose

Start with the Snowflake Native App or get a self-hosted trial key for the private container.

Connect

Call Agent Mask from SQL, dbt, scheduled tasks, applications, or private API workflows.

Get Clean Data

Receive de-identified output with PII replaced. Your original stays intact.

No $200K enterprise contracts. No per-character API fees. Start with a free proof of concept on your actual data.

Start Snowflake Trial Get Self-hosted Trial Key

Workflow Fit

Describe your workflow

Tell us what you need to de-identify, where it lives, and what the clean output needs to feed. We will tell you if Agent Mask fits, what to try first, or if another path is cleaner.

Response

Within one business day

Usually faster for workflow-fit questions, Marketplace access, self-hosted keys, BAAs, and deployment blockers.

Useful context

Your data type and target pipeline

Clinical notes, tickets, call transcripts, LLM pipelines, or anything else packed with unstructured PII.

Deployment fit

Not sure which path fits?

Tell us your data location, security constraints, target pipeline, and evaluation timeline.

Direct email

info@agentmask.io

Same inbox, same humans, no mailing list.

Secure intake Takes about 30 seconds

Name or organization optional

Message

We only use this to reply. No newsletter, no third-party list.

Private PII redaction for Snowflake and self-hosted Docker

Choose your private PII redaction runtime

Data already lives in Snowflake

Data moves through private services

The Data Tax

Regulatory Risk

Locked Analytics

Blocked AI/ML

Risky Data Sprawl

Reclaim the Value

HIPAA-Oriented PHI De-Identification

PCI-DSS and Privacy Redaction Workflows

Safe Data Sharing at Scale

FOIA and Public Records Workflows

Comprehensive Detection

Private Deployment Paths

Document Redaction

Multi-Language

Flexible De-Identification

Your Business. Their Blind Spot.

Semantic Inference, Not Pattern Matching

Industry Starter Kits

Format Enforcement

Fine-Grained Control

Same Entity. Same Mask.

Reference Resolution

Corpus Consistency

vs. Simple Redaction

vs. Rule-Based & Name-Part Matching

vs. Manual Review

Built for Production

Personal Identifiers

Financial

Healthcare

Digital & Location

Organizations & Groups

Your Custom Entities

HIPAA De-Identification Workflows

EU Personal Data

Consumer Data Controls

Payment Data

Audit/Review Controls

Controlled-Environment Architecture

Private redaction from SQL or HTTP.

Where Alternatives Break Down

You're probably not using any of these.

LLM APIs

Snowflake AI_REDACT

John Snow Labs

Google Sensitive Data Protection

AWS Comprehend

Data Privacy Vaults

Stop Paying the Data Tax

Choose

Connect

Get Clean Data

Describe your workflow

Message received