Stop Hoarding
DATA

Every system you run is dumping unstructured text into your warehouse — support tickets, call transcripts, intake forms, survey responses, chat logs. And then you added AI. Now AI agents, assistants, and LLM pipelines are generating text faster than humans ever did. Every record loaded with PII that’s one query away from exposure. And every regulation says that’s your problem now.

Detect and de-identify it in a single function call, entirely inside Snowflake. No data egress. No third-party risk.

97% Name detection accuracy We detect dozens of entities and even custom entities. But names are what you care about most — and the hardest to get right. 97% accuracy across 77K+ samples.
14 Languages Most de‑identification tools are English‑only. Yours doesn’t have to be.
0 Data egress Your data never leaves Snowflake. No external API calls, no third‑party processing, no data residency concerns. The simplest compliance story you’ll ever tell.
<30min Setup to first scan Install from the Marketplace, call one function. You’re scanning real data today — not next quarter.

The Data Tax

You’re paying to store data you can’t safely use.

  • A customer shares their credit card number in a live chat to process a refund.
  • A nurse types a patient’s name into a support ticket to report a software bug.
  • An account holder reads their SSN to a rep to verify their identity.
  • An employee signs their name in an "anonymous" survey to ensure a follow-up.

Now there’s a credit card number in your chat logs. A patient identity in your help desk. A Social Security number in your call transcripts. A personal identity in your "anonymous" dataset.

That’s the data tax. You can’t safely feed it to your LLMs. Your analysts need six approvals to query it. Your ML team won’t touch it for training. It’s either a compliance liability or wasted insights—pick one.

That data has value—if you could just separate what’s sensitive from what’s useful.

Regulatory Risk

Unmasked PII = audit findings, fines, breach risk

Locked Analytics

Data teams can’t use what they can’t safely access

Blocked AI/ML

Models shouldn’t be trained on data riddled with PII

Risky Data Sprawl

PII copies spreading across dev, test, and staging

Reclaim the Value

Scan text fields and documents for hidden PII and mask it automatically—all inside Snowflake.

HIPAA-Compliant Analytics

Clinical notes, medical transcripts, discharge summaries—healthcare data is unstructured and sensitive. Agent Mask detects PII in freeform text with the precision HIPAA demands.

  • De-identify clinical text for research and analytics
  • Share with research partners without expanding BAA scope
  • Train ML models on the language, not the PII
Outcome

Share data with internal teams and external partners for research, analytics, and care coordination—without compromising patient privacy. Train AI on real clinical notes and transcripts, safely.

Discharge Summary
Margaret ChenPERSON, 67F, discharged 03/14/2024DATE following cardiac catheterization. Attending: Dr. Robert OkonkwoPERSON (NPI: 1528496372NPI). Pt to follow up with cardiology in 2 weeks. Daughter Linda ChenPERSON (415-555-0189PHONE) designated emergency contact. Insurance: Blue Cross ID 7294851036INSURANCE_ID.

PCI-DSS and Privacy Compliance

Loan applications, transaction notes, customer communications—financial data lives in documents and conversations. Agent Mask finds sensitive data wherever it hides.

  • Clean transaction data for fraud analytics
  • Enable BI teams to query without compliance risk
  • Provision safe datasets for dev and QA environments
Outcome

Run fraud models on transaction notes that were previously off-limits. Enable data-driven decisions while maintaining the regulatory compliance your business depends on.

Advisor Call Notes
Client David RamirezPERSON from Meridian Capital PartnersORG called re: wire. IBAN: DE89370400440532013000IBAN. Verified via SSN 412-68-6789SSN and DOB 11/03/1978DOB. Card 4532-7891-2345-4421CREDIT_CARD. Callback: 832-555-0147PHONE.

Safe Data Sharing at Scale

Employee feedback, user research, customer surveys—valuable data locked behind privacy concerns. Agent Mask makes it safe to share across teams.

  • De-identify employee surveys for workforce analytics
  • Clean user research before sharing with product
  • Prepare customer feedback for company-wide insights
Outcome

Turn restricted data into company-wide assets. Analyze employee feedback without exposing who said what.

Employee Survey Response
Honestly, my manager Kevin WalshPERSON has been great but the workload since MarchDATE is unsustainable. I've talked to PriyaPERSON and JamesPERSON on my team and they feel the same. I'm starting to look elsewhere. You can reach me at t.morrison@company.comEMAIL if HR wants to discuss.

FOIA and Public Records Compliance

Court filings, body cam transcripts, investigative reports—government records require redaction before release. Agent Mask automates what used to take hours of manual review.

  • Accelerate FOIA response turnaround
  • Enable public records search without exposure
  • Prepare documents for inter-agency sharing
Outcome

Prepare public records without manual review of every document. Meet disclosure deadlines without compromising privacy.

Constituent Complaint
My name is Barbara HendricksPERSON and I live at 2847 Oak Street, Apt 4BADDRESS. I'm writing about the situation at Riverside ElementaryORG. Please contact me at bhendricks@gmail.comEMAIL or 555-294-8831PHONE. My case number is GOV-2024-08472CASE_ID.
Under the Hood

Comprehensive Detection

Detecting SSNs and credit cards is the easy part — every tool does that. The hard part is everything else: ambiguous contexts where Austin is a person, not a city. Drug names buried in clinical prose. Sensitive data unique to your industry that no generic model knows to look for.

Agent Mask understands context, resolves name variants to a single identity, and lets you define custom categories in plain English.

One engine for healthcare, finance, government, and enterprise data across 14 languages.

Native to Snowflake

Your data never leaves your environment. No file transfers, no API calls, no additional infrastructure to manage, no third-party exposure. Your data stays yours.

Document Redaction

Submit PDFs and scanned documents. Get back extracted text with PII de-identified, plus visually redacted files with PII masked in both the text layer and the rendered image — so no one can copy-paste or extract their way around it.

Multi-Language

Detect PII across 14 languages with dedicated models for each. Your EU, APAC, and LATAM data gets the same coverage — no extra tools, no extra vendors.

Flexible De-Identification

Eight operators, configured per entity type. Every response includes a full entity mapping for audit trails and authorized re-identification.

Patient Sarah Chen (DOB: 03/15/1987, SSN: 123-45-6789) presented with recurring lower back pain and bilateral hip stiffness. Symptoms began approximately six months ago and have worsened with prolonged sitting. No history of acute trauma. Referring physician Dr. James Whitfield documented initial assessment on 01/08/2025 and noted prior conservative treatment including physical therapy and NSAIDs with limited improvement. Imaging ordered. Patient to follow up with orthopedics within two weeks. Reach Sarah Chen at sarah@acme.com or 555-867-5309 to confirm scheduling.

Your Business. Their Blind Spot.

Other tools ship with fixed lists and patterns. Your most sensitive data falls through the cracks. Describe what you're looking for in plain English—Agent Mask figures out what matches.

Semantic Inference, Not Pattern Matching

Category Definitions Reset
mrn medical record numbers (MRN)
insurance health insurance: plan names, group numbers
rx_med prescription drug names: Zoloft, Prozac, Ambien, metformin
dosage medication dosages: 50mg BID, 10mg IV push, 500mg TID
mental_dx psychiatric diagnoses: schizophrenia, OCD, anorexia, ADHD
substance substance abuse: cocaine, heroin, methamphetamine, alcohol dependence
terminal_dx terminal diagnoses: ALS, stage IV cancer, end-stage renal
genetic genetic markers and test results: BRCA2, HER2, Lynch syndrome
orientation sexual orientation: gay, lesbian, queer, LGBTQ+
immigration immigration status: visa type, undocumented, asylum
Patient Record
CONFIDENTIAL - Integrated Care Assessment Patient: Maria SantosPERSON (MRN: 847291MRN, DOB: 03/15/1978DATE) Insurance: Blue Cross PPOINSURANCE Psychiatric History: Current medications: - LexaproRX_MED - 20mg PO dailyDOSAGE for generalized anxiety disorderMENTAL_DX - KlonopinRX_MED - 0.5mg SL PRNDOSAGE for panic attacks - SeroquelRX_MED - 100mg PO QHSDOSAGE for sleep/mood Diagnoses: bipolar II disorderMENTAL_DX, post-traumatic stress disorderMENTAL_DX, persistent depressive disorderMENTAL_DX Pain Management: - PercocetRX_MED for cancer pain - SuboxoneRX_MED - 8mg/2mg SL dailyDOSAGE for opioid addictionSUBSTANCE, in remission Oncology: Diagnosis: metastatic pancreatic adenocarcinomaTERMINAL_DX Genetic testing: BRCA1GENETIC positive Social Assessment: Currently at Salvation Army shelter following eviction. Patient identifies as bisexualORIENTATION. On H-1B visaIMMIGRATION, renewal pending.
No regex. No lookup tables. Just describe what's sensitive — Agent Mask understands what you mean.

Industry Starter Kits

Pre-built. Ready to go.

Load a preset and start detecting industry-specific data immediately—diagnoses, medications, account numbers, employee IDs, and more. Mix with your own definitions for complete coverage.

Format Enforcement

Flexible detection. Strict matching.

Layer pattern rules on top of semantic detection to kill false positives. Enforce org-specific formats like MRNs, account numbers, and case IDs—the model detects, you decide what’s real.

Fine-Grained Control

Tune each type independently.

Set different sensitivity levels for different data types. Aggressive detection for medication names, strict matching for structured IDs—each with its own de-identification method, without one affecting the other.

Same Entity. Same Mask.

Other tools give the same person three different placeholders — and your data stops making sense. Contextual matching and cross-field consistency keep your data analytically useful.

PERSON_1 · Sarah Elizabeth Chen, Chen, Sarah
PERSON_2 · James Park, Park
PERSON_3 · Lisa Chen-Nakamura, Lisa
LOCATION_1 · Mercy General Hospital, Mercy General
Demographics
Patient: Sarah Elizabeth ChenPERSON_1 Employer: Mercy General HospitalLOCATION_1 Referred by: Dr. James ParkPERSON_2 Emergency contact: Lisa Chen-NakamuraPERSON_3 (sister)
Clinical Notes
Dr. ParkPERSON_2 referred pt for chronic migraine. ChenPERSON_1 reports worsening with aura. Seen at Mercy GeneralLOCATION_1 outpatient neuro. SarahPERSON_1 declines imaging. Sister LisaPERSON_3 present. ParkPERSON_2 to follow up in 4wk.

Reference Resolution

Every variant, one mask.

Contextual AI matches name variants, abbreviations, and acronyms that rules alone would miss—so “Sarah Elizabeth Chen”, “Chen”, and “Sarah” all collapse to a single placeholder. Deterministic normalization does the same for structured data—“(555) 123-4567” and “555.123.4567”, or “January 15, 2024” and “01/15/2024”. Your de-identified data reads like real data—not a bag of disconnected placeholders.

Corpus Consistency

Mix formats. One identity map.

Send text columns, PDFs, images, and DOCX files together and Agent Mask connects the dots across all of them. The same name always gets the same replacement, everywhere it appears—no manual alignment needed.

vs. Simple Redaction

  • Every mention = different placeholder
  • Your data loses all referential meaning
  • Useless for analytics or ML

vs. Rule-Based & Name-Part Matching

  • Exact strings or split name parts—no context
  • Can’t disambiguate “Smith” when John Smith and Jane Smith both appear
  • Names only—no locations, orgs, dates, or phone numbers

vs. Manual Review

  • Doesn't scale past dozens of records
  • Human reviewers miss cross-column links
  • Can't link name variants to the same person

Built for Production

Technical Capabilities
Supported Entities
Dozens of built-in entity types across personal, financial, healthcare, and digital categories—plus unlimited custom types
Languages
14 languages with dedicated models for each
Detection
Context-aware AI — 97% person name detection, 94% overall NER quality across 17 locales
Processing
GPU-optimized batch processing for high-volume workloads
Deployment
Snowflake Native App—no external infrastructure
Data Residency
All processing within your Snowflake account
De-Identification
Pseudonymization, masking, hashing, encryption, synthetic data (Faker), redaction, keep (detect-only)
Document Formats
PDF (text & scanned/OCR), DOCX — visual redaction with bounding-box metadata
Entity Collapsing
AI-driven coreference for names, places, and orgs (strict / moderate / broad threshold) + deterministic normalization for everything else

Personal Identifiers

Person Email Phone SSN ITIN Driver's License Passport

Financial

Credit Card Bank Account

Healthcare

NPI MBI DEA Health Plan ID Date

Digital & Location

IP Address URL ZIP Code Location

Organizations & Groups

Organization Religion, Nationality, Political Affiliation

Your Custom Entities

Define domain-specific entity types with natural language, let the model do the rest. Ship with industry starter kits or build your own.

See custom detection ↑
HIPAA

HIPAA Ready

Healthcare data protection. BAA support and PHI detection.

GDPR

GDPR Compliant

EU data types, right to erasure, data minimization.

CCPA

CCPA Ready

California consumer data protection and disclosure.

PCI

PCI-DSS Aligned

Credit card detection and masking for payments.

SOC2

SOC 2 Ready

Built with SOC 2 controls for enterprise security.

Zero Trust Architecture

Agent Mask operates on a zero-trust model. We never see your data, never store your data, never have access to your data. The application runs in your Snowflake environment with the permissions you grant—nothing more.

Batch-First. Snowflake-Native. Expanding.

Call Agent Mask from dbt models, scheduled tasks, or batch queries — it only touches the fields you hand it. No scanning. No crawling. Today that means Snowflake. Azure, SageMaker, and self-hosted are on the roadmap. Reach out if that’s what you’re waiting for.

What You Give Up With Every Alternative

Cloud APIs need a pipeline to ship your data out. LLMs hallucinate. Regex misses context. Enterprise platforms weren't built for free text. Pick your poison—or don't.

Cloud APIs

AWS Comprehend · Google Cloud DLP

To use these, you become the pipeline engineer: export from Snowflake, route through API Gateway or Lambda, process on AWS or Google, parse the response, write back. Google DLP is capable—200+ detectors, pseudonymization options—but you're engineering and maintaining that pipeline yourself. Comprehend is narrower: two languages, redaction only. Both charge per unit of data processed.

Their Limitations
  • Requires building and maintaining an export-process-import pipeline
  • Comprehend: English and Spanish only; limited to baked-in entity types, no custom entities
  • Per-character (AWS) or per-GB (Google) pricing scales unpredictably
  • Comprehend: redaction only—no pseudonymization
  • No cross-column entity consistency or coreference resolution
Agent Mask Advantage
  • Zero pipeline engineering—call a function inside Snowflake
  • 14 languages with dedicated AI models and custom entity types—no code required
  • Predictable pricing—budget accurately instead of watching costs scale with data volume
  • Deterministic pseudonymization: same person = same token everywhere
  • Coreference resolution—nicknames, titles, and partial names all map to one identity

John Snow Labs

Healthcare NLP

If you’re a large health system, they deserve a spot on your shortlist. Enterprise pricing and enterprise contracts that may take months to close. Outside of healthcare, the picture changes fast: 26 clinical entity types with no support for financial, government, or custom PII.

Their Limitations
  • Healthcare-only—26 clinical entity types, no credit cards, SSNs, or custom entities
  • Limited language support
  • Only two masking modes
  • Name consistency is string-splitting, not contextual—splits “John Smith” into parts and reuses them, but can’t tell that “Smith” refers to John vs. Jane when both appear in the text
  • No coreference resolution for locations, organizations, dates, or phone numbers
  • $82.88/credit + per-character processing + Snowflake infrastructure costs
Agent Mask Advantage
  • General-purpose: healthcare, financial, enterprise, and government PII in one tool
  • 14 languages with dedicated AI models
  • Eight operators—mask, hash, encrypt, synthetic data, pseudonymize, and more—configured per entity type
  • Context-aware coreference—uses AI to resolve ambiguous mentions (“Smith” in a paragraph about John Smith) across people, places, organizations, dates, and phone numbers
  • Predictable pricing via Snowflake Marketplace

LLM APIs

GPT-4 · Claude · Gemini

You'd build a pipeline to send text to an LLM API, parse whatever it returns, and hope it's consistent. Run the same prompt twice, get different results. LLMs hallucinate PII that isn't there and miss PII that is. Your compliance team will love explaining non-deterministic redaction to auditors.

Their Limitations
  • Requires building a pipeline to send data to external LLM APIs
  • Non-deterministic—different results each run
  • Hallucinates entities that don't exist
  • Returns prose, not structured positions
  • Per-token costs at $2–$75 per million tokens
  • No audit trail or reproducibility
Agent Mask Advantage
  • Zero pipeline engineering—runs inside Snowflake
  • Deterministic—same input, same output, every time
  • AI detection + checksum validation—catches what LLMs miss, rejects what they hallucinate
  • Returns exact character positions for each entity
  • Predictable Snowflake Marketplace pricing—no per-token metering
  • Full audit trail for compliance

Snowflake AI_REDACT

Cortex Built-in Function

Snowflake's built-in option—went GA December 2025. Their docs say it "works best with well-formed English text." 4K token limit on input and output combined, US/UK/CA entities only, no entity positions returned. Convenient for a quick demo, but the gaps show fast in production.

Their Limitations
  • English-optimized only—Snowflake's docs say it "works best with well-formed English text"
  • 4K token limit on input and output combined; 1K token output cap
  • US/UK/CA entities only—no EU, APAC, LATAM, or medical identifiers
  • No pseudonymization or entity positions—just redacted text with [LABELS]
  • No cross-column consistency: same name in column A ≠ column B
Agent Mask Advantage
  • 14 languages with dedicated AI models
  • No token limits—process documents of any length
  • Built-in healthcare, EU, APAC, and LATAM entities—plus custom entity types for anything unique to your data
  • Deterministic pseudonymization with exact entity positions and confidence scores
  • Cross-column coreference resolution—same person gets the same pseudonym everywhere

Regex & Rule-Based

In-house Keyword Lists · Custom Scripts

A greedy pattern matches half your dataset. A tight one misses everything spelled slightly differently. You're maintaining hundreds of rules across ICD codes, NPI formats, and regional ID numbers—and one bad commit can redact entire columns of legitimate data or silently miss real PII for months.

Their Limitations
  • Brittle—one character off and the pattern breaks or over-matches catastrophically
  • Maintaining hundreds of patterns across formats, codes, and regional IDs
  • No semantic understanding—can't distinguish a name from a product or a place
  • Can't detect names, addresses, or context-dependent PII at all
  • Every new edge case, locale, or format = another rule to write and test
Agent Mask Advantage
  • AI detection with checksum validation catches edge cases regex misses
  • No pattern maintenance—AI handles formats, codes, and regional IDs automatically
  • AI models understand semantic context—distinguishes names from products from places
  • Detects names, addresses, and context-dependent PII that regex can never match
  • Custom entity types—describe what to find, no regex required

Data Privacy Vaults

Skyflow · Protegrity

Enterprise platforms for tokenizing structured data—credit card columns, SSN fields, known PII in fixed schemas. Skyflow and Protegrity have added unstructured text capabilities recently, but their core product is vault infrastructure and field-level encryption. Expect months of integration work, enterprise sales cycles, and $100K-$200K+/year—for a platform designed around a different problem than yours.

Their Limitations
  • Built for structured data governance—unstructured text detection is a recent add-on, not the core product
  • Protegrity: external function calls route data out of Snowflake for processing
  • Skyflow: ~$195K/year enterprise contracts; Protegrity: custom enterprise pricing
  • Months of integration, enterprise sales cycles, and professional services
  • You’re buying a data governance platform when you need a text de-identification tool
Agent Mask Advantage
  • Purpose-built for unstructured text de-identification at warehouse scale
  • Runs inside Snowflake—no vault infrastructure, no data egress
  • A fraction of Skyflow’s ~$195K/year enterprise contracts
  • Deploy in minutes, not quarters—no enterprise sales cycles or professional services
  • Focused tool, not a platform—custom entity types, coreference resolution, and cross-column consistency included

Stop Paying the Data Tax

You already know your data is filled with PII. You’re already paying the data tax. Get it de-identified this afternoon. Not next quarter. Not after a six-month integration. Today.

Setup in under 30 minutes No infrastructure to deploy Free proof of concept

You're already on Snowflake—that's the hard part done.

1

Install

Get Agent Mask from the Marketplace. Grant access to your schemas.

2

Point

Pass your text columns through the function.

3

Get Clean Data

Receive de-identified output with PII replaced. Your original stays intact.

No $200K enterprise contracts. No per-character API fees. Start with a free proof of concept on your actual data.