Stop Hoarding
DATA
Every system you run is dumping unstructured text into your warehouse — support tickets, call transcripts, intake forms, survey responses, chat logs. And then you added AI. Now AI agents, assistants, and LLM pipelines are generating text faster than humans ever did. Every record loaded with PII that’s one query away from exposure. And every regulation says that’s your problem now.
Detect and de-identify it in a single function call, entirely inside Snowflake. No data egress. No third-party risk.
The Data Tax
You’re paying to store data you can’t safely use.
- A customer shares their credit card number in a live chat to process a refund.
- A nurse types a patient’s name into a support ticket to report a software bug.
- An account holder reads their SSN to a rep to verify their identity.
- An employee signs their name in an "anonymous" survey to ensure a follow-up.
Now there’s a credit card number in your chat logs. A patient identity in your help desk. A Social Security number in your call transcripts. A personal identity in your "anonymous" dataset.
That’s the data tax. You can’t safely feed it to your LLMs. Your analysts need six approvals to query it. Your ML team won’t touch it for training. It’s either a compliance liability or wasted insights—pick one.
That data has value—if you could just separate what’s sensitive from what’s useful.
Regulatory Risk
Unmasked PII = audit findings, fines, breach risk
Locked Analytics
Data teams can’t use what they can’t safely access
Blocked AI/ML
Models shouldn’t be trained on data riddled with PII
Risky Data Sprawl
PII copies spreading across dev, test, and staging
Reclaim the Value
Scan text fields and documents for hidden PII and mask it automatically—all inside Snowflake.
HIPAA-Compliant Analytics
Clinical notes, medical transcripts, discharge summaries—healthcare data is unstructured and sensitive. Agent Mask detects PII in freeform text with the precision HIPAA demands.
- De-identify clinical text for research and analytics
- Share with research partners without expanding BAA scope
- Train ML models on the language, not the PII
Share data with internal teams and external partners for research, analytics, and care coordination—without compromising patient privacy. Train AI on real clinical notes and transcripts, safely.
PCI-DSS and Privacy Compliance
Loan applications, transaction notes, customer communications—financial data lives in documents and conversations. Agent Mask finds sensitive data wherever it hides.
- Clean transaction data for fraud analytics
- Enable BI teams to query without compliance risk
- Provision safe datasets for dev and QA environments
Run fraud models on transaction notes that were previously off-limits. Enable data-driven decisions while maintaining the regulatory compliance your business depends on.
Safe Data Sharing at Scale
Employee feedback, user research, customer surveys—valuable data locked behind privacy concerns. Agent Mask makes it safe to share across teams.
- De-identify employee surveys for workforce analytics
- Clean user research before sharing with product
- Prepare customer feedback for company-wide insights
Turn restricted data into company-wide assets. Analyze employee feedback without exposing who said what.
FOIA and Public Records Compliance
Court filings, body cam transcripts, investigative reports—government records require redaction before release. Agent Mask automates what used to take hours of manual review.
- Accelerate FOIA response turnaround
- Enable public records search without exposure
- Prepare documents for inter-agency sharing
Prepare public records without manual review of every document. Meet disclosure deadlines without compromising privacy.
Comprehensive Detection
Detecting SSNs and credit cards is the easy part — every tool does that. The hard part is everything else: ambiguous contexts where Austin is a person, not a city. Drug names buried in clinical prose. Sensitive data unique to your industry that no generic model knows to look for.
Agent Mask understands context, resolves name variants to a single identity, and lets you define custom categories in plain English.
One engine for healthcare, finance, government, and enterprise data across 14 languages.
Native to Snowflake
Your data never leaves your environment. No file transfers, no API calls, no additional infrastructure to manage, no third-party exposure. Your data stays yours.
Document Redaction
Submit PDFs and scanned documents. Get back extracted text with PII de-identified, plus visually redacted files with PII masked in both the text layer and the rendered image — so no one can copy-paste or extract their way around it.
Multi-Language
Detect PII across 14 languages with dedicated models for each. Your EU, APAC, and LATAM data gets the same coverage — no extra tools, no extra vendors.
Flexible De-Identification
Eight operators, configured per entity type. Every response includes a full entity mapping for audit trails and authorized re-identification.
Your Business. Their Blind Spot.
Other tools ship with fixed lists and patterns. Your most sensitive data falls through the cracks. Describe what you're looking for in plain English—Agent Mask figures out what matches.
Semantic Inference, Not Pattern Matching
mrn medical record numbers (MRN)insurance health insurance: plan names, group numbersrx_med prescription drug names: Zoloft, Prozac, Ambien, metformindosage medication dosages: 50mg BID, 10mg IV push, 500mg TIDmental_dx psychiatric diagnoses: schizophrenia, OCD, anorexia, ADHDsubstance substance abuse: cocaine, heroin, methamphetamine, alcohol dependenceterminal_dx terminal diagnoses: ALS, stage IV cancer, end-stage renalgenetic genetic markers and test results: BRCA2, HER2, Lynch syndromeorientation sexual orientation: gay, lesbian, queer, LGBTQ+immigration immigration status: visa type, undocumented, asylum
Industry Starter Kits
Pre-built. Ready to go.Load a preset and start detecting industry-specific data immediately—diagnoses, medications, account numbers, employee IDs, and more. Mix with your own definitions for complete coverage.
Format Enforcement
Flexible detection. Strict matching.Layer pattern rules on top of semantic detection to kill false positives. Enforce org-specific formats like MRNs, account numbers, and case IDs—the model detects, you decide what’s real.
Fine-Grained Control
Tune each type independently.Set different sensitivity levels for different data types. Aggressive detection for medication names, strict matching for structured IDs—each with its own de-identification method, without one affecting the other.
Same Entity. Same Mask.
Other tools give the same person three different placeholders — and your data stops making sense. Contextual matching and cross-field consistency keep your data analytically useful.
Reference Resolution
Every variant, one mask.Contextual AI matches name variants, abbreviations, and acronyms that rules alone would miss—so “Sarah Elizabeth Chen”, “Chen”, and “Sarah” all collapse to a single placeholder. Deterministic normalization does the same for structured data—“(555) 123-4567” and “555.123.4567”, or “January 15, 2024” and “01/15/2024”. Your de-identified data reads like real data—not a bag of disconnected placeholders.
Corpus Consistency
Mix formats. One identity map.Send text columns, PDFs, images, and DOCX files together and Agent Mask connects the dots across all of them. The same name always gets the same replacement, everywhere it appears—no manual alignment needed.
vs. Simple Redaction
- Every mention = different placeholder
- Your data loses all referential meaning
- Useless for analytics or ML
vs. Rule-Based & Name-Part Matching
- Exact strings or split name parts—no context
- Can’t disambiguate “Smith” when John Smith and Jane Smith both appear
- Names only—no locations, orgs, dates, or phone numbers
vs. Manual Review
- Doesn't scale past dozens of records
- Human reviewers miss cross-column links
- Can't link name variants to the same person
Built for Production
Personal Identifiers
Financial
Healthcare
Digital & Location
Organizations & Groups
Your Custom Entities
Define domain-specific entity types with natural language, let the model do the rest. Ship with industry starter kits or build your own.
See custom detection ↑HIPAA Ready
Healthcare data protection. BAA support and PHI detection.
GDPR Compliant
EU data types, right to erasure, data minimization.
CCPA Ready
California consumer data protection and disclosure.
PCI-DSS Aligned
Credit card detection and masking for payments.
SOC 2 Ready
Built with SOC 2 controls for enterprise security.
Zero Trust Architecture
Agent Mask operates on a zero-trust model. We never see your data, never store your data, never have access to your data. The application runs in your Snowflake environment with the permissions you grant—nothing more.
Batch-First. Snowflake-Native. Expanding.
Call Agent Mask from dbt models, scheduled tasks, or batch queries — it only touches the fields you hand it. No scanning. No crawling. Today that means Snowflake. Azure, SageMaker, and self-hosted are on the roadmap. Reach out if that’s what you’re waiting for.
What You Give Up With Every Alternative
Cloud APIs need a pipeline to ship your data out. LLMs hallucinate. Regex misses context. Enterprise platforms weren't built for free text. Pick your poison—or don't.
Cloud APIs
AWS Comprehend · Google Cloud DLPTo use these, you become the pipeline engineer: export from Snowflake, route through API Gateway or Lambda, process on AWS or Google, parse the response, write back. Google DLP is capable—200+ detectors, pseudonymization options—but you're engineering and maintaining that pipeline yourself. Comprehend is narrower: two languages, redaction only. Both charge per unit of data processed.
- Requires building and maintaining an export-process-import pipeline
- Comprehend: English and Spanish only; limited to baked-in entity types, no custom entities
- Per-character (AWS) or per-GB (Google) pricing scales unpredictably
- Comprehend: redaction only—no pseudonymization
- No cross-column entity consistency or coreference resolution
- Zero pipeline engineering—call a function inside Snowflake
- 14 languages with dedicated AI models and custom entity types—no code required
- Predictable pricing—budget accurately instead of watching costs scale with data volume
- Deterministic pseudonymization: same person = same token everywhere
- Coreference resolution—nicknames, titles, and partial names all map to one identity
John Snow Labs
Healthcare NLPIf you’re a large health system, they deserve a spot on your shortlist. Enterprise pricing and enterprise contracts that may take months to close. Outside of healthcare, the picture changes fast: 26 clinical entity types with no support for financial, government, or custom PII.
- Healthcare-only—26 clinical entity types, no credit cards, SSNs, or custom entities
- Limited language support
- Only two masking modes
- Name consistency is string-splitting, not contextual—splits “John Smith” into parts and reuses them, but can’t tell that “Smith” refers to John vs. Jane when both appear in the text
- No coreference resolution for locations, organizations, dates, or phone numbers
- $82.88/credit + per-character processing + Snowflake infrastructure costs
- General-purpose: healthcare, financial, enterprise, and government PII in one tool
- 14 languages with dedicated AI models
- Eight operators—mask, hash, encrypt, synthetic data, pseudonymize, and more—configured per entity type
- Context-aware coreference—uses AI to resolve ambiguous mentions (“Smith” in a paragraph about John Smith) across people, places, organizations, dates, and phone numbers
- Predictable pricing via Snowflake Marketplace
LLM APIs
GPT-4 · Claude · GeminiYou'd build a pipeline to send text to an LLM API, parse whatever it returns, and hope it's consistent. Run the same prompt twice, get different results. LLMs hallucinate PII that isn't there and miss PII that is. Your compliance team will love explaining non-deterministic redaction to auditors.
- Requires building a pipeline to send data to external LLM APIs
- Non-deterministic—different results each run
- Hallucinates entities that don't exist
- Returns prose, not structured positions
- Per-token costs at $2–$75 per million tokens
- No audit trail or reproducibility
- Zero pipeline engineering—runs inside Snowflake
- Deterministic—same input, same output, every time
- AI detection + checksum validation—catches what LLMs miss, rejects what they hallucinate
- Returns exact character positions for each entity
- Predictable Snowflake Marketplace pricing—no per-token metering
- Full audit trail for compliance
Snowflake AI_REDACT
Cortex Built-in FunctionSnowflake's built-in option—went GA December 2025. Their docs say it "works best with well-formed English text." 4K token limit on input and output combined, US/UK/CA entities only, no entity positions returned. Convenient for a quick demo, but the gaps show fast in production.
- English-optimized only—Snowflake's docs say it "works best with well-formed English text"
- 4K token limit on input and output combined; 1K token output cap
- US/UK/CA entities only—no EU, APAC, LATAM, or medical identifiers
- No pseudonymization or entity positions—just redacted text with [LABELS]
- No cross-column consistency: same name in column A ≠ column B
- 14 languages with dedicated AI models
- No token limits—process documents of any length
- Built-in healthcare, EU, APAC, and LATAM entities—plus custom entity types for anything unique to your data
- Deterministic pseudonymization with exact entity positions and confidence scores
- Cross-column coreference resolution—same person gets the same pseudonym everywhere
Regex & Rule-Based
In-house Keyword Lists · Custom ScriptsA greedy pattern matches half your dataset. A tight one misses everything spelled slightly differently. You're maintaining hundreds of rules across ICD codes, NPI formats, and regional ID numbers—and one bad commit can redact entire columns of legitimate data or silently miss real PII for months.
- Brittle—one character off and the pattern breaks or over-matches catastrophically
- Maintaining hundreds of patterns across formats, codes, and regional IDs
- No semantic understanding—can't distinguish a name from a product or a place
- Can't detect names, addresses, or context-dependent PII at all
- Every new edge case, locale, or format = another rule to write and test
- AI detection with checksum validation catches edge cases regex misses
- No pattern maintenance—AI handles formats, codes, and regional IDs automatically
- AI models understand semantic context—distinguishes names from products from places
- Detects names, addresses, and context-dependent PII that regex can never match
- Custom entity types—describe what to find, no regex required
Data Privacy Vaults
Skyflow · ProtegrityEnterprise platforms for tokenizing structured data—credit card columns, SSN fields, known PII in fixed schemas. Skyflow and Protegrity have added unstructured text capabilities recently, but their core product is vault infrastructure and field-level encryption. Expect months of integration work, enterprise sales cycles, and $100K-$200K+/year—for a platform designed around a different problem than yours.
- Built for structured data governance—unstructured text detection is a recent add-on, not the core product
- Protegrity: external function calls route data out of Snowflake for processing
- Skyflow: ~$195K/year enterprise contracts; Protegrity: custom enterprise pricing
- Months of integration, enterprise sales cycles, and professional services
- You’re buying a data governance platform when you need a text de-identification tool
- Purpose-built for unstructured text de-identification at warehouse scale
- Runs inside Snowflake—no vault infrastructure, no data egress
- A fraction of Skyflow’s ~$195K/year enterprise contracts
- Deploy in minutes, not quarters—no enterprise sales cycles or professional services
- Focused tool, not a platform—custom entity types, coreference resolution, and cross-column consistency included
Stop Paying the Data Tax
You already know your data is filled with PII. You’re already paying the data tax. Get it de-identified this afternoon. Not next quarter. Not after a six-month integration. Today.
You're already on Snowflake—that's the hard part done.
Install
Get Agent Mask from the Marketplace. Grant access to your schemas.
Point
Pass your text columns through the function.
Get Clean Data
Receive de-identified output with PII replaced. Your original stays intact.
No $200K enterprise contracts. No per-character API fees. Start with a free proof of concept on your actual data.