Privacy and sensitive data
- Build a pre-LLM redactor that catches 5+ PII categories deterministically
- Keep an audit log that is itself privacy-safe (hashes, not raw data)
- Say out loud which tasks should not use AI at all
Scientific environments often handle sensitive material. Privacy is load-bearing infrastructure, not a disclaimer at the end of a README — and the cheapest control is a deterministic redaction step that runs before any model sees the data.
- If your audit log stores the raw PII you were trying to redact, have you improved or worsened the situation?
- Name one task where AI should not be used at all, even with redaction.
Privacy-aware design means limiting what data enters the system, what is stored, who can see outputs, and when AI should not be used at all. This is a system-design issue, not just a legal note. The best privacy control is a node that runs *before* the LLM call.
Layers: (1) don't collect what you don't need; (2) redact what you do collect; (3) log with hashes, not raw strings; (4) mark untouchable data and refuse to ingest it.
Your AI pipeline receives these 5 inputs. Which are safe to process, which need PII scrubbing, which should be rejected entirely?
A student assistant may process scheduling data safely but must never ingest identifiable health data (name + DOB + diagnosis) without explicit consent, redaction, and justification.
▸ Use the instructor's finished build before you build yours feel what "done" looks like — then recreate it
Not loading? https://dify.32dots.de/chat/ZjRgciCl3OgI9dXi
- UNESCO guidance ↗ open
Build a pre-processor: Trigger → Code node (regex + heuristic redaction of names, emails, phone numbers, DOB patterns, MRN patterns, 9-digit IDs) → IF (if high-sensitivity patterns remain, reject with "manual review required") → Chat Model → Output. Log each redaction to a Postgres audit table.
Webhook / Form TriggerCode node (Python/JS regex for PII)IF (sensitivity threshold)Chat Model (only reached on clean text)Postgres (audit log with hashes, not raw data)Output / notification
The important layer happens *before* the LLM ever sees the data, and it involves deterministic redaction + audit logging — pure plumbing, hard requirements, no judgment. n8n gives you a clean pre-processing step you can unit-test. A Dify workflow could do redaction with a Code node, but the audit trail and cross-system enforcement belong in the plumbing layer.
- **LLM-based redaction** — asking a model to "remove any PII" is not a control; it's a wish. Use regex + allow-lists for categories you can enumerate.
- **Raw data in the audit log** — audit logs get read, shared, backed up. Store hashes, not plaintext.
- **No refuse path** — if your system's only answer is "scrubbed and continued", some inputs will leak. There must be inputs you reject outright.
Why is privacy protection a design constraint rather than an afterthought?
Build an identifier-redaction pre-processor that handles at least 5 PII categories and maintains a tamper-evident audit log.
Redaction-aware workflow with a test suite of 10 strings (5 clean, 5 with planted PII) and 100% redaction rate on the dirty ones.
For data in your own research, which category would you rather refuse to ingest than try to redact — and why is refusing the safer design?