32dots HEIDELBERG AI
Session 22

Privacy and sensitive data

hard
  • Build a pre-LLM redactor that catches 5+ PII categories deterministically
  • Keep an audit log that is itself privacy-safe (hashes, not raw data)
  • Say out loud which tasks should not use AI at all

Scientific environments often handle sensitive material. Privacy is load-bearing infrastructure, not a disclaimer at the end of a README — and the cheapest control is a deterministic redaction step that runs before any model sees the data.

  • If your audit log stores the raw PII you were trying to redact, have you improved or worsened the situation?
  • Name one task where AI should not be used at all, even with redaction.

Privacy-aware design means limiting what data enters the system, what is stored, who can see outputs, and when AI should not be used at all. This is a system-design issue, not just a legal note. The best privacy control is a node that runs *before* the LLM call.

Layers: (1) don't collect what you don't need; (2) redact what you do collect; (3) log with hashes, not raw strings; (4) mark untouchable data and refuse to ingest it.

scenario

Your AI pipeline receives these 5 inputs. Which are safe to process, which need PII scrubbing, which should be rejected entirely?

"Summarise: Olaparib showed HR 0.30 in BRCA-mutated patients (SOLO-1, Lancet 2018)."
"Patient Anna Müller, DOB 1974-03-12, reports fatigue and nausea post-olaparib."
"From: dr.schmidt@uniklinik.de — Please analyse this patient's genomic report (attached PDF)."
"Cohort median age 58, IQR 49–66, 63% female, ECOG 0–1."
"Trial participant #P-4821, case #CR-2024-0093, enrolled 2024-01-15."

A student assistant may process scheduling data safely but must never ingest identifiable health data (name + DOB + diagnosis) without explicit consent, redaction, and justification.

▸ Use the instructor's finished build before you build yours feel what "done" looks like — then recreate it

Not loading? https://dify.32dots.de/chat/ZjRgciCl3OgI9dXi

Finished Privacy and sensitive data
The finished app — use this as your target
ask Ask a question about "Privacy and sensitive data"
n8n Task Open in n8n → 🔑 student@cos.32dots.de · cos2026

Build a pre-processor: Trigger → Code node (regex + heuristic redaction of names, emails, phone numbers, DOB patterns, MRN patterns, 9-digit IDs) → IF (if high-sensitivity patterns remain, reject with "manual review required") → Chat Model → Output. Log each redaction to a Postgres audit table.

Nodes needed
  • Webhook / Form Trigger
  • Code node (Python/JS regex for PII)
  • IF (sensitivity threshold)
  • Chat Model (only reached on clean text)
  • Postgres (audit log with hashes, not raw data)
  • Output / notification

The important layer happens *before* the LLM ever sees the data, and it involves deterministic redaction + audit logging — pure plumbing, hard requirements, no judgment. n8n gives you a clean pre-processing step you can unit-test. A Dify workflow could do redaction with a Code node, but the audit trail and cross-system enforcement belong in the plumbing layer.

  • **LLM-based redaction** — asking a model to "remove any PII" is not a control; it's a wish. Use regex + allow-lists for categories you can enumerate.
  • **Raw data in the audit log** — audit logs get read, shared, backed up. Store hashes, not plaintext.
  • **No refuse path** — if your system's only answer is "scrubbed and continued", some inputs will leak. There must be inputs you reject outright.

Why is privacy protection a design constraint rather than an afterthought?

Build an identifier-redaction pre-processor that handles at least 5 PII categories and maintains a tamper-evident audit log.

Deliverable

Redaction-aware workflow with a test suite of 10 strings (5 clean, 5 with planted PII) and 100% redaction rate on the dirty ones.

For data in your own research, which category would you rather refuse to ingest than try to redact — and why is refusing the safer design?