Session 22

Privacy and sensitive data

hard

By the end, you can

Build a pre-LLM redactor that catches 5+ PII categories deterministically
Keep an audit log that is itself privacy-safe (hashes, not raw data)
Say out loud which tasks should not use AI at all

Why this matters

Scientific environments often handle sensitive material. Privacy is load-bearing infrastructure, not a disclaimer at the end of a README â€” and the cheapest control is a deterministic redaction step that runs before any model sees the data.

Warm-up · answer in one sentence

If your audit log stores the raw PII you were trying to redact, have you improved or worsened the situation?
Name one task where AI should not be used at all, even with redaction.

Explanation

Privacy-aware design means limiting what data enters the system, what is stored, who can see outputs, and when AI should not be used at all. This is a system-design issue, not just a legal note. The best privacy control is a node that runs *before* the LLM call.

Layers: (1) don't collect what you don't need; (2) redact what you do collect; (3) log with hashes, not raw strings; (4) mark untouchable data and refuse to ingest it.

scenario

Your AI pipeline receives these 5 inputs. Which are safe to process, which need PII scrubbing, which should be rejected entirely?

"Summarise: Olaparib showed HR 0.30 in BRCA-mutated patients (SOLO-1, Lancet 2018)."

"Patient Anna MÃ¼ller, DOB 1974-03-12, reports fatigue and nausea post-olaparib."

"From: dr.schmidt@uniklinik.de â€” Please analyse this patient's genomic report (attached PDF)."

"Cohort median age 58, IQR 49â€“66, 63% female, ECOG 0â€“1."

"Trial participant #P-4821, case #CR-2024-0093, enrolled 2024-01-15."

Scientific example

A student assistant may process scheduling data safely but must never ingest identifiable health data (name + DOB + diagnosis) without explicit consent, redaction, and justification.

Try the reference bot first

▸ Use the instructor's finished build before you build yours feel what "done" looks like — then recreate it

Not loading? https://dify.32dots.de/chat/ZjRgciCl3OgI9dXi

What you're building

Finished Privacy and sensitive data — The finished app — use this as your target

ask Ask a question about "Privacy and sensitive data"

Reference

UNESCO guidance ↗ open

n8n Task Open in n8n → 🔑 student@cos.32dots.de · cos2026

Build a pre-processor: Trigger â†’ Code node (regex + heuristic redaction of names, emails, phone numbers, DOB patterns, MRN patterns, 9-digit IDs) â†’ IF (if high-sensitivity patterns remain, reject with "manual review required") â†’ Chat Model â†’ Output. Log each redaction to a Postgres audit table.

Nodes needed

Webhook / Form Trigger
Code node (Python/JS regex for PII)
IF (sensitivity threshold)
Chat Model (only reached on clean text)
Postgres (audit log with hashes, not raw data)
Output / notification

Why this platform

The important layer happens *before* the LLM ever sees the data, and it involves deterministic redaction + audit logging â€” pure plumbing, hard requirements, no judgment. n8n gives you a clean pre-processing step you can unit-test. A Dify workflow could do redaction with a Code node, but the audit trail and cross-system enforcement belong in the plumbing layer.

Common pitfalls

**LLM-based redaction** â€” asking a model to "remove any PII" is not a control; it's a wish. Use regex + allow-lists for categories you can enumerate.
**Raw data in the audit log** â€” audit logs get read, shared, backed up. Store hashes, not plaintext.
**No refuse path** â€” if your system's only answer is "scrubbed and continued", some inputs will leak. There must be inputs you reject outright.

Self-check · tick before you mark done

5 planted-PII test inputs are 100% redacted
5 clean test inputs pass through unchanged
Audit log stores hashes or redacted forms, never raw PII
You can name one input category your system refuses entirely
You can explain in one sentence what you learned that you would tell a labmate tomorrow

Student riddle

Why is privacy protection a design constraint rather than an afterthought?

Your answer first — then reveal

Mini project

Build an identifier-redaction pre-processor that handles at least 5 PII categories and maintains a tamper-evident audit log.

Deliverable

Redaction-aware workflow with a test suite of 10 strings (5 clean, 5 with planted PII) and 100% redaction rate on the dirty ones.

Reflection · one sentence in your journal

For data in your own research, which category would you rather refuse to ingest than try to redact â€” and why is refusing the safer design?