32dots HEIDELBERG AI
Session 19

Evaluating outputs

medium
  • Write a 4-criterion rubric BEFORE you look at any output
  • Implement a generator + reviewer loop that rejects below threshold
  • Log scores across runs so you can compare prompts or models empirically

Students need to judge quality rather than assume it. "It looks good" is not a quality signal — it's the absence of one. The generator+reviewer pattern is the smallest eval that actually works.

  • If your rubric has one criterion ("is it good?"), what have you actually measured?
  • Why write the rubric before seeing any outputs, not after?

Evaluation means defining what a good answer looks like before trusting the system. Criteria may include factual support, completeness, clarity, structure, safety, and usefulness for the target audience. A "generator + reviewer" loop is the simplest eval pattern.

A good rubric has orthogonal criteria — accuracy, completeness, support-by-source, tone — and each is scored independently. Summing them gives a noisy signal; looking at per-criterion failure modes gives a useful one.

scenario

An AI pipeline produces these 5 outputs. Before reading the rubric: which ones pass evaluation and which need rejection or revision?

"Olaparib significantly improves PFS in BRCA-mutated ovarian cancer (HR 0.30, 95% CI 0.22–0.41, SOLO-1 trial)."
"Studies show olaparib is one of the most effective PARP inhibitors available for ovarian cancer patients."
"Olaparib works by inhibiting PARP enzymes, trapping them on DNA breaks, which is lethal in BRCA-deficient cells."
"Some patients experience fatigue and nausea. Results may vary."
"Based on the trial data provided, the 3-year OS was 67% in the olaparib arm vs 46% in placebo (SOLO-1, n=391)."

A methods-summary is scored against the source on {accuracy, missing details, jargon level, supported-by-text}. If score < 3/4, the reviewer sends it back with feedback.

▸ Use the instructor's finished build before you build yours feel what "done" looks like — then recreate it

Not loading? https://dify.32dots.de/chat/5nOF2thuDE317MQ6

Finished Evaluating outputs
The finished app — use this as your target
ask Ask a question about "Evaluating outputs"

Build a **Workflow**: Start → **LLM** (generator: summarize methods) → **LLM** (reviewer: rubric-score the summary, return JSON `{score, feedback}`) → **If-Else** (if score ≥ 3 return, else loop with feedback via **Iteration**) → End. ### n8n task Same loop in n8n: Trigger → Chat Model (generator) → Chat Model (reviewer) → IF node on score → loop back via Execute Workflow or output. ### Blocks needed - Dify: Workflow with 2× LLM, If-Else, (optional) Iteration - n8n: 2× Basic LLM Chain, IF, Set (for feedback), loop-back via sub-workflow - Model: Claude Sonnet 4.6 for the reviewer (more careful), Groq Llama for the generator

Blocks needed
  • Dify: Workflow with 2× LLM, If-Else, (optional) Iteration
  • n8n: 2× Basic LLM Chain, IF, Set (for feedback), loop-back via sub-workflow
  • Model: Claude Sonnet 4.6 for the reviewer (more careful), Groq Llama for the generator
n8n Task Open in n8n → 🔑 student@cos.32dots.de · cos2026

Same loop in n8n: Trigger → Chat Model (generator) → Chat Model (reviewer) → IF node on score → loop back via Execute Workflow or output. ### Blocks needed - Dify: Workflow with 2× LLM, If-Else, (optional) Iteration - n8n: 2× Basic LLM Chain, IF, Set (for feedback), loop-back via sub-workflow - Model: Claude Sonnet 4.6 for the reviewer (more careful), Groq Llama for the generator

Nodes needed
  • Dify: Workflow with 2× LLM, If-Else, (optional) Iteration
  • n8n: 2× Basic LLM Chain, IF, Set (for feedback), loop-back via sub-workflow
  • Model: Claude Sonnet 4.6 for the reviewer (more careful), Groq Llama for the generator

If the eval loop is internal to one app and you want to ship the loop as one thing, Dify. If the eval writes scores to a sheet, compares across runs, or sends failures to Slack for human review, n8n. Most real pipelines have both: the inner loop in Dify, the outer dashboard in n8n.

  • **Same model for generator + reviewer** — if both are the same model, the reviewer has the same blind spots as the generator. Use a stronger model for review.
  • **Scalar-only scores** — `score: 7/10` tells you nothing about *what* went wrong. Require per-criterion scores + one-line feedback.
  • **Infinite loop** — always cap the number of retries. After N failures, escalate to a human or emit the best attempt with a warning.

Why is a plausible answer not enough in scientific work?

Build an answer-reviewer for a prior workflow (e.g. the abstract comparator from card 18). Define a 4-criterion rubric before you look at any output.

Deliverable

Generator+reviewer loop with rubric, applied to 5 runs and logged scores.

For the assistant you'd most like to trust, what would the 4 rubric items be — and which one are you least sure how to measure?