Session 19

Evaluating outputs

medium

By the end, you can

Write a 4-criterion rubric BEFORE you look at any output
Implement a generator + reviewer loop that rejects below threshold
Log scores across runs so you can compare prompts or models empirically

Why this matters

Students need to judge quality rather than assume it. "It looks good" is not a quality signal â€” it's the absence of one. The generator+reviewer pattern is the smallest eval that actually works.

Warm-up · answer in one sentence

If your rubric has one criterion ("is it good?"), what have you actually measured?
Why write the rubric before seeing any outputs, not after?

Explanation

Evaluation means defining what a good answer looks like before trusting the system. Criteria may include factual support, completeness, clarity, structure, safety, and usefulness for the target audience. A "generator + reviewer" loop is the simplest eval pattern.

A good rubric has orthogonal criteria â€” accuracy, completeness, support-by-source, tone â€” and each is scored independently. Summing them gives a noisy signal; looking at per-criterion failure modes gives a useful one.

scenario

An AI pipeline produces these 5 outputs. Before reading the rubric: which ones pass evaluation and which need rejection or revision?

"Olaparib significantly improves PFS in BRCA-mutated ovarian cancer (HR 0.30, 95% CI 0.22â€“0.41, SOLO-1 trial)."

"Studies show olaparib is one of the most effective PARP inhibitors available for ovarian cancer patients."

"Olaparib works by inhibiting PARP enzymes, trapping them on DNA breaks, which is lethal in BRCA-deficient cells."

"Some patients experience fatigue and nausea. Results may vary."

"Based on the trial data provided, the 3-year OS was 67% in the olaparib arm vs 46% in placebo (SOLO-1, n=391)."

Scientific example

A methods-summary is scored against the source on {accuracy, missing details, jargon level, supported-by-text}. If score < 3/4, the reviewer sends it back with feedback.

Try the reference bot first

▸ Use the instructor's finished build before you build yours feel what "done" looks like — then recreate it

Not loading? https://dify.32dots.de/chat/5nOF2thuDE317MQ6

What you're building

Finished Evaluating outputs — The finished app — use this as your target

ask Ask a question about "Evaluating outputs"

Reference

Anthropic effective agents ↗ open

Dify Task Open chat → Explore workflow ↗

Build a **Workflow**: Start â†’ **LLM** (generator: summarize methods) â†’ **LLM** (reviewer: rubric-score the summary, return JSON `{score, feedback}`) â†’ **If-Else** (if score â‰¥ 3 return, else loop with feedback via **Iteration**) â†’ End. ### n8n task Same loop in n8n: Trigger â†’ Chat Model (generator) â†’ Chat Model (reviewer) â†’ IF node on score â†’ loop back via Execute Workflow or output. ### Blocks needed - Dify: Workflow with 2Ã— LLM, If-Else, (optional) Iteration - n8n: 2Ã— Basic LLM Chain, IF, Set (for feedback), loop-back via sub-workflow - Model: Claude Sonnet 4.6 for the reviewer (more careful), Groq Llama for the generator

Blocks needed

Dify: Workflow with 2Ã— LLM, If-Else, (optional) Iteration
n8n: 2Ã— Basic LLM Chain, IF, Set (for feedback), loop-back via sub-workflow
Model: Claude Sonnet 4.6 for the reviewer (more careful), Groq Llama for the generator

n8n Task Open in n8n → 🔑 student@cos.32dots.de · cos2026

Same loop in n8n: Trigger â†’ Chat Model (generator) â†’ Chat Model (reviewer) â†’ IF node on score â†’ loop back via Execute Workflow or output. ### Blocks needed - Dify: Workflow with 2Ã— LLM, If-Else, (optional) Iteration - n8n: 2Ã— Basic LLM Chain, IF, Set (for feedback), loop-back via sub-workflow - Model: Claude Sonnet 4.6 for the reviewer (more careful), Groq Llama for the generator

Nodes needed

Dify: Workflow with 2Ã— LLM, If-Else, (optional) Iteration
n8n: 2Ã— Basic LLM Chain, IF, Set (for feedback), loop-back via sub-workflow
Model: Claude Sonnet 4.6 for the reviewer (more careful), Groq Llama for the generator

Why this platform

If the eval loop is internal to one app and you want to ship the loop as one thing, Dify. If the eval writes scores to a sheet, compares across runs, or sends failures to Slack for human review, n8n. Most real pipelines have both: the inner loop in Dify, the outer dashboard in n8n.

Common pitfalls

**Same model for generator + reviewer** â€” if both are the same model, the reviewer has the same blind spots as the generator. Use a stronger model for review.
**Scalar-only scores** â€” `score: 7/10` tells you nothing about *what* went wrong. Require per-criterion scores + one-line feedback.
**Infinite loop** â€” always cap the number of retries. After N failures, escalate to a human or emit the best attempt with a warning.

Self-check · tick before you mark done

Your rubric has â‰¥3 orthogonal criteria written BEFORE any output
Reviewer uses a different (stronger) model than the generator
The loop has a retry cap
5 runs are logged with per-criterion scores
You can explain in one sentence what you learned that you would tell a labmate tomorrow

Student riddle

Why is a plausible answer not enough in scientific work?

Your answer first — then reveal

Mini project

Build an answer-reviewer for a prior workflow (e.g. the abstract comparator from card 18). Define a 4-criterion rubric before you look at any output.

Deliverable

Generator+reviewer loop with rubric, applied to 5 runs and logged scores.

Reflection · one sentence in your journal

For the assistant you'd most like to trust, what would the 4 rubric items be â€” and which one are you least sure how to measure?