Evaluating outputs
- Write a 4-criterion rubric BEFORE you look at any output
- Implement a generator + reviewer loop that rejects below threshold
- Log scores across runs so you can compare prompts or models empirically
Students need to judge quality rather than assume it. "It looks good" is not a quality signal — it's the absence of one. The generator+reviewer pattern is the smallest eval that actually works.
- If your rubric has one criterion ("is it good?"), what have you actually measured?
- Why write the rubric before seeing any outputs, not after?
Evaluation means defining what a good answer looks like before trusting the system. Criteria may include factual support, completeness, clarity, structure, safety, and usefulness for the target audience. A "generator + reviewer" loop is the simplest eval pattern.
A good rubric has orthogonal criteria — accuracy, completeness, support-by-source, tone — and each is scored independently. Summing them gives a noisy signal; looking at per-criterion failure modes gives a useful one.
An AI pipeline produces these 5 outputs. Before reading the rubric: which ones pass evaluation and which need rejection or revision?
A methods-summary is scored against the source on {accuracy, missing details, jargon level, supported-by-text}. If score < 3/4, the reviewer sends it back with feedback.
▸ Use the instructor's finished build before you build yours feel what "done" looks like — then recreate it
Not loading? https://dify.32dots.de/chat/5nOF2thuDE317MQ6
- Anthropic effective agents ↗ open
Build a **Workflow**: Start → **LLM** (generator: summarize methods) → **LLM** (reviewer: rubric-score the summary, return JSON `{score, feedback}`) → **If-Else** (if score ≥ 3 return, else loop with feedback via **Iteration**) → End. ### n8n task Same loop in n8n: Trigger → Chat Model (generator) → Chat Model (reviewer) → IF node on score → loop back via Execute Workflow or output. ### Blocks needed - Dify: Workflow with 2× LLM, If-Else, (optional) Iteration - n8n: 2× Basic LLM Chain, IF, Set (for feedback), loop-back via sub-workflow - Model: Claude Sonnet 4.6 for the reviewer (more careful), Groq Llama for the generator
Dify: Workflow with 2× LLM, If-Else, (optional) Iterationn8n: 2× Basic LLM Chain, IF, Set (for feedback), loop-back via sub-workflowModel: Claude Sonnet 4.6 for the reviewer (more careful), Groq Llama for the generator
Same loop in n8n: Trigger → Chat Model (generator) → Chat Model (reviewer) → IF node on score → loop back via Execute Workflow or output. ### Blocks needed - Dify: Workflow with 2× LLM, If-Else, (optional) Iteration - n8n: 2× Basic LLM Chain, IF, Set (for feedback), loop-back via sub-workflow - Model: Claude Sonnet 4.6 for the reviewer (more careful), Groq Llama for the generator
Dify: Workflow with 2× LLM, If-Else, (optional) Iterationn8n: 2× Basic LLM Chain, IF, Set (for feedback), loop-back via sub-workflowModel: Claude Sonnet 4.6 for the reviewer (more careful), Groq Llama for the generator
If the eval loop is internal to one app and you want to ship the loop as one thing, Dify. If the eval writes scores to a sheet, compares across runs, or sends failures to Slack for human review, n8n. Most real pipelines have both: the inner loop in Dify, the outer dashboard in n8n.
- **Same model for generator + reviewer** — if both are the same model, the reviewer has the same blind spots as the generator. Use a stronger model for review.
- **Scalar-only scores** — `score: 7/10` tells you nothing about *what* went wrong. Require per-criterion scores + one-line feedback.
- **Infinite loop** — always cap the number of retries. After N failures, escalate to a human or emit the best attempt with a warning.
Why is a plausible answer not enough in scientific work?
Build an answer-reviewer for a prior workflow (e.g. the abstract comparator from card 18). Define a 4-criterion rubric before you look at any output.
Generator+reviewer loop with rubric, applied to 5 runs and logged scores.
For the assistant you'd most like to trust, what would the 4 rubric items be — and which one are you least sure how to measure?