Session 1
Speed + the big free budget: fast agent loops & large models
What makes Cerebras worth using
Cerebras runs on wafer-scale silicon that delivers ~2,600 tok/s — the fastest public inference for large models anywhere. Combined with 1 M free tokens/day, it is the right choice for workflows that fire many short requests.
Log into n8n.32dots.de with the email and password you received when you signed up. Will be live on session day
- 1 Understand the speed edge vs Groq: Cerebras is fastest on large models (70B–405B) and has a far larger free daily budget. Groq wins for audio (Whisper STT) and broad batch jobs; OpenRouter wins on model variety.
- 2 Pick a model for your task: llama-3.3-70b (best everyday balance), qwen-3-235b or llama3.1-405b (maximum reasoning at still-fast speed), deepseek-r1-distill (chain-of-thought), llama-4-scout (extended-context model — but read the 8K cap warning in the next lesson before using it on the free tier).
- 3 Run a tight agent loop — fire 10–20 classification or extraction calls in sequence (e.g. classify each abstract in a literature batch). At 2,600 tok/s the round-trip is dominated by network latency, not model time.
- 4 Monitor your quota in the Cerebras dashboard. 1 M tokens resets daily; a typical research loop (10 calls × 500 tokens) consumes ~5,000 tokens — well within budget.
You have run a multi-call loop (5+ requests) and confirmed usage in the Cerebras dashboard is well under the 1 M daily limit.
Batch-abstract classifier
Build a short script that classifies a list of paper abstracts using llama-3.3-70b and prints throughput.
Your task
Given a list of 10–20 short paper abstracts, classify each as 'methods paper', 'review', or 'results paper' in a loop and print total elapsed time.
- 1 Create a list of 10 real or dummy abstracts (one sentence each is fine for testing).
- 2 Write a loop that sends each abstract to Cerebras (base_url="https://api.cerebras.ai/v1", model="llama-3.3-70b") with a one-line system prompt: "Classify this abstract as: methods paper, review, or results paper. Reply with one label only."
- 3 Time the full run with Python's time module and print tokens-per-second. Compare to what you would expect from a slower provider.
Deliverable
A script that prints one classification per abstract plus total elapsed time and estimated throughput.