Run open models locally from the CLI · private · OpenAI-compatible API on port 11434 · macOS / Windows / Linux
| Model size | RAM needed | Good for |
|---|---|---|
| 7–8 B | 5–6 GB | Most laptops — daily chat, drafts, code snippets |
| 13–14 B | 8–10 GB | 16 GB RAM Mac/PC — better reasoning, longer context |
| 30–34 B | 18–22 GB | 32 GB RAM — strong coding, analysis tasks |
| 70 B | 40–45 GB | GPU workstation / server — near-frontier quality, slow on CPU |
Ollama downloads, quantizes, and GPU-accelerates models for you — you pick the tag, it handles the rest. Start with a 7–8 B model such as llama3, qwen2.5, or gemma. Browse the registry at ollama.com, or pull any GGUF model. If replies take >30 s, try a smaller model. The download & run of local models is free and open source; Ollama also offers optional paid cloud tiers (Pro $20/mo, Max $100/mo) that gate access to larger hosted models — local use never costs anything.
ollama.com/download (requires macOS 14 Sonoma or later). Open it once and the background service starts.ollama.com/download and run it. The service runs in the background.curl -fsSL https://ollama.com/install.sh | sh. It installs the CLI and registers a systemd service.ollama CLI reference| Command | What it does |
|---|---|
ollama run llama3 | Run and chat with a model (downloads it first if needed) |
ollama pull llama3 | Download a model without starting a chat |
ollama list | List the models you have downloaded |
ollama ps | List models currently loaded in memory |
ollama stop llama3 | Unload a running model and free its RAM |
ollama rm llama3 | Delete a downloaded model from disk |
ollama serve | Start the Ollama server manually (env vars listed via --help) |
ollama create -f Modelfile | Create a customized model from a Modelfile |
Always-on by default: after install, Ollama runs as a background service that auto-starts on login — the API at localhost:11434 is up without you running ollama serve by hand. Use ollama serve only when running it manually (e.g. on a headless server where the service is not registered).
Native Ollama API
http://localhost:11434/api/chat
OpenAI-compatible endpoint
http://localhost:11434/v1/
The /v1/ path is an OpenAI-compatible endpoint. Any tool or SDK that accepts a custom base_url can use it — including the Python openai package, LangChain, and Hermes. The api_key field is required by the SDK but ignored by Ollama, so set it to any non-empty string such as ollama.
curl http://localhost:11434/api/chat -d '{
"model": "llama3",
"messages": [
{"role": "user", "content": "Explain the central dogma of molecular biology in two sentences."}
],
"stream": false
}'
Replace the model name with whatever you have pulled. With "stream": false you get one JSON object back; the message.content field holds the text.
# pip install openai from openai import OpenAI client = OpenAI( base_url="http://localhost:11434/v1/", api_key="ollama" # required by the SDK, ignored by Ollama ) response = client.chat.completions.create( model="llama3", messages=[ {"role": "user", "content": "Summarise RNA-seq analysis in 3 bullet points."} ] ) print(response.choices[0].message.content)
By default Ollama binds to localhost only. To let other machines on your network reach it, set the OLLAMA_HOST environment variable to 0.0.0.0:11434 and restart Ollama. Other machines then call http://<your-ip>:11434/v1/.
macOS
launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
# then restart the Ollama app
Linux (systemd)
systemctl edit ollama.service # under [Service] add: # Environment="OLLAMA_HOST=0.0.0.0:11434" systemctl daemon-reload systemctl restart ollama
Windows: open Settings → search "environment variables" → add OLLAMA_HOST = 0.0.0.0:11434 for your account, then restart Ollama.
| Variable | What it controls |
|---|---|
OLLAMA_HOST | Which address/port the server binds to (e.g. 0.0.0.0:11434 for LAN access) |
OLLAMA_MODELS | Directory where downloaded models are stored (default ~/.ollama/models) |
OLLAMA_KEEP_ALIVE | How long a model stays loaded in RAM — a duration like 10m/24h, seconds, 0 to unload at once, or a negative number to keep loaded |
Edit ~/.hermes/config.yaml — two lines. Then add a placeholder key to ~/.hermes/.env.
~/.hermes/config.yaml
provider: custom base_url: "http://localhost:11434/v1/"
For a model on another machine, replace localhost with its LAN IP.
~/.hermes/.env
OPENAI_API_KEY=ollama
Ollama ignores the key value. Hermes still requires the variable to be present.
See the Hermes course, lesson hermes-00 for full install instructions. Once these two lines are set, start Hermes with hermes --tui — it will use your local Ollama model for all conversations, with no data leaving your machine.
| Situation | Use |
|---|---|
| Sensitive or unpublished data (patient records, pre-submission results) | Local Ollama |
| Offline work (plane, field, no-internet lab) | Local Ollama |
| A private API endpoint for your scripts or pipeline | Local — OpenAI-compatible /v1/ |
| Shared lab endpoint on a spare/headless machine | Local — OLLAMA_HOST=0.0.0.0 |
| Complex reasoning, very long documents, frontier capability | Cloud (Claude / GPT / Gemini) |
| Hardware is a constraint (old laptop, no GPU) | Cloud or a smaller local model |