Ollama — Quick-reference

Run open models locally from the CLI · private · OpenAI-compatible API on port 11434 · macOS / Windows / Linux

Model size & RAM guide

Model size	RAM needed	Good for
7–8 B	5–6 GB	Most laptops — daily chat, drafts, code snippets
13–14 B	8–10 GB	16 GB RAM Mac/PC — better reasoning, longer context
30–34 B	18–22 GB	32 GB RAM — strong coding, analysis tasks
70 B	40–45 GB	GPU workstation / server — near-frontier quality, slow on CPU

Ollama downloads, quantizes, and GPU-accelerates models for you — you pick the tag, it handles the rest. Start with a 7–8 B model such as llama3, qwen2.5, or gemma. Browse the registry at ollama.com, or pull any GGUF model. If replies take >30 s, try a smaller model. The download & run of local models is free and open source; Ollama also offers optional paid cloud tiers (Pro $20/mo, Max $100/mo) that gate access to larger hosted models — local use never costs anything.

Install

macOS

Download the app

Download from ollama.com/download (requires macOS 14 Sonoma or later). Open it once and the background service starts.

Windows

Download the installer

Download the Windows installer from ollama.com/download and run it. The service runs in the background.

Linux

One-line install script

Run curl -fsSL https://ollama.com/install.sh | sh. It installs the CLI and registers a systemd service.

`ollama` CLI reference

Command	What it does
`ollama run llama3`	Run and chat with a model (downloads it first if needed)
`ollama pull llama3`	Download a model without starting a chat
`ollama list`	List the models you have downloaded
`ollama ps`	List models currently loaded in memory
`ollama stop llama3`	Unload a running model and free its RAM
`ollama rm llama3`	Delete a downloaded model from disk
`ollama serve`	Start the Ollama server manually (env vars listed via `--help`)
`ollama create -f Modelfile`	Create a customized model from a Modelfile

Always-on by default: after install, Ollama runs as a background service that auto-starts on login — the API at localhost:11434 is up without you running ollama serve by hand. Use ollama serve only when running it manually (e.g. on a headless server where the service is not registered).

Local API — base URL & endpoint

Native Ollama API

http://localhost:11434/api/chat

OpenAI-compatible endpoint

http://localhost:11434/v1/

The /v1/ path is an OpenAI-compatible endpoint. Any tool or SDK that accepts a custom base_url can use it — including the Python openai package, LangChain, and Hermes. The api_key field is required by the SDK but ignored by Ollama, so set it to any non-empty string such as ollama.

curl test

curl http://localhost:11434/api/chat -d '{
  "model": "llama3",
  "messages": [
    {"role": "user", "content": "Explain the central dogma of molecular biology in two sentences."}
  ],
  "stream": false
}'

Replace the model name with whatever you have pulled. With "stream": false you get one JSON object back; the message.content field holds the text.

Python (openai SDK)

# pip install openai
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1/",
    api_key="ollama"          # required by the SDK, ignored by Ollama
)

response = client.chat.completions.create(
    model="llama3",
    messages=[
        {"role": "user", "content": "Summarise RNA-seq analysis in 3 bullet points."}
    ]
)
print(response.choices[0].message.content)

Serve on your LAN (headless / shared endpoint)

By default Ollama binds to localhost only. To let other machines on your network reach it, set the OLLAMA_HOST environment variable to 0.0.0.0:11434 and restart Ollama. Other machines then call http://<your-ip>:11434/v1/.

macOS

launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
# then restart the Ollama app

Linux (systemd)

systemctl edit ollama.service
# under [Service] add:
# Environment="OLLAMA_HOST=0.0.0.0:11434"
systemctl daemon-reload
systemctl restart ollama

Windows: open Settings → search "environment variables" → add OLLAMA_HOST = 0.0.0.0:11434 for your account, then restart Ollama.

Useful environment variables

Variable	What it controls
`OLLAMA_HOST`	Which address/port the server binds to (e.g. `0.0.0.0:11434` for LAN access)
`OLLAMA_MODELS`	Directory where downloaded models are stored (default `~/.ollama/models`)
`OLLAMA_KEEP_ALIVE`	How long a model stays loaded in RAM — a duration like `10m`/`24h`, seconds, `0` to unload at once, or a negative number to keep loaded

Connect Hermes to Ollama

Edit ~/.hermes/config.yaml — two lines. Then add a placeholder key to ~/.hermes/.env.

~/.hermes/config.yaml

provider: custom
base_url: "http://localhost:11434/v1/"

For a model on another machine, replace localhost with its LAN IP.

~/.hermes/.env

OPENAI_API_KEY=ollama

Ollama ignores the key value. Hermes still requires the variable to be present.

See the Hermes course, lesson hermes-00 for full install instructions. Once these two lines are set, start Hermes with hermes --tui — it will use your local Ollama model for all conversations, with no data leaving your machine.

Local vs cloud — quick decision

Situation	Use
Sensitive or unpublished data (patient records, pre-submission results)	Local Ollama
Offline work (plane, field, no-internet lab)	Local Ollama
A private API endpoint for your scripts or pipeline	Local — OpenAI-compatible `/v1/`
Shared lab endpoint on a spare/headless machine	Local — `OLLAMA_HOST=0.0.0.0`
Complex reasoning, very long documents, frontier capability	Cloud (Claude / GPT / Gemini)
Hardware is a constraint (old laptop, no GPU)	Cloud or a smaller local model