Ollama — Quick-reference

Run open models locally from the CLI · private · OpenAI-compatible API on port 11434 · macOS / Windows / Linux

Model size & RAM guide

Model size RAM needed Good for
7–8 B 5–6 GB Most laptops — daily chat, drafts, code snippets
13–14 B 8–10 GB 16 GB RAM Mac/PC — better reasoning, longer context
30–34 B 18–22 GB 32 GB RAM — strong coding, analysis tasks
70 B 40–45 GB GPU workstation / server — near-frontier quality, slow on CPU

Ollama downloads, quantizes, and GPU-accelerates models for you — you pick the tag, it handles the rest. Start with a 7–8 B model such as llama3, qwen2.5, or gemma. Browse the registry at ollama.com, or pull any GGUF model. If replies take >30 s, try a smaller model. The download & run of local models is free and open source; Ollama also offers optional paid cloud tiers (Pro $20/mo, Max $100/mo) that gate access to larger hosted models — local use never costs anything.

Install

macOS
Download the app
Download from ollama.com/download (requires macOS 14 Sonoma or later). Open it once and the background service starts.
Windows
Download the installer
Download the Windows installer from ollama.com/download and run it. The service runs in the background.
Linux
One-line install script
Run curl -fsSL https://ollama.com/install.sh | sh. It installs the CLI and registers a systemd service.

ollama CLI reference

CommandWhat it does
ollama run llama3Run and chat with a model (downloads it first if needed)
ollama pull llama3Download a model without starting a chat
ollama listList the models you have downloaded
ollama psList models currently loaded in memory
ollama stop llama3Unload a running model and free its RAM
ollama rm llama3Delete a downloaded model from disk
ollama serveStart the Ollama server manually (env vars listed via --help)
ollama create -f ModelfileCreate a customized model from a Modelfile

Always-on by default: after install, Ollama runs as a background service that auto-starts on login — the API at localhost:11434 is up without you running ollama serve by hand. Use ollama serve only when running it manually (e.g. on a headless server where the service is not registered).

Local API — base URL & endpoint

Native Ollama API

http://localhost:11434/api/chat

OpenAI-compatible endpoint

http://localhost:11434/v1/

The /v1/ path is an OpenAI-compatible endpoint. Any tool or SDK that accepts a custom base_url can use it — including the Python openai package, LangChain, and Hermes. The api_key field is required by the SDK but ignored by Ollama, so set it to any non-empty string such as ollama.

curl test

curl http://localhost:11434/api/chat -d '{
  "model": "llama3",
  "messages": [
    {"role": "user", "content": "Explain the central dogma of molecular biology in two sentences."}
  ],
  "stream": false
}'

Replace the model name with whatever you have pulled. With "stream": false you get one JSON object back; the message.content field holds the text.

Python (openai SDK)

# pip install openai
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1/",
    api_key="ollama"          # required by the SDK, ignored by Ollama
)

response = client.chat.completions.create(
    model="llama3",
    messages=[
        {"role": "user", "content": "Summarise RNA-seq analysis in 3 bullet points."}
    ]
)
print(response.choices[0].message.content)

Serve on your LAN (headless / shared endpoint)

By default Ollama binds to localhost only. To let other machines on your network reach it, set the OLLAMA_HOST environment variable to 0.0.0.0:11434 and restart Ollama. Other machines then call http://<your-ip>:11434/v1/.

macOS

launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
# then restart the Ollama app

Linux (systemd)

systemctl edit ollama.service
# under [Service] add:
# Environment="OLLAMA_HOST=0.0.0.0:11434"
systemctl daemon-reload
systemctl restart ollama

Windows: open Settings → search "environment variables" → add OLLAMA_HOST = 0.0.0.0:11434 for your account, then restart Ollama.

Useful environment variables

VariableWhat it controls
OLLAMA_HOSTWhich address/port the server binds to (e.g. 0.0.0.0:11434 for LAN access)
OLLAMA_MODELSDirectory where downloaded models are stored (default ~/.ollama/models)
OLLAMA_KEEP_ALIVEHow long a model stays loaded in RAM — a duration like 10m/24h, seconds, 0 to unload at once, or a negative number to keep loaded

Connect Hermes to Ollama

Edit ~/.hermes/config.yaml — two lines. Then add a placeholder key to ~/.hermes/.env.

~/.hermes/config.yaml

provider: custom
base_url: "http://localhost:11434/v1/"

For a model on another machine, replace localhost with its LAN IP.

~/.hermes/.env

OPENAI_API_KEY=ollama

Ollama ignores the key value. Hermes still requires the variable to be present.

See the Hermes course, lesson hermes-00 for full install instructions. Once these two lines are set, start Hermes with hermes --tui — it will use your local Ollama model for all conversations, with no data leaving your machine.

Local vs cloud — quick decision

SituationUse
Sensitive or unpublished data (patient records, pre-submission results)Local Ollama
Offline work (plane, field, no-internet lab)Local Ollama
A private API endpoint for your scripts or pipelineLocal — OpenAI-compatible /v1/
Shared lab endpoint on a spare/headless machineLocal — OLLAMA_HOST=0.0.0.0
Complex reasoning, very long documents, frontier capabilityCloud (Claude / GPT / Gemini)
Hardware is a constraint (old laptop, no GPU)Cloud or a smaller local model