Product

How to Enable and Use a Local LLM

How to Enable and Use a Local LLM

This guide walks through installing Ollama on your VPS, downloading a small model, and switching your agent to use it for chat.

What this gives you

A local LLM runs entirely on your VPS — no API calls leave the box, no per-token billing, and no rate limits. The trade-off is quality: small models that fit on a basic-tier VPS (1–4 GB) are useful for simple tasks but not as capable as Claude or GPT-5. Most users keep a cloud model as the primary and switch to local for specific tasks (privacy-sensitive work, offline use, cost control).


1) Install Ollama and a model

Go to Settings → Local LLM → Install New Model. Pick a model from the dropdown:

  • TinyLlama (1.1B) — 637 MB. Recommended for basic-tier VPS (2 GB RAM).
  • Qwen 2.5 (0.5B) — 397 MB. Smallest. Fastest. Useful for housekeeping tasks.
  • Llama 3.2 (1B) — 1.3 GB.
  • Phi-3 Mini (3.8B) — 2.3 GB. Needs at least 4 GB RAM.
  • Gemma 2 (2B) — 1.6 GB.

Click Install Selected Model. The webui installs Ollama itself if it’s missing, then pulls the model. Total time on first install is usually 3–5 minutes.

When it finishes, the This Instance status row reads ● Running and lists the installed model(s).


2) Switch your agent to the local model

Go to Settings → AI Model → Primary Model. The dropdown will now include any locally installed Ollama models, labeled <name> (Local Ollama). Pick one and click Save Model.

The OpenClaw gateway restarts and starts using the local model on the next message.

If the model doesn’t appear in the dropdown right away, refresh the Settings page. The dropdown queries Ollama live each time it loads.

Switching just for one session

You can also switch in chat without changing the default. Open the chat panel and run:

/model ollama/tinyllama

The session uses the local model until you switch back or restart. The default in Settings stays whatever you had set.


3) Verify it’s working

In chat, run:

/status

The active model line should show ollama/<your-model>.

You can also tail the gateway log from Logs to confirm requests are going to localhost:11434 instead of an outbound provider.


How it works under the hood

  • Ollama runs as a system service on your VPS, listening on http://127.0.0.1:11434.
  • The bundled OpenClaw ollama plugin auto-discovers models on that endpoint and uses synthetic auth — no API key required for local use.
  • The webui’s Settings → AI Model dropdown queries Ollama live and shows whatever models are installed.
  • When you save an ollama/... model as primary, the gateway writes it to openclaw.json and the plugin activates automatically.

The “Enable Local LLM” toggle on the Local LLM card is a separate task-routing knob (heartbeat / log analysis / sensitive data) and does not affect chat. Picking a local model from the AI Model dropdown is what makes chat use it.


Troubleshooting

Model didn’t appear in the AI Model dropdown after install

The dropdown queries Ollama live. If the install just finished, refresh the Settings page once. If it still doesn’t appear, check Local LLM → This Instance — if it says ⚠ Installed but not running, the Ollama service may need a restart:

sudo systemctl restart ollama

Chat is slow or times out

Small models on a 2 GB-RAM VPS are tight. First-token latency can be a few seconds and longer responses can take 30+ seconds. If you see timeouts:

  • Make sure swap is at least 2 GB (Settings → Virtual Memory).
  • Try a smaller model (Qwen 2.5 0.5B is the lightest).
  • Don’t run multiple chats concurrently against the local model.

“Model not found” errors after switching

The model name in the dropdown matches Ollama’s tag exactly (e.g. tinyllama:latest). If you manually edited openclaw.json or used a different tag, the gateway won’t be able to resolve it. Re-pick from the Settings dropdown to get the correct name.

Switching back to the cloud model

Go to Settings → AI Model → Primary Model and pick your Anthropic or OpenAI Codex model. Save. The gateway switches on next restart.


What’s coming in 0.2

The current bundle treats local LLMs as a manual switch — install, pick from dropdown, you’re done. In 0.2 the plan is:

  • Ship with a small model preinstalled (likely Qwen 2.5 0.5B) for housekeeping tasks.
  • Wire the task-routing toggle so heartbeat, log analysis, and sensitive-data flows automatically use local while chat stays on cloud.
  • Add a second-brain instance for embeddings and memory operations on local hardware.

For now: install, pick, use.