Local AI ยท Ollama v0.19

Ollama v0.19 + Apple Silicon: Local LLMs Are Fast Enough Now

Ollama v0.19 shipped Apple MLX integration in late March and the speed jump on M-series chips is not marginal โ€” it's the inflection point where local inference stops being a tradeoff and starts being a genuine alternative to cloud APIs. Here's what changed, what models benefit most, and how to run it all through OpenClaw.

๐Ÿฆž claw.mobile EditorialยทApril 3, 2026ยท
10 min read

The promise of local LLMs has always been clear: no API costs, complete data privacy, offline operation, zero latency to the cloud. The problem has been equally clear: they were slower, required careful model selection, and for most users, the friction was real enough to keep cloud APIs as the default.

Ollama v0.19 changes that calculus in a meaningful way, specifically for anyone running Apple silicon. The integration with Apple's MLX framework delivers prefill and decode speed improvements that are substantial โ€” not the "technically faster" kind of improvement, but the kind that changes which models you can practically use for daily-driving.

This article covers what actually changed in v0.19, why MLX delivers the gains it does, which models to run, and how to wire the whole thing into OpenClaw so your persistent agent runs locally at API-competitive speeds.

What Changed in Ollama v0.19

The headline change in v0.19 is MLX backend support. Previously, Ollama on macOS used a Metal-based inference path that was already faster than CPU-only setups but didn't fully exploit the specific optimizations in Apple's neural engine architecture. MLX is Apple's array framework specifically designed for ML workloads on Apple silicon โ€” it understands the hardware at a lower level and schedules operations more efficiently across the CPU, GPU, and Neural Engine.

Prefill Speed

The rate at which the model processes your input prompt. v0.19 shows significant gains here, meaning the time between sending a long message and getting the first token back drops substantially.

Decode Speed

How fast the model generates tokens after the first one. Measured in tokens/second. v0.19's MLX path improves this, making streaming output feel more natural on M-series chips.

Memory Efficiency

MLX's unified memory model means better use of the shared memory pool on Apple silicon, reducing the chance of model-too-large errors on machines with 16-32GB RAM.

Beyond MLX, v0.19 also improved multi-model serving (switching between models in the same session with less overhead) and updated the model library with optimized versions of several of the most popular models for the MLX backend. If you've had Ollama installed for more than a few months, a fresh update plus re-pulling your models gets you the optimized versions.

MLX: Why the Hardware Integration Matters

Apple silicon (M1, M2, M3, M4, M5) uses a unified memory architecture โ€” the CPU, GPU, and Neural Engine all share the same memory pool. This is architecturally different from a traditional PC with discrete GPU VRAM, and it creates a specific opportunity for ML frameworks: you can transfer data between compute units without the bandwidth tax of PCIe copies.

Standard ML frameworks (PyTorch, TensorFlow) were designed for the traditional architecture. They work on Apple silicon but don't exploit the unified memory properties optimally. MLX was built from scratch with this architecture in mind. It schedules operations lazily, meaning it can fuse multiple operations and decide at the last moment which compute unit runs them based on current load โ€” a luxury that standard frameworks don't have because they assume memory boundaries.

Why M5 Benefits Most

M5 chips have the highest memory bandwidth of any Apple silicon generation โ€” up to 800 GB/s on the M5 Ultra. LLM inference is heavily memory-bandwidth-bound. A framework that can fully utilize that bandwidth (MLX) vs one that partially utilizes it (Metal without MLX) shows the biggest gap on the chips with the most bandwidth to exploit.

On M3/M4, the gains are still real but smaller. On M1/M2, you'll see improvement but the absolute bandwidth ceiling is lower. If you're running a Mac mini M4 Pro or M5, v0.19 is a meaningful upgrade in practice.

Real-World Speed Numbers

Community benchmarks from early adopters of v0.19 on Apple silicon. These are informal numbers from consistent testing conditions, not controlled lab benchmarks โ€” but the pattern is consistent across hardware generations.

ModelHardwarePre-v0.19 (tok/s)v0.19 MLX (tok/s)Improvement
Llama 4 Scout (17B)M4 Pro (24GB)~28 tok/s~47 tok/s+68%
Gemma 4 12BM3 (16GB)~22 tok/s~35 tok/s+59%
Qwen3-Next 14BM4 (16GB)~18 tok/s~31 tok/s+72%
DeepSeek V3.2 (Q4)M5 Pro (48GB)~8 tok/s~15 tok/s+87%

Numbers from community testing. Results vary by hardware config, quantization, and prompt characteristics.

The numbers that matter most for practical daily use: sub-50 token/s output starts feeling noticeably slower in interactive conversations. Getting models like Llama 4 Scout and Gemma 4 into the 35-50 tok/s range on consumer hardware crosses the threshold where local inference is comfortable to use for back-and-forth conversation, not just batch tasks.

Best Models to Run Locally in 2026

The local model landscape in April 2026 is meaningfully different from six months ago. Here's the practical breakdown for OpenClaw users:

llama4:scout
Best Overall

Meta's Llama 4 Scout (17B MoE) โ€” Multimodal, native function calling, fast on MLX. The default recommendation for anyone running OpenClaw locally in 2026. Fits in 16GB with 4-bit quantization. Read the full Llama 4 + OpenClaw guide for setup details.

ollama pull llama4:scout
gemma4:12b
Best Reasoning

Google's Gemma 4 12B โ€” Exceptional reasoning per parameter, native function calling, AIME 2026 performance well above its weight class. Works well as the brain of an OpenClaw agent on 16GB hardware. Also available as the 26B MoE for machines with more RAM.

ollama pull gemma4:12b
qwen3-next:14b
Best for Coding

Qwen3-Next 14B โ€” Strong code generation, structured output, and tool-use capabilities. Alibaba's Qwen3 series has caught up significantly with Western models on coding benchmarks. A solid Claude Code alternative for local coding tasks.

ollama pull qwen3-next:14b
deepseek-v3:q4
Best on 48GB+

DeepSeek V3.2 (Q4) โ€” If you have 48GB+ RAM (Mac Studio M5 Pro or Ultra), DeepSeek V3.2 quantized to Q4 is the closest thing to frontier model performance you can run locally. The v0.19 speed improvements make it genuinely usable now vs. agonizing on pre-MLX versions.

ollama pull deepseek-v3:q4_K_M

Wire Ollama v0.19 into OpenClaw

OpenClaw has had Ollama support for a while โ€” but with v0.19 performance levels, it's worth revisiting the setup and treating local models as genuine first-class options rather than fallbacks. Here's the full path.

Step 1: Update and Pull Your Model

# Update Ollama to v0.19+
ollama --version # Confirm you're on v0.19+
curl -fsSL https://ollama.com/install.sh | sh # macOS auto-updates
# Pull a model (re-pull to get MLX-optimized version)
ollama pull llama4:scout
# Verify it's running
ollama run llama4:scout "hello"

Step 2: Configure OpenClaw

Add the Ollama provider to your OpenClaw config. Ollama runs a local API server on port 11434 by default:

{
"model": "ollama/llama4:scout",
"providers": {
"ollama": {
"baseUrl": "http://localhost:11434"
}
}
}

Step 3: Switch Models On-Demand

In Telegram, you can switch between local and cloud models mid-conversation:

# Switch to local Llama 4
/model ollama/llama4:scout
# Switch to cloud Claude for complex tasks
/model anthropic/claude-sonnet-4-6
# Switch back to local
/model ollama/gemma4:12b

The hybrid approach

The most practical setup: run a local Ollama model as your default for everyday tasks (privacy, zero cost, fast), and route complex reasoning or long-context work to Claude via the cost calculator analysis. With v0.19 speeds, the quality gap for routine tasks has closed enough that this is a real option, not just a theory.

For the full OpenClaw + Ollama setup walkthrough including skills, cron jobs, and multi-model routing, see the OpenClaw with Ollama deep-dive. The main setup guide covers the full installation path from scratch.

Honest Take: Is Local AI Ready to Daily-Drive?

Yes, with qualifications.

For routine tasks โ€” drafting, research synthesis, code explanation, conversation, light tool use โ€” a well-configured Llama 4 Scout or Gemma 4 12B on Ollama v0.19 delivers results that are genuinely hard to distinguish from mid-tier cloud models. The speed is there, the quality is there, and the privacy and cost advantages are real.

The qualifications: complex multi-step reasoning, very long context tasks (>50K tokens), and tasks requiring the latest training data still favor frontier cloud models. Claude Opus 4.6 on a genuinely difficult software engineering task is not the same as Llama 4 Scout on the same task. The benchmark gap is real.

The practical recommendation: set local as your default in OpenClaw, route to cloud when you need frontier capability. Use the cost calculator to understand what the cloud fallback actually costs at your usage level. For most users, you'll end up with a hybrid that runs 70-80% of requests locally and costs dramatically less than running everything in the cloud.

The trend line is also clear: local models are improving faster than cloud model pricing is falling. The gap will continue to close. Ollama v0.19 is another step in that direction, and it's a real one.

$0
Per-token API cost for local inference
100%
Data stays on your hardware
+68%
Speed improvement on M4 Pro vs pre-MLX
Join 2,000+ builders

Stay in the Loop

Get weekly OpenClaw tips, new skills, and automation ideas. No spam, unsubscribe anytime.

Join 2,000+ builders ยท No spam ยท Unsubscribe anytime

We use cookies for analytics. Learn more
Run your own AI agent for $6/month โ†’