Gemma 4 Local

Mac Setup Guide

Run Gemma 4 on Mac with Apple Silicon

Find the best model for your M1/M2/M3/M4 — auto-detects your chip and unified memory

Scanning your hardware…

Why Apple Silicon is ideal for running Gemma 4 locally

Apple's unified memory architecture gives Macs a unique advantage for local LLMs — the GPU shares all system RAM, so a 32 GB MacBook has 32 GB available for model weights. Discrete GPUs on PC are limited to their VRAM (typically 8–24 GB).

🧠

Unified Memory

All RAM is shared between CPU and GPU — no VRAM bottleneck. 16 GB Mac ≈ 16 GB GPU.

Metal Acceleration

llama.cpp and Ollama use Metal natively. No CUDA needed — fast inference out of the box.

🔇

Silent & Efficient

M-series chips run LLMs without a fan. Run Gemma 4 during meetings with zero noise.

Which Gemma 4 model fits your Mac?

Mac Config Best Model Speed Notes
M1/M2/M3 · 8 GB E4B (Q4_K_M) 10–20 tok/s Keep context short (<4k). E2B for longer chats.
M1/M2/M3/M4 · 16 GB 26B MoE (Q4_K_M) 17–22 tok/s Sweet spot. Keep context under 8k to avoid swap.
M1 Pro/Max/M2+ · 32 GB 31B Dense (Q6_K) 12–18 tok/s Full precision. 26B MoE also runs fast here.
M1 Ultra/M2+ · 64 GB+ 31B Dense (FP16) 15–25 tok/s Full weights + long context. No compromises.

3 steps to run Gemma 4 on your Mac

Step 1

Install Ollama

Download from ollama.com. One installer, no dependencies. Uses Metal automatically.

Step 2

Pull the model

Run ollama pull gemma4:26b (or whichever model the matcher recommends).

Step 3

Start chatting

Run ollama run gemma4:26b — that's it. Use the matcher above for your exact command.

GGUF vs MLX: Which format on Mac?

Use GGUF via Ollama or llama.cpp — it's the universal standard with the best Gemma 4 compatibility. llama.cpp uses Metal acceleration automatically on Apple Silicon.

Apple's MLX framework can deliver better throughput for some models, but MLX has confirmed bugs with Gemma 4: Markdown output corruption, token parsing errors, and inconsistent formatting have been reported by multiple community members. Stick to GGUF until MLX support stabilizes.

Watch: Gemma 4 Mac setup walkthrough

Step-by-step: installing Ollama, downloading Gemma 4, and running your first local chat session on Mac.

Mac performance tips

Close memory-heavy apps — Chrome tabs, Docker, and Xcode compete for unified memory. Quit them before running 26B+ models on 16 GB Macs.

Set OLLAMA_NUM_PARALLEL=1 for 26B/31B models. This reduces SWA cache from ~3.2 GB to ~1.2 GB, crucial on 16 GB Macs.

Watch for swap thrashing — if Activity Monitor shows high "Memory Pressure" (yellow/red), the model is too large. Drop to a smaller tier or shorter context window.

24 GB cannot run 31B Dense — model weights (~17.5 GB) plus macOS overhead leave near-zero room for KV cache. Use 26B MoE instead.

Mac setup FAQ

Does Gemma 4 use the GPU on Mac or just the CPU? +
Both. Apple Silicon's unified memory means CPU and GPU share the same RAM. Ollama and llama.cpp use Metal acceleration by default — the GPU handles matrix operations while the CPU handles tokenization and scheduling. No configuration needed.
Can I run Gemma 4 on an Intel Mac? +
Technically yes (CPU-only mode), but it will be very slow. Intel Macs don't have unified memory or Metal GPU compute for LLMs. The E2B model might be usable; anything larger is impractical. Apple Silicon is strongly recommended.
Why does the detector identify my chip as "Apple Silicon" instead of the exact model? +
Our detector uses WebGPU and WebGL renderer strings to identify your exact chip (M1, M2 Pro, M4 Max, etc.) when the browser exposes it. In some browsers or privacy configurations, only "Apple GPU" is reported — in that case we fall back to benchmark-based identification. Either way, the model recommendation is based on your RAM amount, which is what actually determines which models fit.