Hardware Matcher
Run Gemma 4 Locally
Auto-detects your GPU — get the right model and command for your hardware
What this matcher helps you decide
This tool eliminates the guesswork:
- → Which Gemma 4 tier fits your hardware — E2B for phones, 26B MoE for consumer desktops, 31B Dense for workstations.
- → What framework to use — Ollama for CLI users, LM Studio for a GUI, Google AI Edge Gallery for mobile.
- → What to expect in speed and stability — realistic token-per-second estimates and known caveats for your specific configuration.
Common Gemma 4 local setups
| Hardware | Recommended Setup | Best For | Notes |
|---|---|---|---|
| MacBook Pro 16 GB | 26B-A4B MoE via Ollama (GGUF) | General chat | Keep context under 8k; avoid MLX (known bugs) |
| RTX 4060 8 GB | E4B via Ollama / LM Studio | Chat, lightweight local use | 26B MoE needs 12 GB+ VRAM |
| RTX 4070 Ti 16 GB | 26B-A4B MoE via Ollama | Chat, multimodal, coding | Sweet spot — comfortable headroom for MoE |
| RTX 4090 24 GB | 31B Dense via Ollama | Coding, deep reasoning | Use -np 1; long context (10k+) is tight even on 24 GB |
| iPhone 15 Pro | E2B via AI Edge Gallery | Offline chat, translation | E4B crashes on <10 GB RAM — stick to E2B |
| Android flagship 8 GB | E2B via AI Edge Gallery | Offline assistant | E4B needs 10 GB+ RAM; E2B is the safe choice |
Gemma 4 hardware requirements: what you really need
Edge & Mobile (E2B / E4B): 4 GB to 8 GB RAM
Perfect for on-device translation and basic tasks on iOS, Android, or Raspberry Pi. The E2B model (2B parameters) runs inside 4 GB of RAM; the E4B handles slightly more complex reasoning and fits in 6–8 GB. No GPU required — these models run entirely on the Neural Engine or CPU. Expect 10–25 tok/s on modern smartphone silicon.
Consumer Sweet Spot (26B-A4B MoE): 12 GB+ VRAM
The 26B Mixture-of-Experts model is the recommended option for running Gemma 4 locally on consumer hardware. Despite
its 26B total parameter count, only ~4B activate per forward pass, keeping inference fast.
12 GB VRAM works but is tight due to Gemma 4's SWA cache — you must use
OLLAMA_NUM_PARALLEL=1
and keep context under 8k. 16 GB+ VRAM is where it gets comfortable.
On Apple Silicon, 16 GB unified memory is sufficient for short-to-medium context.
Expect 17–55 tok/s depending on hardware.
High-End Workstations (31B Dense): 24 GB+ VRAM / 32 GB+ Mac
The flagship 31B Dense model requires a high-end GPU (RTX 4090/5090 24 GB+) or a Mac with
at least 32 GB unified memory (64 GB+ recommended for long context). Q4_K_M weights are
~17.5 GB alone — on a 24 GB Mac, macOS overhead leaves near-zero room for KV cache,
causing severe swap thrashing. Even on 24 GB VRAM, long coding context (10k+) is very tight.
Use OLLAMA_NUM_PARALLEL=1
and keep context short. Best suited for research, deep-context coding, and long-form analysis.
How to run Gemma 4 locally
Three steps from zero to a working local setup.
Pick a runtime
Ollama
CLI-first. One command to download and run. Best for developers comfortable with the terminal.
LM Studio
GUI app for Mac and Windows. Visual model browser, real-time VRAM monitoring, no terminal needed.
Mobile (AI Edge Gallery)
Run E-series models offline on iOS and Android. No cloud, no API key.
Download the right model
Use the matcher above to find your recommended tier, then pull the model. For Ollama:
ollama run gemma4:26b Run and verify
Start a conversation and watch memory usage. If you hit OOM errors or slowdowns,
reduce context length via
--ctx-size 4096
or try a more aggressively quantized variant.
Related walkthroughs
Beginner setup walkthrough
Installing Ollama, downloading the model, and running your first local chat session.
Mobile setup demo
Running Gemma 4 entirely offline on an iPhone.
Gemma 4 model tiers explained
| Model | Best For | Minimum Hardware | Speed Expectation |
|---|---|---|---|
| 31B Dense | Research, deep coding, long-form reasoning | 24 GB+ VRAM or 32 GB+ Apple RAM | 10–30 tok/s |
| 26B-A4B MoE ★ | General chat, coding, multimodal | 12 GB+ VRAM or 16 GB Apple RAM 12 GB tight — use -np 1, keep ctx <8k | 17–55 tok/s |
| E4B | Edge tasks, offline assistant, mobile | 6–8 GB RAM, no GPU needed | 10–35 tok/s |
| E2B | Minimal tasks, translation, embedded | 4–6 GB RAM, embedded hardware | 15–25 tok/s |
Local setup options: GGUF, MLX, and more
When setting up Gemma 4 locally, the model format you choose can meaningfully affect performance.
GGUF — The Universal Standard
GGUF is the format used by llama.cpp and Ollama. It supports all quantization levels (Q4_K_M, Q6_K, IQ2_XXS, etc.) and runs on any hardware — NVIDIA CUDA, Apple Metal, AMD ROCm, and CPU. Most users running Gemma 4 locally should start here.
MLX — Apple Silicon Native (⚠ Not Recommended for Gemma 4)
Apple's MLX framework runs natively on M-series chips and can deliver better throughput than GGUF for some models. However, MLX has confirmed bugs with Gemma 4: Markdown output corruption, token parsing errors, and inconsistent formatting have been reported by multiple community members. Use GGUF via Ollama or llama.cpp instead until MLX Gemma 4 support stabilizes.
EXL2 / ExLlamaV2
EXL2 quantization is optimized for NVIDIA GPUs and provides higher quality at equivalent bit depths compared to GGUF. Requires ExLlamaV2 as the backend. Best for 24 GB+ VRAM workstations targeting maximum throughput.
Gemma 4 for coding: Is it the right local model?
Gemma 4 31B is exceptional at creative writing, multilingual tasks, and general reasoning. In coding benchmarks it performs well, particularly on algorithmic problems and architecture discussions that benefit from strong natural language reasoning.
However, for strict agentic coding — meaning autonomous multi-step tool-calling workflows like those used in Claude Code, OpenCode, or similar frameworks — community reports often prefer Qwen-family models for some of these workflows, citing more consistent tool-call formatting for sequential function-call chains.
Recommendation: Use Gemma 4 26B-A4B MoE for general developer chat, code review, and documentation. Use Qwen 3.5 14B or 32B if your primary workflow is fully autonomous coding agents with heavy tool use.
Can you run Gemma 4 locally on Mac, iPhone, or Android?
Mac (Apple Silicon)
Apple's unified memory architecture is a practical fit for the 26B-A4B MoE — on 16 GB Apple Silicon Macs, expect 17–22 tok/s with short-to-medium context (keep under 8k to avoid swap). Use GGUF format via Ollama — MLX has known bugs with Gemma 4. For the 31B Dense model, 32 GB is the minimum (64 GB+ recommended for long context). 24 GB Macs cannot run 31B — model weights (~17.5 GB) plus macOS overhead leave near-zero room for KV cache, causing severe swap thrashing.
iPhone / iOS
The E2B model runs entirely offline on recent flagship iPhones via Google AI Edge Gallery — no internet connection or API key required. Useful for private, on-device translation and basic Q&A. E4B (Q4_K_M ~9.6 GB) will crash on iPhones with less than 10 GB RAM — stick to E2B for reliable offline chat on 6–8 GB devices.
Android
Google provides native Gemma E-series support through AI Edge Gallery on Android. E2B runs comfortably on 6–8 GB Android flagships. E4B needs 10 GB+ RAM — most phones don't have that, so stick to E2B for stability.
Frequently asked questions
How does the automatic GPU detection work? +
Why does the tool detect my GPU but not my RAM? +
navigator.deviceMemory
API exists but is limited to Chromium browsers and caps at 8 GB —
not useful for local AI workloads. GPU model names, on the other hand, are
exposed through WebGPU and WebGL for rendering purposes, which is why we
can detect your GPU but still need you to select your RAM manually.