Gemma 4 Local

Hardware Matcher

Run Gemma 4 Locally

Auto-detects your GPU — get the right model and command for your hardware

Scanning your hardware…

What this matcher helps you decide

This tool eliminates the guesswork:

  • Which Gemma 4 tier fits your hardware — E2B for phones, 26B MoE for consumer desktops, 31B Dense for workstations.
  • What framework to use — Ollama for CLI users, LM Studio for a GUI, Google AI Edge Gallery for mobile.
  • What to expect in speed and stability — realistic token-per-second estimates and known caveats for your specific configuration.

Common Gemma 4 local setups

Hardware Recommended Setup Best For Notes
MacBook Pro 16 GB 26B-A4B MoE via Ollama (GGUF) General chat Keep context under 8k; avoid MLX (known bugs)
RTX 4060 8 GB E4B via Ollama / LM Studio Chat, lightweight local use 26B MoE needs 12 GB+ VRAM
RTX 4070 Ti 16 GB 26B-A4B MoE via Ollama Chat, multimodal, coding Sweet spot — comfortable headroom for MoE
RTX 4090 24 GB 31B Dense via Ollama Coding, deep reasoning Use -np 1; long context (10k+) is tight even on 24 GB
iPhone 15 Pro E2B via AI Edge Gallery Offline chat, translation E4B crashes on <10 GB RAM — stick to E2B
Android flagship 8 GB E2B via AI Edge Gallery Offline assistant E4B needs 10 GB+ RAM; E2B is the safe choice

Gemma 4 hardware requirements: what you really need

Edge & Mobile (E2B / E4B): 4 GB to 8 GB RAM

Perfect for on-device translation and basic tasks on iOS, Android, or Raspberry Pi. The E2B model (2B parameters) runs inside 4 GB of RAM; the E4B handles slightly more complex reasoning and fits in 6–8 GB. No GPU required — these models run entirely on the Neural Engine or CPU. Expect 10–25 tok/s on modern smartphone silicon.

Consumer Sweet Spot (26B-A4B MoE): 12 GB+ VRAM

The 26B Mixture-of-Experts model is the recommended option for running Gemma 4 locally on consumer hardware. Despite its 26B total parameter count, only ~4B activate per forward pass, keeping inference fast. 12 GB VRAM works but is tight due to Gemma 4's SWA cache — you must use OLLAMA_NUM_PARALLEL=1 and keep context under 8k. 16 GB+ VRAM is where it gets comfortable. On Apple Silicon, 16 GB unified memory is sufficient for short-to-medium context. Expect 17–55 tok/s depending on hardware.

High-End Workstations (31B Dense): 24 GB+ VRAM / 32 GB+ Mac

The flagship 31B Dense model requires a high-end GPU (RTX 4090/5090 24 GB+) or a Mac with at least 32 GB unified memory (64 GB+ recommended for long context). Q4_K_M weights are ~17.5 GB alone — on a 24 GB Mac, macOS overhead leaves near-zero room for KV cache, causing severe swap thrashing. Even on 24 GB VRAM, long coding context (10k+) is very tight. Use OLLAMA_NUM_PARALLEL=1 and keep context short. Best suited for research, deep-context coding, and long-form analysis.

How to run Gemma 4 locally

Three steps from zero to a working local setup.

1

Pick a runtime

Ollama

CLI-first. One command to download and run. Best for developers comfortable with the terminal.

LM Studio

GUI app for Mac and Windows. Visual model browser, real-time VRAM monitoring, no terminal needed.

Mobile (AI Edge Gallery)

Run E-series models offline on iOS and Android. No cloud, no API key.

2

Download the right model

Use the matcher above to find your recommended tier, then pull the model. For Ollama:

ollama run gemma4:26b
3

Run and verify

Start a conversation and watch memory usage. If you hit OOM errors or slowdowns, reduce context length via --ctx-size 4096 or try a more aggressively quantized variant.

Related walkthroughs

Beginner setup walkthrough

Installing Ollama, downloading the model, and running your first local chat session.

Mobile setup demo

Running Gemma 4 entirely offline on an iPhone.

Gemma 4 model tiers explained

Model Best For Minimum Hardware Speed Expectation
31B Dense Research, deep coding, long-form reasoning 24 GB+ VRAM or 32 GB+ Apple RAM 10–30 tok/s
26B-A4B MoE General chat, coding, multimodal 12 GB+ VRAM or 16 GB Apple RAM
12 GB tight — use -np 1, keep ctx <8k
17–55 tok/s
E4B Edge tasks, offline assistant, mobile 6–8 GB RAM, no GPU needed 10–35 tok/s
E2B Minimal tasks, translation, embedded 4–6 GB RAM, embedded hardware 15–25 tok/s

Local setup options: GGUF, MLX, and more

When setting up Gemma 4 locally, the model format you choose can meaningfully affect performance.

GGUF — The Universal Standard

GGUF is the format used by llama.cpp and Ollama. It supports all quantization levels (Q4_K_M, Q6_K, IQ2_XXS, etc.) and runs on any hardware — NVIDIA CUDA, Apple Metal, AMD ROCm, and CPU. Most users running Gemma 4 locally should start here.

MLX — Apple Silicon Native (⚠ Not Recommended for Gemma 4)

Apple's MLX framework runs natively on M-series chips and can deliver better throughput than GGUF for some models. However, MLX has confirmed bugs with Gemma 4: Markdown output corruption, token parsing errors, and inconsistent formatting have been reported by multiple community members. Use GGUF via Ollama or llama.cpp instead until MLX Gemma 4 support stabilizes.

EXL2 / ExLlamaV2

EXL2 quantization is optimized for NVIDIA GPUs and provides higher quality at equivalent bit depths compared to GGUF. Requires ExLlamaV2 as the backend. Best for 24 GB+ VRAM workstations targeting maximum throughput.

Gemma 4 for coding: Is it the right local model?

Gemma 4 31B is exceptional at creative writing, multilingual tasks, and general reasoning. In coding benchmarks it performs well, particularly on algorithmic problems and architecture discussions that benefit from strong natural language reasoning.

However, for strict agentic coding — meaning autonomous multi-step tool-calling workflows like those used in Claude Code, OpenCode, or similar frameworks — community reports often prefer Qwen-family models for some of these workflows, citing more consistent tool-call formatting for sequential function-call chains.

Recommendation: Use Gemma 4 26B-A4B MoE for general developer chat, code review, and documentation. Use Qwen 3.5 14B or 32B if your primary workflow is fully autonomous coding agents with heavy tool use.

Can you run Gemma 4 locally on Mac, iPhone, or Android?

Mac (Apple Silicon)

Apple's unified memory architecture is a practical fit for the 26B-A4B MoE — on 16 GB Apple Silicon Macs, expect 17–22 tok/s with short-to-medium context (keep under 8k to avoid swap). Use GGUF format via Ollama — MLX has known bugs with Gemma 4. For the 31B Dense model, 32 GB is the minimum (64 GB+ recommended for long context). 24 GB Macs cannot run 31B — model weights (~17.5 GB) plus macOS overhead leave near-zero room for KV cache, causing severe swap thrashing.

iPhone / iOS

The E2B model runs entirely offline on recent flagship iPhones via Google AI Edge Gallery — no internet connection or API key required. Useful for private, on-device translation and basic Q&A. E4B (Q4_K_M ~9.6 GB) will crash on iPhones with less than 10 GB RAM — stick to E2B for reliable offline chat on 6–8 GB devices.

Android

Google provides native Gemma E-series support through AI Edge Gallery on Android. E2B runs comfortably on 6–8 GB Android flagships. E4B needs 10 GB+ RAM — most phones don't have that, so stick to E2B for stability.

Frequently asked questions

How does the automatic GPU detection work? +
This tool uses browser-native APIs to identify your GPU — no downloads or extensions required. It first tries the WebGPU API (supported in Chrome, Edge, and recent Safari), which reports your GPU model and architecture directly. If WebGPU is unavailable, it falls back to WebGL renderer detection. The detected GPU name is then matched against a known database of VRAM capacities to auto-fill your hardware profile. Everything runs entirely in your browser — no data is sent to any server.
Why does the tool detect my GPU but not my RAM? +
Browser security restrictions prevent websites from reading your exact system RAM. The navigator.deviceMemory API exists but is limited to Chromium browsers and caps at 8 GB — not useful for local AI workloads. GPU model names, on the other hand, are exposed through WebGPU and WebGL for rendering purposes, which is why we can detect your GPU but still need you to select your RAM manually.
The detection shows the wrong GPU or VRAM — what should I do? +
Click "Switch to manual selection" below the detected hardware card to enter manual mode, where you can set your OS, RAM, and VRAM by hand. Common reasons for inaccurate detection include: laptops with dual GPUs (the browser may report the integrated GPU instead of the discrete one), privacy-focused browsers that block GPU fingerprinting, or newer GPU models not yet in our lookup database.
On Mac it says "Apple Silicon" but doesn't show my exact chip — is it still accurate? +
Yes. macOS browsers report the Metal API version (e.g. "metal-3") rather than the specific chip name (M1, M2, M4, etc.), so we can confirm you're on Apple Silicon but can't pinpoint which generation. This doesn't affect the recommendation — on Macs, the model match is based on your RAM amount, not the chip model, because Apple Silicon uses unified memory shared between CPU and GPU. Just make sure you select the correct RAM in the dropdown.