Gemma 4 Local

Hardware Matcher

Run Gemma 4 Locally

Auto-detects your GPU — get the right model and command for your hardware

Scanning your hardware…

Gemma 4 Minimum Hardware Requirements

Quick reference for every Gemma 4 model tier and popular hardware configuration.

By Model Tier

Model Params Min RAM Min VRAM Recommended GPU Speed
E2B 5B Dense 4 GB 4 GB (Q4) Any modern phone or PC 15–30 t/s (phone), 40+ t/s (desktop)
E4B 9B Dense 8 GB 6 GB (Q4) RTX 3060+, M1+ 20–35 t/s
26B-A4B MoE 27B / 4B active 16 GB 12 GB (Q2), 16 GB (Q4) RTX 4070 Ti, M-series 16 GB+ 10–25 t/s (MoE advantage)
31B Dense 31B 32 GB 20 GB (Q4), 24 GB (Q2) RTX 4090, M-series 32 GB+ 8–20 t/s

By Popular Hardware

Hardware Runs Best Quant Expected Speed
RTX 4060 8 GB E4B Q4_K_M 20–35 t/s
RTX 4070 Ti 16 GB 26B MoE Q4_K_M 25–40 t/s
RTX 4090 24 GB 31B Dense Q4_K_M 15–25 t/s
MacBook M1/M2/M3/M4 8 GB E4B Q4_K_M 10–20 t/s
MacBook Pro M-series 16 GB 26B MoE Q3/Q4 10–22 t/s
MacBook Pro M-series 32 GB+ 31B Dense Q4_K_M 10–18 t/s
iPhone 15 Pro E2B Built-in 10–20 t/s
Android flagship 8 GB E2B Built-in 8–15 t/s

Check Your Hardware Automatically

No installs, no sign-ups. Open the page and get a personalized setup in seconds.

How Auto-Detection Works

1

Auto-detect your GPU

The moment you load this page, the tool reads your GPU via browser-native WebGPU and WebGL APIs. On Mac, it identifies your exact Apple Silicon chip and unified memory size. On PC, it reads your NVIDIA / AMD model and VRAM. No data leaves your browser.

2

Match to the best model

Your detected GPU, VRAM (or unified memory), and OS are cross-referenced against Gemma 4's model tiers — Edge for phones, 26B MoE for consumer hardware, 31B Dense for workstations. The matcher picks the largest model your hardware can run comfortably and flags any known caveats.

3

Copy your run command

You get a ready-to-paste terminal command for Ollama, llama.cpp, or Transformers — with the right model tag, context flags, and performance tweaks pre-filled. On mobile, the tool links directly to Google AI Edge Gallery for one-tap install.

💡

Everything runs in your browser. The GPU detection uses the same APIs that games and 3D apps use to render graphics. No data sent to any server, no cookies, no analytics, no user tracking. You can even use the tool offline after the page loads.

Manual Mode

Click "Wrong? Edit manually" in the tool above to set OS, RAM, and VRAM by hand. Useful when auto-detection doesn't match your actual setup.

  • Dual-GPU laptop — the browser often reports the weaker integrated GPU instead of your discrete NVIDIA/AMD card.
  • New GPU not in our database — very recent models may show as "Unknown GPU". Manual mode lets you enter specs directly.
  • Planning a purchase — try different RAM/VRAM combos to see which upgrade unlocks a bigger model tier.

Common Gemma 4 Local Setups

Hardware Recommended Setup Best For Notes
MacBook Pro 16 GB 26B-A4B MoE via Ollama (GGUF) General chat Keep context under 8k; avoid MLX (known bugs) Try it →
RTX 4060 8 GB E4B via Ollama / LM Studio Chat, lightweight local use 26B MoE needs 12 GB+ VRAM Try it ↑
RTX 4070 Ti 16 GB 26B-A4B MoE via Ollama Chat, multimodal, coding Sweet spot — comfortable headroom for MoE Try it ↑
RTX 4090 24 GB 31B Dense via Ollama Coding, deep reasoning Use -np 1; long context (10k+) is tight even on 24 GB Try it ↑
iPhone 15 Pro E2B via AI Edge Gallery Offline chat, translation E4B crashes on <10 GB RAM — stick to E2B Try it →
Android flagship 8 GB E2B via AI Edge Gallery Offline assistant E4B needs 10 GB+ RAM; E2B is the safe choice Try it →

Supported Runtimes & Formats

Gemma 4 works with all major local inference engines. Choose based on your OS and preference.

🦙

Ollama (GGUF)

CLI-first. One command to download and run any Gemma 4 tier. Best for Mac and Linux users comfortable with the terminal.

ollama run gemma4:26b
🖥

LM Studio (GGUF, EXL2)

GUI app for Mac and Windows. Visual model browser with real-time VRAM monitoring. Search "gemma 4" inside the app to download.

  • • Supports GGUF and EXL2 formats
  • • Built-in chat interface — no terminal needed
  • • Automatic quantization selection based on your VRAM
>_

llama.cpp & MLX Support

Low-level C++ inference engine. Apple MLX provides native Metal acceleration on Apple Silicon. Note: MLX has known bugs with Gemma 4 MoE — prefer Ollama on Mac.

  • • llama.cpp: GGUF format, CUDA / Metal / Vulkan
  • • MLX: Apple Silicon native, GGUF safetensors
  • • Most flexible — build from source for custom configs

ExLlamaV2 (EXL2)

Fastest inference for NVIDIA GPUs. EXL2 format offers flexible bit-rate quantization. Best for RTX 4070 Ti and above running 26B MoE or 31B Dense.

  • • NVIDIA-only — requires CUDA
  • • Gemma 4 31B EXL2 available on HuggingFace
  • • Highest tok/s for a given quality level

How to Run Gemma 4 Locally

Three steps from zero to a working local setup.

1

Pick a Runtime

Ollama

CLI-first. One command to download and run. Best for developers comfortable with the terminal.

LM Studio

GUI app for Mac and Windows. Visual model browser, real-time VRAM monitoring, no terminal needed.

Mobile (AI Edge Gallery)

Run E-series models offline on iOS and Android. No cloud, no API key.

2

Download the Right Model

Use the matcher above to find your recommended tier, then pull the model. For Ollama:

ollama run gemma4:26b
3

Run and Verify

Start a conversation and watch memory usage. If you hit OOM errors or slowdowns, reduce context length via --ctx-size 4096 or try a more aggressively quantized variant.

See it in action

Hardware matcher auto-detection showing Apple M1 Pro with 16 GB unified memory and ~100 GB/s bandwidth

Instant GPU detection

Opens the page, your GPU and memory are identified automatically — no input needed.

Matched result showing 26B MoE recommendation with Ollama run command and 17-22 tok/s speed estimate

Personalized recommendation

Get the right model tier, expected speed, and a copy-paste terminal command.

Animated demo of switching from auto-detection to manual mode, selecting OS, RAM, and VRAM

Full manual control

Switch to manual mode to set OS, RAM, VRAM by hand — or compare upgrade scenarios.

Frequently Asked Questions

What are the minimum PC requirements for Gemma 4? +
For the smallest model (E2B), you need a PC with at least 4 GB of free RAM and any modern GPU or CPU. For the popular 26B MoE model, aim for 16 GB of RAM and 12–16 GB of VRAM (RTX 4070 Ti or equivalent). For the full 31B Dense model, you'll want 24 GB+ VRAM (RTX 4090) or 32 GB+ unified memory (Apple Silicon Max chips). All models can run on CPU alone at reduced speed (3–8 t/s).
Can an RTX 4060 run Gemma 4? +
Yes, an RTX 4060 (8 GB VRAM) can run the E4B model comfortably at Q4_K_M quantization with 20–35 t/s. The 26B MoE model requires 12 GB+ VRAM at minimum, so it won't fit on an RTX 4060. Use Ollama or LM Studio and search for gemma4:e4b to get started.
Can an RTX 4090 run Gemma 4 31B? +
Yes. The RTX 4090 (24 GB VRAM) can run Gemma 4 31B Dense at Q4_K_M quantization (~20 GB VRAM). You'll get approximately 15–25 t/s depending on context length. For best results, use -np 1 with Ollama to limit parallelism. Long contexts (10k+ tokens) may be tight — reduce with --ctx-size if needed.
Does Gemma 4 support EXL2 format? +
Yes. Gemma 4 models (including the 31B Dense) have been converted to EXL2 format by the community. EXL2 files are available on HuggingFace and can be used with ExLlamaV2 for the fastest NVIDIA inference. EXL2 supports flexible bit-rate quantization, letting you fine-tune the VRAM/quality tradeoff. Note: EXL2 requires an NVIDIA GPU with CUDA — it does not support AMD or Apple Silicon.
How to run Gemma 4 on LM Studio? +
Open LM Studio, go to the search tab, and type "gemma 4". You'll find GGUF files for all model tiers (E2B, E4B, 26B MoE, 31B Dense) in various quantization levels. LM Studio automatically highlights which versions fit your available VRAM. Click download, then switch to the chat tab to start talking. LM Studio also supports EXL2 format for NVIDIA GPUs.
What's the difference between Gemma 4 26B MoE and 31B Dense? +
The 26B-A4B MoE (Mixture of Experts) has 27B total parameters but only activates ~4B per token. This means it's much faster than its size suggests — comparable speed to a 4B model — while retaining higher quality. It fits in 12–16 GB VRAM and is the best choice for consumer hardware (RTX 4070 Ti, MacBook Pro 16 GB). The 31B Dense activates all 31B parameters on every token, giving it the highest quality output but requiring 20+ GB VRAM. It's best for workstations (RTX 4090, M-series 32 GB+).
Can I run Gemma 4 on my phone? +
Yes. Install Google AI Edge Gallery on Android or iOS, then download the Gemma 4 E2B model (5B parameters, ~3 GB). It runs entirely offline with no API key. The E4B model (9B parameters) may work on phones with 10+ GB RAM but can crash on devices with less. For best results on mobile, stick to E2B.
How to fix OOM errors when running Gemma 4 locally? +
Out-of-memory errors mean the model + context window exceeds your available VRAM/RAM. Try these steps in order: 1) Reduce context length — use --ctx-size 4096 or lower. 2) Use a lower quantization — switch from Q4_K_M to Q3 or Q2. 3) Drop to a smaller model — use 26B MoE instead of 31B, or E4B instead of 26B. 4) Limit parallelism — use OLLAMA_NUM_PARALLEL=1 with Ollama. The KV cache for Gemma 4 is particularly large due to its 256K context window, so context length is the single biggest VRAM factor.
How does the automatic GPU detection work? +
This tool uses browser-native APIs to identify your GPU — no downloads or extensions required. It first tries the WebGPU API (supported in Chrome, Edge, and recent Safari), which reports your GPU model and architecture directly. If WebGPU is unavailable, it falls back to WebGL renderer detection. For Apple Silicon, it identifies your exact chip (M1, M2 Pro, M4 Max, etc.) and unified memory size. Everything runs entirely in your browser — no data is sent to any server.
Why does the tool detect my GPU but not my RAM? +
Browser security restrictions prevent websites from reading your exact system RAM. The navigator.deviceMemory API exists but is limited to Chromium browsers and caps at 8 GB — not useful for local AI workloads. GPU model names, on the other hand, are exposed through WebGPU and WebGL for rendering purposes, which is why we can detect your GPU but still need you to select your RAM manually.
The detection shows the wrong GPU or VRAM — what should I do? +
Click "Switch to manual selection" below the detected hardware card to enter manual mode, where you can set your OS, RAM, and VRAM by hand. Common reasons for inaccurate detection include: laptops with dual GPUs (the browser may report the integrated GPU instead of the discrete one), privacy-focused browsers that block GPU fingerprinting, or newer GPU models not yet in our lookup database.