Hardware Matcher

Run Gemma 4 Locally

Q: How does the automatic GPU detection work?

This tool uses browser-native WebGPU and WebGL APIs to identify your GPU — no downloads required. The detected GPU name is matched against a known VRAM database to auto-fill your hardware profile. Everything runs entirely in your browser.

Q: Why does the tool detect my GPU but not my RAM?

Browser security restrictions prevent reading exact system RAM. The navigator.deviceMemory API caps at 8 GB and only works in Chromium. GPU names are exposed through WebGPU/WebGL for rendering, so we can detect GPU but need manual RAM selection.

Q: The detection shows the wrong GPU or VRAM — what should I do?

Click 'Switch to manual selection' below the detected hardware card. Common reasons for inaccurate detection include laptops with dual GPUs, privacy browsers blocking GPU fingerprinting, or newer GPUs not yet in the lookup database.

Auto-detects your GPU — get the right model and command for your hardware

Scanning your hardware…

Gemma 4 Minimum Hardware Requirements

Quick reference for every Gemma 4 model tier and popular hardware configuration.

By Model Tier

Model	Params	Min RAM	Min VRAM	Recommended GPU	Speed
E2B	5B Dense	4 GB	4 GB (Q4)	Any modern phone or PC	15–30 t/s (phone), 40+ t/s (desktop)
E4B	9B Dense	8 GB	6 GB (Q4)	RTX 3060+, M1+	20–35 t/s
26B-A4B MoE	27B / 4B active	16 GB	12 GB (Q2), 16 GB (Q4)	RTX 4070 Ti, M-series 16 GB+	10–25 t/s (MoE advantage)
31B Dense	31B	32 GB	20 GB (Q4), 24 GB (Q2)	RTX 4090, M-series 32 GB+	8–20 t/s

By Popular Hardware

Hardware	Runs Best	Quant	Expected Speed
RTX 4060 8 GB	E4B	Q4_K_M	20–35 t/s
RTX 4070 Ti 16 GB	26B MoE	Q4_K_M	25–40 t/s
RTX 4090 24 GB	31B Dense	Q4_K_M	15–25 t/s
MacBook M1/M2/M3/M4 8 GB	E4B	Q4_K_M	10–20 t/s
MacBook Pro M-series 16 GB	26B MoE	Q3/Q4	10–22 t/s
MacBook Pro M-series 32 GB+	31B Dense	Q4_K_M	10–18 t/s
iPhone 15 Pro	E2B	Built-in	10–20 t/s
Android flagship 8 GB	E2B	Built-in	8–15 t/s

Check Your Hardware Automatically

No installs, no sign-ups. Open the page and get a personalized setup in seconds.

How Auto-Detection Works

Auto-detect your GPU

The moment you load this page, the tool reads your GPU via browser-native WebGPU and WebGL APIs. On Mac, it identifies your exact Apple Silicon chip and unified memory size. On PC, it reads your NVIDIA / AMD model and VRAM. No data leaves your browser.

Match to the best model

Your detected GPU, VRAM (or unified memory), and OS are cross-referenced against Gemma 4's model tiers — Edge for phones, 26B MoE for consumer hardware, 31B Dense for workstations. The matcher picks the largest model your hardware can run comfortably and flags any known caveats.

Copy your run command

You get a ready-to-paste terminal command for Ollama, llama.cpp, or Transformers — with the right model tag, context flags, and performance tweaks pre-filled. On mobile, the tool links directly to Google AI Edge Gallery for one-tap install.

💡

Everything runs in your browser. The GPU detection uses the same APIs that games and 3D apps use to render graphics. No data sent to any server, no cookies, no analytics, no user tracking. You can even use the tool offline after the page loads.

Manual Mode

Click "Wrong? Edit manually" in the tool above to set OS, RAM, and VRAM by hand. Useful when auto-detection doesn't match your actual setup.

→ Dual-GPU laptop — the browser often reports the weaker integrated GPU instead of your discrete NVIDIA/AMD card.
→ New GPU not in our database — very recent models may show as "Unknown GPU". Manual mode lets you enter specs directly.
→ Planning a purchase — try different RAM/VRAM combos to see which upgrade unlocks a bigger model tier.

Common Gemma 4 Local Setups

Hardware	Recommended Setup	Best For	Notes
MacBook Pro 16 GB	26B-A4B MoE via Ollama (GGUF)	General chat	Keep context under 8k; avoid MLX (known bugs)	Try it →
RTX 4060 8 GB	E4B via Ollama / LM Studio	Chat, lightweight local use	26B MoE needs 12 GB+ VRAM	Try it ↑
RTX 4070 Ti 16 GB	26B-A4B MoE via Ollama	Chat, multimodal, coding	Sweet spot — comfortable headroom for MoE	Try it ↑
RTX 4090 24 GB	31B Dense via Ollama	Coding, deep reasoning	Use -np 1; long context (10k+) is tight even on 24 GB	Try it ↑
iPhone 15 Pro	E2B via AI Edge Gallery	Offline chat, translation	E4B crashes on <10 GB RAM — stick to E2B	Try it →
Android flagship 8 GB	E2B via AI Edge Gallery	Offline assistant	E4B needs 10 GB+ RAM; E2B is the safe choice	Try it →

Supported Runtimes & Formats

Gemma 4 works with all major local inference engines. Choose based on your OS and preference.

🦙

Ollama (GGUF)

CLI-first. One command to download and run any Gemma 4 tier. Best for Mac and Linux users comfortable with the terminal.

ollama run gemma4:26b

Model library Setup guide

🖥

LM Studio (GGUF, EXL2)

GUI app for Mac and Windows. Visual model browser with real-time VRAM monitoring. Search "gemma 4" inside the app to download.

• Supports GGUF and EXL2 formats
• Built-in chat interface — no terminal needed
• Automatic quantization selection based on your VRAM

lmstudio.ai

llama.cpp & MLX Support

Low-level C++ inference engine. Apple MLX provides native Metal acceleration on Apple Silicon. Note: MLX has known bugs with Gemma 4 MoE — prefer Ollama on Mac.

• llama.cpp: GGUF format, CUDA / Metal / Vulkan
• MLX: Apple Silicon native, GGUF safetensors
• Most flexible — build from source for custom configs

llama.cpp MLX

⚡

ExLlamaV2 (EXL2)

Fastest inference for NVIDIA GPUs. EXL2 format offers flexible bit-rate quantization. Best for RTX 4070 Ti and above running 26B MoE or 31B Dense.

• NVIDIA-only — requires CUDA
• Gemma 4 31B EXL2 available on HuggingFace
• Highest tok/s for a given quality level

ExLlamaV2 HuggingFace models

How to Run Gemma 4 Locally

Three steps from zero to a working local setup.

Pick a Runtime

Ollama

CLI-first. One command to download and run. Best for developers comfortable with the terminal.

Gemma 4 page Setup guide

LM Studio

GUI app for Mac and Windows. Visual model browser, real-time VRAM monitoring, no terminal needed.

lmstudio.ai

Mobile (AI Edge Gallery)

Run E-series models offline on iOS and Android. No cloud, no API key.

Google Play App Store

Download the Right Model

Use the matcher above to find your recommended tier, then pull the model. For Ollama:

ollama run gemma4:26b

Run and Verify

Start a conversation and watch memory usage. If you hit OOM errors or slowdowns, reduce context length via --ctx-size 4096 or try a more aggressively quantized variant.

See it in action

Hardware matcher auto-detection showing Apple M1 Pro with 16 GB unified memory and ~100 GB/s bandwidth

Instant GPU detection

Opens the page, your GPU and memory are identified automatically — no input needed.

Matched result showing 26B MoE recommendation with Ollama run command and 17-22 tok/s speed estimate

Personalized recommendation

Get the right model tier, expected speed, and a copy-paste terminal command.

Animated demo of switching from auto-detection to manual mode, selecting OS, RAM, and VRAM

Full manual control

Switch to manual mode to set OS, RAM, VRAM by hand — or compare upgrade scenarios.

Setup Guides by Platform

Detailed guides with device-specific tips, model recommendations, and step-by-step instructions.

🍎

Mac (Apple Silicon)

Apple Silicon unified memory makes Macs ideal for local LLMs. 16 GB runs 26B MoE at 17–22 tok/s.

Read the Mac guide →

📱

iOS & Android

Run Gemma 4 Edge offline on iPhone & Android — no API key, no internet. Private AI in your pocket.

Read the mobile guide →

🖥

Windows (NVIDIA / AMD)

Use Ollama or LM Studio on Windows. RTX 4070 Ti+ for 26B MoE, RTX 4090 for 31B Dense. AMD supported via Vulkan.

Use the matcher above ↑

Frequently Asked Questions

What are the minimum PC requirements for Gemma 4? +

For the smallest model (E2B), you need a PC with at least 4 GB of free RAM and any modern GPU or CPU. For the popular 26B MoE model, aim for 16 GB of RAM and 12–16 GB of VRAM (RTX 4070 Ti or equivalent). For the full 31B Dense model, you'll want 24 GB+ VRAM (RTX 4090) or 32 GB+ unified memory (Apple Silicon Max chips). All models can run on CPU alone at reduced speed (3–8 t/s).

Can an RTX 4060 run Gemma 4? +

Yes, an RTX 4060 (8 GB VRAM) can run the E4B model comfortably at Q4_K_M quantization with 20–35 t/s. The 26B MoE model requires 12 GB+ VRAM at minimum, so it won't fit on an RTX 4060. Use Ollama or LM Studio and search for gemma4:e4b to get started.

Can an RTX 4090 run Gemma 4 31B? +

Yes. The RTX 4090 (24 GB VRAM) can run Gemma 4 31B Dense at Q4_K_M quantization (~20 GB VRAM). You'll get approximately 15–25 t/s depending on context length. For best results, use -np 1 with Ollama to limit parallelism. Long contexts (10k+ tokens) may be tight — reduce with --ctx-size if needed.

Does Gemma 4 support EXL2 format? +

Yes. Gemma 4 models (including the 31B Dense) have been converted to EXL2 format by the community. EXL2 files are available on HuggingFace and can be used with ExLlamaV2 for the fastest NVIDIA inference. EXL2 supports flexible bit-rate quantization, letting you fine-tune the VRAM/quality tradeoff. Note: EXL2 requires an NVIDIA GPU with CUDA — it does not support AMD or Apple Silicon.

How to run Gemma 4 on LM Studio? +

Open LM Studio, go to the search tab, and type "gemma 4". You'll find GGUF files for all model tiers (E2B, E4B, 26B MoE, 31B Dense) in various quantization levels. LM Studio automatically highlights which versions fit your available VRAM. Click download, then switch to the chat tab to start talking. LM Studio also supports EXL2 format for NVIDIA GPUs.

What's the difference between Gemma 4 26B MoE and 31B Dense? +

The 26B-A4B MoE (Mixture of Experts) has 27B total parameters but only activates ~4B per token. This means it's much faster than its size suggests — comparable speed to a 4B model — while retaining higher quality. It fits in 12–16 GB VRAM and is the best choice for consumer hardware (RTX 4070 Ti, MacBook Pro 16 GB). The 31B Dense activates all 31B parameters on every token, giving it the highest quality output but requiring 20+ GB VRAM. It's best for workstations (RTX 4090, M-series 32 GB+).

Can I run Gemma 4 on my phone? +

Yes. Install Google AI Edge Gallery on Android or iOS, then download the Gemma 4 E2B model (5B parameters, ~3 GB). It runs entirely offline with no API key. The E4B model (9B parameters) may work on phones with 10+ GB RAM but can crash on devices with less. For best results on mobile, stick to E2B.

How to fix OOM errors when running Gemma 4 locally? +

Out-of-memory errors mean the model + context window exceeds your available VRAM/RAM. Try these steps in order: 1) Reduce context length — use --ctx-size 4096 or lower. 2) Use a lower quantization — switch from Q4_K_M to Q3 or Q2. 3) Drop to a smaller model — use 26B MoE instead of 31B, or E4B instead of 26B. 4) Limit parallelism — use OLLAMA_NUM_PARALLEL=1 with Ollama. The KV cache for Gemma 4 is particularly large due to its 256K context window, so context length is the single biggest VRAM factor.

How does the automatic GPU detection work? +

This tool uses browser-native APIs to identify your GPU — no downloads or extensions required. It first tries the WebGPU API (supported in Chrome, Edge, and recent Safari), which reports your GPU model and architecture directly. If WebGPU is unavailable, it falls back to WebGL renderer detection. For Apple Silicon, it identifies your exact chip (M1, M2 Pro, M4 Max, etc.) and unified memory size. Everything runs entirely in your browser — no data is sent to any server.

Why does the tool detect my GPU but not my RAM? +

Browser security restrictions prevent websites from reading your exact system RAM. The navigator.deviceMemory API exists but is limited to Chromium browsers and caps at 8 GB — not useful for local AI workloads. GPU model names, on the other hand, are exposed through WebGPU and WebGL for rendering purposes, which is why we can detect your GPU but still need you to select your RAM manually.

The detection shows the wrong GPU or VRAM — what should I do? +

Click "Switch to manual selection" below the detected hardware card to enter manual mode, where you can set your OS, RAM, and VRAM by hand. Common reasons for inaccurate detection include: laptops with dual GPUs (the browser may report the integrated GPU instead of the discrete one), privacy-focused browsers that block GPU fingerprinting, or newer GPU models not yet in our lookup database.