Running Gemma 3 locally on a Mac Mini M4

I have spent the last few weeks running Gemma 3 on a Mac Mini M4 with 32GB of unified memory. This is the write up I would have wanted before I started, with the rough numbers and the honest limitations.

The short version is that the small Gemma 3 variants are perfectly usable for personal projects on this hardware, the large ones really are not, and the line between the two sits exactly where you would expect it to once you understand how unified memory interacts with the GPU.

Why bother running this locally

Three reasons made it worth the effort.

The first is privacy. A surprising amount of the work I do on the consultancy side touches client material that I would prefer not to upload to anyone, ever. Local inference keeps that data on a single machine that I physically own. There is no terms of service to read carefully, no data retention setting to misconfigure, and no quiet model retraining clause to worry about.

The second is cost shape. Hosted APIs are wonderful when you are using them for a few hundred requests a day. They become expensive when you are iterating on a long-running pipeline that calls a model thousands of times during development. The Mac Mini draws somewhere around 30 to 60 watts under load, which translates to a fixed monthly electricity cost in the low single digits of pounds. Whether that beats hosted pricing depends on the workload, and I will come back to that.

The third is the simple satisfaction of owning your stack. If a vendor changes pricing, deprecates a model, or rate limits me on a Sunday afternoon, I would rather have a local fallback that still works. That is worth a few hours of fiddling.

The hardware

The machine is a base configuration M4 Mac Mini with the memory upgraded to 32GB. List price came to a touch over a thousand pounds. I considered three configurations before settling on this one.

The 16GB version is too tight. By the time the operating system, the browser, a couple of editor windows, and the model weights are all sharing unified memory, anything beyond a 4 to 7 billion parameter model becomes painful. You spend most of your time fighting the swap.

The Mac Studio with more memory is the obvious upgrade path. It is also two to three times the cost, and most of that money is going on memory bandwidth and capacity that I do not yet need. If my local inference workload grew enough to justify the larger box, the move would be straightforward.

The 32GB Mini sits in the middle. It comfortably hosts an 8 to 12 billion parameter model with a usable context window, leaves enough headroom for a chat interface, a vector database, and a development server, and runs near silent under sustained load. It is the smallest configuration I would recommend for genuine local inference work.

The stack

The runtime is Ollama. I tried llama.cpp directly for a week, decided that the convenience of ollama pull and ollama run was worth the small abstraction tax, and have not looked back. Ollama uses llama.cpp under the bonnet on Apple Silicon, so the underlying performance characteristics are similar.

brew install ollama
brew services start ollama

ollama pull gemma3:4b
ollama pull gemma3:12b
ollama pull gemma3:27b

The chat interface is Open WebUI, running in Docker, exposed on the local network so I can use it from the laptop or the phone in another room.

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://host.docker.internal:11434
      - WEBUI_AUTH=true
    volumes:
      - open-webui-data:/app/backend/data
volumes:
  open-webui-data:

The host.docker.internal line matters on macOS. Without it the container cannot find Ollama on the host. With it, you have a clean separation between the runtime and the interface, and Open WebUI manages chat history, RAG documents, and user accounts on its own SQLite store.

A quick sanity check against the API once everything is up:

curl http://localhost:11434/api/generate -d '{
  "model": "gemma3:4b",
  "prompt": "Summarise the difference between a hash table and a hash set.",
  "stream": false
}'

If that returns a coherent response in a few seconds, you are done with setup.

What runs well

The 4 billion parameter Gemma 3 variant is the workhorse on this hardware. Cold start is roughly two seconds. Streaming throughput on short prompts feels close to a hosted model and sits in the rough range of 40 to 60 tokens per second on this box, depending on context size. For drafting, classification, summarisation of single documents, and conversational use, that is plenty.

The 12 billion parameter variant is where it gets interesting. Throughput drops into the 18 to 28 tokens per second range, depending on how long the context has grown. For a chat session it is still entirely usable. For a pipeline that needs to chew through a few hundred documents in an hour, it is borderline.

I have a small RAG setup over a corpus of around 50 internal notes and reference documents. Embedding is done with nomic-embed-text running on the same host. With Gemma 3 12B as the answering model, the end to end latency from question to first token sits in the two to four second range, with the answer streaming out over the next ten or twenty seconds. That is well below the threshold where you start to feel impatient, and the answers are good enough that I have largely stopped reaching for the hosted equivalents on personal queries.

Where it falls over

The 27 billion parameter Gemma 3 variant is the line. On paper it should fit. In practice, once context grows past a few thousand tokens, the workload starts spilling between the GPU and the CPU, and throughput collapses from a respectable 8 to 12 tokens per second down to something closer to 1 or 2.

The reason is the unified memory ceiling. The model weights alone for a 27B at sensible quantisation are eating north of 16GB. By the time you add the KV cache for an extended context window, you are right up against what macOS is willing to allocate to the GPU process. When that ceiling is hit, the runtime starts paging tensors back and forth, and the latency profile changes from pleasant to painful within a couple of turns.

You can mitigate it. Trimming the context window aggressively helps. So does keeping other GPU heavy applications closed, which on macOS includes some browser tabs in ways that are not always obvious. None of that fully fixes it. The honest answer is that 27B class models want a Mac Studio or a discrete GPU with dedicated VRAM, not a Mini.

A comparison with a Windows box and a discrete GPU

I keep a spare Windows desktop with a 16GB GPU in the corner of the office. I ran the same workload on it for comparison.

For the 27B model, the Windows box is materially better. Throughput on a long context sits comfortably around 25 to 35 tokens per second, and the experience is consistent rather than collapsing past a certain context length. If your workload genuinely needs the bigger model, this is the right shape of machine.

For everything else, the Mini wins. The Windows box sounds like a small jet engine when the GPU spins up. Idle power is significantly higher. The thermals warm the room in a way that is genuinely uncomfortable in summer. And the operating system gets in the way constantly, in ways that macOS simply does not.

If I were designing a home AI workstation from scratch today, I would probably build a small Linux machine with a single capable GPU and put up with the noise. As a thing that sits on a desk and is also a perfectly good general purpose computer, the Mini is hard to beat.

Cost per token, roughly

This is the calculation that everyone wants and that nobody quite trusts. Take it as a sketch, not a number to put in a budget.

Assume the Mini draws 50 watts under sustained inference, sits idle the rest of the day, and the relevant electricity rate is around 27 pence per kilowatt hour. Eight hours of inference a day works out to a bit under 11 pence per day, or roughly £40 a year of marginal energy. Amortised over a sensible useful life for the hardware, you can call the all in cost something like £400 a year of electricity plus depreciation if you push the machine hard.

If that £400 a year produced a steady ten thousand requests a day, the per request cost is well below what you would pay for the same volume against a hosted Gemma class model. If it produces ten requests a day, the hosted option is far cheaper per request, and the Mini is being paid for by the privacy and convenience benefits rather than by the maths.

The honest version is that local inference economics only beat hosted economics at meaningful sustained volume, or when you genuinely need the privacy story. For most personal use, the Mini pays for itself in subjective ways, not in pence per token.

Honest closing

The Mac Mini M4 with 32GB is a very good local inference machine for personal projects, small RAG experiments, and the kind of side work where you are happy with an 8 to 12 billion parameter model. It is silent, it sips power, and it integrates well with the rest of a working desk.

It is also not yet good enough to replace a hosted frontier model for the work I would put in front of a client. The capability gap on multi step reasoning, code generation across more than a single file, and long context comprehension is still meaningful. I expect that gap to keep narrowing. For now, I run Gemma locally for everything personal, keep a hosted account for everything that genuinely needs the bigger brain, and find that the split sits in roughly the right place.