Get started
Running in 30 minutes.
You have a Mac. Here's the exact sequence to run a local AI model, no API key, no cloud dependency, no vendor agreement.
I'll skip the theory. You can read the Playbook for that. This is the shortest path from "I have a Mac" to "I'm running a local model."
Before you start
You need an Apple Silicon Mac, M1 or newer. That's any Mac bought after late 2020. If you're not sure: Apple menu → About This Mac. When it says "Apple M" anything, you're good.
16 GB of unified memory is the minimum that's worth your time. You can run small models on less, but the results aren't useful for real work. If you have 24 GB or more, everything here works better.
Step 1, Install Ollama
Ollama is the tool that downloads and runs open-weight models locally. It takes about two minutes to install and handles everything: model storage, the inference server, the API endpoint.
Download it at ollama.com and run the installer. Or if you have Homebrew:
brew install ollama Once installed, start the server. It runs in the background and stays there across reboots if you let it.
ollama serve Step 2, Pull a model
Pick the command that matches your RAM. These are the models I'd reach for first: strong benchmarks, practical speed, and they pull cleanly in one command.
16 GB, start here
ollama pull qwen3.5:9b-q4_K_M Qwen 3.5 9B is currently the strongest model at this size tier. Download is ~5.5 GB. It'll take a few minutes on a decent connection, runs fully offline after that.
24 GB, step up
ollama pull qwen3.5:27b-q4_K_M A real step up in quality. ~16 GB download. If you're going to use this for actual work like drafting documents, reviewing text, or structured analysis, this is the tier where it starts feeling professional.
64 GB+, serious work
ollama pull llama3.3:70b-instruct-q4_K_M 70B at Q4. ~40 GB download. This is where the output quality becomes hard to distinguish from a cloud API, on hardware you own.
Step 3, Run it
Once the model finishes downloading, one command starts an interactive session:
ollama run qwen3.5:9b-q4_K_M Type a prompt and hit Enter. The model runs entirely on your machine. No network calls, no API keys, no tokens being consumed somewhere.
The first response will feel slow: the model is loading into memory. After that, it stays loaded and responds faster. On a 16 GB Mac, expect 20–30 tokens per second on a 9B model. Step up to a 64 GB machine and 70B runs at 12–16 t/s. Fast enough for real use.
Step 4, A better interface (optional)
The terminal works. But if you want a chat interface that looks like what you're used to, Open WebUI runs locally on top of Ollama and gives you a browser-based UI:
docker run -d -p 3000:3000 \
-v open-webui:/app/backend/data \
--add-host=host.docker.internal:host-gateway \
ghcr.io/open-webui/open-webui:main
Then open http://localhost:3000 in your browser. You'll see a chat interface
with all your Ollama models available. Runs fully offline. No account required.
You can also skip Docker entirely and just use the terminal. Many practitioners do. The interface doesn't change what the model produces.
What to do next
You're running a local model. The sequence that's worked for every practitioner I've watched go through this:
- Try your actual work. Give it a real document from your practice, a meeting transcript, a client email, a report you'd normally draft. See what it produces without any fine-tuning. The output will surprise you in both directions.
- Read the experiments relevant to your industry. The experiments document exactly what works, at what hardware tier, with what prompting. Don't start from scratch: start from a result.
- Understand the failure modes before you build any workflow around local AI. The Playbook covers this. Here's the short version: these models are confident even when wrong. Human review is not optional.
Honest caveats
Running the model is the easy part. Honestly, getting here took me about 20 minutes the first time. The harder part is building a workflow where you're actually saving time, not just producing text that needs heavy editing. That takes iteration. The experiments document what that iteration looked like for specific tasks.
Also: these models hallucinate. They state false things confidently, especially when asked about specific facts, citations, or calculations. For any professional work product, every output is a first draft that requires review by someone who knows the subject.
That's not a reason to avoid local AI. It's the reason to use it correctly, as a drafting tool, not a decision-making tool.