Under the hood · Concepts
How LLMs actually work
You don't need to train a model to deploy one. But understanding four mechanics, parameters, tokenizers, context windows, and Apple Silicon's architecture, made every hardware and model decision on this site legible to me. I keep coming back to these when a result surprises me.
Parameters are weights, weights are math, math costs RAM
A language model is a very large collection of numbers, its parameters. During training those numbers are nudged billions of times until the model gets good at predicting the next token in a sequence. After training, the numbers are frozen. Inference is just loading those numbers into memory and doing matrix multiplication fast enough to generate text in real time.
This is why the parameter count maps almost directly to RAM requirement. A 7B model at 4-bit quantization needs roughly 4–5 GB. Scale up to 70B at the same quantization and you need 35–40 GB. There's no trick: you need enough RAM to hold the weights, and the weights are the model.
The Min RAM column in the models table reflects this directly: it's the floor below which the model won't load, before you've handled a single token of context.
Tokenizers determine what the model can see
Models don't read text. They read integers. A tokenizer converts your prompt into a sequence of token IDs, each of which the model looks up in an embedding table to get a vector it can process. On the way out, the model produces a probability distribution over the same token vocabulary and samples the next token from it.
Most production models use byte-pair encoding (BPE): the tokenizer is trained on the same data as the model, learning which character sequences appear together often enough to become a single token. "Transcription" might be one token. "ekpnguyen" is probably five. A rare programming language's keywords might not be in the vocabulary at all and fall back to character-by-character encoding, which is slow and context-expensive.
This matters practically: tokenizer mismatch between the model's training data and your input domain affects both speed and quality. A model trained on general web text will tokenize dense legal citations less efficiently than one trained with legal corpora.
Context window is the model's working memory
Every transformer has a maximum sequence length, the context window. It's the total number of tokens the model can hold in attention at once: your system prompt, the conversation history, the document you pasted in, and the response it's generating, all counted together.
Increasing context length isn't free. Attention scales quadratically with sequence length in the naive implementation: doubling the context window roughly quadruples the computation. Modern architectures (sliding window attention, grouped-query attention, RoPE scaling) reduce this, but there's no architectural free lunch. A 128K context model is genuinely harder to run than a 4K context model at the same parameter count.
For regulated-industry use cases, a patient encounter note, a contract, a meeting transcript, 8K to 32K context covers the vast majority of real documents. The experiments on this site are all run within those bounds.
Why Apple Silicon changes the local AI equation
Conventional AI accelerators (NVIDIA GPUs) have a fixed pool of VRAM separate from system RAM. A 24 GB GPU can only load models that fit in 24 GB of VRAM. Anything larger requires expensive multi-GPU setups. Apple Silicon uses a unified memory architecture: the CPU, GPU, and Neural Engine all share the same physical memory pool.
A MacBook Pro with 128 GB of unified memory can load a model that would require four high-end data center GPUs on conventional hardware. The bandwidth is also exceptional: Apple's memory fabric runs at 400+ GB/s on M3 Max and above, which is the actual bottleneck for inference throughput at large model sizes.
This is why Ground Floor is Apple Silicon-first, not brand loyalty. I looked at every option. It's the only consumer hardware where running a 70B model is a reasonable target without building out specialized infrastructure.
Layers, attention heads, and why they matter less than you think
A transformer is a stack of identical blocks. Each block has two main components: self-attention (which figures out which tokens should pay attention to which other tokens) and a feed-forward network (which processes those relationships into a representation the next layer can use). More layers means more computation per token and more parameters, but it also means the model can build more abstract representations of what it's reading.
Attention heads within each layer each learn to attend to different features: one might track subject-verb agreement, another might track pronouns back to their referents. More heads, more capacity to track relationships simultaneously.
In practice: you rarely need to tune these numbers. They're baked into the model architecture the lab shipped. What you control is which model you run, at what quantization, on what hardware. That's the whole game, and it's exactly what the models table and experiments let you play yourself.
Further reading
If you want to go deeper, actually implement a transformer in PyTorch and watch a small model learn to write Shakespeare from scratch:
- LLM from Scratch , hands-on workshop by Angelos Papageorgiou (ElevenLabs speech-to-text team). Trains a 10M parameter GPT-2-style model in under an hour on a laptop. Pure PyTorch, no pretrained weights.
- nanoGPT , Andrej Karpathy's minimal GPT implementation. The canonical starting point for understanding transformer internals without production-system complexity.
These explanations are intentionally simplified for practitioners, not researchers. Precision is sacrificed where it would obscure the practical point.