Public edition · Methodology

The Playbook

How I design, run, and publish experiments. The methodology is public because reproducibility matters more than mystique.

Why this project exists

I kept running into the same moment. A physician, an attorney, a solo RIA, they'd look at a good AI tool, see the internet dependency, and quietly step back from the whole conversation. Not out of technophobia. Out of a very reasonable calculation about vendor agreements, compliance overhead, and what happens when something goes wrong on someone else's server.

Air-gapped AI is the architectural answer. A model running on hardware you own makes no network calls. The data stays where you put it. Ground Floor documents whether these models are actually good enough to be worth the hardware investment, and for which tasks, on which hardware.

The air-gapped AI thesis

By 2025, consumer hardware had crossed the threshold I'd been watching for. Apple Silicon's unified memory architecture eliminated the VRAM ceiling that had made local inference impractical for years. A quantized 8B model runs faster on a $799 Mac mini than most cloud API round-trips, with no data leaving the building. It's 2026. The question is answered. What I didn't expect is how much documentation work remains: which models, which tasks, which hardware tiers, which failure modes.

Experiment design

Choosing a task

A task qualifies if it meets three criteria:

High volume in the target industry. The task must represent a meaningful share of practitioners' non-billable time.
Evaluable output. There must be a clear standard to judge AI output against, a format, a checklist, a rubric a practitioner can apply.
Human-in-the-loop review. A qualified human reviews and owns the final output. Ground Floor does not test autonomous AI decision-making.

Verdict criteria

Viable: The model's output is good enough that the practitioner's job is editing, not authoring. Time savings are real and the error rate is low enough that a normal review process catches problems before they matter.
Partial: The model helps for a subset of the task or volume. Works well for simple cases, struggles with complex ones. Worth using with awareness of where it breaks down.
Not yet: Current models at this hardware tier don't meet the bar for practical use. The output requires so much correction that it doesn't meaningfully reduce the practitioner's work. Failure modes are documented clearly. This is the most useful verdict for understanding what's missing.

Hardware documentation

Every experiment specifies the exact hardware: machine, RAM, chip generation, runtime version. Results are not portable across hardware tiers without re-testing. When an experiment is documented as "viable at entry level," the configuration was deliberately constrained to match a base Mac mini, even though the lab cluster has significantly more headroom.

Publication format

Each experiment ships in three formats:

Writeup (this site): Full methodology, results, failure modes, replication instructions. Permanent reference.
Video: Screen recording of setup and inference in action. Shows what the model actually produces, not a curated highlight.
Post (LinkedIn): 3–5 paragraph summary with verdict. Links to writeup for depth.

Failure modes, what to watch for

Every experiment on this site documents failure modes explicitly. Here's the pattern I see across all three regulated industries.

False precision

The most dangerous failure mode isn't a wrong answer: it's a confidently wrong answer with no signal that it's wrong. A local 8B model drafting a SOAP note will occasionally invent a medication dose, a legal model will cite a clause number that doesn't exist, a financial model will produce a figure that's plausible but off. These aren't rare edge cases. They happen.

Here's the defense: treat every output as a first draft by a junior colleague who is smart but hasn't verified their work. Your review is not optional. It's the point of the architecture. The model handles the drafting friction. You handle the accuracy.

Complexity degrades quality

Smaller models (8B–13B) perform well on structured, predictable tasks. They struggle on complex ones, a new patient with a complicated history, a contract with cross-referencing clauses, a financial meeting with emotionally complicated family dynamics. The output becomes more generic, misses nuance, or confidently omits relevant detail.

This isn't a reason to avoid local AI. It's the reason to know which tasks belong at which hardware tier. The experiments map this out explicitly.

Prompt quality determines output quality

A vague prompt produces vague output. "Summarize this meeting" produces something generic. "Extract action items, decisions made, and follow-up commitments from this meeting transcript, structured as a CRM note with the sections: Topics Discussed, Decisions, Client Actions, Advisor Actions" produces something useful.

The experiments publish the actual prompts used. Start there.

Red flags by industry

Medical: Watch for invented clinical specifics, dosages, lab values, procedure names that weren't in the transcript. The model will fill gaps with plausible-sounding detail rather than flagging uncertainty.
Legal: Watch for invented case citations, clause numbers that don't match the document, and jurisdiction-specific analysis that's generic rather than accurate. The model lacks depth on state-level variations.
Financial: Watch for misattributed action items (model swaps who owns what), invented figures in numerical summaries, and regulatory language that sounds right but doesn't match current rules.

What this playbook doesn't cover

Compliance guidance, legal interpretation, and clinical recommendations are out of scope. The Scope & Disclaimers page explains this in detail.

This is the public edition of the Ground Floor playbook. It describes methodology and framework, not internal processes.