Building a 100 tok/s Inference Engine in 1,800 Lines of C++

Dirk Harms-Merbitz · February 2026 · linuxtoaster.com

Our local inference engine, written in C, was slower than Python. That shouldn't be possible.

Took us a while but today it runs a 30-billion parameter model at 100 tokens per second on a MacBook — roughly 70 words per second, fast enough that responses feel like they're being typed by someone who can think at the speed of reading. You hit enter and the answer starts before your finger lifts off the key.

Metric Before After
Prompt reading (17K tokens) 7 tok/s 394 tok/s
Time to first word (20K history, cache hit) 75 s 0.6 s
Response writing (short context) 23 tok/s 107 tok/s
Response writing (28K context) ~23 tok/s ~80 tok/s
Response writing (68K context) N/A ~60 tok/s

The final toasted.cpp is ~1,800 lines. Full hybrid DeltaNet/attention inference with a hand-tuned Metal kernel. MoE routing through 512 experts. Chunked batch prefill. Compiled step functions. Session cache with LRU eviction. Unix socket daemon with JSON protocol. No Python. No dependencies beyond MLX.

This is the story of how we got there, the invisible bug that cost us half our throughput, and what happened when we asked the model to optimize its own engine.

How I got here

In 2022 I was working at Amazon on the Linux team. CVE descriptions were a mess — different formats, different severity language, no consistency across sources. I eventually wrote a Python script that piped them through an LLM and got back five clean bullet points every time. It made my brain hurt less.

When Amazon reorganized the US Amazon Linux team was laid off. I decided to write a book about AI. The idea was that AI would write it for me.

It didn't. You actually have to write it. There are two thousand pages sitting in a corner of the office. It is a good book and eventually I'll finish.

Getting 35,000 coherent words out of an LLM requires iterative editing loops — feed a draft in, refine, feed it back, repeat. I needed tooling for that. So I kept building. One annoyance at a time, for three years, and now I have a startup called LinuxToaster.

Unix got talking to a computer right, pipe composable tools. Linuxtoaster's core tool is toast — composable AI for the terminal. Think sed with a brain. You pipe text through an LLM the same way you'd pipe through grep or sed:

cat error.log | toast "diagnose"
git diff | Reviewer
curl site.com | toast "summarize" | toast "translate to Spanish"

Toast routes inference through cloud providers — Anthropic, OpenAI, Cerebras, Groq. It works. But every request is a round trip, and data leaving the machine. Cloud providers say they don't read our data. Before AI, there was a certain safety in the volume, who would be motivated enough to look? But with AI, it is now easy to read EVERYTHING. Privacy seems prudent. Tried local inference on my MacBook Pro with M4 Max and 128 GB of unified memory, started with Ollama.

Why not just use Ollama

Fair question. Ollama and LM Studio are good products. I have used Ollama, and LMSTudio, and everything else. Toast even supports them as local inference providers.

But here's what bugged me. I wanted toast to be a reflex — like grep or curl. Something you reach for without thinking. And every time I set up local inference for someone, I hit the same walls: Python's half-second startup tax, pip install that breaks between Tuesday and Wednesday, a hidden ~/.ollama directory full of mystery blob files, default context windows too small for real work, and a general feeling of managing a second system just to avoid API calls.

What I wanted was a single daemon that loads a single model, listens on a Unix socket, and disappears into the workflow. No Python. No hidden directories. No configuration. Not yet another framework. Just toasted /path/to/model and forget about it. When it's running, toast auto-detects it. Same interface as cloud. When it's not running, toast uses cloud. The pipe doesn't care where the intelligence comes from.

We could have used MLX, but that's Python and all that comes with it. We could have used llama.cpp but what complexity. In the end we wrote our own toasted.cpp from scratch.

The Model

We chose Qwen3-Coder-Next. Because our Mac Minis have 64GB RAM we picked 4-bit quantization. My laptop with 128GB can handle 6-bit or 8-bit, I tend to use the 6-bit. Qwen3-Coder-Next is an unusual model, 42.7 GB on disk at 4-bit quantization, designed for efficient inference through two architectural choices:

Mixture-of-experts: Each token routes through 8 of 512 feed-forward experts. The other 504 sit idle. The model has the knowledge of all 512 — training saw all of them — but inference only pays for 8. You get a 30B parameter model running at speeds you'd normally associate with a 7B dense model.

Hybrid attention architecture: Most LLMs are 100% transformer attention — every token attends to every other token, and cost grows with context length. This model uses DeltaNet, a linear recurrent architecture, for 36 of its 48 layers. DeltaNet maintains a fixed-size state matrix regardless of context length. Only 12 layers use traditional attention with KV caches that grow.

Back-of-napkin math: ~5 GB of weight data read per decode step. At ~546 GB/s unified memory bandwidth on an M4 Max, that gives a theoretical floor of ~9.1ms per token, or ~110 tok/s. We wanted to see how close we could get.

Attempt one: C against the MLX C API

The first version of toasted was pure C. We wrote attention, feed-forward, rotary position embeddings, RMS normalization, top-p sampling — all directly against Apple's MLX C API. It read the model directory (safetensors weights, config, tokenizer) and ran inference natively. No Python anywhere.

It generated correct tokens at 21 tok/s.

That was worse than Apple's reference Python implementation. We wrote the whole thing in C to avoid Python and ended up slower than Python. Not a great start.

The problem: the MLX C bindings aren't fully supported. We had to issue each operation individually — multiply here, add here, softmax here. The GPU spent more time waiting for dispatch instructions than doing math. The C API doesn't support the operation fusion that MLX's Python layer gets automatically.

The Baking Workaround

Lets think about compile time vs runtime. What if we could skip runtime graph construction entirely? We wrote a tool that runs the model's forward pass in Python, but instead of executing it, captures the entire computation graph — every matrix multiply, every attention head, every expert routing decision — and serializes it to disk.

Because each trace is shape-specific, we needed one for each possible token position. For a 262,144-token context window, that's 262,144 individual graphs, plus 2,048 chunk-prefill graphs. The model weights (42 GB) are stored once and referenced by every trace, so each additional graph adds only its topology — the full export is about 45 GB regardless of trace count.

With pre-computed graphs: ~80 tok/s. Parity with MLX's Python implementation.

But the baking step took 18 hours and we had to bump an MLX resource number in the code so it wouldn't run out, meaning, whenever Apple updates MLX, we'd have to do that again. Not sustainable.

Rewrite in C++

We rewrote toasted in C++ against MLX's C++ API, which has proper graph fusion support. First test: generation jumped to 90 tok/s. But prefill — the model processing its input before generating a response — crashed to 7 tok/s. Down from 40.

Two bugs. First: the code called mx::eval(hidden) every 4 layers during prefill. Twelve GPU synchronization points that flushed the pipeline each time. For a 14-token prompt through 48 layers, the GPU spent more time waiting than computing.

Second: prefill ran on a separate GPU stream, but our custom DeltaNet Metal kernel didn't explicitly inherit it. This caused implicit synchronization between streams — the worst kind of performance bug because it's invisible in the code and invisible in the output.

The fix was two lines. Remove intermediate evals for short sequences. Drop the separate prefill stream. Prefill jumped back to 40 tok/s.

The type leak

Even with the C++ rewrite, generation was stuck at 21 tok/s. We tried everything. Profiling. Different batch sizes. Different kernel configurations. The number wouldn't move. We read Apple's source code. Nothing.

Finally, profiling showed our MoE layer was 2.4× slower than Apple's reference — 0.90ms per layer versus 0.29ms — despite calling the same gather_qmm kernel.

A fp16 vs fp32 type leak! The MoE gate's quantized weights use bfloat16 scales and biases. When MLX dequantizes them, the output promotes to float32. Apple's Python code computes softmax in float32 but casts the result back to float16 before feeding it downstream. Our C++ code didn't cast back.

The float32 gate output propagated through expert selection, into gather_qmm, through the expert projections, through the residual add — every subsequent operation inherited the wider type. Every tensor that touched the MoE path doubled in size, doubling the memory bandwidth consumed per token.

The effect was invisible in the code and invisible in the output. The model produced correct tokens. But on a bandwidth-bound workload, doubling your memory reads is fatal.

Making sure everything stayed fp16 got us to ~50 tok/s, and then into the high 80s with further tuning.

Prefill: from 40 to 394 tok/s

At 40 tok/s prefill, a 17,000-token conversation — a normal long chat, maybe with a file in it — meant 7 minutes of staring at a blank screen before the first word appeared.

The bottleneck: prefill processed the entire prompt as one massive batch. The 12 attention layers used scaled_dot_product_attention with a causal mask over all 17K tokens simultaneously.

We restructured prefill into 32-token chunks. Each chunk processes through all 48 layers, updates the cache, and moves on. The attention layers only see their chunk plus cached KV from previous chunks, keeping matrix operations small.

Result: 394 tok/s. The same 17K prompt now prefills in ~44 seconds instead of 7 minutes.

The session cache: 75 seconds → 600 milliseconds

This was the big one. And it wasn't a GPU optimization — it was an architectural insight.

Every request from toast included the entire conversation history. At 20K tokens, that meant 60+ seconds of redundant prefill, re-processing identical messages that hadn't changed since the last turn. The user types "thanks" and waits over a minute.

The insight: on each turn, only the last message is new. Everything before it was already processed on the previous turn. If we save the model's internal state — KV caches for 12 attention layers, recurrent states for 36 DeltaNet layers — we can restore it and only prefill the new message.

The implementation:

  1. Hash messages[0..n-2] (everything except the last message) using FNV-1a
  2. Check against an LRU cache of saved sessions
  3. On hit: restore cached state, tokenize only the last message (~30–80 tokens), prefill just that
  4. On miss: full prefill
  5. After generation: save the new state with a hash that includes the generated response

The save hash is constructed so the next request's lookup hash will match — because the next request's messages[0..n-2] will include this response as the second-to-last message.

Result: 75 seconds → 600 milliseconds. A 125× improvement in time-to-first-token.

The first version of the cache had a memory leak. Each turn created a new entry. After 8 turns: 8 snapshots of essentially the same conversation at different points — ~11 GB of dead state. The fix was parent eviction: when saving a new session restored from an existing one, evict the parent. A single chat now holds exactly one session. Memory tracks active conversations, not history.

Speculative decoding: knowing when to stop

We tried speculative decoding — a smaller draft model proposes tokens, the large model verifies in batch. Two approaches: n-gram prompt lookup and self-speculative early exit.

Both failed. N-gram got 11–14% acceptance rates on text. You need above 70% to break even. Self-speculative got 0% — early layers of this hybrid DeltaNet/attention architecture produce representations too different from the final output. Maybe for a code only AI this would work better.

All speculative code was reverted. Sometimes the best optimization is knowing when to stop.

Compiled step functions: the last 10%

At ~89 tok/s we were close to the theoretical ceiling. Gap analysis showed ~1.0ms per token was CPU-side graph construction — rebuilding the MLX computation graph on every decode step.

MLX supports mx::compile() — trace a graph once, cache it, replay it. We wrapped all 48 layers into compiled step functions.

Result: 100+ tok/s at short context, ~80 tok/s at 28K tokens, ~60 tok/s at 68K tokens.

The model reviews its own code

Once toasted was running, we asked the model — running on toasted, on the engine we'd just built — to review that engine's code and suggest optimizations.

It proposed a paged KV cache (vLLM-style). Sounds impressive. In practice, our concatenate runs once every 256 tokens, copies 12 MB at 546 GB/s, and costs 0.02ms. The actual bottleneck costs 1.5ms per token. The optimization would be 20,000× less impactful than the problem it claims to solve.

It suggested caching the last hidden state from prefill to avoid redundant computation. There is no redundant computation — prefill produces logits, we sample a token, then gen_step embeds that different token and runs it through all 48 layers. The model misread its own execution path.

It recommended enabling MetalFX. That's Apple's game graphics upscaler. It makes explosions look better. It has nothing to do with compute shaders.

It suggested pinning threads to performance cores. The CPU thread just submits work to the GPU and reads back one integer. It is not the bottleneck.

After several pages of confident, wrong suggestions, the model eventually reasoned itself to the correct conclusion:

"Your code is optimal. The only path forward is a faster Mac."

It had seen thousands of blog posts and papers about inference optimization during training. It knew the vocabulary, the concepts, the common patterns. But it had zero visibility into its own architecture, its own runtime, or the hardware it was literally running on at that moment. It pattern-matched on code structure rather than tracing actual execution. It knew every cookbook but couldn't tell you if the soup needed salt.

Oh, and the system prompt said "be brief, as if text messaging." The model reviewing its own engine responded with excited 2,000-word essays and emoji section headers. This is perhaps the most relatable LLM behavior of all.

Distribution: rsync

A 42–60 GB model file presents a distribution problem. HTTP downloads die at 80% and start over. CDNs charge per gigabyte. BitTorrent is too weird for enterprise.

We use rsync. Resume on failure — built in. Delta transfers — a model update that changes 2 GB only transfers 2 GB, not 60. Progress reporting. Checksum verification. Compression in transit. Every Mac already has it.

$ toasted pull     # rsync from our server, resumable
$ toasted start   # loads model once, listens on Unix socket
$ toast "hello"   # auto-detects local inference

A Unix tool solving a Unix tool's distribution problem. There are many such solutions in Unix that just work, and toast knows all of them.

What we learned

The biggest wins aren't where you expect. Compiled step functions — the "obvious" optimization — gave 10%. The session cache — an architectural change to avoid redundant work — gave 12,500%. The best optimization was not making the hot loop faster, but not entering it at all.

LLMs are better at writing code than reviewing it. The model wrote the DeltaNet kernel, the MoE routing, the session cache, all correctly on first or second attempt. But when asked to optimize existing code, it pattern-matched against training data rather than reasoning about the specific system. Writing is synthesis; optimization is analysis. They're different skills.

Unified memory changes the optimization landscape. Most inference optimization literature assumes discrete GPUs with separate CPU and GPU memory. On Apple Silicon, there's no PCIe transfer, no memory copies, no staging buffers. Half the standard playbook doesn't apply. You have to reason from first principles about bandwidth and compute, not cargo-cult techniques from GPU server deployments.

An invisible bug can be the whole problem. The fp16 type leak produced correct output, passed every test, and cost us half our throughput. The model ran fine — just slowly. These are the hardest bugs to find because nothing tells you to even look.

The model that can't taste the soup. Qwen3-Coder-Next, running on the very engine we built for it, couldn't identify a single valid optimization for that engine. It had read every cookbook but couldn't tell if the soup needed salt. There's a lesson in there about the difference between knowledge and understanding — one that feels important as we build systems that are very good at the former.