linuxtoasterblog → local inference is a unix problem

Local Inference Is a Unix Problem | LinuxToaster

Why the inference stack has to stop looking like a research notebook.

Most local LLM inference today runs on Python. Ollama wraps llama.cpp in Go, but the surrounding ecosystem — MLX, transformers, vLLM, the fine-tuning stacks, the notebook tutorials — is overwhelmingly Python. This is a historical accident that is costing the ecosystem more than it realizes.

Python won machine learning because ML started as a research activity, and Python is what researchers write. Research code wants interactive notebooks, quick iteration, and a rich scientific library ecosystem. Python delivers that. But inference — the part where you take a trained model and run it in production — is not research. It is plumbing. And Python is a bad language for plumbing.

What "plumbing" means in Unix

In Unix, a tool is plumbing when it behaves well in a pipe. It starts fast, reads stdin, writes stdout, exits cleanly, returns a useful status code, and does not care about its neighbors. grep is plumbing. jq is plumbing. curl is plumbing. You can chain a dozen of them together and the shell orchestrates the composition with essentially zero overhead.

A Python-based inference server is not plumbing. It takes seconds to start — interpreter initialization, library imports, model loading, CUDA or Metal context setup. It wants to be long-running, because paying the startup cost for every call is intolerable. It exposes an HTTP API because that is how Python services talk to the world. It has a GIL, so concurrency is a careful dance. It has a dependency tree with sharp edges — CUDA versions, transformers releases, tokenizer compatibility.

None of this is Python's fault exactly. Python was designed for a different job. But the result is that local inference in 2026 looks like enterprise Java circa 2005: a heavyweight runtime with a service-oriented API, running as a daemon, talked to over HTTP. That is not a tool. That is a server.

The cost of servers

When inference is a server, composition breaks. You cannot put a server in the middle of a pipe. You can put a client to a server in the middle of a pipe, but now every invocation is a round-trip to localhost, every invocation serializes to JSON, every invocation pays the overhead of HTTP framing. For high-throughput batch work this is fine. For interactive shell use — the kind of thing where you type cat file | toast "explain" and expect an answer — it is a tax you pay every time.

More subtly, servers have lifecycle. They have to be started, monitored, restarted on crash, upgraded without breaking clients. They accumulate configuration — which model is loaded, which port, which quantization, which context length. Every one of these is a small operational burden. Multiply by the number of models you want to try and you have built a little fleet inside your laptop.

Unix tools do not have lifecycle in this sense. You install grep, you use grep, you are done. You do not start grep. You do not monitor grep. grep is not a service.

What a Unix-native inference tool looks like

When we started building toasted, our local inference daemon for Apple Silicon, the first question was: how do we make this feel like a Unix tool even though inference has non-trivial startup cost?

The answer has two parts. First, the heavyweight work — loading weights, setting up the Metal context, allocating KV cache — happens once in a background daemon. Second, the client that goes into a pipe is tiny and starts instantly. toasted is the daemon; toast -p toasted is the client. The client opens a local socket, streams the prompt, streams back tokens, exits.

From the shell's perspective, it is a normal tool. You can pipe to it. You can pipe from it. You can xargs it. You can put it in cron. The daemon is an implementation detail of "how do you make a 30B-parameter model respond in 0.6 seconds"; it is not exposed to the user as a service they have to manage.

We wrote it in C++ against Apple's MLX C API. No Python. No transformers library. No virtual environment. No pip install. The binary is a few megabytes. It loads Qwen3-Next-Coder at ~100 tokens per second generation, ~400 tokens per second prefill, with session caching that brings time-to-first-token down to 0.6 seconds.

The reason this matters is not the raw performance numbers, though those are good. The reason it matters is that a tool this fast and this small can be composed. You can put it inside a loop without thinking about it. You can call it from find -exec without worrying about throughput. You can cron it for nightly reports. You stop thinking about "the inference server" and start thinking about the model as just another filter in the pipeline.

Why Python made the choice hard

The case for Python inference is real. You get transformers, you get the huggingface ecosystem, you get fine-tuning, you get every new model architecture the day it drops. Switching off Python means reimplementing operations, dealing with new architectures as they appear, maintaining your own kernels.

This is a legitimate cost. We took it because we thought the composition story mattered more. We think, in the long run, the ecosystem will too — not because Python is bad at ML, but because inference and research are different activities with different requirements, and the industry is still treating them as the same thing.

The same split happened before. Compilers used to be research activities in Lisp; now they are production tools in C++ and Rust. Databases used to be research prototypes; now they are system software. Machine learning is going through the same transition. The research stays in Python. The inference moves to something leaner. This is not a criticism of Python. It is recognition that the job has changed.

What else needs to be Unix-shaped

Inference is the obvious candidate, but it is not the only one. A partial list of things that currently want to be services but would be better as tools:

Embeddings. Today you spin up a sentence-transformers server. Tomorrow it should be embed < file.txt, reading stdin, writing vectors to stdout, exiting. A cold-start latency of a second is fine for interactive use; a daemonized version is fine for high-throughput use; the interface is the same.

Retrieval. A vector search over a local index should be search "query" < index.db, not a database service with an HTTP API and its own query language. Sqlite showed that for a huge class of databases, the service model is overkill. The same is true for most vector search.

Rerankers, classifiers, small specialized models. All of these are currently shipped as services because that is what Python produces. All of them should be tools.

The pattern in every case is the same: take something that is currently a Python service, and ask what it would look like as a Unix tool. Usually the answer is "faster, smaller, easier to compose, and exactly as capable."

Closing

Local inference is a Unix problem in the sense that Unix already figured out the shape of the answer. Small tools, composed through pipes, running on demand. The model is not a service. The model is a filter.

Python inference is what you get when you take a research stack and deploy it. Unix-native inference is what you get when you take composition seriously.

The hard part is that the ecosystem has to be rebuilt. Model formats, quantization tools, tokenizer libraries, all of it. That is a lot of work. But it is the work that separates "we have Python in production" from "we have the thing working." Forty years after Unix, the lesson keeps recurring: production is what happens when you stop writing research code and start writing tools.

This essay is part of LinuxToaster — Unix re-imagined for the era of AI.