← Back to Blog
AI
Running LLMs on CPU: Quantisation in Practice
5 April 2026 · 11 min read
Not every deployment has access to GPU infrastructure. In defence environments especially, you’re often working with air-gapped systems running commodity hardware. We’ve spent the last year making AI inference practical on CPU-only machines — here’s what actually works.
This is a practical guide based on production deployments, not benchmarks-on-paper. If you’re considering running LLMs without a GPU farm, this article covers the tooling, quantisation techniques, performance expectations, and architectural patterns we use in real client work.
Why CPU Inference Matters in 2026
GPU availability is a luxury, and a costly one. Beyond the obvious cost issue, many of our clients operate in environments where GPU hardware isn’t available, isn’t approved, or isn’t practical:
- Air-gapped defence networks — classified environments with strict hardware approval lists.
- Healthcare on-prem deployments — hospitals with existing server estates that can’t add GPU racks.
- Edge devices — medical imaging carts, industrial sensors, retail kiosks. CPU is what you have.
- Cost-sensitive SaaS — if your product runs an LLM call per user action, GPU costs scale faster than revenue.
- Privacy-first products — cloud LLMs leak data by design. On-device inference is the only honest answer.
The good news is that modern quantisation techniques have closed the gap dramatically. A 7B parameter model quantised to 4-bit runs comfortably on a modern desktop CPU with 16GB of RAM — producing usable output for document analysis, classification, structured extraction, and conversational interfaces.
Quantisation: The Practical Bits
Quantisation reduces the precision of model weights from 32-bit or 16-bit floating point down to 8-bit or 4-bit integers. The maths is straightforward — you’re trading precision for speed and memory.
What’s less obvious is which quantisation method to use. We’ve benchmarked GPTQ, AWQ, and GGUF across a range of tasks relevant to our clients:
- GGUF (llama.cpp) — the best choice for pure CPU inference. No GPU required at all. Runs on AVX2-capable processors and benefits significantly from AVX-512 on newer hardware. We use this for all air-gapped deployments.
- AWQ — better quality preservation at 4-bit than GPTQ, but needs GPU for inference. Not suitable for our CPU-only use cases. Worth considering if you have hybrid infrastructure.
- GPTQ — mature and well-supported, but GPU-dependent for practical speed. CPU inference exists but is dramatically slower than GGUF equivalents.
Choosing a Quantisation Level
Within GGUF, you have multiple precision options. Our rule of thumb based on production testing:
- Q8_0 — nearly indistinguishable from FP16. Use when memory isn’t the bottleneck and quality matters.
- Q5_K_M — sweet spot for most tasks. Small quality loss, significant memory savings.
- Q4_K_M — our default for production. Acceptable quality for most tasks, half the memory of Q8_0.
- Q3_K_M and below — quality degradation becomes noticeable. Only use when memory is severely constrained.
Anecdotally, the perplexity numbers you see in quantisation papers don’t fully capture the user-perceived quality difference. Test with your actual prompts and tasks before committing to a precision level.
Real-World Performance Benchmarks
On an Intel Xeon with 64GB RAM (typical server hardware in defence environments), we measure:
- Llama 2 7B (Q4_K_M): ~15 tokens/sec — usable for interactive applications.
- Mistral 7B (Q4_K_M): ~18 tokens/sec — slightly faster architecture, same memory profile.
- Llama 2 13B (Q4_K_M): ~8 tokens/sec — still practical for batch processing.
- Llama 2 70B (Q3_K_M): ~1-2 tokens/sec — only for offline batch work.
On consumer hardware (Apple Silicon M2 Pro or modern AMD Ryzen with 32GB RAM), the numbers are surprisingly competitive thanks to memory bandwidth advantages:
- Llama 2 7B (Q4_K_M) on M2 Pro: ~28 tokens/sec — faster than the Xeon because memory bandwidth matters more than core count for inference.
- Mistral 7B (Q4_K_M) on M2 Pro: ~35 tokens/sec.
These numbers aren’t going to win benchmarks against an A100, but they’re absolutely practical for document summarisation, classification, structured extraction, retrieval-augmented generation, and conversational interfaces — the tasks our defence, healthcare, and SaaS clients actually need.
Deployment Architecture
We wrap llama.cpp in a Go service that provides an OpenAI-compatible API. This means client applications don’t need to know whether they’re talking to a local CPU model or a cloud GPU service — the interface is identical. Swap one for the other with a config change.
This architecture has practical advantages:
- Hot-swap models — load multiple quantised models, route requests by complexity (small for classification, large for generation).
- Graceful degradation — under load, automatically downshift to smaller models rather than queue requests indefinitely.
- Compatibility — existing client libraries built for OpenAI work unchanged. Useful for clients migrating from cloud LLM APIs to on-prem.
- Observability — standard HTTP metrics, request tracing, model performance dashboards.
Threading and Memory Considerations
llama.cpp gives you fine-grained control over thread count, batch size, and context length. Common mistakes we’ve seen:
- Over-threading — using all CPU cores often performs worse than 4-8 threads due to memory bandwidth saturation. Benchmark with your hardware.
- Underestimating context cost — KV cache memory scales linearly with context length. A 32K context consumes significantly more RAM than 4K, sometimes blocking the model from loading at all.
- Misconfigured NUMA — on multi-socket servers, pinning the inference process to a single NUMA node typically outperforms unrestricted scheduling.
For production deployments, we instrument the inference service to track tokens-per-second, memory usage per session, and queue depth. These metrics catch performance regressions before they become user complaints.
When to Use CPU Inference
CPU inference isn’t a universal solution. It makes sense when:
- You can’t use cloud APIs (compliance, privacy, air-gap).
- Your throughput needs are modest (single-digit to low-double-digit concurrent users per server).
- Your latency targets are conversational, not real-time (sub-second tokens are fine, sub-millisecond aren’t needed).
- Your tasks fit smaller models (7B-13B parameter range covers most practical NLP work).
It doesn’t make sense when you need:
- Massive throughput (thousands of concurrent users) — GPU is more cost-effective at scale.
- Frontier-model capability (GPT-4 class) — the open models that fit on CPU are 6-12 months behind frontier in capability.
- Sub-100ms latency for real-time UX — cloud GPU inference is faster, period.
Tooling We Recommend
The CPU LLM ecosystem is healthy but fragmented. Our production stack:
- llama.cpp — the core inference engine. Active development, mature codebase, supports nearly every modern open model.
- llamafile — single-binary distribution of llama.cpp + model weights. Excellent for air-gapped or zero-dependency deployments.
- vLLM (CPU mode) — if you need batching across multiple concurrent users, vLLM’s CPU support is improving rapidly.
- Hugging Face Hub — for sourcing pre-quantised GGUF models.
TheBloke has been the most reliable community quantiser, but most major model releases now publish official GGUF variants.
Working With Us
If you’re running AI workloads in constrained environments — air-gapped, edge, on-prem, or cost-sensitive — we’d like to hear about your use case. We provide dedicated AI engineering services for clients deploying production ML without GPU infrastructure.
For the bigger picture on how we approach AI engineering — from animation retargeting to LLM deployment to ML pipelines — see our AI sector page.