← Back to Blog
AI

Running LLMs on CPU: Quantisation in Practice

5 April 2026 · 11 min read

Not every deployment has access to GPU infrastructure. In defence environments especially, you’re often working with air-gapped systems running commodity hardware. We’ve spent the last year making AI inference practical on CPU-only machines — here’s what actually works.

This is a practical guide based on production deployments, not benchmarks-on-paper. If you’re considering running LLMs without a GPU farm, this article covers the tooling, quantisation techniques, performance expectations, and architectural patterns we use in real client work.

Why CPU Inference Matters in 2026

GPU availability is a luxury, and a costly one. Beyond the obvious cost issue, many of our clients operate in environments where GPU hardware isn’t available, isn’t approved, or isn’t practical:

The good news is that modern quantisation techniques have closed the gap dramatically. A 7B parameter model quantised to 4-bit runs comfortably on a modern desktop CPU with 16GB of RAM — producing usable output for document analysis, classification, structured extraction, and conversational interfaces.

Quantisation: The Practical Bits

Quantisation reduces the precision of model weights from 32-bit or 16-bit floating point down to 8-bit or 4-bit integers. The maths is straightforward — you’re trading precision for speed and memory.

What’s less obvious is which quantisation method to use. We’ve benchmarked GPTQ, AWQ, and GGUF across a range of tasks relevant to our clients:

Choosing a Quantisation Level

Within GGUF, you have multiple precision options. Our rule of thumb based on production testing:

Anecdotally, the perplexity numbers you see in quantisation papers don’t fully capture the user-perceived quality difference. Test with your actual prompts and tasks before committing to a precision level.

Real-World Performance Benchmarks

On an Intel Xeon with 64GB RAM (typical server hardware in defence environments), we measure:

On consumer hardware (Apple Silicon M2 Pro or modern AMD Ryzen with 32GB RAM), the numbers are surprisingly competitive thanks to memory bandwidth advantages:

These numbers aren’t going to win benchmarks against an A100, but they’re absolutely practical for document summarisation, classification, structured extraction, retrieval-augmented generation, and conversational interfaces — the tasks our defence, healthcare, and SaaS clients actually need.

Deployment Architecture

We wrap llama.cpp in a Go service that provides an OpenAI-compatible API. This means client applications don’t need to know whether they’re talking to a local CPU model or a cloud GPU service — the interface is identical. Swap one for the other with a config change.

This architecture has practical advantages:

Threading and Memory Considerations

llama.cpp gives you fine-grained control over thread count, batch size, and context length. Common mistakes we’ve seen:

For production deployments, we instrument the inference service to track tokens-per-second, memory usage per session, and queue depth. These metrics catch performance regressions before they become user complaints.

When to Use CPU Inference

CPU inference isn’t a universal solution. It makes sense when:

It doesn’t make sense when you need:

Tooling We Recommend

The CPU LLM ecosystem is healthy but fragmented. Our production stack:

Working With Us

If you’re running AI workloads in constrained environments — air-gapped, edge, on-prem, or cost-sensitive — we’d like to hear about your use case. We provide dedicated AI engineering services for clients deploying production ML without GPU infrastructure.

For the bigger picture on how we approach AI engineering — from animation retargeting to LLM deployment to ML pipelines — see our AI sector page.