Running Modern LLMs Without a GPU

Most teams talk about CPU inference like it is a temporary fallback. In practice, it is often the first real production environment because procurement, policy, and budget move slower than product demand.

The useful question is not, "Can CPU beat GPU?" It cannot.
The useful question is, "Can this product meet user expectations with CPU constraints?" In many internal workloads, the answer is yes.

CPU-first works when request shape is controlled and latency is treated as a product contract, not a best effort.

Where CPU works well, and where it struggles

CPU deployments fail when every request is allowed to become a long, open-ended generation path. They succeed when teams intentionally bound context, output size, and concurrency behavior.

Workload pattern	CPU viability	Why
Retrieval-grounded Q and A with bounded context	Strong	Predictable token counts and stable response paths
Structured extraction and classification	Strong	Short outputs and repeatable prompt templates
Internal copilots at moderate concurrency	Moderate to strong	Works with queue controls and clear admission limits
Long-form creative generation at high concurrency	Weak	Tail latency and queue pressure grow quickly
Tool-heavy chains with serial external calls	Mixed	Orchestration latency often dominates model time

This is why CPU performance is usually a system design issue before it is a model issue.

The architecture pattern that keeps latency sane

The most reliable pattern is role separation. A smaller model handles light routing and normalization tasks, while a medium model handles the final grounded response. That avoids spending expensive generation cycles on requests that could have been resolved earlier in the pipeline.

Request shaping matters just as much. Hard token ceilings, retrieval limits, and fast-fail behavior for over-budget requests give you predictability that raw benchmark tuning cannot.

inference:
  max_input_tokens: 2200
  max_output_tokens: 320
  queue_timeout_ms: 12000
  cancel_on_client_disconnect: true

retrieval:
  max_chunks: 8
  rerank_top_n: 4

routing:
  default_path: "small_model_then_medium_model"
  over_budget_behavior: "return_short_answer_with_source_links"

Configuration like this is not glamorous, but it prevents the majority of runaway latency incidents.

Measure phases, not a single latency number

A single end-to-end metric hides where CPU systems actually degrade. You want phase-level visibility so the team can separate queue pressure from retrieval delay from generation delay.

type CpuPhaseMetrics = {
  submitToAckMs: number
  queueWaitMs: number
  retrievalMs: number
  generationMs: number
  completeMs: number
  route: string
}

export function recordCpuInferenceMetrics(m: CpuPhaseMetrics) {
  trackEvent('cpu_inference_phase_metrics', {
    route: m.route,
    submit_to_ack_ms: m.submitToAckMs,
    queue_wait_ms: m.queueWaitMs,
    retrieval_ms: m.retrievalMs,
    generation_ms: m.generationMs,
    complete_ms: m.completeMs,
  })
}

When queue wait is rising faster than generation time, you have a capacity and admission problem. When generation dominates, you have a request-shaping or model-path problem.

What good runtime telemetry looks like

Operational triage gets much faster when each request emits a compact, structured payload.

{
  "request_id": "rq-7f3b2c",
  "route": "small_model_then_medium_model",
  "tokens_in": 1840,
  "tokens_out": 241,
  "queue_wait_ms": 920,
  "retrieval_ms": 310,
  "generation_ms": 2870,
  "status": "ok"
}

That gives you enough signal to answer, in one glance, whether the system is compute-bound, queue-bound, or orchestration-bound.

When GPU migration is the right move

CPU-first should not become CPU-only dogma. Move to GPU when product thresholds are consistently missed after request shaping and queue controls are already in place.

Typical triggers are persistent p95 latency beyond SLA, queue waits that break interaction flow, or quality requirements that need larger models than CPU can serve economically.

Final note

Running modern LLMs without a GPU is a valid production strategy, not a stunt. The teams that succeed are not the ones with the cleverest benchmark screenshots. They are the ones that treat latency budgets, admission control, and request design as core product engineering.