How Modern LLM Serving Systems Actually Work
I'm an AI and Quality Engineering Lead at HBLAB, Vietnam's trusted partner for transforming enterprises with modern technology.
After 8 years building quality systems for Fortune 500 companies, I've realized something: legacy systems aren't bad—they're just old. The magic happens when you give them superpowers.
At HBLAB, I lead initiatives that blend cutting-edge AI with practical engineering discipline. We've helped 600+ enterprises modernize their applications, reduce costs, and actually enjoy their infrastructure.
What gets me excited: • Turning "this will take 2 years" into "this will take 3 months" • Making AI accessibility for enterprises (not just startups) • Building teams that care about quality AND velocity • Modernization stories that actually save millions
I write about digital transformation, the business case for technical investment, and the human side of technology change. Because at the end of the day, great technology is about enabling people, not just impressive code.
Let's talk about making your enterprise software better.
A Technical Breakdown of the Stack Behind Fast, Cheap Inference
Running a large language model in production is nothing like running one in a notebook. The gap between "it works on my A100" and "it serves 10,000 users at acceptable latency for under $X/hour" is where most teams discover that model quality is only half the problem. The other half is the serving stack.
This post breaks down the core techniques that modern inference systems use to close that gap: batching strategies, KV cache management, attention optimizations, speculative decoding, and quantization. Each section explains not just what the technique does, but why naive approaches fail first.
The Architecture You Are Actually Deploying Into
Before examining individual techniques, the overall system shape matters.
When a user sends a request to an LLM-powered application, that request travels to an inference server, joins a queue, and gets managed by a scheduler. The scheduler decides which requests enter processing based on available hardware resources. Inside the inference server, an inference engine groups incoming requests into batches before GPU computation begins.
Every optimization described below targets a specific bottleneck somewhere in this pipeline. Understanding the pipeline prevents the mistake of applying the right technique to the wrong problem.
1. Prefill and Decode: The Two Phases of Inference
LLM inference is not a single uniform computation. It splits into two distinct phases with fundamentally different performance profiles.
Prefill
During prefill, the model processes the entire input prompt at once. It tokenizes the input, runs the full forward pass across all input tokens simultaneously, and builds the KV cache (more on this shortly). Because the entire input is available upfront, this phase is highly parallelizable and makes aggressive use of GPU compute.
Prefill is compute-bound. The bottleneck is raw floating-point throughput.
Decode
Decode generates output tokens one at a time. Each new token depends on all previous tokens, which means the model cannot parallelize across output positions. It reads the KV cache produced during prefill and appends to it with every step.
Decode is memory-bandwidth-bound. The bottleneck is how fast the GPU can read model weights and cache data from memory, not how fast it can compute. This is why decoding runs significantly slower than prefill despite generating one token per step rather than processing hundreds.
This asymmetry drives most of the architectural decisions that follow.
2. Batching: From Naive to Continuous
GPUs are parallel processors. Their throughput advantage over CPUs depends on keeping thousands of cores occupied simultaneously. A GPU running inference on a single request at a time wastes most of that parallelism.
Batching addresses this by loading model weights once and running multiple requests through a single forward pass. One weight load, multiple sequences processed. The result is higher memory bandwidth utilization, higher throughput, and lower per-request cost.
Why Naive Batching Breaks Down
The standard approach, processing a fixed batch until every sequence finishes, fails under real-world conditions for one structural reason: sequences in a batch finish at different times.
Consider a batch of eight requests. One finishes after 12 tokens. Three finish after 40 tokens. Two run to 200. Under naive batching, the GPU waits for the longest sequence before the batch closes. The GPU slots occupied by finished sequences sit idle. The white space in the batch is wasted compute you already paid for.
This matters more in production than in benchmarks. Chatbot inputs vary from one sentence to multi-page documents. Outputs range from a single word to full essays. Uniform batch sizes cannot accommodate this variance efficiently.
Continuous Batching (ORCA)
The ORCA system introduced iteration-level scheduling to solve this. Instead of fixing batch membership for the duration of a request, the scheduler makes a new batching decision at every forward pass iteration.
When a sequence finishes generating its stop token, it leaves the batch immediately. A new sequence from the queue takes its slot. The GPU stays fully occupied.
ORCA also applies selective batching: for most operations (matrix multiplications, normalization layers), tokens from different requests are flattened into a single 2D tensor and processed together as if they were one request. For attention operations, which require distinct KV pairs per request, the batch splits and each request processes independently before outputs merge again.
The scheduler itself tracks memory availability across requests, prioritizes by arrival time, and pipelines multiple batches across workers to minimize idle time between iterations.
Continuous batching is now standard in production inference frameworks including vLLM, TensorRT-LLM, and SGLang.
3. Paged Attention: Fixing the KV Cache Memory Problem
KV cache is what makes decode fast. Without it, the model would recompute attention keys and values for every prior token on every decoding step, which scales quadratically with sequence length.
With KV cache, each token's key and value vectors are computed once during prefill and stored. Each decoding step reads from this store rather than recomputing. The cost becomes linear in sequence length.
The problem is memory.
What Traditional KV Cache Allocation Gets Wrong
Conventional systems allocate KV cache as a contiguous block at the start of a request, sized for the maximum possible output length. This creates three categories of waste:
Internal fragmentation: Memory is reserved for the maximum output length at step one, even though the model might generate only 10% of that. The unused allocation sits locked until the request finishes.
Reservation waste: Even partially used allocations block other requests from accessing the free portions. A block reserved for 2,048 tokens that has used 200 is functionally 1,848 tokens of inaccessible memory.
External fragmentation: Fixed-size blocks leave gaps between allocations that are too small for new requests but too scattered to combine.
Research measurements from the vLLM team found that 60 to 80% of allocated KV cache memory was wasted under these conditions.
How Paged Attention Works
Paged Attention borrows directly from operating system virtual memory management. Rather than one contiguous block per request, it allocates fixed-size pages of KV cache dynamically as the sequence grows.
A logical-to-physical page table maps each request's growing KV cache to physical memory pages wherever they are available, scattered across GPU memory. Pages are allocated on demand, one block at a time. When a request finishes, its pages return to the free pool immediately.
This eliminates external fragmentation entirely. Internal fragmentation drops to at most one partial page per request (the last page, which may not be fully used) rather than potentially thousands of wasted tokens per request.
Under Paged Attention, measured waste drops from 60 to 80% to approximately 4%. This directly increases the maximum achievable batch size at a given memory budget.
Paged Attention also enables parallel sampling without duplication: multiple output sequences generated from the same prompt can share KV cache pages for the prompt portion through copy-on-write, allocating separate pages only when sequences diverge.
4. Radix Attention: Reusing KV Cache Across Requests
Paged Attention improves memory use within a request. Radix Attention addresses a different inefficiency: the KV cache built during one request's prefill is discarded after the request ends, even if the next request shares an identical prefix.
In practice, prefix sharing is common. A system prompt prepended to every user message. A few-shot example block reused across a batch of classification queries. A retrieval-augmented generation context retrieved from the same document. Each of these generates the same KV cache for the shared prefix, over and over.
The Radix Tree Structure
Radix Attention stores KV cache entries in a radix tree (compressed prefix tree) indexed by token sequences. Each node represents a token subsequence; the edge to a child represents appending more tokens. The leaves store KV cache blocks.
When a new request arrives, the system traverses the tree to find the longest matching prefix. If a match exists, the corresponding KV cache is reused for the matching portion, and only the novel suffix requires prefill computation.
For requests with no prefix match, Radix Attention behaves identically to standard KV cache. For requests with substantial shared prefixes, it can skip the prefill phase entirely for those tokens.
Cache eviction uses a Least Recently Used (LRU) policy. Nodes not accessed recently are evicted first, preserving the cache entries most likely to benefit future requests.
Radix Attention is most effective in high-query-rate scenarios with structured, repeated inputs. In workloads with fully novel inputs per request, the overhead of tree management produces no benefit.
5. Speculative Decoding: Running Two Models to Go Faster
The decode phase is the primary latency bottleneck for long outputs. Each token requires one full forward pass of the model. For a 70-billion parameter model, that is expensive per token, and the cost compounds across hundreds of output tokens.
Speculative decoding reduces the number of forward passes required from the large target model by using a smaller, faster draft model to propose candidate tokens in bulk, then verifying them in a single parallel forward pass of the target model.
How the Process Works
Step 1 — Draft: The draft model (typically a distilled or smaller version of the target) generates K tokens autoregressively. This is fast because the draft model is small.
Step 2 — Verify: The K candidate tokens are fed into the target model in a single forward pass. Because the target model processes them in parallel rather than sequentially, this costs roughly one target model call regardless of K.
Step 3 — Accept or Reject: The target model's probability distribution at each position is compared against the draft model's choices. Tokens that the target model would have selected with high probability are accepted. The first rejected token triggers a fallback: the target model's own sample at that position replaces the draft token, and generation continues from there.
If the draft model is well-calibrated to the target, most tokens are accepted, and the effective throughput of the target model increases proportionally to K. If the draft model is poorly matched, many tokens are rejected, and the system runs slower than standard decoding because it is running both models.
The technique works best when the draft model shares training data and architecture patterns with the target model (making acceptance rates high) and when output tokens have predictable distributions (common in instruction-following and structured generation tasks).
6. Quantization: Shrinking the Model Without Replacing It
A model with 70 billion parameters stored in 32-bit floating point requires approximately 280 GB of memory to load, before accounting for KV cache or activations. That exceeds the capacity of most multi-GPU server configurations. Quantization makes large models deployable by reducing the bit-width of stored weights.
The Core Trade-off
Training uses high-precision weights (float32 or bfloat16) because gradient updates are small and accumulate over many steps. Inference does not require the same numerical range. Quantization maps the weight distribution to a lower-bit representation, accepting a small precision loss in exchange for a large reduction in memory footprint and faster compute on hardware that natively supports lower precision formats.
Moving from float32 to int8 halves memory use. Moving to int4 reduces it by a factor of eight. NVIDIA GPUs optimize int8 and int4 operations through tensor cores, so inference often runs faster on quantized weights, not just cheaper.
Quantization Strategies
Symmetric (linear) quantization maps the weight range to a symmetric integer scale centered at zero. Straightforward to implement, but less efficient when weight distributions are skewed, because the integer range is split evenly even if most values cluster on one side.
Asymmetric quantization defines the mapping using the actual minimum and maximum of each weight tensor, with an arbitrary zero-point. This uses the full integer range more efficiently for skewed distributions, reducing rounding error.
Post-Training Quantization (PTQ) applies quantization to a trained model without retraining. A small calibration dataset establishes scale factors. PTQ is fast to apply but can degrade accuracy, particularly on models with outlier weight values in certain layers.
Quantization-Aware Training (QAT) simulates quantization noise during training, allowing the model to adapt its weights to the precision constraints before deployment. QAT preserves accuracy better than PTQ but requires access to the training pipeline and substantial additional compute.
Mixed-precision quantization assigns different bit widths to different layers based on their sensitivity to precision loss. Layers identified as numerically sensitive retain higher precision (float16); less sensitive layers drop to int8 or int4. This approach currently offers the best balance between compression and preserved accuracy and is standard in production deployments of frontier models.
The practical risk in quantization is calibration error: applying the wrong scale factor or failing to account for activation outliers can degrade model behavior significantly without obvious failure modes. Models that produce coherent but subtly degraded outputs after quantization require careful evaluation, not just perplexity checks.
Putting It Together: How These Techniques Interact
These six techniques address distinct bottlenecks but do not operate in isolation.
Continuous batching maximizes GPU utilization across requests. Paged Attention makes it possible to run larger batch sizes within a fixed memory budget. Radix Attention reduces redundant prefill computation across requests with shared context. Speculative decoding reduces target model calls during the decode phase. Quantization shrinks the memory footprint of weights, freeing capacity for larger KV caches and batch sizes.
A production inference system for a large model typically combines all of these. vLLM implements Paged Attention, continuous batching, and Radix Attention. SGLang adds further prefix-sharing optimizations. TensorRT-LLM integrates quantization deeply into its compute kernels. Speculative decoding is increasingly standard across frameworks.
The right configuration depends on workload: prefix-sharing optimizations benefit high-throughput batch processing more than interactive chat. Speculative decoding benefits workloads with consistent output length distributions more than highly variable ones. Quantization trade-offs depend on how sensitive the specific model is to precision reduction.
What This Means for Teams Deploying LLMs
Understanding this stack changes how teams approach infrastructure decisions.
The choice between GPU SKUs is partially a KV cache problem, not just a compute problem. A GPU with higher memory bandwidth benefits decode-heavy workloads more than raw FLOPS numbers suggest.
Serving framework selection matters as much as model selection. Two teams running the same model with different inference stacks can achieve throughput differences of 3x to 5x at equivalent hardware cost.
Profiling must separate prefill latency from decode latency. Optimizing for one while ignoring the other produces misleading benchmarks that do not reflect user-facing performance.
The 60-80% memory waste figure from pre-Paged Attention systems is not historical trivia. Teams running older serving stacks or custom implementations without dynamic paging are likely experiencing this waste today.
Further Reading
Agrawal et al., "SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills" (2023)
Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention" (2023), vLLM
Leviathan et al., "Fast Inference from Transformers via Speculative Decoding" (2023), Google Research
Yu et al., "Orca: A Distributed Serving System for Transformer-Based Generative Models" (2022)
Dettmers et al., "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (2022)



