Recent advances in large language models (LLMs) like GPT-4, PaLM have led to transformative capabilities in natural language tasks. LLMs are being incorporated into various applications such as chatbots, search engines, and programming assistants. However, serving LLMs at scale remains challenging due to their substantial GPU and memory requirements.
Approaches to overcome this generally fall into two main categories:
- Model Compression Techniques
These techniques aim to reduce the size of the model while maintaining accuracy. Common approaches include:
- Pruning – Removing redundant or less important parameters from the model. This creates a sparse model with fewer parameters.
- Quantization – Using lower precision numbers like int8 or bfloat16 to represent weights instead of fp32 or fp16. This reduces memory footprint.
- Knowledge distillation – Training a smaller “student” model to mimic a large “teacher” model. The smaller model is then used for inference.
- Selective Execution
Rather than compressed models, these techniques selectively execute only parts of the model per inference:
- Sparse activations – Skipping computation on zero activations.
- Conditional computation – Executing only certain layers conditioned on the input.
On complementary side wrt to the software architect side; to enable faster deployment of LLMs researchers have proposed serverless inference systems. In serverless architectures, LLMs are hosted on shared GPU clusters and allocated dynamically based on demand. This allows efficient utilization of GPUs and reduces costs for developers. Prominent implementations include Amazon SageMaker, Microsoft Azure ML, and open-source options like KServe.
Despite the promise of serverless LLMs, existing systems exhibit high latency overheads that degrade user experience in interactive applications:
- Costly checkpoint downloads: LLMs have large memory footprints, often gigabytes to terabytes in size. Downloading checkpoints from remote storage is time-consuming, taking over 20 seconds even with optimized networks.
- Inefficient checkpoint loading: Even with local SSD storage, loading checkpoints into GPU memory takes tens of seconds due to factors like tensor deserialization and allocation. This adds significant delays beyond container startup time.
To address these issues, researchers at MIT CSAIL proposed ServerlessLLM, an innovative system that achieves low-latency serverless inference for LLMs. ServerlessLLM enhances locality by exploiting the abundant yet underutilized capacity and bandwidth in multi-tier server storage for LLM deployment.
Key Innovations in ServerlessLLM ServerlessLLM incorporates several novel designs to slash LLM loading times in serverless environments:
- Rapid checkpoint loading
- Loading-optimized checkpoint format that enables fast sequential reading and efficient in-memory tensor addressing.
- Multi-tier checkpoint loading pipeline that maximizes bandwidth utilization across network, SSDs, DRAM, and GPU memory through techniques like direct I/O, pinned memory transfer, and parallelism.
- Live migration for locality-driven inference
- Token-based migration that only transmits essential prompt tokens over the network, avoiding slow snapshot transfer.
- Two-phase migration that allows uninterrupted inference by asynchronously recomputing cache states on the destination server before transferring final tokens.
- Latency-optimized server allocation
- Accurate models to estimate checkpoint loading times from each tier and migration times for a server.
- Locality-aware scheduler that selects servers minimizing expected startup latency using the above models.
These optimizations allow ServerlessLLM to reduce LLM loading times by 4-8X and end-to-end startup times by over 25X compared to existing systems like PyTorch, TensorFlow, and KServe.
Let's dive deeper into how ServerlessLLM achieves these significant performance gains.
Accelerating Checkpoint Loading
The first major bottleneck addressed by ServerlessLLM is the high latency of loading LLM checkpoints from storage into GPU memory.
To enable rapid checkpoint loading, ServerlessLLM introduces:
- Loading-optimized checkpoint format
Standard checkpoints used by frameworks like PyTorch are designed for model training and debugging. But for serverless inference, checkpoints are read-only and accessed repeatedly.
To optimize for such read-intensive usage, ServerlessLLM converts checkpoints into a format with two key properties:
- Sequential chunk-based reading: Tensors are grouped into per-GPU binary files, facilitating large sequential reads.
- Efficient tensor addressing: An index maps tensor names to memory offsets, allowing direct in-memory restoration without deserialization.
- Multi-tier checkpoint loading pipeline
ServerlessLLM leverages the tiered architecture of GPU servers, with storage media like SSDs and networking connecting to GPUs via PCIe, NVMe, etc.
The system incorporates a multi-stage pipeline to maximize bandwidth utilization across all tiers:
- In-memory data chunks are allocated using pinned memory for fast GPU transfer.
- Direct I/O is used for efficient SSD reads without caching overheads.
- Multiple threads read different storage chunks in parallel.
- Inter-stage coordination occurs via asynchronous task queues.
Together, this enables saturating the bandwidth capacity of even the fastest tiers like NVMe RAID. Experiments reveal that ServerlessLLM achieves 6-8X faster loading than PyTorch/TensorFlow, reducing startup times for large LLMs from over a minute to under 10 seconds.
Locality-Driven LLM Inference via Live Migration
With accelerated loading, ServerlessLLM faces a new challenge – how to leverage pre-loaded checkpoints for locality without interrupting ongoing inferences on busy servers?
ServerlessLLM introduces a novel technique – live migration of LLM inference across GPU servers. This allows seamlessly transferring execution to servers with local checkpoints available.
Key enablers of live LLM migration:
- Token-based migration
Rather than snapshotting the entire model state, ServerlessLLM only migrates the minimal prompt tokens over the network. This transfers orders of magnitude less data than snapshots.
- Two-phase migration
Destination server asynchronously precomputes cache states from prompt tokens. Once ready, source server transfers final tokens before releasing resources. This prevents inference stalls.
Experiments reveal that token-based migration slashes migration times from tens of seconds to under a second even for long sequences. Live migration is crucial to prevent queuing delays when achieving locality-driven allocation.
Latency-Optimized Model Scheduling
To minimize end-to-end latency, ServerlessLLM enhances the scheduler to optimize server selection considering locality. This involves:
- Fine-grained loading time estimator
Models predict loading times from network, SSD caches, and memory for each server using metrics like queue delays, model sizes, and measured bandwidth.
- Accurate migration time predictor
The scheduler estimates migration times for servers using the number of prompt and output tokens. It tracks inference progress asynchronously to avoid overhead.
- Locality-aware allocation
For each inference request, the scheduler evaluates estimated loading and migration times across servers. It selects the server minimizing expected startup latency.
The scheduler also maintains server task queues and leverages a strongly consistent store for fault tolerance. Together, these innovations reduce scheduling overheads while maximizing locality benefits.
Evaluating ServerlessLLM Performance
Comprehensive experiments benchmark the end-to-end effectiveness of ServerlessLLM against existing systems using real-world models like OPT-175B and workloads modeled after Azure traces.
- Microbenchmarks: ServerlessLLM accelerates checkpoint loading by 3.6-8.2X over PyTorch/TensorFlow. It fully saturates storage bandwidth, even for cutting-edge NVMe RAID.
- Scheduling: ServerlessLLM reduces allocation latency by 4-12X compared to random scheduling, highlighting benefits of locality-awareness. Live migration prevents queuing delays.
- End-to-end serving: For large models like OPT-30B, ServerlessLLM improves 99th percentile latency by 28-200X over systems like KServe and Ray Serve. It also enhances resource efficiency.
These substantial gains demonstrate ServerlessLLM's ability to overcome bottlenecks in existing serverless implementations and unlock the power of LLMs for interactive services.
The optimizations introduced in ServerlessLLM, like multi-tier loading, live migration, and latency-driven scheduling, can help inform the design of future serverless architectures. The system's ability to slash loading and startup times unblocks the scalable deployment of large language models for practical applications.
Looking Ahead: Ongoing Challenges
While a significant leap forward, ServerlessLLM represents only the first step in optimizing serverless inference for massive LLMs. Several open problems remain, including:
- Predicting real-time model demand to guide provisioning and pre-loading
- Intelligently placing checkpoints across servers to maximize cache hits
- Efficiently scaling scheduling algorithms to handle larger clusters
- Ensuring fairness in resource allocation across models and developers
- Generalizing innovations like live migration to other serverless workloads
Addressing these areas can help build on the promise of serverless LLMs and make their capabilities even more accessible. Beyond system-level optimizations, reducing the egregious carbon footprint and potential harms of large models also remains an urgent priority.
ServerlessLLM demonstrates that tremendous headroom exists for innovation in next-generation serverless architectures for AI workloads. As LLMs continue ballooning in size and popularity, solutions like ServerlessLLM that unlock their scalability will grow even more impactful. The confluence of systems and machine learning research can introduce new paradigms in serving, sharing, and scaling AI models safely and sustainably.
The post The Future of Serverless Inference for Large Language Models appeared first on Unite.AI.