/blog / comparison
LLM training spot instances: Runpod, Vast.ai, Vultr compared for cost
Cut LLM training costs with spot instances. We compare Runpod, Vast.ai, and Vultr's A100 spot pricing, preemption rates, and strategies for resilient training.
- gpu
- comparison
- llm
- spot
- a100
- pricing
On a quiet Tuesday in mid-May 2026, we stumbled upon an A100 80GB instance on Vast.ai’s marketplace for an absurd $0.48/hr. This wasn’t a fluke; it was a spot instance, available for the taking, and it immediately made us question every “on-demand” dollar we’d spent on LLM training. The promise of cutting costs by half or more for interruptible workloads is enticing, but the reality of managing preemption can quickly erode those savings. We’ve spent the last few weeks digging through pricing pages, availability dashboards, and preemption policies for Runpod, Vast.ai, and Vultr to see where the real savings and headaches lie when training LLMs on spot instances.
Why spot instances are a game-changer for LLM training costs
Spot instances are the deep discounts of the cloud world. Providers offer spare compute capacity at drastically reduced rates, sometimes 70-90% off on-demand prices. The catch, of course, is preemption: the provider can reclaim the instance with little warning if demand for on-demand capacity rises. For LLM training, especially experimental runs, hyperparameter tuning, or even long-running pre-training with robust checkpointing, this trade-off can be incredibly attractive. Why pay full price for an A100 when your job might get interrupted anyway, or when you can design your workflow to recover gracefully? The delta between a standard on-demand A100 and a spot A100 can mean the difference between iterating daily or weekly. For a detailed look at standard A100 costs, check out our A100 Cloud Pricing Comparison. The goal here isn’t just to find the cheapest hour, but the cheapest reliable hour when preemption is a constant threat.
Runpod’s spot market: what to expect for LLM workloads
Runpod offers what they call “Secure Cloud Spot” instances, often at a significant discount compared to their on-demand rates. The appeal is straightforward: you pick your GPU, OS, and software, and if it’s available as a spot instance, you get a lower hourly rate. As of late May 2026, we found that Runpod A100 80GB secure cloud spot instance pricing can be as low as $0.79/hr. This is a static, published rate, which brings a degree of predictability that the more dynamic marketplaces lack.
Availability for these lower-cost A100s can fluctuate, particularly during peak times, but we generally found them more reliably listed than on some other platforms. For LLM training, the secure cloud environment means you’re not sharing your host machine with unknown tenants, which is a common concern on truly decentralized marketplaces. This also means you’re generally getting a more consistent network and storage experience. Speaking of storage, Runpod’s block storage costs $0.05/GB per month. While this seems low, it’s worth factoring into your long-term training costs, especially if your datasets are large or you’re managing numerous checkpoints. The preemption risk is still there, but in our experience monitoring their network, it feels a bit less volatile than some truly open markets.
Vast.ai’s marketplace: the ultimate bargain hunter’s guide
Vast.ai operates a decentralized marketplace where individuals and small providers rent out their idle GPUs. This model leads to incredible price variability and, often, the absolute lowest rates if you’re willing to hunt. We’ve seen Vast.ai marketplace A100 80GB instances average around $0.55/hr for spot instances on their marketplace, as observed in late May 2026, though prices can dip significantly lower depending on demand and host specifics. For instance, we occasionally see offers below $0.50/hr for A100 80GB instances, provided you’re flexible with location or host reputation. You can verify typical pricing directly on the Vast.ai console when creating an instance.
The challenge with Vast.ai lies in consistency. The quality of hosts can vary — some are professional setups, others are literally someone’s gaming PC. This means network performance, underlying CPU power, and even the stability of the GPU itself can be a roll of the dice. You’ll want to filter heavily by host reliability scores and verify specs before committing to a long job. For LLM training, this means you need a robust checkpointing strategy and a high tolerance for restarts. It also means paying close attention to storage costs, which, like on other platforms, can add up. Our previous dive into GPU Instance Storage: The Hidden Cost You Keep Forgetting is particularly relevant here, as some Vast.ai hosts might have slower or more expensive local storage options. Despite the quirks, for those on a tight budget or with extremely fault-tolerant workloads, Vast.ai often presents unbeatable prices. For a deeper look into navigating this unique market, particularly for iterative workloads, our guide on Vast.ai for hobbyist ML covers some of the practicalities.
Here’s a snapshot of typical A100 80GB spot pricing and characteristics we observed across a sample of Vast.ai hosts in late May 2026:
| Characteristic | Typical Range (A100 80GB) | Notes |
|---|---|---|
| Hourly Rate | $0.45 - $0.80 | Highly variable, depends on host, location, demand |
| Preemption Risk | High | Depends on host stability and utilization |
| Host Reliability | Variable | Check host scores; some are excellent, some are not |
| Network Speed | Variable | Can range from consumer-grade to datacenter-tier |
| Storage Cost/Type | Often included (local) | Verify NVMe availability and performance |
| Setup Complexity | Moderate | Requires some familiarity with host configuration |
Vultr’s spot instances: predictable savings for LLMs?
Vultr, traditionally known for its simpler cloud compute offerings, has entered the GPU market with its own brand of spot instances. Their model is more akin to traditional cloud providers than Vast.ai’s marketplace, offering a more stable price point and, ostensibly, more consistent infrastructure. For LLM training, this predictability can be a significant advantage, reducing the “unknown unknowns” that come with highly variable marketplaces.
As of late May 2026, Vultr A100 80GB spot instance pricing starts at approximately $1.10/hr. This is higher than what you might find on Vast.ai and sometimes Runpod’s lowest spot rates, but it comes with the promise of more enterprise-grade infrastructure and a more controlled environment. Preemption rates are generally lower than on decentralized platforms, though still present. Their spot instances are typically available across their global data centers, which can be useful for reducing latency if your data sources are regionally dispersed. One often-overlooked cost on any cloud is egress, and Vultr’s approach is fairly standard. We’ve detailed the importance of tracking these transfers in our egress cost guide, which applies here as much as anywhere else. Vultr’s spot offerings are a good middle ground for those who want significant savings but aren’t quite ready for the wild west of a fully decentralized market.
Minimizing preemption and managing checkpoints for LLM training
The allure of cheap spot instances for LLM training is clear, but the threat of preemption is real. Losing hours of training progress because your instance got reclaimed is not a cost-saving measure. The solution lies in robust fault tolerance.
First, checkpointing is non-negotiable. Your training loop must save model weights and optimizer states frequently — every few minutes for critical stages, or every few hundred steps. Store these checkpoints on durable, network-attached storage (NAS, S3-compatible object storage, or dedicated block storage) that persists beyond the life of your ephemeral spot instance. This means that when preemption hits, you can simply spin up a new instance, load the latest checkpoint, and resume training. Tools like PyTorch Lightning, Hugging Face Accelerate, or even custom scripts make this relatively straightforward.
Second, orchestration and automation are key. Manually monitoring and restarting instances is a recipe for frustration. Leverage tools like Kubernetes with custom operators, or simpler shell scripts wrapped in tmux or screen that can detect preemption (e.g., by checking for instance termination signals or API calls) and automatically trigger a new instance launch and resume command. Some platforms offer built-in mechanisms for this, but a custom solution often provides more flexibility.
Third, dataset management. Ensure your training data is also accessible from durable storage, ideally co-located with your compute region to minimize transfer costs and latency. Pre-loading datasets into a local cache on the spot instance can speed up training, but the canonical source should always be external.
Finally, understand preemption signals. Some providers offer a warning period (e.g., 30 seconds to 2 minutes) before an instance is preempted. Build logic into your application to catch these signals and initiate a final quick checkpoint save. This small window can often be enough to prevent data loss.
Our verdict: which spot market for your LLM training?
Choosing the right spot market for LLM training boils down to your tolerance for risk, your budget, and the fault tolerance of your workflow. There’s no single “best” option, but clear winners emerge for specific use cases.
For the absolute lowest price and maximum flexibility, Vast.ai wins, hands down. We’ve regularly seen A100 80GB instances there for under $0.50/hr, which is hard to beat. However, you’re signing up for variability in host quality, network performance, and preemption frequency. It requires a robust, battle-hardened checkpointing and orchestration setup. If you’re a hobbyist, researcher, or a small team with a highly fault-tolerant workflow and budget is paramount, Vast.ai is your playground.
If you need a balance of savings and predictability, Runpod’s Secure Cloud Spot is a strong contender. At around $0.79/hr for A100 80GB, it’s significantly cheaper than on-demand, while offering a more stable and secure environment than Vast.ai’s open marketplace. Preemption is still a factor, but it feels less chaotic. This is a good choice for iterative development, fine-tuning, or mid-scale training jobs where you want cost savings without constant babysitting. If you want to try the same workload yourself, our referral link is an easy way to get started.
Vultr sits in the middle-to-higher end of spot pricing, starting around $1.10/hr for A100 80GB. While less aggressive on discounts, it offers the most “traditional cloud” experience of the three, with more consistent performance and potentially lower preemption rates than Vast.ai. It’s a solid option for teams already integrated into Vultr’s ecosystem, or those who prioritize a more managed feel over the deepest possible discounts, while still seeking cost reductions over on-demand rates.
Here’s a quick summary of our findings:
| Provider | Typical A100 80GB Spot Rate (May 2026) | Preemption Risk | Host Consistency | Best For |
|---|---|---|---|---|
| Vast.ai | ~$0.55/hr (can go lower) | High | Variable | Extreme budget constraints, highly fault-tolerant workloads, hobbyists |
| Runpod | ~$0.79/hr | Moderate | High | Balance of cost savings and predictability, secure environments |
| Vultr | ~$1.10/hr | Moderate-Low | High | Teams wanting cloud predictability with some spot savings, existing Vultr users |
Ultimately, the cheapest hourly rate isn’t always the cheapest overall. A $0.50/hr instance that preempts every hour might cost more in lost time and re-starts than a $0.80/hr instance that runs for days. For LLM training, a robust checkpointing strategy is your non-negotiable insurance policy against the inherent instability of spot markets, regardless of provider.
Run the numbers · interactive
Monthly LLM training cost comparison (A100 80GB spot)
Pricing is for a single A100 80GB GPU. Excludes storage, egress, and potential preemption costs.
Want to compare more providers across H100, H200, A100, and RTX tiers? Try the full GPU rental cost calculator →
comparison
A100 Cloud Pricing: Runpod, Vultr, Lambda, Vast.ai Battle for Your DL Dollars
We put four A100 providers through our standard LLM inference benchmark and tracked every dollar, queue, and cold-start in the weeks leading up to May 2026.
8 min
comparison
Intel Gaudi 2: AWS vs. CoreWeave for LLM Pre-training. Price vs. Pain.
We put Intel's Gaudi 2 through a month of LLM pre-training and inference on two clouds to see if the cost savings are worth the ecosystem friction.
7 min
comparison
LLM model load times: how slow cloud block storage costs you money
We benchmarked LLM model load times on Runpod, Vultr, and Lambda Labs to see how block storage performance impacts your cloud GPU costs. See who wins.
8 min