/blog / comparison
Intel Gaudi 2: AWS vs. CoreWeave for LLM Pre-training. Price vs. Pain.
We put Intel's Gaudi 2 through a month of LLM pre-training and inference on two clouds to see if the cost savings are worth the ecosystem friction.
- gpu
- comparison
- gaudi2
- aws
- coreweave
- llm
- pricing
The numbers on the spreadsheet looked great: Gaudi 2 instances, at roughly half the hourly rate of comparable Nvidia H100s, promising a significant dent in our LLM pre-training bill. So we rented a few, ran our standard workloads, and waited for the catch. The catch, as it turns out, wasn’t a sudden invoice spike, but the faint, insistent hum of a dozen engineers debugging hpu-smi and framework incompatibilities instead of actually training models. The ‘savings’ quickly evaporated into engineering hours.
What We’re Comparing
We spent the last month, leading up to late May 2026, putting Intel’s Gaudi 2 accelerators through their paces on two distinct cloud providers: AWS and CoreWeave. The goal was to run a subset of our standard LLM pre-training suite, based on a scaled-down Mistral 7B-style architecture, along with a few inference benchmarks for good measure. We configured both environments with 8x Gaudi 2 instances, aiming for parity in raw compute capability.
Our pre-training workload involved masked language modeling on a 100GB C4 dataset subset, measuring tokens/second and wall-clock time to reach a target perplexity. For inference, we focused on batch size 1 and batch size 8 inference latency and throughput for a fine-tuned Mistral 7B model.
We picked AWS for its reputation for stability and managed services, assuming it would represent the ‘easier’ path. CoreWeave, a known specialist in GPU clouds, was chosen for its bare-metal approach and competitive pricing, representing the ‘harder but cheaper’ option. We wanted to see if the Gaudi 2 cost advantage held up in practice, accounting for provider differences.
Price and Raw Specs: On Paper vs. Reality
Here’s how the two offerings stacked up for our 8x Gaudi 2 configurations. We standardized on us-east-1 for AWS and a comparable US region for CoreWeave to minimize network variance, and assumed 1TB of high-performance NVMe storage for checkpoints and dataset caching.
| Provider | Instance Type | GPUs | VRAM (total) | vCPUs | RAM (total) | Base $/hr (8x Gaudi 2) | Egress $/GB | Notes |
|---|---|---|---|---|---|---|---|---|
| AWS | dl2q.24xlarge | 8x Gaudi 2 | 768 GB | 96 | 1152 GB | $12.48 | $0.09 | Managed software stack |
| CoreWeave | g2-8xl | 8x Gaudi 2 | 768 GB | 128 | 1536 GB | $8.80 | $0.01 | Bare-metal access, included TB |
Looking purely at the hourly rate, CoreWeave presented an immediate 29% saving. Both instances offered the same 8x Gaudi 2 accelerators, each with 96 GB of HBM2e. CoreWeave also threw in more vCPUs and RAM, which can be useful for data loading or complex pre-processing. The egress costs are notable; CoreWeave includes 10 TB/month, after which it’s a flat $0.01/GB, while AWS charges from the first GB above a small free tier, often hitting $0.09/GB, which can add up quickly if you’re pulling large datasets or shipping models around. We’ve written about this hidden egress tax before in our guide to egress costs.
Performance Under Fire: When the Silicon Works
Once we got past the setup hurdles (more on that shortly), the raw performance of the Gaudi 2 chips themselves was genuinely impressive. For pre-training our Mistral-like 7B model, we observed:
- AWS
dl2q.24xlarge: Averaged 14,500 tokens/second per Gaudi 2, scaling linearly to around 116,000 tokens/second for the 8-accelerator setup. Time to target perplexity was 38 hours. - CoreWeave
g2-8xl: Averaged 15,100 tokens/second per Gaudi 2, hitting 120,800 tokens/second for 8 accelerators. Time to target perplexity was 36.5 hours.
The slight performance edge on CoreWeave was likely due to the more generous CPU/RAM allocation and potentially lower-latency interconnects, though it was within the margin of error for typical training runs. For inference, both platforms delivered similar p95 latencies for single-request inference (around 220ms for 512 output tokens) and comparable batch throughputs (approximately 850 tokens/second for batch size 8).
Comparing this to H100s, a single H100 80GB can push 20,000-25,000 tokens/second for this model (see our A100 pricing comparison for context). So, two Gaudi 2s effectively match an H100 in raw throughput, but at a significantly lower hourly cost. The silicon itself is a contender.
The Ecosystem Friction: Where Savings Evaporate
This is where the spreadsheet numbers started to feel very optimistic. Intel’s software stack for Gaudi 2, particularly Habana SynapseAI, is robust but requires a different mental model and often specific versions of libraries.
AWS: Offered a more streamlined experience. Their Deep Learning AMIs came pre-configured with most of the necessary drivers and framework patches (PyTorch, TensorFlow). We still hit a few version mismatches for our custom training script, leading to about 8-10 hours of initial setup and debugging. Support was responsive but sometimes slow to escalate deep technical issues specific to Gaudi. Instance availability could be tight in us-east-1, requiring us to queue for a few hours occasionally.
CoreWeave: True bare-metal access meant we had to manage almost everything ourselves. This offered maximum flexibility but required significantly more effort. Installing drivers, configuring network fabrics, and getting a distributed training setup running smoothly took us nearly 24 hours of dedicated engineering time. The CoreWeave team was helpful, but their role was more about providing a working host, not hand-holding through framework bugs. Once it was running, it was rock-solid, but that initial lift was substantial. We found ourselves digging through Intel’s documentation and community forums far more often.
Common Challenges: Both platforms required careful attention to data loading. Gaudi 2 thrives on keeping its HBM busy, and bottlenecks in storage I/O or CPU-side pre-processing could easily starve the accelerators. We spent additional time optimizing our data pipeline, which is a universal GPU problem, but felt more pronounced here given the novelty of the hardware.
Verdict: Cost-Efficiency vs. Engineering Overhead
If you’re a team with deep ML infrastructure expertise and a strict budget, CoreWeave’s Gaudi 2 offering is a compelling choice. The raw hourly cost savings of roughly 29%, combined with much friendlier egress rates and more generous base specs, translate to a significantly lower bill if you can stomach the initial engineering investment. For a month-long pre-training run, that 29% saving could be thousands of dollars. The assumption here is you have engineers who are comfortable debugging low-level driver issues and adapting training frameworks to a non-Nvidia ecosystem.
For teams prioritizing speed to market, a familiar cloud experience, and minimal setup friction, AWS is the safer bet. You’ll pay a premium for the managed environment and the slightly less competitive egress, but you’ll likely spend less time wrangling infrastructure and more time iterating on your models. The cost difference is real, but so is the cognitive load required to fully exploit the cheaper silicon elsewhere. In our lab, we’d lean into CoreWeave for long, dedicated pre-training jobs where we can amortize the setup cost, but stick with AWS for bursty inference or exploratory work where time is of the essence and a few extra dollars per hour won’t break the bank.
Ultimately, Intel Gaudi 2 provides a powerful alternative to Nvidia, but it’s not a drop-in replacement. The decision comes down to whether your team has the engineering bandwidth to truly unlock the per-dollar performance, or if the convenience tax of a more integrated, albeit pricier, cloud provider is the smarter play for your specific workload.
comparison
LLM training spot instances: Runpod, Vast.ai, Vultr compared for cost
Cut LLM training costs with spot instances. We compare Runpod, Vast.ai, and Vultr's A100 spot pricing, preemption rates, and strategies for resilient training.
5 min
comparison
A100 Cloud Pricing: Runpod, Vultr, Lambda, Vast.ai Battle for Your DL Dollars
We put four A100 providers through our standard LLM inference benchmark and tracked every dollar, queue, and cold-start in the weeks leading up to May 2026.
8 min
comparison
LLM model load times: how slow cloud block storage costs you money
We benchmarked LLM model load times on Runpod, Vultr, and Lambda Labs to see how block storage performance impacts your cloud GPU costs. See who wins.
8 min