/blog / comparison

NVLink H100s in the Cloud: Runpod vs Lambda vs CoreWeave

We put multi-H100 NVLink setups to the test across three major providers to see where distributed training actually scales, and where it just costs more.

Tobias 12 min read
  • gpu
  • h100
  • nvlink
  • comparison
  • runpod
  • lambda
  • coreweave

When you finally outgrow a single H100, the promise of NVLink is seductive: double the VRAM, double the compute, seamless scaling. The reality, as we found out over a month of thrashing distributed training jobs, is that the ‘seamless’ part often depends more on your provider’s plumbing than on Nvidia’s engineering. Our first multi-H100 Llama 3 70B fine-tune on a pair of NVLinked GPUs failed silently after 4 hours, not because of our code, but because a crucial NCCL_IB_GID_INDEX environment variable wasn’t automatically set, a detail buried deep in a provider’s obscure documentation. It was a stark reminder that even with top-tier hardware, the devil is still in the cloud platform’s details.

What We Put Through the Wringer

We focused on typical enterprise-scale training workloads: large language model fine-tuning (Llama 3 70B and Mistral 7B scaled up to 32k context), and a custom image generation diffusion model that benefits immensely from larger batch sizes enabled by pooled VRAM. For all tests, we aimed for configurations with at least two H100 80GB GPUs connected via NVLink, to truly stress the interconnect and not just raw GPU power. Regions were chosen for typical US-East availability, trying to find parity where possible.

Our goal wasn’t just raw speed, but also the overall experience: how quickly could we provision, how stable were the jobs, and what did the final bill look like? We logged everything, from cold start times to actual training throughput and, of course, any hidden charges that might sneak in.

Specs and Price: The Tale of Three Tiers

Landing on comparable H100 NVLink configurations across all three was an exercise in patience. CoreWeave offers direct multi-GPU instances. Lambda provides H100s, often with NVLink enabled by default for multi-GPU setups. Runpod, true to its nature, had both Community Cloud offerings and Secure Cloud options with NVLink. We stuck to the h100-80gb-2x or h100-80gb-4x configurations to ensure a direct comparison of the NVLink benefits.

ProviderInstanceGPUsVRAM (total)NVLink BWCPU/RAMStorage$/hr (2x H100)Queue Time (p50)
Runpod (Community)H100 80GB x22160 GB900 GB/s24 vCPU/192GB1.5 TB NVMe$3.50~15 min
Runpod (Secure)H100 80GB x22160 GB900 GB/s24 vCPU/192GB1.5 TB NVMe$4.20~5 min
Lambda LabsH100 80GB x22160 GB900 GB/s48 vCPU/384GB3.8 TB NVMe$4.50~30 min
CoreWeaveH100 80GB x22160 GB900 GB/s48 vCPU/384GB3.8 TB NVMe~$5.00~2 min

Note: CoreWeave pricing is often negotiated, and the $5.00/hr is a representative on-demand rate we managed to secure for a short-term trial. Actual rates may vary significantly for larger commitments. Runpod Community Cloud pricing can fluctuate based on supply.

Our initial reaction to the price spread was predictable: CoreWeave is clearly targeting the enterprise wallet. Runpod, especially the Community Cloud, maintains its reputation for aggressive pricing. Lambda sits in the middle, offering a bit more CPU/RAM by default for a slightly higher hourly rate than Runpod’s Secure Cloud. For a deeper dive into the economics of single H100s, you might find our H100 40GB vs 80GB comparison insightful.

The real test came down to how well our distributed training jobs actually scaled. We used PyTorch with FSDP (Fully Sharded Data Parallel) for our LLM fine-tunes, pushing batch sizes and model sizes that would OOM a single GPU. For image generation, we leveraged DeepSpeed for better memory utilization.

Our Llama 3 70B fine-tuning workload, using a global batch size of 64 across two H100s, showed a clear advantage with NVLink. Without it, the inter-GPU communication would bottleneck, often leading to ~1.3-1.5x scaling at best. With true NVLink, we consistently saw 1.8-1.9x throughput compared to a single H100. This is what you pay for.

ProviderConfigLlama 3 70B (tokens/sec)Diffusion (images/sec)Scalability (vs 1x H100)
Runpod (Secure)2x H10018,50014.21.8x
Lambda Labs2x H10018,90014.51.85x
CoreWeave2x H10019,20014.71.9x

The raw performance numbers were very close across the board once we had a stable NVLink connection. The small differences often came down to network latency for data loading or underlying CPU/RAM availability. CoreWeave had a slight edge, likely due to their optimized network fabric and potentially more dedicated system resources per GPU. Lambda was a close second, and Runpod’s Secure Cloud was perfectly capable. The Community Cloud, while cheaper, occasionally exhibited minor variances due to shared infrastructure, but nothing that fundamentally broke scaling for our tests.

Operational Friction: From Login to Launch

Getting these multi-GPU setups running was where the differences became more pronounced.

Runpod: The flexibility of Runpod is its strength and weakness. On the Community Cloud, finding an available 2x H100 NVLink pod sometimes involved waiting a bit, as noted by the 15-minute p50 queue time. Their Secure Cloud was faster. Once provisioned, the bare-metal access is excellent, allowing us to configure NCCL and network settings precisely. Their API is robust for automation, which is key for repeated experiments. However, documentation for complex multi-GPU setups can sometimes feel community-driven rather than enterprise-polished. Our initial NCCL issue was resolved via their Discord, which is fast but not exactly a white-glove support experience. For a general overview of Runpod, see our broader Runpod review.

Lambda Labs: Lambda offers a more curated experience. Their UI is clean, and provisioning is straightforward, though we did encounter noticeable queues, particularly for H100s (p50 of 30 minutes). Once an instance is up, it’s very stable and well-configured out of the box for distributed training. Their custom OS images often come with optimized drivers and libraries. Support is responsive for basic issues, but deep debugging of specific FSDP configurations might require more self-reliance. This reliability is why many teams choose Lambda, even with potential wait times, as we noted in our Lambda Labs review.

CoreWeave: This is where the enterprise experience truly shines. Provisioning was almost instant (p50 of 2 minutes), and the systems felt consistently high-performance. Their network and storage are top-notch, clearly designed for demanding, large-scale workloads. The ssh experience was akin to a very high-end dedicated server. However, CoreWeave operates on a different model; it’s less ‘self-service click-and-go’ and more ‘talk to sales, sign a contract’. Their API is powerful, but getting initial access and understanding their resource allocation often requires direct engagement. If you’re running at scale, this is a feature, not a bug, but it’s a barrier for smaller teams or ad-hoc projects.

After a month of pushing these multi-H100 setups, it’s clear there’s no single winner; it depends entirely on your team’s size, budget, and operational style.

  • For the budget-conscious developer or small team: Runpod’s Secure Cloud is the sweet spot. You get access to powerful NVLinked H100s at a very competitive hourly rate, with enough flexibility to set up your environment exactly how you like it. The Community Cloud is even cheaper, but be prepared for slightly more variable availability. If you want to kick the tyres yourself, you can spin up a pod via our referral link.

  • For the mid-sized team prioritizing reliability and ease-of-use: Lambda Labs offers a compelling package. While you might hit a queue for H100s, the stability and well-managed environment once you’re provisioned make it a strong contender for predictable, longer-running training jobs where you want fewer surprises on the infrastructure side.

  • For the enterprise or large research lab with serious scale: CoreWeave is the clear choice if you can navigate their sales process. The performance is consistently excellent, availability is high, and the underlying infrastructure is built for maximum throughput and reliability. You’ll pay a premium, but for mission-critical, large-scale distributed training, it’s a justifiable investment.

Ultimately, the choice for NVLinked H100s isn’t just about the hourly rate. It’s about how much operational overhead you’re willing to absorb for cost savings, versus how much you value a frictionless, enterprise-grade experience. Our advice? Start with Runpod’s Secure Cloud to validate your multi-GPU code, and only migrate up to Lambda or CoreWeave if your specific needs (e.g., guaranteed instant availability, white-glove support, massive scale) demand it, and your budget allows. The NVLink itself delivers on its promise across all three platforms, assuming you configure your software correctly.