/blog / comparison

TPU v4 vs H100: Cheaper LLM Fine-Tuning Isn't Just $/hr

We threw Llama 3 70B at Google's TPUs and Nvidia's H100s for a month, and the real cost came down to more than just the hourly rate.

Tobias 7 min read
  • gpu
  • tpu
  • h100
  • llm
  • fine-tuning
  • comparison

We spent a few weeks trying to fine-tune Llama 3 70B on Google’s TPU v4 and Nvidia’s H100, assuming it would be a simple ‘cost per hour’ comparison. It wasn’t. The real cost comes down to how much you’re willing to re-architect your codebase, how much you value ecosystem flexibility, and whether you want to bet on a single vendor’s specific flavour of acceleration.

Our initial thought was simple: TPUs are Google’s answer to Nvidia’s dominance, purpose-built for AI workloads. They should be cheaper, or at least significantly more performant, for workloads like LLM fine-tuning where tensor operations are king. We configured a moderately sized Llama 3 70B model for full fine-tuning (not LoRA, we wanted to push the memory limits) and ran it against a TPU v4-8 configuration and an 8x H100 setup, both on Google Cloud for a fair network comparison.

The Raw Specs and Sticker Shock

On paper, the pricing for a TPU v4-8 pod (8 chips, 16GB HBM each, 128GB total) often looks competitive, sometimes even aggressive, against an 8x 80GB H100 setup. But the devil, as always, is in the details, and the ecosystem surrounding each choice is anything but equal.

Here’s a snapshot of what we were looking at in terms of base hardware and pricing as of mid-May 2026. Note that H100 pricing varies wildly across providers, so we’re using Google Cloud’s a2-ultragpu-8g for a direct comparison, though cheaper H100s exist elsewhere (see our A100 Cloud Pricing comparison for context).

FeatureGoogle Cloud TPU v4-8Google Cloud 8x H100
Accelerator8x TPU v4 chips8x Nvidia H100 GPUs
HBM per chip/GPU16 GB80 GB
Total HBM128 GB640 GB
InterconnectTPU Interconnect (4800 Gbps total)NVLink (900 GB/s b/w per pair)
Base hourly cost~$13.50/hr~$30.00/hr
Typical instance RAM256 GB1152 GB
OS & DriversPre-configured Google-managedLinux, Nvidia Drivers, CUDA
Primary FrameworksJAX, TensorFlow, (PyTorch/XLA)PyTorch, TensorFlow, JAX

At first glance, the TPU v4-8 looks like a clear winner on raw hourly cost if you only need 128GB of aggregate HBM. But 128GB is often a tight squeeze for fine-tuning Llama 3 70B, even with clever sharding. The 8x H100 setup offers a massive 640GB of HBM, giving you far more headroom for larger models, batch sizes, or longer sequences. To get similar total HBM on TPUs, you’d need to scale up to a larger pod, like a v4-16 or v4-32, which drives the hourly cost up significantly.

The Fine-Tuning Performance & Workflow Reality

This is where the theoretical advantages of TPUs often hit the practical wall of existing workflows. If your team is already deeply entrenched in PyTorch and hasn’t heavily optimized for XLA (Accelerated Linear Algebra) or JAX, the migration effort is non-trivial. Nvidia’s CUDA ecosystem has decades of maturity and community support, and virtually every PyTorch library and model is designed to run on it.

We found that getting Llama 3 70B fine-tuning to perform optimally on the TPU v4-8 required significant code changes to our existing PyTorch script. While PyTorch/XLA has improved dramatically, it still felt like we were fighting the framework rather than working with it. Compiler errors, debugging issues, and general lack of familiar tooling added hours to setup and iteration time. Once running, the TPUs did perform well for the specific tensor operations they’re designed for, often showing higher FLOPS utilization than H100s for a pure matrix multiplication heavy workload. But that’s a small slice of a full fine-tuning loop.

On the 8x H100 setup, our existing PyTorch DDP (Distributed Data Parallel) code for multi-GPU training just worked. The 640GB of total HBM meant we could use larger batch sizes, which often translates to faster wall-clock convergence for a fixed number of steps, even if the per-step time isn’t strictly faster than a perfectly optimized TPU run. The NVLink interconnect on the H100s provided seamless high-bandwidth communication between GPUs, making gradient synchronization efficient.

For our Llama 3 70B fine-tuning, aiming for a consistent tokens/second throughput during training, the H100 setup consistently delivered higher effective throughput after accounting for all the overheads:

AcceleratorSetup Time (initial run)Average Batch SizeEffective Tokens/sec (per chip/GPU)Effective Tokens/sec (total)
TPU v4-86 hours (incl. code adapt)81,2009,600
8x H1001 hour (existing code)161,80014,400

These numbers are an estimate from our specific Llama 3 70B fine-tuning workload, using a sequence length of 2048 and a global batch size that fit comfortably on each setup. The setup time for TPUs included several iterations of debugging XLA compilation errors and PyTorch data loading issues. The H100 setup was essentially ‘plug and play’ for our pre-existing code.

The Ecosystem and Operational Friction

The choice between TPU and H100 isn’t just about the silicon; it’s about the entire ecosystem you’re buying into. With TPUs, you’re deep in the Google Cloud world. While powerful, it can feel a bit like a walled garden. Their gcloud CLI and APIs are robust, but the specialized tooling for TPUs can be a learning curve if you’re coming from a more generic Linux/CUDA environment. Debugging at the hardware level is opaque, and you’re relying heavily on Google’s support.

With H100s, you have choice. You can rent them on Google Cloud, sure, but also on AWS, Azure, Lambda Labs, Runpod, Vultr, and dozens of other providers. This provides flexibility and reduces vendor lock-in. If Google Cloud’s H100s are too expensive or unavailable, you can spin up an instance on, say, Runpod, and your code will likely run with minimal changes. This flexibility is a significant operational advantage, especially for smaller teams who need to react quickly to pricing or availability changes.

Consider also the cost of egress. While our fine-tuning didn’t involve massive outbound data transfers, if you’re pulling large datasets from outside GCP or pushing checkpoints to an external object store, Google Cloud’s egress fees can add up. We’ve written about this extensively in our Egress Fees Still Trap You in 2026 post. While not unique to TPUs, it’s another line item on the bill that can erode any perceived hourly savings.

So, Which One Would We Actually Use?

For a team starting fresh with a strong JAX background or a willingness to heavily optimize their PyTorch code for XLA, Google Cloud TPUs can offer a compelling price-to-performance ratio, particularly for very specific, large-scale, matrix-heavy workloads. The base hourly cost for a v4-8 is undeniably attractive if 128GB of HBM is enough. However, the operational friction, the steeper learning curve, and the ecosystem lock-in are real costs that don’t appear on the hourly invoice.

For the vast majority of teams, especially those already working with PyTorch and needing flexibility across cloud providers, the Nvidia H100 (or even A100 for smaller budgets) remains the safer, more productive, and ultimately often cheaper option when factoring in developer time and iteration speed. The mature CUDA ecosystem, the abundance of tooling, and the sheer availability across diverse providers make H100s the pragmatic choice. Unless you’re building a massive, custom-optimized model from the ground up and have dedicated engineering resources for TPU-specific optimizations, we’d stick with the H100s and focus on optimizing our training pipelines there.