/blog / comparison / hetzner
Hetzner Dedicated RTX 4070 Ti vs. Cloud RTX 4070 Super for Llama 3
Comparing Hetzner's dedicated RTX 4070 Ti vs. cloud RTX 4070 Super for Llama 3 inference. Find out which offers better cost per token and flexibility.
- gpu
- comparison
- hetzner
- rtx4070
- llama3
On a cool morning in late May, we crunched numbers for a new Llama 3 8B inference deployment, and the hourly cloud GPU rate for a 4070 Super looked reasonable enough. Then we compared it to a fully-spec’d Hetzner dedicated server with an RTX 4070 Ti at a fixed monthly price, and the cloud’s ‘flexibility’ started to look like a very expensive indulgence.
Dedicated vs. cloud for inference: why it matters
When you’re running Llama 3 inference, especially at scale, the GPU is only one part of the equation. You’re paying for the hardware, sure, but also for the setup time, the network, the storage, and, perhaps most crucially, the mental overhead of constant cost monitoring. For many teams, the instant availability of cloud GPUs makes sense for bursty workloads or initial experimentation. Spinning up an instance, running a quick test, and spinning it down keeps the bill predictable in an hourly sense, if not always in a monthly one.
However, once a workload becomes continuous or predictable, that hourly premium quickly compounds. This is where dedicated hardware, like a Hetzner dedicated server, starts to look appealing. You pay a fixed monthly fee, get full control over the machine, and often find significantly better price-to-performance ratios for sustained use. The trade-off, of course, is the increased operational responsibility and the lack of instant scalability. For Llama 3 inference, where VRAM and raw throughput are key, getting the most bang for your buck on the GPU itself can dramatically alter your total cost of ownership.
The contenders: hetzner dedicated rtx 4070 ti vs. cloud rtx 4070 super
We’re pitting a self-managed Hetzner dedicated server featuring an RTX 4070 Ti against a typical on-demand cloud instance running an RTX 4070 Super. Both are solid mid-range cards, offering 12GB of VRAM, which is enough to run Llama 3 8B (and even Llama 3 70B with aggressive quantization, though we’ll stick to 8B for consistent comparison). The core differences lie in their underlying architecture and clock speeds.
The RTX 4070 Ti, while technically a generation older than the Super variant, was a higher-tier card at its release. The RTX 4070 Super, on the other hand, is a refresh that slots neatly into the mid-range. Here’s how their key specs stack up, per Nvidia’s published data as of 2026-06-14:
| Feature | NVIDIA RTX 4070 Ti | NVIDIA RTX 4070 Super |
|---|---|---|
| CUDA Cores | 7,680 (source) | 7,168 (source) |
| VRAM | 12 GB GDDR6X | 12 GB GDDR6X |
| Memory Interface | 192-bit | 192-bit |
| Memory Bandwidth | 504 GB/s (source) | 504 GB/s (source) |
| TDP | 285W | 220W |
As you can see, the RTX 4070 Ti still holds a slight edge in CUDA cores, which translates directly to raw compute power for parallel tasks like LLM inference. Memory bandwidth, a critical factor for feeding large models, is identical between the two cards. The higher TDP of the Ti suggests it might draw more power, but also indicates a more powerful chip. For our Llama 3 8B benchmark, we’d expect the 4070 Ti to perform marginally better.
Comparing the hourly and monthly costs
This is where the rubber meets the road. For dedicated servers, you’re looking at a fixed monthly cost, regardless of how much you use the GPU (within reason, for power consumption). For cloud instances, it’s typically an hourly rate, which can quickly add up if you run continuously. We’re using publicly available pricing as of 2026-06-14.
Hetzner Dedicated Server (EX44-NVMe with RTX 4070 Ti):
Hetzner’s EX44-NVMe, configured with an Intel Core i7-12700, 64GB DDR4 RAM, 2x 1TB NVMe SSDs, and a dedicated NVIDIA RTX 4070 Ti, is listed at approximately €59.00/month (before VAT, per their dedicated root server configurations page). This includes 20 TB of traffic, which is usually more than enough for inference workloads.
Cloud RTX 4070 Super (e.g., Runpod, on-demand):
Providers like Runpod offer on-demand RTX 4070 Super instances. Per their GPU prices page, an RTX 4070 Super with a reasonable CPU, 32GB RAM, and 50GB NVMe storage goes for around $0.34/hour. While RTX 4070 Super cloud pricing can vary slightly between providers, this rate is a good average.
Let’s break down the costs:
| Item | Hetzner Dedicated RTX 4070 Ti | Cloud RTX 4070 Super (Runpod) |
|---|---|---|
| GPU | RTX 4070 Ti | RTX 4070 Super |
| Monthly Cost (approx) | €59.00 (~$64.00) | $0.34/hour |
| Monthly Equivalent (720 hrs) | N/A (fixed) | $244.80 |
| Included Traffic | 20 TB | Varies (typically 1 TB free) |
Immediately, the disparity is clear. The cloud’s hourly rate, while low for short bursts, becomes significantly more expensive for continuous operation. If you need your Llama 3 inference engine running 24/7, the dedicated server is less than a third of the cost.
Llama 3 inference performance and cost per million tokens
For Llama 3 8B inference, both cards provide ample VRAM. The performance difference will come down to raw compute. Based on the CUDA core counts and general performance benchmarks for these cards, we can estimate:
- RTX 4070 Ti: ~700-750 tokens/sec for Llama 3 8B with vLLM, batch size 1 (this is an estimate for desk-research mode; actual performance may vary based on specific model quantization, framework, and system load).
- RTX 4070 Super: ~650-700 tokens/sec for Llama 3 8B with vLLM, batch size 1 (again, an estimate).
Let’s use conservative estimates of 700 tokens/sec for the 4070 Ti and 650 tokens/sec for the 4070 Super to calculate the cost per million tokens. For simplicity, we’ll use the USD equivalents for Hetzner and Runpod.
| Metric | Hetzner Dedicated RTX 4070 Ti | Cloud RTX 4070 Super (Runpod) |
|---|---|---|
| Estimated Tokens/sec | 700 | 650 |
| Tokens per Hour | 2,520,000 | 2,340,000 |
| Cost per Hour (USD) | ~$0.09 (fixed $64/720 hrs) | $0.34 |
| Cost per Million Tokens | ~$0.036 | ~$0.145 |
These numbers highlight the economic break-even point: if your Llama 3 inference workload is consistent, even for just a few hours a day, the dedicated server rapidly becomes the cheapest Llama 3 inference option. For every million tokens processed, the Hetzner server is roughly four times cheaper. Even if you only utilized the dedicated server for 8 hours a day, its effective hourly rate for that period would still beat the cloud’s on-demand offering.
Beyond raw cost: setup, flexibility, and egress
Raw cost per token is important, but it’s not the whole story. The operational realities differ significantly:
-
Setup and Management: With Hetzner, you’re responsible for the entire software stack. OS installation, GPU driver setup, Docker, vLLM, model loading – it’s all on you. This requires more technical expertise and initial time investment. Cloud instances, while still needing configuration, often come with pre-baked AMIs or Docker images that streamline the process. For a detailed look at this trade-off, see our comparison of Hetzner GPU Cloud vs. Dedicated.
-
Flexibility and Scalability: Cloud GPUs win here hands down. Need more GPUs for a burst of traffic? Spin up more instances. Need to switch GPU types? Terminate and launch a new one. With a dedicated server, you’re locked into that specific hardware for the duration of your rental. Scaling means renting another physical server, which takes time and isn’t dynamic.
-
Instance Availability: On-demand cloud GPUs, especially popular models like the 4070 Super, can sometimes be unavailable in certain regions during peak times. Dedicated servers, once provisioned, are yours. However, initial provisioning can take a few days or even weeks if specific hardware configurations are in high demand.
-
Egress Costs: This is often the hidden killer. Both providers offer some free egress, but high-volume Llama 3 inference can generate substantial output data. Hetzner’s dedicated servers often come with very generous traffic allowances (e.g., 20TB/month), effectively making egress free for most use cases. Cloud providers, on the other hand, typically offer 1TB free and then charge around $0.01 - $0.05 per GB. If your Llama 3 API serves millions of users, those egress charges can quickly erode any perceived cloud flexibility benefits.
Which option wins for your llama 3 inference workload?
The verdict largely depends on your usage pattern and tolerance for operational complexity. If your Llama 3 inference needs are occasional, bursty, or you require instant, dynamic scaling for unpredictable demand, the cloud RTX 4070 Super is the more practical choice. The premium on cost per token is the price of flexibility and reduced management overhead.
However, if you have a consistent, predictable Llama 3 inference workload that runs for more than a few hours a day, or if you’re building a long-term service and value cost control above all else, the Hetzner dedicated RTX 4070 Ti is the clear winner. The initial setup time and management burden are quickly offset by the dramatically lower cost per token and generous egress allowance. We’d absolutely lean towards dedicated hardware for any workload that looks like it will run for more than a month straight. If you’re looking to try the cloud route yourself, Runpod is a solid place to start your experiments.
Run the numbers · interactive
Monthly Llama 3 inference cost
Hetzner price is an example from their EX series; actual auction prices may vary. Runpod price does not include storage or egress.
Want to compare more providers across H100, H200, A100, and RTX tiers? Try the full GPU rental cost calculator →
comparison
RX 7900 XT 20GB vs RTX 4060 Ti 16GB for Llama 3 Fine-Tuning
Comparing RX 7900 XT 20GB vs RTX 4060 Ti 16GB for Llama 3 fine-tuning on a budget. See how VRAM and price impact your choice.
5 min
comparison
Modal vs Replicate vs Runpod: cheapest Llama 3 vLLM inference
Compare Modal, Replicate, and Runpod for Llama 3 inference with vLLM. See our measured cost-per-token, latency, and cold start times to find the cheapest option for your LLM workloads.
5 min
comparison
Nvidia L40 48GB vs A100 40GB: better value for LLM inference?
Compare Nvidia L40 48GB vs A100 40GB for LLM inference. We break down pricing, performance, and which GPU offers better value for your specific AI workloads.
7 min