January 23, 2025 by Vincent Schmalbach
Last week, DeepSeek unveiled their V3 model, trained on just 2,048 H800 GPUs - a fraction of the hardware used by OpenAI or Meta. DeepSeek claims their model matches or exceeds several benchmarks set by GPT-4 and Claude
What's interesting isn't just the results, but how they got there.
The Numbers Game
Let's look at the raw figures:
- Training cost: $5.5M (vs $40M for GPT-4)
- GPU count: 2,048 H800s (vs estimated 20,000+ H100s for major labs)
- Parameters: 671B
- Training: 2.788M GPU hours
Recent research shows model training costs growing by 2.4x annually since 2016. Everyone assumed you needed massive GPU clusters to compete at the frontier. DeepSeek suggests otherwise.
Export Controls: Task Failed Successfully?
The U.S. banned high-end GPU exports to China to slow their AI progress. DeepSeek had to work with H800s - handicapped versions of H100s with half the bandwidth. But this constraint might have accidentally spurred innovation.
Instead of throwing compute at the problem, Deepseek focused on architectural efficiency:
- FP8 precision training
- Infrastructure algorithm optimization
- Novel training frameworks
They couldn't access unlimited hardware, so they made their hardware work smarter. It's like they were forced to solve a different, potentially more valuable problem.
The High-Flyer Factor
Context matters though. DeepSeek isn't a typical startup - they're backed by High-Flyer, an $8B quant fund. Their CEO Liang Wenfeng built High-Flyer from scratch and seems focused on foundational research over quick profits:
"If the goal is to make applications, using the Llama structure for quick product deployment is reasonable. But our destination is AGI, which means we need to study new model structures to realize stronger model capability with limited resources."
Beyond the Hype
We should be careful about overinterpreting these results. Yes, DeepSeek achieved impressive efficiency. No, this doesn't mean export controls "backfired" or that they've cracked some magic formula.
What it does show is that the path to better AI isn't just about throwing more GPUs at the problem. There's still huge room for fundamental improvements in how we train these models.
For developers, this is actually exciting news. It suggests you don't need a hyperscaler's budget to do meaningful work at the frontier. The real innovations might come from being resource-constrained, not resource-rich.
The Road Ahead
DeepSeek's paper mentions they're working on "breaking through the architectural limitations of transformers." Given their track record with efficiency improvements, this is worth watching.
Subscribe to my Newsletter
Get the latest updates delivered straight to your inbox
I respect your privacy. Unsubscribe at any time.

