Mega-Watts are a thing of the past ←

It's actually quite nice when ML researchers not only publish their results but also the time and compute, it took to train their models.

The DeepSeek-V3 paper (2412.19437) even puts it in the abstract:

Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training.

They say only because there are models that require more.

2.788 million hours! That is about 326 years on a H800 GPU which consumes about 350 Watt. That is about 975,800,000 Watt,

I wonder what Grace Hopper would have said to that.