There are reports that the Meta H100 is experiencing malfunctions every 3 hours.
$NVIDIA (NVDA.US)$During the training of the Llama 3 large-scale language model, it became apparent that Meta was plagued by frequent failures of the H100 GPU. With 16,384 H100 80GB GPUs in training, unexpected component failures were occurring at an average rate of once every 3 hours. More than half of these astonishingly frequent failures were attributable to the GPU or memory.
Disclaimer: Community is offered by Moomoo Technologies Inc. and is for educational purposes only.
Read more
ジェンスノファン : So I had no choice but to buy Blackwell!
shuan : If you cut it out like this, the impression is slightly different from the nuance conveyed in the original article
Meta reports that half of the failures during Llama 3 training were caused by frequent NVIDIA H100 GPU failures | XenoSpectrum