Meta reports that H100 breaks down once every 3 hours
$NVIDIA(NVDA.US$While Meta was training a large-scale language model for Llama 3, it became clear that it was suffering from frequent H100 GPU failures. During training with 16,384 H100 80GB GPUs, unexpected component failures occurred at an average rate of once every 3 hours. More than half of the alarming frequency of failures are due to GPU or memory.
Disclaimer: Community is offered by Moomoo Technologies Inc. and is for educational purposes only.
Read more
Comment
Sign in to post a comment
ジェンスノファン : So I had no choice but to buy Blackwell!
shuan : If you cut it out like this, the impression is slightly different from the nuance conveyed in the original article
Meta reports that half of the failures during Llama 3 training were caused by frequent NVIDIA H100 GPU failures | XenoSpectrum