Meta reports that half of the failures during Llama 3 training were caused by frequent NVIDIA H100 GPU failures

Meta reports that half of the failures during Llama 3 training were caused by frequent NVIDIA H100 GPU failures | XenoSpectrum

2024/7/30 (some excerpts)

While Meta was training the Llama 3 large-scale language model, it became clear that it was suffering from frequent breakdowns of the NVIDIA H100 GPU. According to research recently published by Meta, during the 54-day training period using a cluster composed of 16,384 NVIDIA H100 80GB GPUs, unexpected component failures occurred at an average rate of once every 3 hours. More than half of this astonishing frequency of failures were due to the GPU or its onboard memory.

GPUs are important, but the results suggest issues with reliability
The Meta research team conducted 54 days of training on the Llama 3 405B model. The cluster experienced a total of 466 job interruptions during this period. Of these, 47 were due to planned maintenance, while the remaining 419 were due to unexpected failures. If you look at the breakdown of these unexpected interruptions, you can see that GPU-related issues are the biggest factor.
Specifically, 58.7% of unexpected interruptions were due to GPU-related issues. Among them, 30.1% were due to various GPU failures (including NVLink failures), and 17.2% were due to HBM3 memory failures. The NVIDIA H100 GPU consumes extremely high power of about 700W, and considering that it is exposed to thermal stress associated with it, the frequency of these failures may not be surprising.

Meanwhile, there were only 2 CPU failures, and the result was that the vulnerability of GPUs stood out. This suggests that while GPUs play an important role in modern large-scale AI training, there are issues with their reliability.

It became clear that not only GPU failure, but also environmental factors had an impact on training performance that could not be ignored. There was a 1-2% fluctuation in GPU throughput due to temperature fluctuations during the day. This is thought to be because the dynamic voltage/frequency scaling of the GPU was affected by temperature changes.

Furthermore, it was discovered that fluctuations in simultaneous power consumption of tens of thousands of GPUs placed a heavy load on the power grid of data centers. These fluctuations sometimes reached tens of megawatts, reaching the limits of the power grid. This shows the need for Meta to secure sufficient power supply for future AI training.

Despite such frequent failures, the Meta team succeeded in maintaining a high effective training time of over 90%. This is because the multiple strategies they have adopted have worked effectively.

First, we worked on shortening job startup time and checkpoint creation time. Thus, downtime when a failure occurs can be minimized. Next, a unique diagnostic tool was developed to enable rapid identification and resolution of problems.

Furthermore, PyTorch's NCCL Flight Recorder was used to diagnose and resolve hangups and performance issues related to NCCLX in particular. This tool captured collective communication metadata and stack traces, and contributed to the rapid resolution of problems.

We have also developed a specialized tool for identifying “stragling GPUs” that reduce the processing speed of other GPUs. Thus, problematic communication can be preferentially detected and resolved in a timely manner, thereby maintaining overall training efficiency.

The fact that Meta's 16,384 GPU-scale cluster experienced 419 failures over 54 days (7.76 times per 24 hours, or approximately once every 3 hours) gives important implications about the reliability of larger AI training clusters.

For example, a cluster composed of 100,000 H100 GPUs owned by xAI is about 6 times larger than Meta's cluster. Assuming similar failure rates, there is a possibility that even more frequent failures will occur in xAI clusters. This prediction suggests that ensuring reliability in large-scale AI training will become increasingly important in the future.

Meta's experience presents complex challenges in the operation of large-scale AI systems. The fact that a high effective training time was maintained despite frequent breakdowns shows the importance of proactive failure mitigation strategies. At the same time, it became clear that there is a need for improvements in both hardware and infrastructure, such as improving hardware reliability, developing more efficient cooling systems, and constructing more stable power supply systems. The results of this study give important suggestions on the reliability of large-scale AI training clusters. For example, in a cluster composed of 100,000 H100 GPUs owned by xAI, assuming a similar failure rate, there is a possibility that even more frequent failures will occur.

As the scale of AI models and their training clusters continues to expand, these lessons learned from Meta's experience will be an extremely important guide for the AI industry as a whole. Going forward, it is expected that hardware manufacturers, data center designers, and AI researchers will cooperate to tackle these issues, which will be essential for the development of next-generation AI systems.

Sources
Meta: The Llama 3 Herd of Models
Tom's Hardware: Faulty Nvidia H100 GPUs and HBM3 memory delays half of failures during LLama 3 training — one failure every three hours for Meta's 16,384 GPU training cluster

Disclaimer: Community is offered by Moomoo Technologies Inc. and is for educational purposes only. Read more

See Original

Report

7417 Views

Comment

ピンハネ

フォローしてくださっても、私からフォローすることはありません😪 チャットもお断りしています😪

2561Followers

2Following

20KVisitors