Datacenter GPUs for AI, shorter lifespan than expected - only lasting 1-3 years as reported.
October 25, 2024
It has been revealed that the lifespan of datacenter GPUs used for training and inference of Large Language Models (LLMs) may significantly fall below conventional expectations. According to an anonymous Chief AI Architect at Alphabet, there is a possibility that the lifespan may reach 1-3 years under high load conditions. This discovery could have a significant impact on the profitability of the rapidly expanding AI industry.
1、The reality and factors behind the shortening lifespan of GPUs
Operating rates of GPUs for AI processing in data centers have reached 60-70%. In this high-load environment, the lifespan of the GPU is estimated to be only 1-2 years, at most up to about 3 years. This information has been reported by TechFund, a tech investor known for its reliable sources.
The primary reason for the shortened lifespan is due to the latest datacenter GPUs consuming over 700W of power and generating heat. This significant heat load is believed to impose serious stress on the delicate semiconductor chips.
The primary reason for the shortened lifespan is due to the latest datacenter GPUs consuming over 700W of power and generating heat. This significant heat load is believed to impose serious stress on the delicate semiconductor chips.
2、The failure rate indicated by actual data.
A survey on the learning of the Llama 3 405B model announced by Meta this year revealed specific failure data. During the period, the cluster experienced a total of 466 job interruptions, most of which were caused by the GPU. During the 54-day learning period on a cluster using 16,384 NVIDIA H100 80GB GPUs:
- 419 unexpected interruptions occurred
- Among them, 148 cases (30.1%) were GPU-related failures (including NVLink failures)
- 72 cases (17.2%) were HBM3 memory failures
reported.
- 419 unexpected interruptions occurred
- Among them, 148 cases (30.1%) were GPU-related failures (including NVLink failures)
- 72 cases (17.2%) were HBM3 memory failures
reported.
The estimated annual failure rate from these data is approximately 9%, which may reach about 27% after 3 years of use. Furthermore, it is pointed out that the failure rate tends to increase as the period of use lengthens.
3、Impact and Countermeasures for the AI Industry
The shortening lifespan issue of GPUs is having a serious economic impact on the AI industry. As an emblematic example, there are financial estimates for OpenAI, a leading entity in the AI industry. Despite receiving substantial support from Microsoft, the company is expected to report a $5 billion loss in 2024. One significant factor contributing to this loss is the cost of computational resources required for the training and operation of large language models.
Furthermore, Google continues to actively invest in enhancing AI processing capabilities and has allocated $13.2 billion solely to AI processing hardware in the second quarter of 2024. However, these investments are losing their nature as long-term capital expenditures traditionally thought of. If equipment updates are required within short cycles like three years, the outlook for investment returns will inevitably undergo significant changes.
In response to this challenge, some datacenter operators are intentionally suppressing the GPU operation rates to extend their lifespan. However, this countermeasure comes with significant trade-offs. The decrease in operation rates extends the equipment's depreciation period, thereby deteriorating investment efficiency. This dilemma is becoming a structural challenge faced by the entire AI industry.
In response to this challenge, some datacenter operators are intentionally suppressing the GPU operation rates to extend their lifespan. However, this countermeasure comes with significant trade-offs. The decrease in operation rates extends the equipment's depreciation period, thereby deteriorating investment efficiency. This dilemma is becoming a structural challenge faced by the entire AI industry.
4、Xenospectrum’s Take
The revelation of the shortened lifespan of GPUs marks an important turning point for the AI industry. This issue goes beyond mere technical challenges and may affect the entire industrial structure.
Firstly, a reevaluation of the approach to equipment investments will be necessary. Investment plans based on the traditional 3-year depreciation period are becoming increasingly outdated. As a shift towards shorter investment recovery plans is required, AI companies will face even stronger temporal pressures towards monetization.
Ironically, this situation is further strengthening NVIDIA's market dominance in the GPU market, which holds a overwhelming share. In fact, the company's market capitalization is expected to reach 3 trillion dollars in June 2024, with secured regular GPU demand driving further growth.
However, the more fundamental challenge lies in the sustainability of the current AI business model. The development and operation approach of current large-scale language models, requiring massive computational resources, may face a fundamental rethink in the face of the reality of hardware's short lifespan.
Firstly, a reevaluation of the approach to equipment investments will be necessary. Investment plans based on the traditional 3-year depreciation period are becoming increasingly outdated. As a shift towards shorter investment recovery plans is required, AI companies will face even stronger temporal pressures towards monetization.
Ironically, this situation is further strengthening NVIDIA's market dominance in the GPU market, which holds a overwhelming share. In fact, the company's market capitalization is expected to reach 3 trillion dollars in June 2024, with secured regular GPU demand driving further growth.
However, the more fundamental challenge lies in the sustainability of the current AI business model. The development and operation approach of current large-scale language models, requiring massive computational resources, may face a fundamental rethink in the face of the reality of hardware's short lifespan.
The future AI industry will have no choice but to focus on the development of more efficient model architectures and the establishment of innovative learning methods. Additionally, it is expected that the development competition of dedicated AI accelerators will intensify as an alternative to GPUs. The short lifespan issue of GPUs has the potential to influence the direction of AI technology evolution itself. Addressing such technical and economic challenges will become a crucial factor in determining the competitiveness of future AI companies.
Sources
Disclaimer: Community is offered by Moomoo Technologies Inc. and is for educational purposes only.
Read more
Comment
Sign in to post a comment