share_log

趋势洞察 | AI的尽头真的是光伏、储能和核能吗?— GTC归来,对AIDC的设计与建设思考(一)

Trend Insight | Is the end of AI really photovoltaics, energy storage, and nuclear energy? — Return of GTC, thoughts on AIDC design and construction (1)

GDS ·  Apr 10 00:00

The following article is from GDS PiloTalk, by Grandpa Gang

INTRODUCTION

“The end of AI is energy”. As soon as the opinion came out, it sparked a buzz in the technology community, even under the “extreme” energy efficiency performance of the Nvidia Blackwell architecture, where the performance of a single GPU was increased by 5 times and the energy consumption was reduced by 25 times.

For data center companies, how to reduce energy consumption and improve energy efficiency is an eternal topic, and the wider deployment of AIDC will surely continue to pay attention to this topic in the future.

Of course, in the data center field, the “end” that needs to be thought about is definitely not just energy; the so-called limit often depends on the boundaries of our current perception and technical level, but it may also be borderless. So here, we would like to discuss our own thoughts on some of the existing limitations in line with AIDC's development trends.

1. AI chip density is booming, where is the server limit?

The NVL72 recently released by NVIDIA has a maximum electrical power of 120 kW in a single cabinet and uses a cold plate liquid cooling solution. Of this 120kW single cabinet power, there is still about 15% of the air-cooled cooling requirement of about 20kW, and it is close to the upper limit of the air-cooling and cooling capacity of room-level air conditioners. According to information from the GTC conference site, under the condition that the power density of GPU chips is further increased, the next cooling plan will consider a simultaneous cooling solution of immersion+cold plate liquid cooling. The power of a single cabinet is expected to reach 300 kW per unit in the next 2 to 3 years.

In fact, carefully understand the NVL72 architecture. As a new product, its shocking point is that the NVL72 architecture no longer uses simple repetitive stacking of traditional chip-server-network architecture concepts, but instead uses the violent aesthetic method of disruptive architecture deconstruction and restructuring to select the most appropriate technical solutions based on the most basic physical principles, driving the exponential evolution of GPUs. The sharp increase in chip density has led to sufficient necessary results of liquid cooling and heat dissipation, and the efficiency of cold plate liquid cooling and heat dissipation has also brought copper cables that are no longer used in existing classical network architectures to replace optical modules and optical cables back on the historical stage, thus giving birth to a new species. This is a classic case of “quantitative change causes qualitative change”. It is also a typical example of how supporting traditional technology solutions were hit by downsizing after the underlying technology was iterated.

Perhaps, after many years, deciding that the upper power limit of a single GPU will no longer be the upper limit of the chip's cooling capacity, but the upper limit of the PDU, UPS, or transformer capacity of the power distribution system. The maximum capacity of a single AIDC park is not an upper limit on chip computing power, but an upper limit on the capacity of power plants and power grids.

2. What is the most appropriate granularity for today's data centers from the dimensions of a single computer room, single building, and park?

Considering that in the past, data center requirements were in the megawatt or 10MW range, single delivery and overall demand were small. Therefore, in DC parks and individual planning, granularity is determined more based on infrastructure such as building planning requirements, fire regulations requirements, power grid capacity (10kV/110kV/220kV), and the optimal ratio between redundancy and cost of MEP facilities such as diesel, chillers, air conditioning, etc. Today, the development of AIGC's blowout business is essentially that the demand for DC capacity is growing exponentially. When matching the optimal DC or campus capacity requirements, planning restrictions are dominated by the upper limit of network architecture capacity and the upper limit of chip density.

Today, mainstream cloud vendors or Internet companies in the industry can reach 100 to 200 MW in size under the current popular architecture. In the future, as chip density continues to increase, it will even reach more than 300 to 500 MW in the future. With this background and matching power and land conditions, when planning data center infrastructure resources, you should learn Nvidia's concept of using GPUs, from designing a data center as a computer to designing a data center as a GPU. If the ideal capacity of a Giant GPU is 100MW, then the optimal granularity for infrastructure capacity is 100MW. However, the development of chip technology is changing rapidly. The efficiency of NVIDIA's GPU chips has increased 1,000 times in 8 years, and the investment, construction and payback period of data centers is 10 to 15 years. Based on current chip technology, the idea of predicting the size of a single data center and matching a perfect ultimate data center plan is unrealistic and unscientific.

We may need to change our approach, take the best dimension of infrastructure efficiency, be reasonable, feasible, and cost as low as possible, overcome challenges such as building fire protection planning restrictions, initial CAPEX investment ratio, and subsequent continuous phased construction, and create an IDC integrated product model that can adapt to future changes in business requirements. This is an actual problem that data center practitioners need to solve, and it is also a key element for IDC companies to establish the core competitiveness of their products.

3. Is it reasonable to say that the end of AI is photovoltaics, energy storage, and nuclear energy?

This should be one of the hottest topics in the industry recently. The end of AI is photovoltaics, energy storage, and nuclear energy, as well as various energy-related futures such as transformers, copper, and cables. Essentially, energy demand is surging due to rapid growth in business demand.

Site selection — moving from being close to a load center to being close to an integrated energy center

When the size of the park reaches 200MW or more, a 220kV station configuration is required to meet the requirements from a power supply perspective, and when the park volume exceeds 500MW, a 220kV station cannot meet the requirements. Under the conditions of a data center cluster of this size, the capacity of the existing power grid is limited, and the data center must be located closer to an energy center with sufficient electricity before it can be established. The contradiction between this is that large energy centers invest a lot at once. If the business needs of data centers cannot be matched in a short period of time, it will incur huge initial investment costs.

Therefore, in order to better solve such problems, the need for a regional integrated energy management system supporting data center infrastructure needs has emerged. Through the application of source network load storage technology, various types of energy within a local area can be effectively utilized to ensure economy. Furthermore, through sufficient supply guarantees for energy systems and long-term stable OPEX guarantees, it can also bring more added value to the certainty of asset investment in data center infrastructure compared to civil construction and MEP systems that are relatively easy to implement.

Onsite —— Light storage all in one

Simple wind energy, photovoltaics, and energy storage technology, whether on-site or off-site, has nothing to do with the data center itself. Whether it can be used or not depends entirely on the maturity and application scenarios of the technology and industry. At least under current market and technical conditions, the application of photovoltaics and energy storage is strongly related to the location of the project and its unstable power supply. It is difficult to use as an AIDC single general-purpose power supply solution, and must be used in combination with other stable energy solutions. Onsite's photovoltaic and energy storage technology is difficult to scale up due to space limitations, but due to its ability to provide additional energy over time and matches the data center's electricity load with climate change conditions, it can be integrated with the data center infrastructure to try to improve the IT and external power conversion rate of the data center, replace some UPS power backup equipment, and exert additional economic benefits, even far greater than the economic benefits brought by conventional photovoltaic and energy storage technology itself.

Future —— Prospects for the application of nuclear energy technology

As an IDC practitioner who has worked in the nuclear power industry for many years, I never thought that these two industries could be combined to become a hot topic today. In the past two years, the trend of Microsoft and AWS in the nuclear power industry has also sparked a lot of discussion among peers. Today, let's take a look at nuclear power and data centers. From the energy side and load side, they have a lot in common, and they are also quite exquisite. One is stable input and output, and the other is logically similar in terms of safety and redundancy settings.

After years of development, China's nuclear power technology has independently developed and completed Hualong 1 and Guo1 advanced third-generation nuclear power technology with independent intellectual property rights, and is also in a leading position in the world. Due to the integrity of China's nuclear power supply chain system, the localization ratio has reached more than 90%. The cost of a single kW of nuclear power has reached 15,000 RMB/kW, which is less than 20% of the cost of similar reactor types overseas, and now the cost of feed-in electricity is less than 0.4 yuan. At the same time, it is important to know that the design life of a nuclear power plant is 40 to 60 years. Compared to photovoltaic or wind power systems with a lifespan of only 10 to 20 years, regardless of the stability and comprehensive investment cost of the power supply, it is believed that as nuclear power technology continues to be iterated, it will become more competitive. In terms of technical maturity, the ACP100, a small nuclear power reactor developed independently by China, already has the conditions for commercialization. The SMR 100 to 300MW of small modular reactors is also very compatible with the size of the AIDC park in terms of capacity.

However, in order to further substantially solve the problem of nuclear energy and data center integration, the author initially shared the following thoughts:

Fast delivery matching: Advanced third-generation and fourth-generation nuclear power technology itself is mature, and there is absolutely no need to worry about its safety. However, the world's nuclear safety supervision system is still very complicated, and the project planning, development and construction cycle lasts more than 8 to 10 years. This contradicts the characteristics of AIDC's rapid deployment. Currently, most international and domestic small modular nuclear power reactors have a capacity of around 100 MWe. Although the technology has been optimized compared to traditional million-kilowatt nuclear power plants, the basic architecture of the entire nuclear power plant is still based on the traditional nuclear power plant architecture, which is somewhat similar to a scaled-down version of a large-scale nuclear power plant. Therefore, it is difficult for the relevant regulatory system and standard system to break away from the original industry standards, so there are problems with matching the development cycle. Just imagine, going back to the principle of the original NVL72. If we can find a balance between 20 MWe to 100 MWe of nuclear power reactor type capacity, maximize and simplify the design of safety systems on the basis of breaking through the traditional technical architecture of nuclear power systems, and develop a non-active small reactor or microstack with rapid delivery capabilities. Although it is smaller in terms of size and may be more expensive, it is more capable of rapid replication and is more compatible with AIDC. In fact, the current big pile was developed on the basis of the previous small pile, so there should be quite a few mature solutions in the prototype pile that you can look back at, and maybe turn waste into treasure.

Universal matching: The current nuclear power technology standard system is very complete. Essentially, there are many differences with traditional civil projects in terms of equipment manufacturing, design and construction management systems. These differences have also formed related industry barriers and cost premiums. In order to make the construction speed and cost of nuclear power more competitive, when using smaller capacity small-reactor models, we need to make better use of commercial-grade materials currently commonly used in the market to replace them, improve product utility rate while ensuring safety, and avoid cost premiums caused by industry barriers.

Matching site selection conditions: The site selection requirements for nuclear power plants are far higher than data centers. A qualified nuclear power plant site is a scarce resource. The advantage is that nuclear power plants can choose not to be affected by climatic conditions, and the latest technology does not have to rely on the sea or rivers to solve the heat dissipation problem. As long as the network and latency issues required for the data center's own site selection can be matched, there won't be much of a problem.

One more thing: As a power plant, a nuclear power plant also generates a large amount of waste heat resources during operation. Through triple supply of cold, heat, and power, it is possible to achieve a win-win situation in improving energy conversion efficiency and computing power conversion efficiency, and PUE will probably no longer be a problem.

epilogue

Above, we have analyzed the limits, end point, and future of AI GPUs and AIDC by combining the blockbuster products released at the 2024GTC, and the information obtained from keynote speeches and exchanges, and the author's many years of continuous deep cultivation experience in the field of data centers and nuclear energy. In the second half, we'll continue to explore future trends in AIDC infrastructure around one key word — change and change.

Disclaimer: This content is for informational and educational purposes only and does not constitute a recommendation or endorsement of any specific investment or investment strategy. Read more
    Write a comment