As the number of HPC chips increases, so does the need for 1kW+ chip cooling

An increasingly apparent trend in the high performance computing (HPC) space is that power consumption per chip and per rack unit does not stop at the limits of air cooling. Since supercomputers and other high-performance systems have already reached these limits – and in some cases exceeded these limits – power requirements and power densities have continued to increase. And based on news from TSMC’s recent annual technology symposium, we can expect this trend to continue as TSMC lays the groundwork for even denser chip configurations.

The problem at hand is not new: the current consumption of the transistor does not decrease as quickly as the dimensions of the transistors. And as chipmakers fail to put performance on the table (and fail to deliver semi-annual increments for their customers), in the HPC space, power per transistor is growing rapidly. As an added wrinkle, chiplets pave the way for building chips with even more silicon than traditional reticle limits, which is good for performance and latency, but even more problematic for cooling.

Enabling this kind of silicon and capital growth are modern technologies such as TSMC’a CoWoS and InFO, which allow chipmakers to build integrated multi-chiplet system-in-packages (SiPs) with double the amount of silicon otherwise allowed by TSMCs. reticle limits. By 2024, advancements in TSMC’s CoWoS packaging technology will enable even larger multi-chiplet SiPs to be built, with TSMC anticipating stitching together more than four reticle-sized chiplets. This will allow for enormous levels of complexity (more than 300 billion transistors per SiP is a possibility that TSMC and its partners are looking at) and performance, but of course at the cost of formidable power consumption and heat generation.

Already, flagship products like NVIDIA’s H100 accelerator module require more than 700W of power for peak performance. So the prospect of multiple chiplets the size of GH100 on a single product raises eyebrows — and energy budgets. TSMC foresees that in a few years there will be multi-chip SiPs with a power consumption of about 1000 W or even higher, posing a cooling challenge.

At 700W, H100 already requires liquid cooling; and the story is much the same for Intel’s chiplet-based Ponte Vecchio and AMD’s Instinct MI250X. But traditional liquid cooling also has its limits. By the time chips reach a cumulative 1 kW, TSMC envisions that data centers will need to use immersion liquid cooling systems for such extreme AI and HPC processors. Immersion liquid cooling, in turn, requires the data centers themselves to be redesigned, which will be a major change in design and a major challenge in continuity.

Short-term challenges aside, once data centers are set up for liquid cooling immersion, they will be ready for even hotter chips. Liquid immersion cooling has a lot of potential for handling large cooling loads, which is one reason why Intel is investing heavily in this technology in an effort to make it more mainstream.

In addition to immersion liquid cooling, there is another technology that can be used to cool ultra-hot chips – on-chip water cooling. Last year, TSMC revealed that it had been experimenting with on-chip water cooling, saying that even 2.6 kW of SiPs could be cooled with this technology. But of course, on-chip water cooling is an extremely expensive technology in itself, which will drive the costs of those extreme AI and HPC solutions to unprecedented heights.

Nevertheless, the future is not set in stone, but seemingly cast in silicon. TSMC’s chipmakers have customers willing to pay top dollar for those ultra-high-performance solutions (think hyperscale cloud data center operators), even with the high cost and technical complexity that comes with it. To get things back to where we started, TSMC has primarily developed CoWoS and InFO packaging processes – because there are customers ready to break the crosshairs limit through chiplet technology. We’re already seeing this today with products like Cerebras’ massive Wafer Scale Engine processor, and through large chiplets, TSMC is gearing up to make smaller (but still reticle-breaking) designs more accessible to their wider customer base.

Such extreme demands on performance, packaging and cooling not only push manufacturers of semiconductors, servers and cooling systems to their limits, but also require customization of cloud data centers. Indeed, if massive SiPs for AI and HPC workloads become widespread, cloud data centers will be completely different in the coming years.

Get in Touch

Related Articles

Get in Touch

0FansLike
3,912FollowersFollow
0SubscribersSubscribe

Latest Posts