You are here: Home » Information » Thermal Encyclopedia » Thermal Management Breakthroughs for AI Chips: From Cold Plate Innovations to Immersion Cooling

Thermal Management Breakthroughs for AI Chips: From Cold Plate Innovations to Immersion Cooling

Views: 0 Author: Site Editor Publish Time: 2025-04-11 Origin: Site

1. Introduction

The exponential growth of AI computing power has ushered in unprecedented thermal challenges for modern data centers. High-performance AI accelerators like NVIDIA’s Blackwell GB200 and Meta’s Catalina platforms now demand cooling solutions capable of dissipating 4,000W+ per processor—far exceeding the limits of traditional air cooling. Liquid cooling technologies, particularly direct liquid cooling (DLC) and immersion cooling, have emerged as critical enablers for next-generation AI infrastructure. This article explores cutting-edge advancements in liquid cooling, including cold plate designs, immersion systems, and hybrid architectures, with real-world insights from industry leaders like CoolIT, Meta, Intel, and Alibaba Cloud.

2. The Evolution of Liquid Cooling Technologies

2.1 Direct Liquid Cooling (DLC) and Cold Plates

DLC systems transfer heat by circulating coolant directly over heat-generating components. Cold plates—metallic blocks embedded with microchannel arrays—are core to this approach. Recent breakthroughs include:

·CoolIT’s 4000W Single-Phase DLC Cold Plate

Heat Dissipation: Achieves 4,000W thermal load at a heat resistance of <0.009 K/W, outperforming conventional solutions by 2x.

Split-Flow Technology: Coolant enters through microchannel midpoints, optimizing flow distribution to hotspots (e.g., GPU/CPU cores).

OMNI™ Monolithic Design: Full-copper cold plates eliminate brazing seams, reducing leakage risks and thermal resistance.

Industry Adoption: Deployed in Dell PowerEdge and Lenovo ThinkSystem servers for hyperscale AI workloads.

·Intel’s FSW-Based Cold Plates
Friction-stir welding (FSW) replaces traditional brazing, enhancing structural integrity and thermal performance. FSW reduces welding temperatures by 40%, minimizing thermal stress for high TDP processors.

2.2 Immersion Cooling: Single-Phase vs. Two-PhaseImmersion cooling submerges hardware in dielectric fluids, eliminating air cooling entirely.

1744352112148

·Single-Phase Immersion (SPILC):

Alibaba Cloud’s SPILC System: Utilizing electronic fluorinated liquids (EFL-3), SPILC captures 97.3% of heat at 6 L/min flow rates. Optimal counter-gravity flow improves heat transfer by 35% compared to co-flow configurations.

Meta’s Catalina AI Server: Combines air-cooled components (E1.S SSDs, OCP NICs) with liquid-cooled GB200 NVL72 GPUs. A hybrid approach balances efficiency (40°C coolant inlet) and scalability.

·Two-Phase Immersion (TPILC):
Fluids absorb heat via phase change (liquid-to-vapor). While TPILC improves COP by 72–79% over SPILC (Kanbur et al.), its complexity and costs limit adoption. Recent studies reveal TPILC underperforms cold plates at >300W/cm² fluxes due to critical heat flux (CHF) restrictions.

3. Design Innovations in Cold Plates

3.1 Microchannel Geometry

Parallel vs. split-flow designs:

·Parallel Flow: Simple but prone to hotspots.

·Split Flow: Midpoint entry divides coolant into radial streams, reducing thermal gradient by 61% (Intel).

·Meta’s Dual-Mode Cold Plates: Cold plates in Catalina servers support 3.9kW TDP via optimized microchannels and Frost LC-25 coolant (PG25-based).

3.2 Material Innovations

·Copper vs. Aluminum: Copper offers 2.5x higher thermal conductivity but adds weight.

·Boiling Enhancement Coatings (BECs): Microporous copper layers boost heat transfer coefficients by 15x in TPILC systems.

3.3 System Integration

·Coolant Distribution Units (CDUs):

Liquid-Liquid CDUs: Transfer heat from server loops to facility water (e.g., Boyd’s 100L/min systems).

Air-Liquid CDUs: Ideal for retrofitting air-cooled racks (e.g., CoolIT’s Dynamic Cold Plate).

4. Immersion Cooling: Performance and Limitations

4.1 SPILC in High-Density Deployments

·Alibaba Cloud’s Findings:

Counter-gravity flow reduces CPU temps by 33.8% over co-flow.

Optimal EFL-3 coolant (low viscosity) lowers thermal resistance by 10.5% vs. mineral oils.

Flow rate prioritization: Beyond 8 L/min, SPILC gains diminish due to pump overdrive.

·Meta’s Catalina Case Study:

Coolant Flow: 100 L/min at 15 PSI using Frost LC-25.

Hybrid Cooling: Combines rear-door cold plates (GPUs) with air-cooled front components.

4.2 TPILC Challenges

·CHF Limitations: R1233zd refrigerants max out at ~80W/cm² (Google, 2024), making TPILC unfit for AI chips exceeding 300W/cm².

·Fluid Compatibility: Novec 649’s low boiling point (49°C) restricts ΔT in high-TDP scenarios.

5. Hybrid Architectures: Balancing Efficiency and Cost

5.1 Meta’s 80/20 Standardization Model

Catalina’s design combines off-the-shelf GB200 NVL72 racks with custom cooling:

·Liquid-Cooled GPUs: Cold plates handle 1200W Blackwell GPUs.

·Air-Cooled Support Components: E1.S SSDs, OCP NICs cooled via traditional airflow.

·Power Efficiency: PUE of 1.15, 30% lower than air-cooled counterparts.

5.2 Air-Assisted Liquid Cooling (AALC)

As seen in HPE’s 100% fanless DLC systems, AALC uses passive airflow to assist liquid loops during peak loads, reducing pump dependency.

6. Future Trends and Sustainability

·Phase-Change Materials (PCMs): Intel’s research on R134a refrigerant in microchannels reduces GPU temps by 34.5% at 6 L/min.

·Sustainable Fluids: Bio-based dielectric coolants (e.g., Shell Nature 3.0) lower GWP by 75% vs. synthetic fluids.

·AI-Driven Thermal Control: Google’s ML-based CDU optimization cuts cooling能耗 20% via predictive flow adjustments.

7. Conclusion

Liquid cooling is no longer optional for AI infrastructure. Cold plates dominate with their reliability in 4,000W+ scenarios, while SPILC provides scalable efficiency for hyperscale data centers. Hybrid systems like Meta’s Catalina highlight the industry’s shift toward adaptive thermal architectures, balancing performance, cost, and sustainability. As AI chip TDPs approach 5,000W by 2026, innovations in microchannel design, two-phase fluids, and AI-driven cooling will define the next frontier in thermal management.

cold water plate data center cold plate heatsinks cold plate cooling Liquid Cooling of data center Cooling Technology of Data Center cooling solution for data center cooling solution of data center

News Center

*Verify Code