OPS
Managing 10k H100s
David Kim
//
Jul 5, 2025
Scaling physical infrastructure is a logistical nightmare. Here is how we manage heat dissipation and power load balancing across 12 availability zones.
Thermal Throttling & Power
When you pack GPUs this dense, airflow becomes fluid dynamics. We had to write custom firmware to undervolt cards dynamically based on ambient data center temperature. A 1% efficiency gain translates to millions of dollars in power savings annually, and prevents thermal throttling during peak loads.
The Checkpoint Problem
Saving the state of a massive model during training requires writing terabytes of data to disk instantly. Standard file systems choke on this burst throughput. We built a custom distributed file system optimized for burst writes, preventing the "checkpoint pause" that idles expensive compute resources for minutes at a time.
Conclusion
Building the cloud is different than using the cloud. At this scale, hardware failure is a statistic, not an exception. Our software must be resilient enough to treat a burning GPU as a mundane event.



