OPS

Managing 10k H100s

David Kim

//

Jul 5, 2025

Scaling physical infrastructure is a logistical nightmare. Here is how we manage heat dissipation and power load balancing across 12 availability zones.

Thermal Throttling & Power

When you pack GPUs this dense, airflow becomes fluid dynamics. We had to write custom firmware to undervolt cards dynamically based on ambient data center temperature. A 1% efficiency gain translates to millions of dollars in power savings annually, and prevents thermal throttling during peak loads.

The Checkpoint Problem

Saving the state of a massive model during training requires writing terabytes of data to disk instantly. Standard file systems choke on this burst throughput. We built a custom distributed file system optimized for burst writes, preventing the "checkpoint pause" that idles expensive compute resources for minutes at a time.

Conclusion

Building the cloud is different than using the cloud. At this scale, hardware failure is a statistic, not an exception. Our software must be resilient enough to treat a burning GPU as a mundane event.

Create a free website with Framer, the website builder loved by startups, designers and agencies.