Managing 10k H100s

RETURN TO LOGS

OPS

David Kim

Jul 5, 2025

Scaling physical infrastructure is a logistical nightmare. Here is how we manage heat dissipation and power load balancing across 12 availability zones.

Thermal Throttling & Power

When you pack GPUs this dense, airflow becomes fluid dynamics. We had to write custom firmware to undervolt cards dynamically based on ambient data center temperature. A 1% efficiency gain translates to millions of dollars in power savings annually, and prevents thermal throttling during peak loads.

The Checkpoint Problem

Saving the state of a massive model during training requires writing terabytes of data to disk instantly. Standard file systems choke on this burst throughput. We built a custom distributed file system optimized for burst writes, preventing the "checkpoint pause" that idles expensive compute resources for minutes at a time.

Conclusion

Building the cloud is different than using the cloud. At this scale, hardware failure is a statistic, not an exception. Our software must be resilient enough to treat a burning GPU as a mundane event.

Security

SOC2 in the Loop

Written by:

Sarah Vance

8 min read

Security

SOC2 in the Loop

Written by:

Sarah Vance

8 min read

Security

SOC2 in the Loop

Written by:

Sarah Vance

8 min read

Performance

Latency: The New Gold

Written by:

David Kim

4 min read

Performance

Latency: The New Gold

Written by:

David Kim

4 min read

Performance

Latency: The New Gold

Written by:

David Kim

4 min read

Architecture

Beyond the Vector DB

Written by:

Elena Ro

7 min read

Architecture

Beyond the Vector DB

Written by:

Elena Ro

7 min read

Architecture

Beyond the Vector DB

Written by:

Elena Ro

7 min read