Cloud vs On-Premise

Following up from the previous post, "What I learned from Open Compute Project"

I initially thought other clients should buy directly from ODMs instead of OEMs. I later realised the situation is more complex - there's money on the table for a reason. As for cloud versus on-premise, I am rather confused. Both sides claim cost advantages, and it really depends on the scale and context. When I spoke with the hedge fund industry, most opted for on-prem for security and control. What is the cost of compute is a frequent question, but a complex one to answer. The peak flops, the utilisation rate, cost of hardware depreciation, and the desired throughput are some examples of the factors.

Some of the key insights I've learned are from Prag Mishra:
1. Flexibility: The cost of FLOP/$ doubles every two years for ML/AI-specific GPUs (source: Epoch AI). Cloud instance requires a fixed-year contract, which might not be cost-effective. By maintaining flexibility to upgrade compute resources, you can cut the cost by half in a year.

2. Utilization: 100% utilization in cloud are rare. The max compute per dollar for cloud instance is 25PFlops/$ on a one year upfront. An on-prem GPU server, if utilized 60% of the time, can be a fully deprecated in two years to match the maximum PFLOP/$ that can be achieved in cloud instances.

3. Execution: On-prem is hard and risky. One missing piece - a generator stuck in backlog - can halt everything. Key components like HBM supply can't scale overnight. At the application layer, OpenAI needs compute that's ready and reliable to serve the customers at speed.

4. Cost of Performance: In July 2024, the cost of H100 instances in the cloud was $77/hr for AWS p5.48xlarge with 8xH100s. In August 2025, its $55/h - much lower. Actual costs vary depending on the region.

I've read many resources online for TCO costs, such as semi-analysis or Neoclouds themselves. I'm wary of some of the assumptions, especially the high utilisation rate, which might not be reflective of reality.

Example: OpenAI Training Estimates
According to reports from the Information, the New York Times and Epoch AI. For GPT 4.5, Epoch AI estimated that OpenAI utilises between 40,000 to 100,000 H100s to train its model, between 90 to 165 days for $2 per H100 hour. Based on these assumptions and a maximum utilization rate of 90%, the total training cost for a single model is estimated to range between $192 million and $890 million. According to Winsome, OpenAI plans to spend $350B by 2030 on compute infrastructure, with annual server bills at $85B.

How would you split between on-prem vs cloud?

Decision will be made mainly on technical strengths and costs. Beware of online commentary on clouds. I met an ex-OpenAI technical staff member who spoke highly of his 2023 experience with Azure vs the other cloud provider. I'm cautious about online commentary - much of it reflects incentives rather than reality. From my experience, GPU memory and networking performance often outweigh FLOPs in selecting hardware. This makes Nvidia networking superior for training, and the ecosystem is working hard to catch up - Broadcom recently announced an 800GB Ethernet NIC in a bid against Nvidia ConnectX.

CXL memory is another industry solution for memory constraints. After talking to Marvell in OCP, my understanding is that it's mostly used by hyperscalers. The software stack is still immature.

Assuming training occurs on bare-metal nodes, differences in hyperscaler software stacks are largely negligible. What truly matters is speed, reliability and efficiency.