Scaling hardware isn't straightforward

When I was working at deep learning at my previous job, I wasn't expecting to be interested in hardware. However, AI is different because the cost and quality of the infra make a significant impact on the model. So I did what I always did when I became curious: I went down the rabbit hole.

Introduction

The increase in machine learning usage - from natural language processing, graph neural networks to monte carlo pruning (Alpha Go) - has driven a massive surge in computational power. Models continue to double in size, and racks are on track to increase to 1 MW by 2030.

There are several problems arising from this. First, adding extra GPUs doesn't always help - the total system performance scales sub-linearly to the extra GPUs. Adding more compute nodes tends to reduce overall system efficiency. Second, migrating hardware between different sub-domains is tough. Financial trading models and autonomous driving have diverse priorities in terms of latency, memory usage, and throughput. AI models are unlikely to converge across sub-domains, and existing algorithms continue to evolve. Third, an average server requires ~50 subcomponents, and as Jensen introduces new GPU models, parts of the supply chain get reworked.

As a result, scaling hardware blindly leads to diminishing returns — and without careful hardware-software co-design, most of that extra compute goes to waste.

Physical System Design*

1. Chips

Modern algorithms need lots of cores and fast memory. The easiest way to scale is with chiplets. Instead of building one big chip, you build smaller ones and connect them. It’s cheaper, more reliable, and scales better.

New players like Groq are rapidly entering the inference space, aiming to carve out market share. China is trying to move to local chips like Huawei. To keep up, you need a modular system—something that can take new chips without fully starting over. Standards like Open Accelerator Inference and Universal Baseboard make that possible.

2. Chassis - Tray

A chassis combines key materials - such as processors, storage drives and memory - into a compute, storage or memory node. Traditionally, 19-inch racks are the industry standard. 21-inch racks, supported by OCP, are increasingly becoming popular due to bigger AI workload needs.

3. Server - Compute*

The GB200 NVL72 racks consist of 18 compute trays and 9NV Switch trays. Meta has made their Catalina NVL-72 system open-source. Catalina is Meta's next-generation AI/ML rack that supports large cluster training and inference use cases. The design focuses on achieving a fast time to market, alignment with industry references, and providing cutting-edge performance.

4. Pod Density - Compute Nodes

A pod is a series of compute nodes that work together as if it's a single computer. Even though the job is being split up into multiple physical machines, the software sees it as one machine. The pods are connected with a low-latency interconnect like NVSwitch. In the case of Meta's NVL72, each tray contains 2 CPUs and 4 GPUs, and there are 18 trays in a rack. Two racks are then connected to fit 72 accelerators per pod. Pod density is expected to increase in the future, but due to current power and liquid-cooling constraints, most data centres cannot support the rack density of NVL72 in one rack.

OCP also talked about a future where there can be more than two accelerators per high-performance module, and depending on advancements in fabric technology, that number could be 576 per rack!

Different types of clients also require different ratios of GPUs to other components, depending on the primary purpose of the hardware. Semianalysis wrote a great article on how to improve bare metal cost by omitting less important units.

5. Networking

AI workloads often involve a huge amount of data moving between CPUs, GPUs, memory, storage and sometimes even across data centres. If all these transfers happened one after another, the system would become a bottleneck. Parallelism in networking means designing the network so that multiple data flows - CPU to GPU, GPU to memory, node to node, cluster to cluster. This involves interconnects like NVLink, Infiniband and special architecture. As my friend, Reynold, says "Networking is a whole set of different complex problems"

Different networks that co-exist for different purposes

Frontend networking (Normal Ethernet): Connect Servers to the outside world

Backend Networking (InfiniBand/RoCE Ethernet or other high-performance fabrics): Connect all nodes and servers, move a huge amount of data between GPUs in different servers with minimal delay

Scale up Accelerator Interconnect (NVLink): Connect GPUs within server, share memory and exchange data faster than PCIe

Out-of-Band Networking for Manageability: If networks are overloaded. down, admins can still access the system via this method.

*Source: Open Compute Society, Semi Analysis