The Hardware You Can't See

I’ve never seen an AI server in person, but I’m excited for the chance to. This year, I was lucky enough to be sponsored to attend the OCP Global Summit and join NeurIPS in San Diego.

I’m also really curious about how open source might help reduce costs over time. I remember how open-sourcing Android lowered phone manufacturing costs, and I’m hoping to see something similar happen here—though it’ll definitely take time. And for Meta, the benefit seems straightforward: lower costs for them as well. (Update, Nov 2025: I’m more skeptical about open-source hardware now.)

Building AI clusters takes more than GPUs. Networking and bandwidth are key to keeping them fast. Meta's open-source system consists of an isolated high-bandwidth network that connects all their GPUs and domain-specific accelerators. Bandwidth is expected to grow by 5x to 10x by 2030. Hopper GPUs had 900GB/s NVLink; Blackwell is twice that. Every new generation GPU is doubling or more in compute throughput, which forces interconnect bandwidth to keep up, so the chips aren't waiting for the data. 


To support this, AI Labs need a high-performance, multi-tier, non-blocking network fabric that can manage traffic smoothly. Meta decided to open-source Catalina, a high-power rack capable of supporting up to 140kW. The other reference you can typically find on the internet is Bianca, but it's a total integrated system from Nvidia. 

Source: Meta Catalina Specification via OCP

Components within Meta's compute tray

1. GB200 High Performance Module (N)- Modular component that contains CPU and GPU
2. Host Management Controller (N)- Control panel to monitors all parts, checks temperature, etc 
3. Connect X7 (N)- Network interface card that talks to your GPU and data center fabric 
4. Power Distribution Board (M)- Receive bulk power and distribute it to everyone else 
5. Data Center Secure Control Module (M) - Low speed chips moved to DC SCM, making HPM less dense and cheaper. Allows upgrade without changing the whole HPM 
6. E1.S NVMe backplane (M) - Interconnect board connecting SSDs to main system 
7. OSFP Carrier Board (M)- Acts as a interface between compute trays and network fabric. Best for thermals and maintanability. 
8. Front IO Board (M)- Interface board to keep motherboard cleaner
9. CX7 OCP NIC 3.0 (Commodity)- Seperate board for CX7 NIC to allow easier upgrade, simplify cooling and signal integrity

N= Nvidia-designed
M= Meta-designed 

Thermal Solution
Catalina uses a combination of liquid and fan cooling, with a suspected code name of "Channel Island". They utilise a PG25-based liquid, with a temperature of 10-12 degrees celcius. They have terminal sensors within the baseboard management controller (BMC) that tolerates +-2C. Channel Island can also detect leaks via sensors, contain leaks via mechanical design and response to leaks by shutting down power or turn off supply once detected. 

Within the compute tray, there are eight fans. These fans provide air cooling for the E1.S drives, front end CX7 OCP NICs, the DC-SCM, and the power conversion circuitry on the PDB. The thermal design is resilient and can continue to operate with a single fan rotor failure. The cold plate loop is used for liquid cooling on the high-powered components - like the GB200 and CX7 backend NIC modules. 

Electricity
From reading semi-analysis, I was intrigued to learn that the cost of power is cheaper in some US states compared to some parts of the world. Notably, costs are 1/3 of Singapore! The power landscape is accelerating, and a 50MW+ per facility would no longer be enough. Legacy data centres would no longer be relevant. 
 
There are other components beyond electricity and thermal, which are open-sourced by Meta. Currently, they utilise Quanta, a Taiwanese supplier, to build their bare metal.