I’m also really curious about how open source might help reduce costs over time. I remember how open-sourcing Android lowered phone manufacturing costs, and I’m hoping to see something similar happen here—though it’ll definitely take time. And for Meta, the benefit seems straightforward: lower costs for them as well. (Update, Nov 2025: I’m more skeptical about open-source hardware now.)
Building AI clusters takes more than GPUs. Networking and bandwidth are key to keeping them fast. Meta's open-source system consists of an isolated high-bandwidth network that connects all their GPUs and domain-specific accelerators. Bandwidth is expected to grow by 5x to 10x by 2030. Hopper GPUs had 900GB/s NVLink; Blackwell is twice that. Every new generation GPU is doubling or more in compute throughput, which forces interconnect bandwidth to keep up, so the chips aren't waiting for the data.
To support this, AI Labs need a high-performance, multi-tier, non-blocking network fabric that can manage traffic smoothly. Meta decided to open-source Catalina, a high-power rack capable of supporting up to 140kW. The other reference you can typically find on the internet is Bianca, but it's a total integrated system from Nvidia.
1. GB200 High Performance Module (N)- Modular component that contains CPU and GPU
5. Data Center Secure Control Module (M) - Low speed chips moved to DC SCM, making HPM less dense and cheaper. Allows upgrade without changing the whole HPM
6. E1.S NVMe backplane (M) - Interconnect board connecting SSDs to main system
7. OSFP Carrier Board (M)- Acts as a interface between compute trays and network fabric. Best for thermals and maintanability.
8. Front IO Board (M)- Interface board to keep motherboard cleaner
9. CX7 OCP NIC 3.0 (Commodity)- Seperate board for CX7 NIC to allow easier upgrade, simplify cooling and signal integrity
N= Nvidia-designed
M= Meta-designed
Catalina uses a combination of liquid and fan cooling, with a suspected code name of "Channel Island". They utilise a PG25-based liquid, with a temperature of 10-12 degrees celcius. They have terminal sensors within the baseboard management controller (BMC) that tolerates +-2C. Channel Island can also detect leaks via sensors, contain leaks via mechanical design and response to leaks by shutting down power or turn off supply once detected.
Within the compute tray, there are eight fans. These fans provide air cooling for the E1.S drives, front end CX7 OCP NICs, the DC-SCM, and the power conversion circuitry on the PDB. The thermal design is resilient and can continue to operate with a single fan rotor failure. The cold plate loop is used for liquid cooling on the high-powered components - like the GB200 and CX7 backend NIC modules.
Electricity
From reading semi-analysis, I was intrigued to learn that the cost of power is cheaper in some US states compared to some parts of the world. Notably, costs are 1/3 of Singapore! The power landscape is accelerating, and a 50MW+ per facility would no longer be enough. Legacy data centres would no longer be relevant.