Compute is a key lever for AI progress. I wanted to work on the business side of AI Infra, but that itself encompasses many layers - business development, operations, finance, technical program management, etc. Since it really comes down to the right role opening up at the right time, I figured it’s better to build a solid big-picture understanding of how everything fits together, and then go deep on the specifics when the opportunity comes.

Having an entrepreneurial mindset, plus backgrounds in finance and data science, sets me up well for system-level thinking. I think managing costs, hardware/software optimisation, and actually executing are the things that matter most, and I wanted a skill set that lines up with that. I love adventures, and the idea of taking on a role that involves navigating ambiguity in a fast-moving environment really excites me.

The goal of this blog is to cover:

High-level aspects of how modern machine learning infrastructure works
Hardware advancements that accelerate deep learning workloads
Industry insights.

It builds on themes from my previous blog posts—including supply chain dynamics, neocloud, TCO analysis, hardware fundamentals, and OCP reflections*:

Machine Learning Infrastructure

A full ML platform usually has two parallel data pipelines:

Real-time pipeline: Handles data that arrives continuously and needs low-latency processing
Batch: Processes large historical datasets on a schedule(hourly, daily, etc) and is usually used for training.

Real-time: Apache Kafka receives continuous events. Events can be a log record, a click event, etc. Flink consumes these events and performs ETL (Extract, Transform and Load) into the real time feature store. Prediction service will use the latest model and features to make instant predictions.

Batch: Data lake stores large volumes of historical data, which is then processed by Spark ETL. The output goes to the Batch Feature Store which is used for training and batch inference, while the features & labels goes to training. Batch prediction jobs periodically run predictions over large datasets, and the output is saved to the data lake.

Why do you need batch inference?
Some predictions are too computationally intensive to run on demand. Even though the end user might not request the prediction directly, the system does. For example, a financial system loads risk rankings for reporting.

System Constraints

ML workloads are also shaped by hard physical limits:

Input/Output: How fast can data be fed into compute workloads?
Compute: How much TFLOPS are needed?
Memory bandwidth: How much memory is needed to hold models?

This makes hardware-software optimisation critical. For inference, this means optimising the entire stack from GPU hardware to model run time. Currently, NVIDIA GPUs dominate, and their TensorRT library is highly optimised for deep learning inference. Large companies like Uber, Meta, have their own heavily optimised internal inference stacks that are proprietary. From my experience in the AI supercomputing centre, LLMs are often limited by memory rather than compute. My understanding is that this impacts training more than inference. Uber* talked about doing memory offloading to system memory or NVMe SSDs, but that causes network/PCIe bandwidth to be the bottleneck. OCP* talked about CXL, which is the next-generation memory solution, but based on my conversations with experts, the software stack is still limited.

Industry Landscape: Technical Debt

One of the issues companies face is decentralisation of workloads. Some companies do not have a uniform pipeline for managing training and predicting data at scale. Engineering teams* were building bespoke one-off systems to use models created by data scientists. It wasn’t possible to train models larger than the scientist’s desktop, and there was no established path to deploying a model in production. Furthermore, technical debt can be difficult to detect because it exists on the system level rather than on the code level.

Google’s paper, “Hidden Technical Debt in Machine Learning Systems,” captures these issues well.

“For instance, consider a system that uses features x1,...xn in a model. If we change the input distribution of values in x1, the importance, weights, or use of the remaining n − 1 features may all change. This is true whether the model is retrained fully in a batch style or allowed to adapt in an online fashion. Adding a new feature xn+1 can cause similar changes, as can removing any feature xj. No inputs are ever really independent.”

The author refers to “CACE”: Changing Anything Changes Everything, and provides an in-depth example of how ML systems accumulate debt.

Some of my related blog posts:

TCO: https://finetti.posthaven.com/total-cost-of-ownership-analysis
Hardware: https://finetti.posthaven.com/metas-open-sourced-hardware

Hardware Part Two: https://finetti.posthaven.com/scaling-hardware-isnt-straightforward
Risk of Supply Chain: https://finetti.posthaven.com/risk-and-realities-of-supply-chain
OCP 2025 Reflection: https://finetti.posthaven.com/there-is-money-in-the-table-for-a-reason
Neocloud: https://finetti.posthaven.com/neo-clouds

Opportunity in AI Infra: https://finetti.posthaven.com/turns-out-i-am-10-years-late

Source:

*https://www.uber.com/blog/from-predictive-to-generative-ai/
*https://www.opencompute.org/documents/ocp-document-submission-ai-co-design-docx-pdf
*https://www.uber.com/blog/scaling-ai-ml-infrastructure-at-uber/