MLSYS ENGINEERING

Preface

In 2025, as the AI arms race intensified, Meta spent billions to recruit a handful of top AI researchers while restructuring its legendary FAIR lab to let go of hundreds of AI researchers and seeing the departure of pioneers like Turing Award winner Yann LeCun. Meanwhile, Elon Musk eliminated the title of "researcher" at xAI, reclassifying everyone as an engineer.

These events signal a fundamental shift in the demand for AI talent. Research budgets are concentrating on a few high-paying roles, away from the majority. It is no longer about open-ended exploration, but product-led research. A company now only needs a few elite researchers to make the algorithmic breakthroughs, backed by an army of machine learning engineers to improve the model's product quality.

The industry's goals have become clear and urgent: improve AI and accelerate its mass adoption using existing technology. Training large models is extremely compute-intensive and costly, so companies concentrate scarce compute resources on a small number of the most promising researchers to reduce time and expense. As a result, people with alternative research visions, or those who are not at the top of the job market, are increasingly sidelined, even if their ideas are valid. There is simply no time or resources left for them.

Today, researchers and machine learning engineers want to learn more about Machine Learning Systems (MLSys) because compute is the bottleneck. Running the models more efficiently means you can get more things done with the given compute. For example, researchers are encouraged to learn how to write custom kernels to optimize the performance of their models.

On the other hand, the demand for MLSys engineers is growing steadily. Tech giants are buying up every scrap of compute available on the planet, and even putting some in space, yet they remain "compute-hungry." Companies are bottlenecked by hardware, creating a massive need for MLSys engineers who can ease the crunch by making systems more efficient.

This demand will only continue to grow. While NVIDIA currently dominates the hardware market and charges a steep premium, everyone is trying to work around this. Other hardware vendors are hiring MLSys engineers to build competitive software stacks, while model builders need MLSys engineers to port workloads to more diverse, affordable hardware. In addition, as models grow larger and more complex, the complexity of the systems supporting them will only increase.

The supply of MLSys engineers, researchers, or ML engineers with good MLSys skills remains low because the field is incredibly challenging. Once you master it, you become highly valuable with immense job security. Thinking about it from a different perspective, NVIDIA's primary moat, the CUDA ecosystem, remains formidable precisely because it is hard to replicate. If MLSys engineering were easy, NVIDIA would have lost its monopoly years ago.

I was a little skeptical about the low supply at first. I thought the founding fathers of compilers and distributed computing would easily fill all the demand for MLSys engineers, which turns out to be false. First, there are not many of them. The tech industry has grown a lot in size. The talents from decades ago are too few to meet the demand today. Second, they are not at the age and position to switch careers. They are indispensable at their current jobs, which may be maintaining some fundamental technology powering the entire industry. They are unlikely to start fresh again as MLSys engineers.

In summary, the industry is shifting from wanting more elite researchers to more product-driven researchers, ML engineers, and MLSys engineers. All of them need to upskill in MLSys to stay relevant. The gap between high demand and low supply of MLSys engineers, and the challenging nature of the field, makes MLSys engineering a rare and vital skill.

While many of my friends with exceptional skills have joined that short list of elite researchers, I chose to become an MLSys engineer, based on my passion and the reasoning above. The transition wasn't easy. The learning curve was steep, and at times, the knowledge felt impenetrable.

I went through those hurdles so you don't have to. I've documented what I learned so you can navigate the MLSys landscape more smoothly. Getting there takes time and energy, but you aren't alone on this journey. I, and fellow readers from around the world, are with you. Now, hoist the sails. We are ready to set out!

The Author
San Francisco, CA
December 2025