#9

Unpacking MLIR and Mojo with Ivo Balbaert

How MLIR is reshaping compilers for heterogeneous hardware despite adoption challenges—and how Mojo builds on it to unify Pythonic ease with AI‑scale performance

Twilio Segment: Data you can depend on , Built your way

deep-engineering-9-unpacking-mlir-and-mojo-with-ivo-balbaert-img-2

Twilio Segment was purpose-built so that you don’t have to worry about your data. Forget the data chaos, dissolve the silos between teams and tools, and bring your data together with ease. So that you can spend more time innovating and less time integrating.

Learn more

Welcome to the ninth issue of Deep Engineering.

As CPUs, GPUs, TPUs, and custom accelerators proliferate, compilers have become the thin yet critical layer that enables both abstraction and performance.

Our feature this week looks at Multi-Level Intermediate Representation (MLIR)—a compiler infrastructure that promises to unify optimization across wildly different domains. Born at Google and now adopted in projects like OpenXLA, LLVM Flang, NVIDIA’s CUDA Quantum, and even hardware DSLs like Chisel, MLIR offers a powerful foundation—but one that comes with real‑world friction: steep learning curves, ecosystem fragmentation, and legacy integration challenges. We unpack where MLIR delivers, where developers struggle with it, and what its future might mean for software architects.

Building on this theme, we’re also kicking off a new series on Mojo🔥, a programming language built entirely on MLIR. Written by Ivo Balbaert , Lector at CVO Antwerpen and author of The Way to Go and Packt introductions to Dart, Julia, Rust, and Red, Building with Mojo (Part 1): A Language Born for AI and Systems explores Mojo’s origins, its design goals, and its promise to unify Pythonic ergonomics with AI‑scale performance. Future parts will go deeper—covering Mojo’s tooling, metaprogramming, hardware abstraction, and its role in simplifying development pipelines that currently span Python, CUDA, and systems languages.

Read on for our take on MLIR’s trajectory—and then take your first step into Mojo, a language built for the next wave of AI and systems programming.

Sign Up |Advertise

MLIR’s Promise, Pain Points, and the Path Forward

To use a cliched statement: hardware and software are becoming increasingly diverse and complex. And because modern workloads must run efficiently across this diversity and complexity in form of CPUs, GPUs, TPUs, and custom accelerators, compilers are now critical for both abstraction and performance. MLIRemerged to tame this complexity by enabling multiple layers of abstraction in one framework. MLIR has rapidly grown from a Google research project into an industry-wide technology. After being open-sourced and contributed to LLVM in 2019, MLIR’s modular design attracted a broad community.

Today MLIR underpins projects beyond Google’s TensorFlow. For example, it is the foundation of OpenXLA, an open compiler ecosystem co-developed by industry leaders (AMD, Apple, NVIDIA, etc.) to unify ML model deployment on diverse hardware. It’s also inside OpenAI’s Triton (for GPU kernel optimization) and even quantum computing compilers like NVIDIA’s CUDA Quantum (which defines a “Quake” IR on MLIR). In hardware design, the LLVM-affiliated experimental CIRCT project applies MLIR to circuit design and digital logic – so much so that a modern hardware DSL like Chisel moved its back-end to MLIR for richer analysis than standard RTL provides. MLIR’s multi-dialect flexibility has proven useful well beyond machine learning.

MLIR has also made inroads into a traditional compiled language. The new LLVM Fortran compiler (Flang) adopted MLIR to represent high-level Fortran IR (FIR), allowing more powerful optimizations than the old approach of jumping straight to LLVM IR. This MLIR-based Flang already achieves performance on par with classic Fortran compilers in many benchmarks (within a few percent of GCC’s Fortran). In fact, in 2024, AMD announced its next-gen Fortran compiler will be based on Flang/MLIR to target AMD GPUs and CPUs in a unified way.

However, MLIR’s adoption remains uneven across domains. For example, the LLVM C/C++ frontend (Clang) still uses its traditional monolithic pipeline. There is work in progress on a Clang IR dialect (“CIR”) to eventually bring C/C++ into MLIR, but Clang’s large legacy and stability requirements mean it won’t rewrite itself overnight.

MLIR is proving itself in new or specialized compilers (AI, HPC, DSLs) faster than it can retrofit into long-established general-purpose compilers. It is technically capable of being a general compiler framework, but the industry is still in transition.

The Hard Gaps – Adoption Challenges in Practice

Engineers may be enthusiastic about MLIR’s potential but also hit real pain points when evaluating it for production. Some key challenges include:

Steep learning curve and tooling maturity: The MLIR ecosystem is complex and still maturing, which can intimidate new developers. Ramalho et al., in a 2024 conference paper note that “the MLIR ecosystem has a steep learning curve, which hinders adoption by new developers.” Building a new dialect or pass often means delving into MLIR’s internals (C++ templates, TableGen definitions, etc.) with sparse documentation. In fact, MLIR’s flexibility can be a double-edged sword – there are many moving parts to learn (dialects, ops, attributes, patterns, builders), and patterns are still emerging. Google’s engineers originally writing machine-learning kernels directly in MLIR found it “a productivity challenge”, which led them to create the Mojo language to get a higher-level syntax on top of MLIR. The lack of out-of-the-box IDE support or debugging tools for MLIR IR further adds friction. Adopting MLIR often requires hiring or developing compiler expertise, and that investment can be hard to justify for every team.
Integration with Legacy Compiler Stacks: For organizations with existing compilers, taking advantage of MLIR might mean significant refactoring or a total rewrite of the front-end or middle-end. The LLVM community has been careful with Clang for this reason: “Clang also has a legacy to protect, so it is unlikely to fully adopt MLIR quickly.” Instead, they are introducing MLIR gradually via a new CIR dialect for C/C++. Retrofitting MLIR into a mature compiler is expensive and risky because you must maintain feature-parity during the transition. Unless starting a compiler from scratch or facing a dead-end with current tools, it can be hard to justify MLIR’s long-term benefits over short-term upheaval.
Dialect Fragmentation and Ecosystem Maturity: One strength of MLIR is its dialect system – you can create domain-specific IR “dialects” and compose them. However, in practice this has led to an explosion of dialects, especially in the AI domain, not all of which are stable or even compatible. As Chris Lattner (MLIR’s co-creator) observed:

“Unfortunately, this explosion happened very early in MLIR’s design, and many design decisions in these dialects weren’t ideal for the evolving requirements of GenAI. For example, much of this early work was directed towards improving TensorFlow and building OpenXLA, so these dialects weren’t designed with first-class PyTorch and GenAI support.”

The result was that by the time generative AI and PyTorch use cases rose, the upstream MLIR dialects (like linalg or tensor) were not a perfect fit for new workloads. Companies ended up forking or inventing their own dialects (e.g., Google’s StableHLO vs. others), leading to ecosystem fracture. Lattner describes it as an “identity crisis.” Architecturally, it is difficult to determine which dialects to build on or standardize around. On the bright side, the MLIR project recently established a new governance structure and an MLIR area team to improve consistency, but it will take time to harmonize the dialect zoo.

Unpredictable Performance in Niche Scenarios: MLIR adds its own layer of transformations and scheduling – if the compiler pipeline isn’t expertly constructed, you might not hit peak performance for a given target. Until more of these optimizations are shared in the community, teams adopting MLIR in new domains might face a period of performance tuning and even uncertainty. (On the flip side, MLIR’s structure can enable new performance tools. For example, Lücke et al. in their CGO 2025 Main Conference paper demonstrate through five case studies that the transform dialect enables precise, safe composition of compiler transformations and allows for straightforward integration with state-of-the-art search methods.)

But probably the most practical pain point is day-to-day developer experience. Debugging an MLIR-based compiler can be challenging – error messages often come from deep in the MLIR/LLVM machinery, and stepping through multi-dialect lowering is hard. So, there are challenges and tradeoffs in MLIR adoption at both the organizational and individual levels. But how have these trade-offs played out in the real world: who is successfully using MLIR today, and what did they learn from it?

MLIR in the Real World

Despite the hurdles, some teams have embraced MLIR and demonstrated tangible benefits. Let’s explore four use cases:

Fortran & HPC Applications: The LLVM Flang project’s adoption of MLIR is a showcase for using MLIR in a non-ML domain. By inserting MLIR into the compilation flow (via FIR dialects), Flang keeps more high-level semantics available for optimization than the old approach that dropped straight to LLVM IR. This enabled powerful transformations for array operations, loop optimizations, and OpenMP parallelism, all within the MLIR framework. Notably, an MLIR dialect for OpenMP was created so Flang could represent parallel loops in a higher form than just runtime calls. Software engineers at Linaro showed that the new Flang compared favorably with Classic Flang and was not far behind GFortran on benchmarks. Researchers at national labs have run full applications through Flang and confirmed its output is efficient, while also praising the new compiler’s extensibility for future needs. This hints that MLIR can deliver HPC performance while providing a more modern, maintainable codebase. It’s not all rosy – Flang is still catching up on full Fortran 2018 feature support – but it’s a concrete proof that MLIR can anchor a production compiler for a decades-old language. It also drove industry involvement: Fujitsu and ARM are contributing to Flang’s MLIR optimizations, and AMD is aligning its own Fortran compiler with Flang’s MLIR pipeline. For HPC architects, MLIR’s holds potential to unify CPU/GPU optimization (Flang will emit GPU offload code to AMD and NVIDIA through LLVM) and to lower maintenance in the long run by leveraging common infrastructure.
SiFive RISC-V Intelligence Products: Hardware startups and AI accelerator teams can adopt MLIR as their compiler toolkit rather than writing everything from scratch. For example, SiFive RISC-V Intelligence Products use Google’s open-source MLIR-based compiler IREE as the core of their ML software stack. SiFive added their own custom dialect (VCIX) to MLIR so that IREE could target SiFive’s vector extensions and custom AI accelerators. This allowed them to lower deep learning models (like LLaMA LLMs) onto RISC-V hardware with relative ease, reusing IREE’s many optimization passes and then adding just the pieces needed for SiFive’s architecture. The result was the ability to run LLMs on RISC-V and get real-time performance – something that would have been immensely difficult without a framework like MLIR.
NVIDIA’s CUDA Quantum platform: MLIR can be leveraged to build compilers for quantum computing and other novel processors. NVIDIA’s CUDA Quantum platform uses MLIR under the hood, mapping quantum IR into MLIR’s SSA form (the Quake dialect) and allowing compiler optimizations on quantum circuits. The same infrastructure enabling tensor optimizations can also optimize quantum gate pipelines. For software architects at companies making custom chips (AI or otherwise), MLIR provides a common compiler backbone where you plug in hardware-specific pieces (dialects, cost models) rather than reinventing entire compilers.
OpenXLA: On the enterprise side, MLIR is creeping into data centers. OpenXLA, which as noted uses MLIR in components like StableHLO and IREE, has been used in production at companies like DeepMind, Waymo, and Amazon. A Google blog noted that OpenXLA (with MLIR inside) has been used for training AlphaFold, serving large Transformer models in self-driving car systems, and even accelerating Stable Diffusion inference on AMD GPUs. These are real workloads where the MLIR-based compiler achieved better throughput or latency than default frameworks, often by performing advanced optimizations (fusions, layout optimizations, multi-host parallelization) that framework runtimes alone couldn’t.
Torch-MLIR: This is an open project to compile PyTorch models via MLIR. While not yet mainstream in PyTorch deployments, it’s gaining traction among researchers trying to optimize PyTorch beyond what TorchScript or Inductor can do. The mere existence of Torch-MLIR underscores the interest in MLIR’s ability to serve as a common IR bridge – here, between the dynamic PyTorch ecosystem and lower-level backends like LLVM, SPIR-V, or custom accelerators.
CIRCT in hardware design: companies designing FPGAs and ASICs (e.g., in the FPGA EDA industry) are experimenting with MLIR to replace or augment HDLs. Chisel, a high-level hardware language, now emits MLIR (via CIRCT) instead of a custom IR, allowing use of MLIR’s analysis to optimize hardware generators. This could streamline chip design workflows by enabling cross-optimization of hardware and software models. While still experimental, it’s a real adoption in a traditionally conservative domain (EDA).

MLIR’s value multiplies in “greenfield” projects or where incumbents are hitting limits. New hardware with no legacy compiler, new languages (like Mojo, which we will talk about shortly) or AI serving stacks that need every ounce of performance – these are where MLIR has shined. The most effective MLIR deployments often abstract MLIR behind a higher-level interface. Flang hides MLIR behind normal Fortran semantics for end-users; SiFive’s users see an AI runtime API, not MLIR directly; even OpenXLA exposes a compiler API and uses MLIR internally. This suggests a potential best practice to ease adoption: shield developers from MLIR’s complexity via good APIs or DSLs, so they benefit from it without needing to write MLIR from scratch.

Mojo & MLIR

No discussion of MLIR in 2025 is complete without Mojo – a new programming language from Modular (a company founded by Chris Lattner and others) that has been making waves. Mojo is essentially a distilled essence of what MLIR can enable in software design. It’s billed as a superset of Python, combining Python’s ease with C++/Rust-like performance. Under the hood, Mojo is built entirely on MLIR – in fact, Mojo’s compiler is an MLIR pipeline specialized for the language. This design choice sheds light on what MLIR brings that classic LLVM IR could not:

Multi-level abstraction and optimization: Mojo uses MLIR to represent Python-like high-level features (e.g., list comprehensions, dynamic dispatch) in rich intermediate forms, then progressively lowers them to efficient native code via LLVM dialects—something impractical with LLVM IR alone.
Hardware abstraction with performance: By leveraging MLIR dialects for CPUs, GPUs, and TPUs, Mojo can specialize code for diverse hardware while keeping a single high-level language surface, preserving type and shape information longer for deeper optimizations.
Seamless Python interoperability: MLIR enables Mojo to handle Python’s dynamic typing and runtime behaviors, compiling only what benefits from optimization while falling back to the Python runtime, allowing a smooth transition from interpreted to compiled execution.

Mojo’s success so far validates MLIR’s promised benefits. Within a few months of Mojo’s preview release, the Modular team itself used Mojo to write all the high-performance kernels in their AI engine. Like we mentioned earlier, Mojo was born because writing those kernels in pure MLIR was too slow – by creating a high-level language that compiles via MLIR, the Modular team combined productivity with performance.

deep-engineering-9-unpacking-mlir-and-mojo-with-ivo-balbaert-img-3

Figure 1.1: “Mojo is built on top of MLIR, which makes it uniquely powerful when writing systems-level code for AI workloads.” (Source: Modular Blog)

Mojo’s compile-time cost is mitigated by MLIR’s design as well – parallelizing and caching in the compiler are easier with MLIR’s explicit pass pipeline, so Mojo can afford to do more heavy analysis without long build times. The language is still young, but it shines a promising light on what’s possible.

(As an aside for readers, Mojo’s use of MLIR is a deep topic on its own. In Building with Mojo (Part 1): A Language Born for AI and Systems, Ivo introduces Mojo’s origins, design goals, and its promise to unify Pythonic ergonomics with AI-scale performance—but only at a high level. Later parts of the series will go deeper into Mojo’s internals, including how MLIR enables compile-time metaprogramming, hardware-specific optimizations, and seamless Python interoperability. To receive these articles in your inbox as soon as they are published, subscribe here)

Wrapping Up

MLIR’s trajectory over the past year shows cautious but real momentum toward broader adoption. The community has addressed key pain points like dialect fragmentation with new governance and curated core dialects, while new tooling—such as the Transform dialect presented at CGO 2025—lowers the barrier for tuning compiler optimizations. Proposed additions like a WebAssembly dialect and Clang CIR integration suggest MLIR is expanding beyond its “ML-only” roots into systems compilers and web domains. Industry trends reinforce its relevance: heterogeneous compute continues to grow, and MLIR already underpins projects like OpenXLA with backing from NVIDIA, AMD, Intel, Apple, and AWS. Still, its success depends on balancing generality with usability and proving its value beyond Google and Modular; competing approaches like SPIR‑V and TVM remain viable alternatives. Yet with advocates like Chris Lattner, ongoing research from firms like Meta and DeepMind, and AMD and Fujitsu adopting MLIR for HPC compilers, it’s likely to become a cornerstone of future compiler infrastructure if it maintains this pace.

deep-engineering-9-unpacking-mlir-and-mojo-with-ivo-balbaert-img-4

Read the Article

🛠️Tool of the Week⚒️

IREE – MLIR-Based Compiler & Runtime

Intermediate Representation Execution Environment (IREE) is an open-source end-to-end compiler and runtime for machine learning models, built on MLIR. In the OpenXLA ecosystem, IREE serves as a modular MLIR-based compiler toolchain that can lower models from all major frameworks (TensorFlow, PyTorch, JAX, ONNX, etc.) into highly optimized executables for a wide variety of hardware targets.

Highlights:

Broad Framework & Hardware Support: IREE can import models from multiple frontends (TensorFlow, PyTorch, JAX, ONNX, TFLite, etc.) and target nearly any platform – from x86 or Arm servers to mobile GPUs, DSPs, and custom NPUs.
Intuitive Tooling & Integration: IREE provides a command-line compiler tool (iree-compile) and libraries that are straightforward to use. Models are compiled ahead-of-time into an efficient binary format, and runtime APIs are available in C and multiple languages (with language bindings) to easily load and execute the compiled models in your application. The tool comes with clear documentation and examples on its official site.
Debugging & Profiling Support: Unlike many experimental compilers, IREE doesn’t treat the compiled model as a black box – it includes developer-friendly features like IR inspection, logging flags, and integration with MLIR’s debugging tools. There are guides for debugging model issues and profiling performance (e.g., integration with CPU/GPU profilers and the Tracy profiler).
Active Community & Extensibility: Because IREE is built on MLIR, it is highly extensible – you can author custom MLIR dialects or passes and plug them into IREE’s pipeline if you have domain-specific optimizations or new hardware. The project’s community (spanning industry and academia) is very active, offering support and continuously adding features.

Learn more about IREE

📰 Tech Briefs

WAMI: Compilation to WebAssembly through MLIR without Losing Abstraction by Kang et al. from Carnegie Mellon University and Yale University: Introduces a new MLIR-based compilation pipeline that preserves high-level abstractions by adding Wasm-specific MLIR dialects, enabling direct, modular generation of WebAssembly code with better support for evolving Wasm features and comparable performance to LLVM-based compilers.
2025 AsiaLLVM - Sanitizing MLIR Programs with Runtime Operation Verification by Matthias Springer: Introduces MLIR's new runtime operation verification interface, which enables dynamic checks for undefined behavior—complementing static verification, improving debugging, and supporting tools like memory leak sanitizers, though with trade-offs in runtime overhead and adoption maturity.

deep-engineering-9-unpacking-mlir-and-mojo-with-ivo-balbaert-img-5

Leveraging the MLIR infrastructure for the computing continuum by Bi et al. presented at the CPSW’24: CPS Workshop: This WIP paper presents a node-level compiler and deployment framework built on MLIR for the MYRTUS project, targeting heterogeneous computing across the cloud-edge continuum by extending dataflow dialects, optimizing for CGRAs and FPGAs, and enabling adaptive execution with tools like Mocasin and CIRCT.
Precise control of compilers: a practical approach to principled optimization | Doctoral thesis by Martin Paul Lücke, The University of Edinburgh: Demonstrates how integrating principled program representations like Rise and flexible transformation control mechanisms such as the Transform dialect and Elevate into MLIR enables production compilers to achieve systematic, verifiable optimizations while giving developers fine-grained control over complex optimization strategies.
2025 AsiaLLVM - Data-Tiling in IREE: Achieving High Performance Through Compiler Design by Han-Chung Wang: Explains how the IREE MLIR-based compiler uses tensor encodings and progressive lowering to optimize data layout, memory access, and instruction scheduling across CPUs and GPUs, enabling efficient, retargetable compilation for heterogeneous hardware.

deep-engineering-9-unpacking-mlir-and-mojo-with-ivo-balbaert-img-7

That’s all for today. Thank you for reading this issue of Deep Engineering. We’re just getting started, and your feedback will help shape what comes next.

Take a moment to fill out this short survey we run monthly—as a thank-you, we’ll add one Packt credit to your account, redeemable for any book of your choice.

We’ll be back next week with more expert-led content.

Stay awesome,
Divya Anne Selvaraj
Editor-in-Chief, Deep Engineering

Take the Survey, Get a Packt Credit!

If your company is interested in reaching an audience of developers, software engineers, and tech decision makers, you may want toadvertise with us.