AI on the edge is moving fast—and that speed is both an opportunity and a problem. In this post, we introduce AI2EDGE, a publicly funded project supported by the European Union and the Ministerium für Wirtschaft, Industrie, Klimaschutz und Energie des Landes Nordrhein-Westfalen (MWIKE). The project brings together compiler technology and virtual platforms to make it easier to evaluate and deploy AI workloads on constrained, heterogeneous hardware. Before diving into the "how", it's worth clarifying the "why": edge AI deployment is still too hard, too slow, and too dependent on having the right hardware on your desk.
Training an AI model is only half the job. The other half — getting it to run reliably and efficiently on the target device — often consumes the most time. In practice, teams face a fragmented ecosystem: different frameworks and model formats (e.g., TensorFlow, Caffe, Apache TVM), rapidly changing toolchains, and inevitable incompatibilities. Some approaches age out quickly, leaving behind abandoned conversion scripts and brittle pipelines.
On edge devices (phones, embedded Linux systems, microcontrollers), the constraints are harsher: limited memory, limited compute, strict power budgets, and a smaller software stack. Even if you manage to get a model running, performance and efficiency can still be far from production-ready.
This is where the compiler toolchain of our partner Roofline AI comes into play: it helps bridge the gap between models and diverse hardware targets by turning AI workloads into efficient implementations for the chosen platform.

Compiler support alone doesn't remove a major real-world bottleneck: hardware availability.
Many companies evaluate multiple chips and acceleration options in parallel. But physical prototypes are often scarce, arrive late, or are shared among many teams. That makes early software bring-up, performance exploration, and regression testing difficult.
MachineWare addresses this with Virtual Platforms (VPs). A VP simulates a complete microprocessor-based system on a general-purpose computer. For example, a VP can include a RISC-V CPU model, peripherals, and a Neural Processing Unit (NPU):

When done well, software developers can work against a VP with the same workflows they would use on real hardware—often long before physical devices are broadly available. Because it's software, a VP can also be cloned, versioned, and integrated into CI/CD for repeatable regression testing.
Within AI2EDGE, MachineWare contributes their SIM-V instruction-set simulator for CPU simulation and their open-source peripheral modeling library VCML.
AI2EDGE combines these two worlds: Roofline's compilation technology and MachineWare's simulation technology. The goal is a workflow where teams can answer questions like: "Can my model run on chip X—and what performance should I expect?" …quickly and repeatedly, without needing physical prototypes. This enables rapid iteration across hardware options and reduces risk when selecting a target platform. Use cases and requirements are defined together with Fraunhofer IPT. Overall, the project aims to deliver an integrated system along the following lines:

If you're working on edge AI deployment and want to reduce the friction between "model trained" and "model running on target hardware," AI2EDGE is all about closing that gap—by pairing robust compilation with realistic, automation-friendly virtual platforms.
We are building the deployment platform for edge AI and are looking for exceptional people to join us.
If you want to help bring the next generation of AI software infrastructure to market, we would love to hear from you.
We have the following open positions:
- Content Marketing Lead
- AI Compiler Engineer (Senior Staff, Senior, Junior, Master Thesis)
- ML Infrastructure and Validation Engineer
- Build System & Packaging Engineer
All roles: https://lnkd.in/dA2y4f6y
#Hiring#EdgeAI#AIDeployment#AICompiler#Roofline

We're excited to present two talks at this year's EuroLLVM Developers' Meeting by the LLVM Foundation.
Florian Walbroel will present our open-source tool mlir-track-src for tracking operations through MLIR pass pipelines: https://lnkd.in/demfG--8
Ege Beysel will talk about optimizations for efficient tiling and vectorization in MLIR's linalg dialect.
Together with Maximilian Bartel, they will be in Dublin for the entire conference. Reach out if you are around!
#MLIR#LLVM#AICompiler#EdgeAI#AIDeployment#OpenSource#Roofline

We recently announced compiler enablement for NXP Semiconductors' eIQ® Neutron NPU. Today, we’re sharing a hands-on demo of how that helps developers to increase iteration speed for their edge AI products.
Our AI Engineer Juan Pisula built a factory monitoring agent on NXP's i.MX 95 applications processor. It detects fires and triggers actions, using a combination of vision and language models.
The joint NXP × roofline software enablement provides broad model support across CPU, GPU, and NPU, allowing developers to easily swap and test models. In this demo, switching from a full VLM to CLIP for a more targeted "fire vs no fire" classification delivers a 7x speed-up.
That’s what faster iteration cycles look like, resulting in accelerated time-to-market for edge AI products.
Tech talks, food, drinks, and plenty of time to connect with the LLVM community. Whether you are a student, researcher, or seasoned compiler engineer: come by!
📍 Design Offices Dominium, Tunisstr. 19-23, 50667 Köln
🗓️ Tuesday, March 31 · 18:00h–21:00h
Join here: https://lnkd.in/ekiaE6Rf

Edge AI innovation is accelerating, and software velocity is its key enabler. NXP and Roofline have teamed up to showcase how scalable software infrastructure, combined with deep hardware-specific optimizations, unlock NPU-based systems for real-world adoption.
Starting with LLM enablement for NXP’s eIQ Neutron NPU on the i.MX 95 applications processor, we highlight three tangible advantages: 1) Broad model coverage across cutting-edge LLMs, 2) Support for larger models exceeding the NPU's 2 GB local memory, and 3) Performance gains of up to 3.2x faster LLM prefill compared to CPU-only execution.
By orchestrating heterogeneous execution across CPU and NPU and offloading matrix multiplications at the compiler level, we enable full SoC utilization and Day-0 support for latest models.
Read the full case study for technical details, performance insights, and the practical implications for developers building on NXP hardware: https://lnkd.in/d86p7Gks
Thanks to Sebastian Vogel, Dr., Lennart Bamberg, Ali O. Ors, Moritz Riesterer, Davis Sawyer, and the entire NXP team for the collaboration, as well as Toradex for the provided i.MX 95 EVK.
#EdgeAI#AIDeployment#AICompiler#MLIR#IREE#NXP#Roofline
LLMs are moving onto edge devices and naturally come with variable prompt lengths. Unlike traditional inference with fixed input sizes, LLM prefill therefore operates on dynamic input shapes. At the same time, edge GPUs and NPUs are typically optimized for fixed-size computations. This makes dynamic shape handling a key prerequisite for high-performance on-device LLM inference.
roofline elevates handling of dynamic input shapes to a first-class compiler capability for on-device LLMs. Expanding from established operator-level techniques such as padding, peeling, and masking, we introduce a model-level approach that constrains dynamic prompt lengths once at the model boundary to hardware-friendly multiples. This global guarantee enables efficient fixed-size tensor execution on edge hardware.
In the video below, our LLM wizard Thomas Ziereis guides you through the key concepts and demonstrates up to 23× higher prefill performance for Qwen3-0.6B on an NVIDIA RTX 3070.
Read the full case study here: https://lnkd.in/eucmJDBN
#EdgeAI#AIDeployment#AICompiler#MLIR#Roofline