Will 2025 be the breakthrough year for on-device LLMs?

The first major proof point is here, and roofline is ready. DeepSeek R1 marks a significant step toward enabling capable LLMs on constraint edge devices.

We tested the DeepSeek R1 Distill Qwen 1.5B through our pipeline, and it worked out of the box — thanks to the flexibility of our MLIR-based compiler. The performance of our CPU compiler for edge is shown below. Today, we give you a sneak peek into our early LLM pipeline with similar performance for tokens/s as llama.cpp. Compared to llama.cpp, our pipeline is model agnostic. Without any model-specific optimization, we achieved 4x memory savings vs TorchInductor. This highlights our flexible and efficient edge AI deployment solution.