Pushing Matrix-Vector Multiplication Performance on AMD AI Engines for Low-Latency Edge Inference

Danopoulos, Dimitrios

doi:10.17181/54saz-aa871

Published September 2, 2025 | Version v1

Presentation Open

Pushing Matrix-Vector Multiplication Performance on AMD AI Engines for Low-Latency Edge Inference

Danopoulos, Dimitrios

Matrix-vector (GEMV) operations are a common building block in many deep learning models, particularly for large dense layers found in convolutional neural networks (CNNs) and multi-layer perceptrons (MLPs). Despite their importance, GEMV kernels have historically underperformed compared to matrix-matrix (GEMM) operations due to their lower arithmetic intensity and limited data reuse, making them harder to scale efficiently. This work presents the first comprehensive analysis and optimization of matrix-vector operations using AMD’s AI Engines on the latest AIE-ML architecture. It addresses key bottlenecks in deploying AI models that rely on such operations targetting low-latency edge inference, such as meeting the tight real-time requirements of the CERN trigger system. Our proposed GEMV kernel achieves high throughput and low latency through exploitation of the AI Engine array, scaling efficiently across tiles both horizontally and vertically through a custom placement strategy. Furthermore, we introduce a novel graph connection mechanism that enables efficient pipelining across multiple layers. The design is modular and can be easily integrated with other frameworks such as hls4ml in a straightforward manner. Our multi-layer implementation achieves close to microsecond-level latency, demonstrating its suitability for ultra-low-latency applications. These results make AMD's AI engines a realistic middle ground solution that can offer the scalability that FPGAs struggle to reach for large models, while maintaining the ultra-low latency that GPUs typically cannot provide.

Files

fastML_2025_dimitrios_danopoulos.pdf

Files (2.0 MB)

Name	Size	Download all
fastML_2025_dimitrios_danopoulos.pdf md5:55ae5a604d1217f5257e2816725ea467	2.0 MB	Preview Download

Additional details

Schmidt Family Foundation

Title: Fast Machine Learning for Science Conference 2025
Dates: 1-5 September 2025
Place: ETH Zurich

https://indico.cern.ch/event/1496673/contributions/6637935/

	All versions	This version
Views	27	27
Downloads	20	20
Data volume	45.7 MB	45.7 MB

fastML_2025_dimitrios_danopoulos.pdf

Files (2.0 MB)

Funding

Conference

References

Pushing Matrix-Vector Multiplication Performance on AMD AI Engines for Low-Latency Edge Inference

Authors/Creators

Description

Files

fastML_2025_dimitrios_danopoulos.pdf

Files (2.0 MB)

Additional details

Funding

Conference

References