Accelerating Deployment of FPGA-based AI in hls4ml with Parallel Synthesis through Model Partitioning

Danopoulos, Dimitrios

doi:10.17181/c7z92-ve035

Published September 9, 2025 | Version v1

Presentation Open

Accelerating Deployment of FPGA-based AI in hls4ml with Parallel Synthesis through Model Partitioning

Danopoulos, Dimitrios

The increasing reliance on deep learning for high-energy physics applications demands efficient FPGA-based implementations. However, deploying complex neural networks on FPGAs is often constrained by limited hardware resources and prolonged synthesis times. Conventional monolithic implementations suffer from scalability bottlenecks, necessitating the adoption of modular and resource-aware design paradigms. hls4ml, an open-source tool developed to translate machine learning models into FPGA-compatible architectures, has been instrumental in this effort but still faces synthesis bottlenecks for large networks. To address this challenge, we introduce a novel partitioning methodology that integrates seamlessly with hls4ml, allowing users to segment neural networks at predefined layers. This approach facilitates parallel synthesis and enables stepwise optimization, thus complementing both scalability and resource efficiency. The partitioned components are systematically reassembled into a unified architecture through an automated workflow leveraging AMD Vivado, ensuring functional correctness while minimizing manual intervention. An automated RTL-level testbench verifies system-wide correctness, eliminating manual validation steps and accelerating deployment. Experimental evaluations on convolutional neural networks, including ResNet20, demonstrate up to a 3.5× reduction in synthesis time, alongside enhanced debugging flexibility, thereby improving FPGA prototyping and deployment.

While existing tools like hls4ml facilitate machine learning model translation to FPGA architectures, they struggle with large networks due to long synthesis times and resource constraints. Our approach enables network partitioning at predefined layers, allowing for parallel synthesis and stepwise optimization, thus significantly improving scalability and resource efficiency. The key novel contributions include A) a novel partitioning methodology that allows neural networks to be split at predefined layers enabling parallel synthesis, B) automated workflow integration with hls4ml and AMD Vivado, reassembling the partitioned components into a unified design with minimal manual intervention, C) automated RTL-level verification, eliminating manual validation. Experimental results showed up to a 3.5× reduction in synthesis time, which represents a significant update to hls4ml, improving FPGA-based deep learning for trigger applications and making AI model deployment more practical for real-time data processing in high-energy physics.

Files

ACAT_2025_dimitrios_danopoulos.pdf

Files (3.2 MB)

Name	Size	Download all
ACAT_2025_dimitrios_danopoulos.pdf md5:4c03481f831d09b243cc7118ff343ca9	3.2 MB	Preview Download

Additional details

Schmidt Family Foundation

Acronym: ACAT2025
Dates: 8-12 September 2025

https://indico.cern.ch/event/1488410/contributions/6562800/

	All versions	This version
Views	15	15
Downloads	16	16
Data volume	55.1 MB	55.1 MB

ACAT_2025_dimitrios_danopoulos.pdf

Files (3.2 MB)

Funding

Conference

References

Accelerating Deployment of FPGA-based AI in hls4ml with Parallel Synthesis through Model Partitioning

Authors/Creators

Description

Files

ACAT_2025_dimitrios_danopoulos.pdf

Files (3.2 MB)

Additional details

Funding

Conference

References