Accelerating Deployment of FPGA-based AI in hls4ml with Parallel Synthesis through Model Partitioning
Authors/Creators
Description
The increasing reliance on deep learning for high-energy physics applications demands efficient FPGA-based implementations. However, deploying complex neural networks on FPGAs is often constrained by limited hardware resources and prolonged synthesis times. Conventional monolithic implementations suffer from scalability bottlenecks, necessitating the adoption of modular and resource-aware design paradigms. hls4ml, an open-source tool developed to translate machine learning models into FPGA-compatible architectures, has been instrumental in this effort but still faces synthesis bottlenecks for large networks. To address this challenge, we introduce a novel partitioning methodology that integrates seamlessly with hls4ml, allowing users to segment neural networks at predefined layers. This approach facilitates parallel synthesis and enables stepwise optimization, thus complementing both scalability and resource efficiency. The partitioned components are systematically reassembled into a unified architecture through an automated workflow leveraging AMD Vivado, ensuring functional correctness while minimizing manual intervention. An automated RTL-level testbench verifies system-wide correctness, eliminating manual validation steps and accelerating deployment. Experimental evaluations on convolutional neural networks, including ResNet20, demonstrate up to a 3.5× reduction in synthesis time, alongside enhanced debugging flexibility, thereby improving FPGA prototyping and deployment.
While existing tools like hls4ml facilitate machine learning model translation to FPGA architectures, they struggle with large networks due to long synthesis times and resource constraints. Our approach enables network partitioning at predefined layers, allowing for parallel synthesis and stepwise optimization, thus significantly improving scalability and resource efficiency. The key novel contributions include A) a novel partitioning methodology that allows neural networks to be split at predefined layers enabling parallel synthesis, B) automated workflow integration with hls4ml and AMD Vivado, reassembling the partitioned components into a unified design with minimal manual intervention, C) automated RTL-level verification, eliminating manual validation. Experimental results showed up to a 3.5× reduction in synthesis time, which represents a significant update to hls4ml, improving FPGA-based deep learning for trigger applications and making AI model deployment more practical for real-time data processing in high-energy physics.
Files
ACAT_2025_dimitrios_danopoulos.pdf
Files
(3.2 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:4c03481f831d09b243cc7118ff343ca9
|
3.2 MB | Preview Download |
Additional details
Funding
- Schmidt Family Foundation
Conference
- Acronym
- ACAT2025
- Dates
- 8-12 September 2025