Published September 2, 2025 | Version v1
Presentation Open

End-to-End Neural Network Compression and Deployment for Hardware Acceleration Using PQuant and hls4ml

Authors/Creators

Description

As the demand for efficient machine learning on resource-limited devices grows, model compression techniques like pruning and quantization have become increasingly vital. Despite their importance, these methods are typically developed in isolation, and while some libraries attempt to offer unified interfaces for compression, they often lack support for deployment tools such as hls4ml. To bridge this gap, we developed PQuant, a Python library designed to streamline the process of training and compressing machine learning models. PQuant offers a unified interface for applying a range of pruning and quantization techniques, catering to users with minimal background in compression while still providing detailed configuration options for advanced use. Notably, it features built-in compatibility with hls4ml, enabling seamless deployment of compressed models on FPGA-based accelerators. This makes PQuant a versatile resource for both researchers exploring compression strategies and developers targeting efficient implementation on edge devices or custom hardware platforms. We will present the PQuant library, the performance of several compression algorithms implemented with it, and demonstrate the conversion flow of a neural network model from an uncompressed state to optimized firmware for FPGAs.

Files

PQuant_fastML.pdf

Files (1.6 MB)

Name Size Download all
md5:ed49469b02ff4ebae43c048f7fa89126
1.6 MB Preview Download

Additional details

Funding

Schmidt Family Foundation

Conference

Title
Fast Machine Learning for Science Conference 2025
Dates
1-5 September 2025