Published October 21, 2025
| Version v1.0.0
Thesis
Open
Lifecycle-Aware Scheduling and Resource Monitoring of ATLAS Workloads on the NHR-HPC Cluster EMMY
Contributors
Supervisors:
Description
As the university-based Tier-2 computing centres in Germany transition towards integration with national HPC facilities, the challenge arises of exploiting these resources opportunistically without disrupting primary production workloads. This thesis investigates methods for background job execution and monitoring within the established integration of the GoeGrid cluster with the NHR cluster EMMY. Building on the existing drone-based deployment, the work focused on enabling background jobs through Condor slots, isolation via cgroup-based controls and validation of scheduler behaviour under different activation policies.
A monitoring stack based on the ELK framework was implemented, offering visibility into network traffic, power consumption, node lifecycle states and PanDA queue assignments. These dashboards provided operational transparency and ensured that drones and queues behaved as expected in production-like settings. Controlled experiments showed that background jobs can successfully recover idle resources, but when activated prematurely they interfered with single-core foreground tasks, causing an efficiency loss.
Restricting background jobs to draining phases mitigated interference while maintaining high utilisation. In parallel, the drone management codebase was refactored to emphasise modularity, configurability and reproducibility. Although static drone lifetimes and limited error handling remain, the system provides a solid basis for future automation with COBalD/TARDIS
Files
MSc_Thesis_25845968_Ughur_Mammadzada.pdf
Files
(5.6 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:46a125c78bcc096588345182f853bfe8
|
5.6 MB | Preview Download |
Additional details
CERN
- Department
- EP
- Programme
- No program participation
- Accelerator
- CERN LHC
- Experiment
- ATLAS