Published September 3, 2025 | Version v1
Presentation Open

COLLIDE-2V - 750 Million Dual-View LHC Event Dataset for Low-Latency ML

Authors/Creators

  • 1. Massachusetts Inst. of Technology (US)

Description

Modern foundation models (FMs) have pushed the frontiers of language, vision, and multi-model tasks by training ever-larger neural networks (NN) on unprecedented volumes of data. The use of FM models has yet to be established in collider physics, which both lack a comparably sized, general-purpose dataset on which to pre-train universal event representations, and a clear demonstrable need. Real-time event identification presents a possible need due to a requirement for fast event classification and selection of all possible collisions at the LHC. As a result, we construct a dual-view LHC collision dataset (COLLIDE-2V), a 50TB public dataset comprising ~750 million proton-proton events generated with MadGraph + Pythia + Delphes under High-Luminosity LHC conditions (<μ> = 200). Spanning everything from minimum-bias and γ+jets to top, Higgs, di-boson, multi-boson, exotic long-lived signatures and dark showers, the sample covers 50+ distinct processes and >99% of the CMS Run-3 trigger menu in a single coherent format. To allow for effective real-time event interpretation each event is provided twice, as Parquet files which retain physics-critical content:

  • Offline - a full CMS-like reconstruction emulated by a tuned Delphes card
  • L1T - a low-latency, lower-resolution view obtained via a custom Level-1 Trigger (L1T) parameterisation (degraded vertex, track and calorimeter performance, altered puppi, |η| ≤ 2.5 tracking, pT thresholds, etc.)

As a proof-of-concept, COLLIDE-2V supports a wide spectrum of research applications ranging from few-shot transfer learning, fine-tuning, pileup mitigation, detector-level generative modelling, cross-experiment benchmarking, to fast simulation surrogates and real-time trigger inference, and entirely novel anomaly-detection - thereby accelerating the shift from handcrafted topology cuts to data-driven decision making throughout the HL-LHC program.

Files

FM_Collide2V_EMoreno.pdf

Files (10.3 MB)

Name Size Download all
md5:065104c42f4d0548a0bba4edfad9707d
10.3 MB Preview Download

Additional details

Funding

Schmidt Family Foundation

Conference

Acronym
FASTML25
Dates
1-5 September 2025
Place
Zurich, Switzerland