Published January 28, 2019
| Version v1
Thesis
Open
Interactive data analysis of data from high energy physics experiments using Apache Spark
Description
The primary goal of the project was to evaluate a set of Big Data tools for the analysis of the data from the TOTEM experiment which will enable interactive or semi-interactive work with large amounts of data. The product is a set of the analysis codes and notebooks written in a distributed model, together with their performance profiling, reports, and execution results. Our analysis application has several requirements to fulfill: correctness of the results, capacity to work interactively using data-science notebooks, scalability of the solution, simple and easy way to visualize the results, use of existing storage services. The significant characteristic of this analysis code is the application of the RDataFrame model. In the time of our work, RDataFrame was an innovation introduced in the newest release of ROOT and was still under heavy development. The concept is in general similar to the data frames known from other languages like Python or R and provides a high-level interface to work with data in ROOT format. Generally, it introduces a bit of a functional approach in the mostly iterative analyses done in C++ ROOT giving users interesting possibilities, like implicit multithreading. We run a sample analysis on 4.7TB of data from the TOTEM experiment, rewriting the analysis code to leverage the PyRoot and RDataFrame model and to take full advantage of the parallel processing capabilities offered by Apache Spark. The analysis was evaluated on a system that combines the use of public cloud infrastructure (Helix Nebula Science Cloud), and storage and processing services developed by CERN (Science Box).
Files
CERN-THESIS-2019-004.pdf
Files
(40.3 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:a80673f1f50a1b486172e49bc169862b
|
40.3 MB | Preview Download |
Additional details
Additional titles
- Translated title
- Interaktywna analiza danych z eksperymentów fizyki wysokich energii z użyciem Apache Spark
Identifiers
- CDS
- 2655457
- CDS Report Number
- CERN-THESIS-2019-004
CERN
- Department
- EP
- Programme
- No program participation
- Accelerator
- CERN LHC
- Experiment
- TOTEM