Analyzing Open Data at Scale

Image credit: CERN

The previous lessons have shown how CMS Open Data MiniAOD files can be processed through tools like the Physics Object Extractor Tool to produce ROOT files with ``flat” structure TTree objects containing event and physics object information. You have also seen how such ROOT files can be analyzed using python tools like Coffea.

On small scales, all of these techniques can be tested and applied to a specific analysis using docker resources on your local machine. But in order to process all of the Open Data needed for a typical physics search or measurement, more computational resources are a great help.

Many distributed computing resources are accessible to Open Data researchers! Many universities and laboratories host Linux computing clusters to which you may have access. Cloud computing is also available to the general public for various fees through (at least) Google, Amazon, and Microsoft platforms. Thanks to the participation of Tata Institute of Fundamental Research (TIFR) in this workshop, this lesson will show you how to analyze full Open Data datasets using an HTCondor queue system on a Linux cluster.

Prerequisites

For this lesson you will log in to a computing cluster provided by the Tata Institute of Fundamental Research (TIFR). Open a terminal on your computer from which you can use ssh.

Schedule

	Setup	Download files required for the lesson
00:00	1. Check access to TIFR (Jan 5)	Can you log in to TIFR to use condor?
00:00	2. Logistics of an Open Data analysis (Jan 10)	What does a full CMS analysis workflow contain? What is the role of distributed computing in an Open Data analysis? What resources exist for Open Data distributed computing?
00:20	3. HTCondor submission	How can I use the CMSSW docker container in a condor job? How can I divide a dataset up into several jobs? How do I track the progress of condor jobs?
00:40	4. Run your analysis	Can you run POET through apptainer in a condor job? Can you merge the output files? Can you export the merged files to your laptop?
01:20	Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.