This lesson is being piloted (Beta version)

Dataset scouting

Introduction

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • What is the point of these exercises?

  • How do I find the data I want to work with?

Objectives
  • To understand why we start with the Open Data Portal

  • To understand the basics of how the datasets are divided up

Get ahead!

The 3/4 of this lesson is done entirely in the browser.

However, Episode 4: What is in the data files?, requires the use of a running CMSSW environment, either using Docker or the VM. You may want to jump ahead to that episode to remind yourself of how to do that and get that running in the background, so that it’s all ready to to go when we get there!

But come back here right after! :)

You’ve got a great idea! What’s next?

Suppose you have a great idea that you want to test out with real data! You’re going to want to know:

  • What year was the data taken that would work best for you?
  • What triggers were applied? (how was the data pre-selected to be saved and recorded)
  • What Monte Carlo datasets are available and appropriate for your studies?
    • This may mean finding simulated physics processes that are background to your signal
    • This may mean finding simulated physics processes for your signal, if they exist
    • Possibly just finding simulated datasets where you know the answer, allowing you to test your new analysis techniques

In this lesson, we’ll walk through the process of finding out what data and Monte Carlo is available to you, how to find it, and how to examine what data is in the individual data files.

First of all, let’s understand how the data is stored and why we need certain tools to access it.

The CERN Open Data Portal

In some of the earliest discussions about making HEP data publicly available there were many concerns about people using and analyzing “other people’s” data. The concern centered around well-meaning scientists improperly analyzing data and coming up with incorrect conclusions.

While no system is perfect, one way to guard against this is to only release well-understood, well-calibrated datasets and to make sure open data analysts only use these datasets. Because the CERN Open Data Portal is a mutable environment and some datasets may change over time as they are being validated, it is important that we make sure that analysts only use these vetted datasets. These datasets are given a Digital Object Identifier (DOI) code for tracking. And if there are ever questions about the validity of the data, it allows us to check the data provenance.

DOI

The Digital Object Identifier (DOI) system allows people to assign a unique ID to any piece of digital media: a book, a piece of music, a software package, or a dataset. If you want to learn more about the DOI process, you can learn more at their FAQ. Assigning of DOIs to CERN products is generally handled through Zenodo.

Challenge!

You will find that all the datasets have their DOI listed at the top of their page on the portal. Can you locate where the DOI is shown for this dataset, Record 6029, DoubleElectron primary dataset in AOD format from Run of 2012 (/DoubleElectron/Run2012C-22Jan2013-v1/AOD)

With a DOI, you can create citations to any of these records, for example using a tool like doi2bib.

Provenance

You will hear experimentalists refer to the “provenance” of a dataset. From the Cambridge dictionary, provenance refers to “the place of origin of something”. The way we use it, we are referring to how we keep track of the history of how a dataset was processed: what version of the software was used for reconstruction, what period of calibrations was used during that processing, etc. In this way, we are documenting the data lineage of our datasets.

From Wikipeda

Data lineage includes the data origin, what happens to it and where it moves over time. Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process.

Provenance is an an important part of our data quality checks and another reason we want to make sure you are using only vetted and calibrated data.

This lesson

For all the reasons given above, we encourage you to familiarize yourself with the search features and options on the portal. With your feedback, we can also work to create better search tools/options and landing points.

This exercise will guide you through the current approach to finding data and Monte Carlo. Let’s go!

Key Points

  • Finding the data is non-trivial, but all the information is on the portal

  • A careful understanding of the search options can help with finding what you need


Where are the datasets?

Overview

Teaching: 5 min
Exercises: 5 min
Questions
  • Where do I find datasets for data and Monte Carlo?

Objectives
  • Be able to find the data and Monte Carlo datasets

CERN Open Data Portal

Our starting point is the landing page for CERN Open Data Portal. You should definitely take some time to explore it. But for now we will select the CMS data.

CERN Open Data Portal

The landing page for the CERN Open Data Portal.

Make a selection!

Find the CMS link under Focus on and click on it.

CMS-specific datasets

The figure below shows the website after we have chosen the CMS data. Note the left-hand sidebar that allows us to filter our selections. Let’s see what’s there. (Note! I’ve collapsed some of the options so while the order is the same when you view it, your webpage may look a little different at first glance.)

CERN Open Data Portal - CMS data

The first pass to filter on CMS data

At first glance we can see a few things. First, there is an option to select only Dataset rather than documentation or software or similar materials. Great! Going forward we’ll select Dataset.

Next we see that there are a lot of entries for data from 2010, 2011, and 2012, the 7 TeV and 8 TeV running periods. That’s what we’ll be working with for these exercises.

Coming soon!

For the keen of eye, you can see that there are some entries for 2016 and later data, the 13 TeV run periods! But that will be for a future workshop. :)

Make a selection!

For the next module, let’s select Dataset and 2012.

Key Points

  • The data and Monte Carlo are stored in directories with names that give you some insight as to what they contain


What data and Monte Carlo are available?

Overview

Teaching: 5 min
Exercises: 10 min
Questions
  • What data and run periods are available?

  • What triggers were used when the data was taken?

  • What Monte Carlo samples are available?

Objectives
  • To be able to navigate the CERN Open Data Portal’s search tools

  • To be able to find what triggers and Monte Carlo datasets there are using these search tools

Data and triggers

We make a distinction between data which comes from the real-life CMS detector and Monte Carlo data. In general, when we say data, we mean the real, CMS-detector-created data.

The main data available is from what is known as Run 1 and spans 2010-2012. These run periods can also be broken into A, B, C, and so-on, sub-periods and you may see that in some of the dataset names.

Make a selection!

If you are coming from the previous module you should have selected CMS, Dataset, and 2012.

CERN Open Data Portal - CMS datasets

Selecting CMS, Dataset, and 2012.

Your view might look slightly different than this screenshot as the available datasets and tools are regularly updated.

When Dataset is selected, there are 3 subcategories:

Make a selection!

Let’s now unselect Derived and Simulation so that only the Collision option is set under Dataset.

Collision data

When you select Collision you’ll see a lot of datasets with names that may be confusing. Let’s take a look at two of them and see if we can break down these names.

CERN Open Data Portal - CMS datasets

Some samples from the 2012 collision data

/DoubleElectron/Run2012B-v1/RAW /SingleMu/Run2012B-22Jan2013-v1/AOD

There are three (3) parts to the names, separated by `/’.

Trigger

DoubleElectron or SingleMu is the name of the trigger. This is the trigger that selected out this subset of data. Ideally, you have completed the pre-exercise on triggers in CMS but for now, remind yourself that they select out some subset of the collisions based on certain criteria in the hardware or software.

Some of them are quite difficult to intuit what they mean. Others should be roughly understandable. For example,

For more information on these triggers, you can either reach out to the organizers through Mattermost or review the pre-exercise on triggers in CMS.

Run period

Run2012B-v1 and Run2012B-22Jan2013-v1 refer to when the data was taken and in the case of the second, when the data was processed. The details are not so important for you because the open data coordinators have taken care to only post vetted data. If you were a CMS analyst working on the data as it was being processed, you might have to shift your analysis to a different dataset once all calibrations were completed.

Data format

Further information

If you click on the link to any of these datasets, you will find even more information, including

There are multiple text files that contain the paths to these ROOT files. If we click on any one of them, we see something like this.

CERN Open Data Portal - CMS datasets

Sample listing of some of the ROOT files in the /SingleMu/Run2012B-22Jan2013-v1/AOD dataset.

The prepended root: is because of how these files are stored. We’ll use these directory paths when we go to inspect some of these files.

Monte Carlo

We can go through a similar exercise with the Monte Carlo data. One major difference is that the Monte Carlo is not broken up by trigger. Instead, when you analyze the Monte Carlo, you will apply the trigger to the data to simulate what happens in the real data. You will learn more about this in the upcoming trigger exercise.

For now, let’s look at some of the Monte Carlo datasets that are available to you.

Make some selections! But first make some unselections!

Unselect everthing except for CMS, Dataset, Simulated (under Dataset) and 2012.

Next, select a new button near the top of the left-hand sidebar, include on-demand datasets. This will give us some search options related to the Monte Carlo samples.

CERN Open Data Portal - CMS datasets

Selection of the Monte Carlo dataset search options

There are a lot of Monte Carlo samples! It’s up to you to determine which ones might contribute to your background. The names try to give you some sense of the primary process, subsequent decays, the beam energy and specific simulation software (e.g. Pythia), but if you have questions, reach out to the organizers through Mattermost.

As with the collision data, here are three (3) parts to the names, separated by `/’.

Let’s look at one of them: /DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball/Summer12_DR53X-No_PU_RD1_START53_V7N-v1/AODSIM

Physics process/Monte Carlo sample

DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball is hard to understand at first glance, but if we take our time we might be able to intuit some of the meaning. This appears to simulate a Drell-Yan process in which two quarks interact to produce a virtual photon/Z boson which then couples to two leptons. The M-50 refers to a selection that has been imposed requiring the mass of the di-lepton pair to be above 50 GeV/c^2 and the remaining fields tell us something about what software was used to generate this (madgraph) the beam energy (8TeV) and some extra, quite frankly, Byzantine text. :)

Global tag

Summer12_DR53X-No_PU_RD1_START53_V7N-v1 refers to how and when this Monte Carlo was processed. The details are not so important for you because the open data coordinators have taken care to only post vetted data. But it is all part of the data provenance.

Data format

The last field refers to the data format and here again there is a slight difference.

One difference is that you will want to select the Monte Carlo events that pass certain triggers at the time of your analysis, while that selection was already done in the data by the detector hardware/software itself.

If you click on any of these fields, you can see more details about the samples, similar to the collision data.

More Monte Carlo samples

If you would like a general idea of what other physics processes have been simulated, you can check scroll down the sidebar until you come to Filter by category.

CERN Open Data Portal - CMS datasets

Selection of the Monte Carlo dataset search options

You may have to do a bit of poking around to find the dataset that is most appropriate for what you want to do, but remember, you can always reach out to the organizers through Mattermost.

Summary

By now you should have a good sense of how to find your data using the Open Data Portal’s search tools.

Key Points

  • The triggers are all given their own collision datasets

  • The Monte Carlo samples all have their own datasets

  • Navigating the Open Data Portal is the right way to find out what is available


What is in the datafiles?

Overview

Teaching: 5 min
Exercises: 5 min
Questions
  • How do I inspect these files to see what is in them?

Objectives
  • To be able to see what objects are in the data files

  • To be able to see how big these files are and how much space these object take up.

This part of the lesson will be done from within either a Docker container or VM. All commands will be typed inside that environment.

Setting up your CMSSW area

If you completed the lessons on virtual machines or Docker you should already have a working CMSSW area.

Make sure you change directories to the CMSSW_5_3_32/src area; for instance, in Docker:

cd /home/cmsusr/CMSSW_5_3_32/src

Note that we are not really “installing” CMSSW but setting up an environment for it. CMSSW was already installed. This is why every time you open a new shell you will have to issue the cmsenv command, which is just a script that runs to set some environmental variables for your working area:

cmsenv

edmXXX tools

CMS uses a set of homegrown tools to interact with the AOD format, all of which are prefixed by edm, which stands for Event Data Model. We will not show you all of them, but introduce a few to give you an idea of what can be done.

edmDumpEventContent

The edmXXX tools take as an argument the full path to a file. Following a similar approach to the previous module, we’ve chosen one of the Monte Carlo files to test, but these commands would equally well with a data file.

Let’s start by using edmDumpEventContent and looking at the options

edmDumpEventContent --help
Usage: edmDumpEventContent [options] templates.root
Prints out info on edm file.

Options:
  -h, --help      show this help message and exit
  --name          print out only branch names
  --all           Print out everything: type, module, label, process, and
                  branch name
  --lfn           Force LFN2PFN translation (usually not necessary)
  --lumi          Look at 'lumi' tree
  --run           Look at 'run' tree
  --regex=REGEX   Filter results based on regex
  --skipping      Print out branches being skipped
  --forceColumns  Forces printouts to be in nice columns

We will first use edmDumpEventContent to see what is in one of these files with no other options. It may take 15-60 seconds to run and there will be a lot of output. You may find it useful to redirect the output to a file and then look at it there using less or a similar command (you can exit less by typing q).

edmDumpEventContent root://eospublic.cern.ch//eos/opendata/cms/MonteCarlo2012/Summer12_DR53X/TTJets_SemiLeptMGDecays_8TeV-madgraph/AODSIM/PU_RD1_START53_V7N-v1/10000/EA978C41-27D1-E211-9424-003048D46016.root > test_edm_output.log

less test_edm_output.log
Type                                  Module                      Label             Process
----------------------------------------------------------------------------------------------
LHEEventProduct                       "source"                    ""                "LHE"
GenEventInfoProduct                   "generator"                 ""                "SIM"
edm::TriggerResults                   "TriggerResults"            ""                "SIM"
vector<int>                           "genParticles"              ""                "SIM"
vector<reco::GenJet>                  "ak5GenJets"                ""                "SIM"
vector<reco::GenJet>                  "ak7GenJets"                ""                "SIM"
vector<reco::GenJet>                  "kt4GenJets"                ""                "SIM"
vector<reco::GenJet>                  "kt6GenJets"                ""                "SIM"
vector<reco::GenMET>                  "genMetCalo"                ""                "SIM"
vector<reco::GenMET>                  "genMetCaloAndNonPrompt"    ""                "SIM"
.
.
.
vector<reco::TrackExtra>              "tevMuons"                  "firstHit"        "RECO"
vector<reco::TrackExtra>              "tevMuons"                  "picky"           "RECO"
vector<reco::TrackExtrapolation>      "trackExtrapolator"         ""                "RECO"
vector<reco::TrackJet>                "ak5TrackJets"              ""                "RECO"
vector<reco::Vertex>                  "offlinePrimaryVertices"    ""                "RECO"
vector<reco::Vertex>                  "offlinePrimaryVerticesWithBS"   ""                "RECO"
vector<reco::VertexCompositeCandidate>    "generalV0Candidates"       "Kshort"          "RECO"
vector<reco::VertexCompositeCandidate>    "generalV0Candidates"       "Lambda"          "RECO"

You can get from this information the names of physics objects you may be interested in (e.g. ak5TrackJets) as well as what stage of processing they were produced at (SIM is for simulations and RECO is for reconstruction).

This information can be useful when writing your analysis code, which will be discussed in a later lesson.

Some of the other command-line options can be useful as well to filter the information.

Challenge!

Try the following options (with the same file) and see what it gives you. Can you see why this might be useful?

edmDumpEventContent --regex=Muon root://eospublic.cern.ch//eos/opendata/cms/MonteCarlo2012/Summer12_DR53X/TTJets_SemiLeptMGDecays_8TeV-madgraph/AODSIM/PU_RD1_START53_V7N-v1/10000/EA978C41-27D1-E211-9424-003048D46016.root

edmDumpEventContent --name root://eospublic.cern.ch//eos/opendata/cms/MonteCarlo2012/Summer12_DR53X/TTJets_SemiLeptMGDecays_8TeV-madgraph/AODSIM/PU_RD1_START53_V7N-v1/10000/EA978C41-27D1-E211-9424-003048D46016.root

Key Points

  • It’s useful to sometimes inspect the files before diving into the full analysis

  • Some files may not have the information you’re looking for


Break

Overview

Teaching: 15 min
Exercises: 0 min
Questions
  • Is it time for a break?

Objectives
  • To get some coffee or tea or water or a snack

  • To stretch your legs

  • To use the restroom

Take a break!

You earned it!

Key Points

  • Taking a break is good!


Hands-on activity

Overview

Teaching: 0 min
Exercises: 30 min
Questions
  • How well do you understand what we covered?

Objectives
  • To review…

Let’s review some of the concepts we covered in the previous episodes. You are encouraged to go back and engage with those episodes in order to answer these questions.

You can post questions on the Mattermost channel for this exercise or, if you are doing this exercise in real time, use the “Raise hand” feature on Zoom.

Let’s review!

Can you get to the starting page for the CMS Open Data Portal? Can you select the CMS data from that page?

Let’s review!

The bulk of the CMS released data (as of July 2021) covers what years? What was the collision energy for those years?

Let’s review!

What’s the difference between Collision data and Simulated data?

Let’s review!

Select the CMS collision data for 2012. Select only the AOD data. Do you remember what the AOD data are?

Try to identify five (5) different triggers. What do you think they are triggering on? How do you find out what these triggers are actually doing?

Let’s review!

Select the CMS simulated data for 2012. Select the Heavy Gauge Bosons (under Filter by category) and find a Zprime sample. Click on it. These are hypothetical gauge bosons.

How can you learn about CMS simulated dataset names? What do you think the Z’ is decaying to? What is the assumed mass of the Z’ in this particular sample?

How many events are in this sample? How much hard drive space does this sample occupy?

Key Points

  • The information is all there for you to find datasets you want for your analysis.

  • But it make take some poking around to find it.

  • Familiarizing yourself with the search options is time well-spent.


Offline challenge

Overview

Teaching: 0 min
Exercises: 30 min
Questions
  • How well do you understand what we covered?

Objectives
  • Determine what collision data is applicable to some search

  • Determine what simulation data is potential background to some search

Will you make the next big discovery?

Will you make the next big discovery? Image from Science magazine

You have formulated a new theory that predicts a brand new particle! You call this particle an ULTIMATON

You calculate that the most likely decay of an ultimaton is to two (2) same-sign charged leptons and two (2) jets of any flavour. The predicted mass is in the 300-500 GeV/c^2 range and you plan on going looking for it in the CMS Open Data!

Let’s see what you’ll need. Consider only the 2012 run period.

Triggers

What triggers are most sensitive to this process? How many records are there for each of the triggers you are interested in? How many events?

Background

What Standard Model processes might contribute to the background? Are there any simulation samples that can help you study this?

Remind yourself of the following:

  • What does the top-quark decay to?
  • What does a W boson decay to?
  • What does a Z boson decay to?
  • What is the Drell-Yan process?

Do you think any of these process are significant? Can any be ignored?

Look at the Standard Model samples. Do you see any of these processes in there?

Is this background?

Consider this sample. What process is this? Could this be a significant background, if the cross section is large?

Key Points

  • It can take some time and some effort to figure out what collision data and what simulation data is appropriate for your analysis