What is in the datafiles?

Last updated on 2024-05-17 | Edit this page

Overview

Questions

  • How do I inspect these files to see what is in them?

Objectives

  • To be able to see what objects are in the data files

In the following, we will learn about CMS data in the NANOAOD format. This is the data format that can be accessed without CMS-specific software using ROOT or python tools.

NANOAOD variable list


Each NANOAOD dataset has the variable list with a brief description attached to the portal record.

Challenge: variables

Find the variable listings for a collision data record and a Monte Carlo data record. What are the differences? What is the same?

The data records have a “Luminosity” block with some beam related information whereas MC records have a “Runs” block with event generation information.

The MC records have event generator or simulation information in the “Events” block.

The variables of reconstructed objects, such as Muons, are the same for data and MC.

Inspect datasets with ROOT


This part of the lesson will be done from within the ROOT tools container. You should have it available, start it with:

BASH

docker start -i my_root

All ROOT commands will be typed inside that environment.

If you are using VNC for the graphics, remember to start it before starting ROOT in the container prompt:

BASH

start_vnc

Work through the quick introduction to getting started with CMS NANOAOD Open Data in the getting started guide page on the CERN Open Data portal.

You will learn:

Inspect datasets with python tools


This part of the lesson will be done from within the python tools container. You should have it available, start it with:

BASH

docker start -i my_python

Start the jupyter lab with

BASH

jupyter-lab --ip=0.0.0.0 --no-browser

Open a new jupyter notebook from the jupyter lab tab that the container will open in your browser. Type the commands in code cells of the notebook.

First, import some python libraries:

PYTHON

import uproot
import matplotlib.pylab as plt
import awkward as ak
import numpy as np
  • uproot is a python inteface to ROOT
  • matplotlib can be used for plotting
  • numpy is a python package for scientific computing
  • awkard is a python library for variable-size data

Then, open a file from a dataset:

PYTHON

file = uproot.open("root://eospublic.cern.ch//eos/opendata/cms/Run2016H/SingleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/61FC1E38-F75C-6B44-AD19-A9894155874E.root")

You can check the content blocks of the file with

PYTHON

file.classnames()

OUTPUT

{'tag;1': 'TObjString',
 'Events;1': 'TTree',
 'LuminosityBlocks;1': 'TTree',
 'Runs;1': 'TTree',
 'MetaData;1': 'TTree',
 'ParameterSets;1': 'TTree'}

Events is the TTreeobject that contains the variables of interest to us. Get it from the data with:

PYTHON

events = file['Events']

You can list the full list of variables with

PYTHON

events.keys()

Now, fetch some variables of interest, e.g. take all muon pt values in an array

PYTHON

pt = events['Muon_pt'].array()

Challenge: inspect pt

Find out the type of the pt array. Print values of some of its elements. What do you observe?

Find the type with

PYTHON

type(pt)

OUTPUT

awkward.highlevel.Array

Print out some of the values with

PYTHON

print(pt)

OUTPUT

[[23.1], [39.3], [24.1], [91.9, 52.7], ... [88.7, 14.2], [58.9], [27.1], [50]]

Note that the muon pt array can contain one or more elements, depending on the number of reconstructed muons in the event.

From the awkward array documentation, you can find out hot to choose the events with exactly two reconstructed muons:

PYTHON

pt[ak.num(pt) == 2]

OUTPUT

<Array [[91.9, 52.7], ... [88.7, 14.2]] type='2713 * var * float32'>

You can learn about awkward arrays and much more in the HEP Software Foundation Scikit-HEP tutorial.

Plot a variable with python tools

Common python plotting tools cannot handle arrays or arrays with different sizes, such as the muon pt. If you want to plot all muon pts, “flatten” the pt array first:

PYTHON

ptflat=ak.flatten(pt)

Now plot the muon pt values with

PYTHON

plt.hist(ak.flatten(pt),bins=200,range=(0,200))
plt.show()

You could also plot directly ptflat variable, but it you want to introduce some cuts that need the knowledge that certain values belong to the same event, you need to do them to the original pt array. For example, plotting the events with 2 muons:

PYTHON

plt.hist(ak.flatten(pt),bins=200,range=(0,200))
plt.hist(ak.flatten(pt[ak.num(pt) == 2]),bins=200,range=(0,200))
plt.show()

This is what you will see:

If you’ve chosen another dataset or another file, it will look different.

Key Points

  • It’s useful to inspect the files before diving into the full analysis.
  • Some files may not have the information you’re looking for.