What is in the datafiles?

Overview

Teaching: 5 min
Exercises: 5 min
Questions
  • How do I inspect these files to see what is in them?

Objectives
  • To be able to see what objects are in the data files

  • To be able to see how big these files are and how much space these object take up.

This part of the lesson will be done from within the CMSSW docker container. All commands will be typed inside that environment.

Go to your CMSSW area

If you completed the lesson on Docker you should already have a working CMSSW area, if your operating system allows.

Restart your existing my_od container with:

docker start -i my_od

Make sure you are in the CMSSW_7_6_7/src area; if needed change the directory:

cd /code/CMSSW_7_6_7/src

edmXXX tools

CMS uses a set of homegrown tools to interact with the AOD format, all of which are prefixed by edm, which stands for Event Data Model. We will not show you all of them, but introduce a few to give you an idea of what can be done.

edmDumpEventContent

The edmXXX tools take as an argument the full path to a file. Following a similar approach to the previous module, we’ve chosen one of the Monte Carlo files to test, but these commands would equally well with a data file.

Let’s start by using edmDumpEventContent and looking at the options

edmDumpEventContent --help
Usage: edmDumpEventContent [options] templates.root
Prints out info on edm file.

Options:
  -h, --help      show this help message and exit
  --name          print out only branch names
  --all           Print out everything: type, module, label, process, and
                  branch name
  --lfn           Force LFN2PFN translation (usually not necessary)
  --lumi          Look at 'lumi' tree
  --run           Look at 'run' tree
  --regex=REGEX   Filter results based on regex
  --skipping      Print out branches being skipped
  --forceColumns  Forces printouts to be in nice columns

We will first use edmDumpEventContent to see what is in one of these files with no other options. It may take 15-60 seconds to run and there will be a lot of output. You may find it useful to redirect the output to a file and then look at it there using less or a similar command (you can exit less by typing q).

edmDumpEventContent root://eospublic.cern.ch//eos/opendata/cms/mc/RunIIFall15MiniAODv2/DYJetsToLL_M-5to50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/MINIAODSIM/PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1/00000/029FD4E8-26DE-E511-92E9-0CC47A78A45A.root  > test_edm_output.log

less test_edm_output.log
Type                                  Module                      Label             Process
----------------------------------------------------------------------------------------------
LHEEventProduct                       "externalLHEProducer"       ""                "LHE"
GenEventInfoProduct                   "generator"                 ""                "SIM"
edm::TriggerResults                   "TriggerResults"            ""                "SIM"
edm::TriggerResults                   "TriggerResults"            ""                "HLT"
HcalNoiseSummary                      "hcalnoise"                 ""                "RECO"
L1GlobalTriggerReadoutRecord          "gtDigis"                   ""                "RECO"
double                                "fixedGridRhoAll"           ""                "RECO"
double                                "fixedGridRhoFastjetAll"    ""                "RECO"
.
.
.
vector<reco::GenJet>                  "slimmedGenJets"            ""                "PAT"
vector<reco::GenJet>                  "slimmedGenJetsAK8"         ""                "PAT"
vector<reco::GenParticle>             "prunedGenParticles"        ""                "PAT"
vector<reco::GsfElectronCore>         "reducedEgamma"             "reducedGedGsfElectronCores"   "PAT"
vector<reco::PhotonCore>              "reducedEgamma"             "reducedGedPhotonCores"   "PAT"
vector<reco::SuperCluster>            "reducedEgamma"             "reducedSuperClusters"   "PAT"
vector<reco::Vertex>                  "offlineSlimmedPrimaryVertices"   ""                "PAT"
vector<reco::VertexCompositePtrCandidate>    "slimmedSecondaryVertices"   ""                "PAT"
unsigned int                          "bunchSpacingProducer"      ""                "PAT"

You can get from this information the names of physics objects you may be interested in (e.g. slimmedGenJets) as well as what stage of processing they were produced at (SIM is for simulations, RECO is for reconstruction and PAT is for MINIAOD level).

This information can be useful when writing your analysis code, which will be discussed in a later lesson.

Some of the other command-line options can be useful as well to filter the information.

Challenge!

Try the following options (with the same file) and see what it gives you. Can you see why this might be useful?

edmDumpEventContent --regex=Muon root://eospublic.cern.ch//eos/opendata/cms/mc/RunIIFall15MiniAODv2/DYJetsToLL_M-5to50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/MINIAODSIM/PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1/00000/029FD4E8-26DE-E511-92E9-0CC47A78A45A.root

edmDumpEventContent --name root://eospublic.cern.ch//eos/opendata/cms/mc/RunIIFall15MiniAODv2/DYJetsToLL_M-5to50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/MINIAODSIM/PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1/00000/029FD4E8-26DE-E511-92E9-0CC47A78A45A.root

Key Points

  • It’s useful to sometimes inspect the files before diving into the full analysis

  • Some files may not have the information you’re looking for