Running your own NanoAOD

Overview

Teaching: 10 min
Exercises: 10 min
Questions
  • How can I run over many AOD files to produce NanoAOD?

Objectives
  • Experience running an entire sample

By this point you have made many updates to AOD2NanoAOD.cc, and for a future analysis it could look completely unique to your needs. There are several ways to run over large datasets (in contrast to the test files we have used so far). A cloud-based solution will be covered in detail in a later lesson, so here we will introduce a basic sequential method.

Extending the file list

In configs/simulation_cfg.py, we are currently instructing CMSSW to open one file:

process.source = cms.Source("PoolSource",
        fileNames = cms.untracked.vstring('root://eospublic.cern.ch//eos/opendata/cms/MonteCarlo2012/Summer12_DR53X/TTbar_8TeV-Madspin_aMCatNLO-herwig/AODSIM/PU_S10_START53_V19-v2/00000/000A9D3F-CE4C-E311-84F8-001E673969D2.root'))

process.maxEvents = cms.untracked.PSet(input = cms.untracked.int32(200))

Note that fileNames is expecting a vstring, which stands for “vector of strings”. So these configuration files can easily support running over multiple input files (with the obvious increase in run time and output file size).

Multiple files can be specified as a comma-separated list (cms.vstring('file1.root', 'file2.root')), or by reading an entire text file with one ROOT file per line:

# Define files of dataset
files = FileUtils.loadListFromFile("data/CMS_MonteCarlo2012_Summer12_DR53X_TTbar_8TeV-Madspin_aMCatNLO-herwig_AODSIM_PU_S10_START53_V19-v2_00000_file_index.txt")

files.extend(FileUtils.loadListFromFile("data/CMS_MonteCarlo2012_Summer12_DR53X_TTbar_8TeV-Madspin_aMCatNLO-herwig_AODSIM_PU_S10_START53_V19-v2_20000_file_index.txt"))

process.source = cms.Source(
   "PoolSource", fileNames=cms.untracked.vstring(*files))

These .txt files live in the data/ subdirectory and contain “eospublic” links to all the ROOT files in a sample.

Parallelization

Parallelization is important for efficiently running over many files. To get a taste for this issue, set the maxEvents parameter in simulation_cfg.py to -1 and set the job running. This will give a time estimate for 1 file.

Time test

Edit the configuration file and run it:

// change the max events and output file name
process.maxEvents = cms.untracked.PSet(input=cms.untracked.int32(-1))

process.TFileService = cms.Service(
   "TFileService", fileName=cms.string("output_fullfile.root"))
// save and quit
$ cmsRun configs/simulation_cfg.py > fullfile.log 2>&1 &

A script called submit_jobs.sh exists in the AOD2NanoAODOutreachTool repository for anyone who has access to an HTCondor system on which they can run this CMS code. When working inside an Open Data container, a few options for parallelization are available:

If you use a method that produces more than one output ROOT file per sample (ex: 54 output files for VBF Higgs production), combining them could simplify future steps of your physics analysis. This can be done interactively with ROOT via the hadd method:

$ cmsenv # if this is a new shell
$ hadd mergedfile.root input1.root input2.root input3.root #...and so on...

Note: the internal content of the ROOT files must be the same (ex: tree names and branch lists) for ROOT to add them intelligently.

There is also a script called merge_jobs.py (with the bash wrapper merge_jobs.sh) provided in the repository to look for ROOT files in a certain output path and merge them using ROOT’s TChain::Merge method.

Key Points

  • The configuration file can be adapted to run many files in sequence

  • HTCondor and other tools helps when running many files in parallel