Higgs to Tau Tau analysis

Overview

Teaching: 5 min
Exercises: 10 min

Questions

How is the Higgs -> Tau Tau analysis example set up?

Objectives

Checkout Higgs -> Tau Tau code for the workshop

Review the basic RDataFrame commands for filtering and defining variables

Setup

To begin, check out the workshop’s version of this analysis code if you have not done so already:
$ cd ~/CMSSW_5_3_32/src/
$ git clone git://github.com/cms-opendata-workshop/HiggsTauTauNanoAODOutreachAnalysis Htautau
$ cd Htautau

Data and simulation files

You learned in the Data Scouting lesson to search for CMS Open Data samples, and in the previous lesson we discussed how to run the AOD2NanoAOD tool over multiple files to incorporate your changes to AOD2NanoAOD.cc. For the sake of time in the workshop, data and simulation NanoAOD samples have already been produced for you.

Data samples to be used in the analysis:

Run2012B_TauPlusX
Run2012C_TauPlusX

Signal simulations to be used in the analysis:

GluGluToHToTauTau (gluon-gluon fusion Higgs production)
VBF_HToTauTau (vector boson fusion Higgs production)

Background processes can produce very similar signatures in the detector which have to be considered in the anaylsis. The most prominent background processes with a similar signature include:

DYJetsToLL (gamma*/Z production with decays to leptons)
TTbar (top quark pair production)
W1JetsToLNu (W boson produced with 1 jet)
W2JetsToLNu (W boson produced with 2 jets)
W3JetsToLNu (W boson produced with 3 jets)

All of these files can be accessed from the “eospublic” realm:

$ root -l root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/GluGluToHToTauTau.root
[0] TBrowser b

Review of TTree::Draw and RDataFrame

Before moving on, please follow the pre-exercises on ROOT (especially TTree::Draw) and RDataFrame if you did not do so earlier.

To review: the Events tree inside the NanoAOD file can be used to draw histograms of branches within the tree, and cuts can be performed using any branch in the tree.

[0] TTree *Events = (TTree*)_file0->Get("Events")
[1] Events->Draw("Muon_pt")  // draws a histogram of muon momentum
[2] Events->Draw("Muon_pt","Muon_pt > 17") // draws a histogram of muon momentum for muons with pT > 17 GeV
[3] Events->Draw("Muon_pt","value_jet_n > 10") // draws a histogram of muon momentum in events with more than 10 jets

The RDataFrame classes in ROOT offer column-based processing that is useful for speeding up analyses using ROOT files. RDataFrame also allows for multi-threading. To draw the same set of 3 histograms from the review example above, we will need to use some “filters”.

[0] ROOT::EnableImplicitMT() // Tell ROOT you want to go parallel
[1] ROOT::RDataFrame df("Events", "root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/GluGluToHToTauTau.root"); //Interface to TTree

[2] auto myHisto = df.Histo1D("Muon_pt"); // This happens in parallel!

[3] auto df2 = df.Define("highPtMuons","Muon_pt > 17");
[4] auto df3 = df2.Define("Muon_highpt","Muon_pt[highPtMuons]");
[5] auto df4 = df3.Filter("nJet > 10");

[6] auto myHisto2 = df3.Histo1D("Muon_highpt"); // new branch in a dataframe without extra cuts
[7] auto myHisto3 = df4.Histo1D("Muon_pt"); // old branch in a dataframe with a new cut applied

[8] myHisto->Draw();
[9] myHisto2->Draw("pe same");
[10] myHisto3->Draw("pe same");

An RDataFrame can perform transformations (to manupulate the data) and actions (to produce a result from the data). The Define and Filter methods are transformations while the Count and Report methods are actions.

Define: Creates a new column in the datset
Filter: Filters the rows of the dataset
Count: Returns the number of events processed
Report: Obtains statistics on how many entries have been accepted and rejected by the filters

std::cout << "Number of events: " << *df.Count() << std::endl;

Challenge: replicate histograms with RDataFrame

Perform the examples above and confirm that you get matching histograms from either method. Are you able to tell a difference in speed for this small test? To run the RDataFrame example you will need to have cvmfs installed on your local machine and mounted in your container. Source the proper ROOT environment and get a ROOT command line:
$ source /cvmfs/sft.cern.ch/lcg/views/LCG_95/x86_64-slc6-gcc8-opt/setup.sh
$ root -l

Key Points

The NanoAOD samples required for this analysis include data, simulated signals, and several simulated backgrounds

RDataFrame tools are useful for efficiently manipulating data and plotting histograms.

Event selection

Overview

Teaching: 10 min
Exercises: 20 min

Questions

How can I select events in which to search for Higgs -> tau tau?

Objectives

Understand typical object selection criteria

Implement selections for the muon and tau examples

Begin the workshop’s physics analysis example

Event selection for the Higgs -> tau tau search is performed in the file called skim.cxx. As we select events, our goal is to keep the signal events from Higgs to tau tau decay and suppress the background events from other sources (Drell-Yan, W boson, top quarks, etc). Many background processes can be suppressed by introducing kinematic limits on the physics objects. For instance, a W boson could decay to a muon and neutrino, similar to a tau decay, but only the W boson background processes should have the transverse mass of the muon and neutrino (modeled by the missing transverse energy) near 80 GeV. Cuts on key distinguishing variables can strongly suppress background.

Trigger and final state definition

The first event selection requirement should be a set of triggers. In fact, we have made a hidden trigger selection by analyzing the “TauPlusX” datasets! Only events passing some tau-related trigger will enter our data files.

In the MinimalSelection function we will filter events by requiring they fire a hadronic tau + muon trigger – this means we are choosing to search for events in which one of the tau leptons decayed to a muon and the other decayed to hadrons. We will certainly reject signal events based on this requirement, but narrowing the final state makes the relevant background processes more clear and simplifies the search procedure.

template <typename T>
auto MinimalSelection(T &df) {
    return df.Filter("HLT_IsoMu17_eta2p1_LooseIsoPFTau20 == true", "Passes trigger")
}

Challenge: require muons and taus

In the same function, require that each event have at least one muon and at least one tau lepton. (hint: .Filter() commands can be chained together in the same statement) Open a VBF signal ROOT file and use TTree::Draw to plot a histogram of the number of tau leptons in each event, and then the number of muons in each event. What do the distributions tell you about the sources of these leptons?
Solution

Add two more Filter statements in MinimalSelection:
  return df.Filter("HLT_IsoMu17_eta2p1_LooseIsoPFTau20 == true", "Passes trigger")
           .Filter("nMuon > 0", "there are muons")
	      .Filter("nTau > 0", "there are taus");
The plot of nMuon from the gg fusion H -> tau tau sample is shown below – there are many more than the 1 expected from the signal, so clearly we need to do more work to identify which muon is the interesting one from the tau decay.

Lepton selection criteria

We clearly need to slim down the number of muons! The typical event has far more than 2 muons from tau decays. The FindGoodMuons function will add a new column to the dataframe that defines a “good muon” for the purpose of our analysis.

Question

What constitutes a good muon?

For any physics object, the selection criteria typically include:

kinematic constraints (momentum, pseudorapidity, masses of object pairs, etc)
identification requirements (loose, medium, tight quality levels). We have stored all of these labels as boolean pass/fail variables.
isolations requirements (loose, medium, tight isolation levels). When considering the relative isolation variables calculated in the previous exercise for muons, “loose” isolation would correspond to values < 0.4 while “tight” isolation would correspond to values < 0.15.

Thinking about the signal process guides our decisions here: in signal events we expect one hadronic tau, one muon, and neutrinos that were the prompt decay products of a massive particle. The tau and the muon will likely be produced with reasonably high momentum, and are not likely to be surrounded by energy deposits from other particles. For this physics process we should select hadronic taus and muons that pass at least their “loose” ID criteria, and perhaps even the “tight” ID criteria. It may help reject background to require that these leptons be isolated. Making these decisions often involves studying the signal and background efficiencies for a variety of choices.

Which kinematic constraints should be applied? The first place to look is the trigger path: “HLT_IsoMu17_eta2p1_LooseIsoPFTau20”. This trigger path selects events with an isolated muon that has pT > 17 GeV and |eta| < 2.1; along with a loose, isolation tau lepton with pT > 20 GeV. In an ideal case, “offline” (post-trigger) selection criteria are universally more restrictive than “online” (trigger) selection criteria, especially in terms of transverse momentum. This allows analysts to avoid the murky kinematic regions of “trigger turn-ons” in which data and simulation might respond differently to the application of the trigger because of small differences in the input variables.

Challenge: define good muons

Complete the FindGoodMuons function with selection criteria on muon momentum, pseudorapidity, identification level, and isolation level.
template <typename T>
auto FindGoodMuons(T &df) {
   return df.Define("goodMuons", "BOOLEAN CONDITIONS GO HERE");
}
(hint: Muon_pt, Muon_eta, Muon_tightId, Muon_looseId, etc, are branches of interest.)
Solution

I’ve opted to tighten the momentum cut a little compared to the trigger, copy the trigger’s eta requirement, and to require tight ID and loose isolation on the muon.
   return df.Define("goodMuons","Muon_pt > 20 && abs(Muon_eta) < 2.1 && Muon_tightId == true && Muon_pfreliso03all < 0.4");

Question

What constitutes a good hadronic tau?

Taus that decay to hadrons and neutrinos are identified using similar logic, but with combinations of tau identification discriminants instead of the simpler loose/medium/tight framework. The tau ID reference page gives a brief description of each type identification variable. They note that we always want to use the “ByDecayModeFinding” discriminant.

Challenge: define good taus

Complete the FindGoodTaus function with selection criteria on the tau identification discriminants. Based on the trigger path, we have already placed some kinematic constraints on the tau.
template <typename T>
auto FindGoodTaus(T &df) {
   return df.Define("goodTaus", "Tau_charge != 0 && abs(Tau_eta) < 2.3 && Tau_pt > 20 && BOOLEAN ID CONDITIONS HERE");
}
Solution

Based on the TWiki, we should always use byDecayModeFinding. Based on the trigger, we should require one of the isolation-based IDs. Additionally, since we want to be very careful to choose a hadronic tau, let’s use an anti-electron and anti-muon ID:
   return df.Define("goodTaus","Tau_charge != 0 && abs(Tau_eta) < 2.3 && Tau_pt > 20 &&\
                                Tau_idDecayMode == true && Tau_idIsoTight == true &&\
                                Tau_idAntiEleTight == true && Tau_idAntiMuTight == true");
The choices of tight vs medium or loose are up to the user based on studies of signal and background efficiency.

Selecting on jets and MET

In this analysis we will not explicitly require jets or missing transverse momentum. Analyses expecting MET will set thresholds on MET_pt and might also require that MET is not close to collinear with a jet (aka, require MET to be “significant”).

When selecting jets, the “loose” identification criteria that you added in AOD2NanoAOD.cc is recommended, and the number of jets required would be informed by your physics process (keeping in mind that jets can be lost and radiation can produce extra jets). An extra wrinkle comes in when leptons and jets are expected in the final state: particle-flow jets are clustered from all particle-flow candidates – muons and electrons are not removed from this list! For many final state topologies, it is sufficient to remove “jets” from the list if they closely overlap a tight, isolated muon or electron. If this is expected based on the physics process (for instance, the decay of a very high momentum top quark to a lepton and b quark jet), the four-vector of the lepton can be subtracted from the uncorrected four-vector of the jet before the corrections are applied.

Selecting events

We can now reduce the dataset to the interesting events containing at least one good muon and good tau candidate.

template <typename T>
auto FilterGoodEvents(T &df) {
    return df.Filter("Sum(goodTaus) > 0", "Event has good taus")
             .Filter("Sum(goodMuons) > 0", "Event has good muons");
}

Question

How should we determine which muon-tau pair for the best Higgs boson candidate?

This is not possible will the simple one-liner RDataFrame commands we have used so far. Instead, we will need a C++ function based on the RVec columns. The function is called build_pair and it takes several arguments:

goodMuons and goodTaus: these columns were defined in the previous functions and hold a pass/fail (1/0) value for each muon or tau in the event
momentum, eta, and phi of the muon (pt_1, eta_1 and phi_1)
isolation, eta, and phi of the tau (iso_2, eta_2, phi_2)

The function finds all possible pairs of a good muon and a good tau, and filters out pairs in which the leptons are too close together. Then preference is given to the pair with the highest muon pT, and then (if any options remain) the tau with the lowest isolation.

template <typename T>
auto FindMuonTauPair(T &df) {
    using namespace ROOT::VecOps;
    auto build_pair = [](RVec<int>& goodMuons, RVec<float>& pt_1, RVec<float>& eta_1, RVec<float>& phi_1,
                         RVec<int>& goodTaus, RVec<float>& iso_2, RVec<float>& eta_2, RVec<float>& phi_2)
            {
                 // Get indices of all possible combinations
                 auto comb = Combinations(pt_1, eta_2);
                 const auto numComb = comb[0].size();

                 // Find valid pairs based on delta r
                 std::vector<int> validPair(numComb, 0);
                 for(size_t i = 0; i < numComb; i++) {
                     const auto i1 = comb[0][i];
                     const auto i2 = comb[1][i];
                     if(goodMuons[i1] == 1 && goodTaus[i2] == 1) {
                         const auto deltar = sqrt(
                                 pow(eta_1[i1] - eta_2[i2], 2) +
                                 pow(Helper::DeltaPhi(phi_1[i1], phi_2[i2]), 2));
                         if (deltar > 0.5) {
                             validPair[i] = 1;
                         }
                     }
                 }

                 // Find best muon based on pt
                 int idx_1 = -1;
                 float maxPt = -1;
                 for(size_t i = 0; i < numComb; i++) {
                     if(validPair[i] == 0) continue;
                     const auto tmp = comb[0][i];
                     if(maxPt < pt_1[tmp]) {
                         maxPt = pt_1[tmp];
                         idx_1 = tmp;
                     }
                 }

                 // Find best tau based on iso
                 int idx_2 = -1;
                 float minIso = 999;
                 for(size_t i = 0; i < numComb; i++) {
                     if(validPair[i] == 0) continue;
                     if(int(comb[0][i]) != idx_1) continue;

		     // COMPLETE THE FUNCTION
                 }

                 return std::vector<int>({idx_1, idx_2});
            };
}

Challege: select lowest-iso tau

Complete the function above to select the pair with the highest muon pT (already implemented) and the most isolated tau (missing)
Solution

The example has already chosen a well-separated pair with the highest-pT muon. We just need to minimize the minIso variable:
 // Find best tau based on iso
 int idx_2 = -1;
 float minIso = 999;
 for(size_t i = 0; i < numComb; i++) {
     if(validPair[i] == 0) continue;
     if(int(comb[0][i]) != idx_1) continue;

	// get the index of the tau (element 1) in combination #i
     const auto tmp = comb[1][i]

	// test that tau's isolation against the minimum and reset if needed
     if(minIso < iso_2[tmp]){
 	    minIso = iso_2[tmp];
	    idx_2 = tmp;
	}
 }

The rest of the FindMuonTauPair function shows how to call build_pair in RDataFrame and pass it the necessary columns. A Define statement is used to store the pair index from the output of build_pair. This object is a vector of 2 values, so we can also use Define to store the first (muon) and second (tau) index separately. These indices will show which lepton in the column should be used to form a Higgs boson.

    return df.Define("pairIdx", build_pair,
                     {"goodMuons", "Muon_pt", "Muon_eta", "Muon_phi",
                      "goodTaus", "Tau_relIso_all", "Tau_eta", "Tau_phi"})
             .Define("idx_1", "pairIdx[0]")
             .Define("idx_2", "pairIdx[1]")

Challenge: Filter on a good muon and tau in the pair

Add Filter statements to the end of FindMuonTauPair to filter out events with bad indices for the muon or tau in the pair.
Solution

A bad index means the value is -1. The index would only remain -1 if no valid pairs of good muons and good taus were found, so this is a way to require that a good pair exists in the event.
 .Define("idx_2", "pairIdx[0]")
 .Filter("idx_1 != -1", "muon index is good")
 .Filter("idx_2 != -1", "tau index is good")

Finally, to actually perform event selection we must call all these functions we have created!

ROOT::EnableImplicitMT();
ROOT::RDataFrame df("Events", "root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/GluGluToHToTauTau.root");

auto df2 = MinimalSelection(df);
auto df3 = FindGoodMuons(df2);
auto df4 = FindGoodTaus(df3);
auto df5 = FilterGoodEvents(df4);
auto df6 = FindMuonTauPair(df5);
auto df7 = DeclareVariables(df6); // adds more column definitions
auto df8 = CheckGeneratorTaus(df7, sample); // checks that the generated particles are good
auto df9 = AddEventWeight(df8, sample);

// Final dataframe!
auto dfFinal = df9;
auto report = dfFinal.Report();
dfFinal.Snapshot("Events", sample + "Skim.root", finalVariables);

The last line above saves a “snapshot” of the dataframe, which is a ROOT file with a new tree (still called “Events”) containing only the branches listed in “finalVariables”.

Challenge: go through the final variables

Go through DeclareVariables to understand the variables being defined. Describe in words all the masses computed. If time permits, edit the definition of goodJets to require that they are separated from the muon (see the previous definitions in the chain!) by DR > 0.4.
.Define("goodJets", "Jet_puId == true && abs(Jet_eta) < 4.7 && Jet_pt > 30")
Solution

Several masses are defined:

m_1 = mass of the good muon selected for the Higgs decay

m_2 = mass of the good tau selected for the Higgs decay

mt_1 = transverse mass from the compute_mt helper function, calculated from the muon and the MET

mt_2 = transverse mass from the compute_mt helper function, calculated from the tau and the MET

m_vis = “visible” mass of the Higgs boson. This is computed from the p4 variable, which was previously defined as the sum of the muon and tau four-vectors

mjj = jet mass from the compute_mjj function, which returns the mass of the four-vector in the first argument if there are at least 2 good jets. The first argument is the jP4 variable, defined a few lines above as the sum of the first and second jet four-vectors. The compute_mjj function adds a protection against cases where there is only one jet.

To add a jet-muon separation requirement, we could add a definition and then use it in goodJets. For simplicity, let’s write a compute_dR function:
auto compute_dR = [](RVec<float> &eta_j, RVec<float> &phi_j, float eta_m, float phi_m)
{
 RVec<float> jetmuDR(eta_j.size(),-9);

 for(unsigned int ijet = 0; ijet < eta_j.size(); ijet++){
    const auto dphi = Helper::DeltaPhi(phi_j[ijet], phi_m);
	jetmuDR[ijet] = std::sqrt((eta_j[ijet] - eta_m)*(eta_j[ijet] - eta_m) + dphi*dphi);
 }
 return jetmuDR
}
Now we can use this function in a definition:
 .Define("Jet_muDR",compute_dR, {"Jet_eta", "Jet_phi", "eta_1", "phi_1"})
 .Define("goodJets", "Jet_puId == true && abs(Jet_eta) < 4.7 && Jet_pt > 30 && Jet_muDR > 0.4")

Running the skimmer

With your edits in place, run skim.cxx to process all the data and simulation samples. Great time for a break!

$ source /cvmfs/sft.cern.ch/lcg/views/LCG_95/x86_64-slc6-gcc8-opt/setup.sh
$ g++ -g -O3 -Wall -Wextra -Wpedantic -o skim skim.cxx $(root-config --cflags --libs)
$ ./skim

Key Points

Selection criteria include kinematic limits (momentum and angle), identification, and isolation.

Kinematic limits are usually based on detector ranges and the physics process being studied.

Identification and isolation criteria depend mostly on your physics analysis goals.

Object ID and Event Selection

Higgs to Tau Tau analysis

Overview

Setup

Data and simulation files

Review of TTree::Draw and RDataFrame

Challenge: replicate histograms with RDataFrame

Key Points

Event selection

Overview

Trigger and final state definition

Challenge: require muons and taus

Solution

Lepton selection criteria

Question

Challenge: define good muons

Solution

Question

Challenge: define good taus

Solution

Selecting on jets and MET

Selecting events

Question

Challege: select lowest-iso tau

Solution

Challenge: Filter on a good muon and tau in the pair

Solution

Challenge: go through the final variables

Solution

Running the skimmer

Key Points