Event selection

Overview

Teaching: 10 min
Exercises: 20 min

Questions

How can I select events in which to search for Higgs -> tau tau?

Objectives

Understand typical object selection criteria

Implement selections for the muon and tau examples

Begin the workshop’s physics analysis example

Event selection for the Higgs -> tau tau search is performed in the file called skim.cxx. As we select events, our goal is to keep the signal events from Higgs to tau tau decay and suppress the background events from other sources (Drell-Yan, W boson, top quarks, etc). Many background processes can be suppressed by introducing kinematic limits on the physics objects. For instance, a W boson could decay to a muon and neutrino, similar to a tau decay, but only the W boson background processes should have the transverse mass of the muon and neutrino (modeled by the missing transverse energy) near 80 GeV. Cuts on key distinguishing variables can strongly suppress background.

Trigger and final state definition

The first event selection requirement should be a set of triggers. In fact, we have made a hidden trigger selection by analyzing the “TauPlusX” datasets! Only events passing some tau-related trigger will enter our data files.

In the MinimalSelection function we will filter events by requiring they fire a hadronic tau + muon trigger – this means we are choosing to search for events in which one of the tau leptons decayed to a muon and the other decayed to hadrons. We will certainly reject signal events based on this requirement, but narrowing the final state makes the relevant background processes more clear and simplifies the search procedure.

template <typename T>
auto MinimalSelection(T &df) {
    return df.Filter("HLT_IsoMu17_eta2p1_LooseIsoPFTau20 == true", "Passes trigger")
}

Challenge: require muons and taus

In the same function, require that each event have at least one muon and at least one tau lepton. (hint: .Filter() commands can be chained together in the same statement) Open a VBF signal ROOT file and use TTree::Draw to plot a histogram of the number of tau leptons in each event, and then the number of muons in each event. What do the distributions tell you about the sources of these leptons?
Solution

Add two more Filter statements in MinimalSelection:
  return df.Filter("HLT_IsoMu17_eta2p1_LooseIsoPFTau20 == true", "Passes trigger")
           .Filter("nMuon > 0", "there are muons")
	      .Filter("nTau > 0", "there are taus");
The plot of nMuon from the gg fusion H -> tau tau sample is shown below – there are many more than the 1 expected from the signal, so clearly we need to do more work to identify which muon is the interesting one from the tau decay.

Lepton selection criteria

We clearly need to slim down the number of muons! The typical event has far more than 2 muons from tau decays. The FindGoodMuons function will add a new column to the dataframe that defines a “good muon” for the purpose of our analysis.

Question

What constitutes a good muon?

For any physics object, the selection criteria typically include:

kinematic constraints (momentum, pseudorapidity, masses of object pairs, etc)
identification requirements (loose, medium, tight quality levels). We have stored all of these labels as boolean pass/fail variables.
isolations requirements (loose, medium, tight isolation levels). When considering the relative isolation variables calculated in the previous exercise for muons, “loose” isolation would correspond to values < 0.4 while “tight” isolation would correspond to values < 0.15.

Thinking about the signal process guides our decisions here: in signal events we expect one hadronic tau, one muon, and neutrinos that were the prompt decay products of a massive particle. The tau and the muon will likely be produced with reasonably high momentum, and are not likely to be surrounded by energy deposits from other particles. For this physics process we should select hadronic taus and muons that pass at least their “loose” ID criteria, and perhaps even the “tight” ID criteria. It may help reject background to require that these leptons be isolated. Making these decisions often involves studying the signal and background efficiencies for a variety of choices.

Which kinematic constraints should be applied? The first place to look is the trigger path: “HLT_IsoMu17_eta2p1_LooseIsoPFTau20”. This trigger path selects events with an isolated muon that has pT > 17 GeV and |eta| < 2.1; along with a loose, isolation tau lepton with pT > 20 GeV. In an ideal case, “offline” (post-trigger) selection criteria are universally more restrictive than “online” (trigger) selection criteria, especially in terms of transverse momentum. This allows analysts to avoid the murky kinematic regions of “trigger turn-ons” in which data and simulation might respond differently to the application of the trigger because of small differences in the input variables.

Challenge: define good muons

Complete the FindGoodMuons function with selection criteria on muon momentum, pseudorapidity, identification level, and isolation level.
template <typename T>
auto FindGoodMuons(T &df) {
   return df.Define("goodMuons", "BOOLEAN CONDITIONS GO HERE");
}
(hint: Muon_pt, Muon_eta, Muon_tightId, Muon_looseId, etc, are branches of interest.)
Solution

I’ve opted to tighten the momentum cut a little compared to the trigger, copy the trigger’s eta requirement, and to require tight ID and loose isolation on the muon.
   return df.Define("goodMuons","Muon_pt > 20 && abs(Muon_eta) < 2.1 && Muon_tightId == true && Muon_pfreliso03all < 0.4");

Question

What constitutes a good hadronic tau?

Taus that decay to hadrons and neutrinos are identified using similar logic, but with combinations of tau identification discriminants instead of the simpler loose/medium/tight framework. The tau ID reference page gives a brief description of each type identification variable. They note that we always want to use the “ByDecayModeFinding” discriminant.

Challenge: define good taus

Complete the FindGoodTaus function with selection criteria on the tau identification discriminants. Based on the trigger path, we have already placed some kinematic constraints on the tau.
template <typename T>
auto FindGoodTaus(T &df) {
   return df.Define("goodTaus", "Tau_charge != 0 && abs(Tau_eta) < 2.3 && Tau_pt > 20 && BOOLEAN ID CONDITIONS HERE");
}
Solution

Based on the TWiki, we should always use byDecayModeFinding. Based on the trigger, we should require one of the isolation-based IDs. Additionally, since we want to be very careful to choose a hadronic tau, let’s use an anti-electron and anti-muon ID:
   return df.Define("goodTaus","Tau_charge != 0 && abs(Tau_eta) < 2.3 && Tau_pt > 20 &&\
                                Tau_idDecayMode == true && Tau_idIsoTight == true &&\
                                Tau_idAntiEleTight == true && Tau_idAntiMuTight == true");
The choices of tight vs medium or loose are up to the user based on studies of signal and background efficiency.

Selecting on jets and MET

In this analysis we will not explicitly require jets or missing transverse momentum. Analyses expecting MET will set thresholds on MET_pt and might also require that MET is not close to collinear with a jet (aka, require MET to be “significant”).

When selecting jets, the “loose” identification criteria that you added in AOD2NanoAOD.cc is recommended, and the number of jets required would be informed by your physics process (keeping in mind that jets can be lost and radiation can produce extra jets). An extra wrinkle comes in when leptons and jets are expected in the final state: particle-flow jets are clustered from all particle-flow candidates – muons and electrons are not removed from this list! For many final state topologies, it is sufficient to remove “jets” from the list if they closely overlap a tight, isolated muon or electron. If this is expected based on the physics process (for instance, the decay of a very high momentum top quark to a lepton and b quark jet), the four-vector of the lepton can be subtracted from the uncorrected four-vector of the jet before the corrections are applied.

Selecting events

We can now reduce the dataset to the interesting events containing at least one good muon and good tau candidate.

template <typename T>
auto FilterGoodEvents(T &df) {
    return df.Filter("Sum(goodTaus) > 0", "Event has good taus")
             .Filter("Sum(goodMuons) > 0", "Event has good muons");
}

Question

How should we determine which muon-tau pair for the best Higgs boson candidate?

This is not possible will the simple one-liner RDataFrame commands we have used so far. Instead, we will need a C++ function based on the RVec columns. The function is called build_pair and it takes several arguments:

goodMuons and goodTaus: these columns were defined in the previous functions and hold a pass/fail (1/0) value for each muon or tau in the event
momentum, eta, and phi of the muon (pt_1, eta_1 and phi_1)
isolation, eta, and phi of the tau (iso_2, eta_2, phi_2)

The function finds all possible pairs of a good muon and a good tau, and filters out pairs in which the leptons are too close together. Then preference is given to the pair with the highest muon pT, and then (if any options remain) the tau with the lowest isolation.

template <typename T>
auto FindMuonTauPair(T &df) {
    using namespace ROOT::VecOps;
    auto build_pair = [](RVec<int>& goodMuons, RVec<float>& pt_1, RVec<float>& eta_1, RVec<float>& phi_1,
                         RVec<int>& goodTaus, RVec<float>& iso_2, RVec<float>& eta_2, RVec<float>& phi_2)
            {
                 // Get indices of all possible combinations
                 auto comb = Combinations(pt_1, eta_2);
                 const auto numComb = comb[0].size();

                 // Find valid pairs based on delta r
                 std::vector<int> validPair(numComb, 0);
                 for(size_t i = 0; i < numComb; i++) {
                     const auto i1 = comb[0][i];
                     const auto i2 = comb[1][i];
                     if(goodMuons[i1] == 1 && goodTaus[i2] == 1) {
                         const auto deltar = sqrt(
                                 pow(eta_1[i1] - eta_2[i2], 2) +
                                 pow(Helper::DeltaPhi(phi_1[i1], phi_2[i2]), 2));
                         if (deltar > 0.5) {
                             validPair[i] = 1;
                         }
                     }
                 }

                 // Find best muon based on pt
                 int idx_1 = -1;
                 float maxPt = -1;
                 for(size_t i = 0; i < numComb; i++) {
                     if(validPair[i] == 0) continue;
                     const auto tmp = comb[0][i];
                     if(maxPt < pt_1[tmp]) {
                         maxPt = pt_1[tmp];
                         idx_1 = tmp;
                     }
                 }

                 // Find best tau based on iso
                 int idx_2 = -1;
                 float minIso = 999;
                 for(size_t i = 0; i < numComb; i++) {
                     if(validPair[i] == 0) continue;
                     if(int(comb[0][i]) != idx_1) continue;

		     // COMPLETE THE FUNCTION
                 }

                 return std::vector<int>({idx_1, idx_2});
            };
}

Challege: select lowest-iso tau

Complete the function above to select the pair with the highest muon pT (already implemented) and the most isolated tau (missing)
Solution

The example has already chosen a well-separated pair with the highest-pT muon. We just need to minimize the minIso variable:
 // Find best tau based on iso
 int idx_2 = -1;
 float minIso = 999;
 for(size_t i = 0; i < numComb; i++) {
     if(validPair[i] == 0) continue;
     if(int(comb[0][i]) != idx_1) continue;

	// get the index of the tau (element 1) in combination #i
     const auto tmp = comb[1][i]

	// test that tau's isolation against the minimum and reset if needed
     if(minIso < iso_2[tmp]){
 	    minIso = iso_2[tmp];
	    idx_2 = tmp;
	}
 }

The rest of the FindMuonTauPair function shows how to call build_pair in RDataFrame and pass it the necessary columns. A Define statement is used to store the pair index from the output of build_pair. This object is a vector of 2 values, so we can also use Define to store the first (muon) and second (tau) index separately. These indices will show which lepton in the column should be used to form a Higgs boson.

    return df.Define("pairIdx", build_pair,
                     {"goodMuons", "Muon_pt", "Muon_eta", "Muon_phi",
                      "goodTaus", "Tau_relIso_all", "Tau_eta", "Tau_phi"})
             .Define("idx_1", "pairIdx[0]")
             .Define("idx_2", "pairIdx[1]")

Challenge: Filter on a good muon and tau in the pair

Add Filter statements to the end of FindMuonTauPair to filter out events with bad indices for the muon or tau in the pair.
Solution

A bad index means the value is -1. The index would only remain -1 if no valid pairs of good muons and good taus were found, so this is a way to require that a good pair exists in the event.
 .Define("idx_2", "pairIdx[0]")
 .Filter("idx_1 != -1", "muon index is good")
 .Filter("idx_2 != -1", "tau index is good")

Finally, to actually perform event selection we must call all these functions we have created!

ROOT::EnableImplicitMT();
ROOT::RDataFrame df("Events", "root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/GluGluToHToTauTau.root");

auto df2 = MinimalSelection(df);
auto df3 = FindGoodMuons(df2);
auto df4 = FindGoodTaus(df3);
auto df5 = FilterGoodEvents(df4);
auto df6 = FindMuonTauPair(df5);
auto df7 = DeclareVariables(df6); // adds more column definitions
auto df8 = CheckGeneratorTaus(df7, sample); // checks that the generated particles are good
auto df9 = AddEventWeight(df8, sample);

// Final dataframe!
auto dfFinal = df9;
auto report = dfFinal.Report();
dfFinal.Snapshot("Events", sample + "Skim.root", finalVariables);

The last line above saves a “snapshot” of the dataframe, which is a ROOT file with a new tree (still called “Events”) containing only the branches listed in “finalVariables”.

Challenge: go through the final variables

Go through DeclareVariables to understand the variables being defined. Describe in words all the masses computed. If time permits, edit the definition of goodJets to require that they are separated from the muon (see the previous definitions in the chain!) by DR > 0.4.
.Define("goodJets", "Jet_puId == true && abs(Jet_eta) < 4.7 && Jet_pt > 30")
Solution

Several masses are defined:

m_1 = mass of the good muon selected for the Higgs decay

m_2 = mass of the good tau selected for the Higgs decay

mt_1 = transverse mass from the compute_mt helper function, calculated from the muon and the MET

mt_2 = transverse mass from the compute_mt helper function, calculated from the tau and the MET

m_vis = “visible” mass of the Higgs boson. This is computed from the p4 variable, which was previously defined as the sum of the muon and tau four-vectors

mjj = jet mass from the compute_mjj function, which returns the mass of the four-vector in the first argument if there are at least 2 good jets. The first argument is the jP4 variable, defined a few lines above as the sum of the first and second jet four-vectors. The compute_mjj function adds a protection against cases where there is only one jet.

To add a jet-muon separation requirement, we could add a definition and then use it in goodJets. For simplicity, let’s write a compute_dR function:
auto compute_dR = [](RVec<float> &eta_j, RVec<float> &phi_j, float eta_m, float phi_m)
{
 RVec<float> jetmuDR(eta_j.size(),-9);

 for(unsigned int ijet = 0; ijet < eta_j.size(); ijet++){
    const auto dphi = Helper::DeltaPhi(phi_j[ijet], phi_m);
	jetmuDR[ijet] = std::sqrt((eta_j[ijet] - eta_m)*(eta_j[ijet] - eta_m) + dphi*dphi);
 }
 return jetmuDR
}
Now we can use this function in a definition:
 .Define("Jet_muDR",compute_dR, {"Jet_eta", "Jet_phi", "eta_1", "phi_1"})
 .Define("goodJets", "Jet_puId == true && abs(Jet_eta) < 4.7 && Jet_pt > 30 && Jet_muDR > 0.4")

Running the skimmer

With your edits in place, run skim.cxx to process all the data and simulation samples. Great time for a break!

$ source /cvmfs/sft.cern.ch/lcg/views/LCG_95/x86_64-slc6-gcc8-opt/setup.sh
$ g++ -g -O3 -Wall -Wextra -Wpedantic -o skim skim.cxx $(root-config --cflags --libs)
$ ./skim

Key Points

Selection criteria include kinematic limits (momentum and angle), identification, and isolation.

Kinematic limits are usually based on detector ranges and the physics process being studied.

Identification and isolation criteria depend mostly on your physics analysis goals.

previous episode

Object ID and Event Selection

lesson home

Event selection

Overview

Trigger and final state definition

Challenge: require muons and taus

Solution

Lepton selection criteria

Question

Challenge: define good muons

Solution

Question

Challenge: define good taus

Solution

Selecting on jets and MET

Selecting events

Question

Challege: select lowest-iso tau

Solution

Challenge: Filter on a good muon and tau in the pair

Solution

Challenge: go through the final variables

Solution

Running the skimmer

Key Points

previous episode

lesson home