This lesson is being piloted (Beta version)

Introduction

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • What is the point of these exercises?

  • How do I find the data I want to work with?

Objectives
  • To understand why we start with the Open Data Portal

  • To understand the basics of how the datasets are divided up

Get ahead!

The 3/4 of this lesson is done entirely in the browser.

However, Episode 4: What is in the data files?, requires the use of a running CMSSW environment, either using Docker or the VM. You may want to jump ahead to that episode to remind yourself of how to do that and get that running in the background, so that it’s all ready to to go when we get there!

But come back here right after! :)

You’ve got a great idea! What’s next?

Suppose you have a great idea that you want to test out with real data! You’re going to want to know:

  • What year was the data taken that would work best for you?
  • What triggers were applied? (how was the data pre-selected to be saved and recorded)
  • What Monte Carlo datasets are available and appropriate for your studies?
    • This may mean finding simulated physics processes that are background to your signal
    • This may mean finding simulated physics processes for your signal, if they exist
    • Possibly just finding simulated datasets where you know the answer, allowing you to test your new analysis techniques

In this lesson, we’ll walk through the process of finding out what data and Monte Carlo is available to you, how to find it, and how to examine what data is in the individual data files.

First of all, let’s understand how the data is stored and why we need certain tools to access it.

The CERN Open Data Portal

In some of the earliest discussions about making HEP data publicly available there were many concerns about people using and analyzing “other people’s” data. The concern centered around well-meaning scientists improperly analyzing data and coming up with incorrect conclusions.

While no system is perfect, one way to guard against this is to only release well-understood, well-calibrated datasets and to make sure open data analysts only use these datasets. Because the CERN Open Data Portal is a mutable environment and some datasets may change over time as they are being validated, it is important that we make sure that analysts only use these vetted datasets. These datasets are given a Digital Object Identifier (DOI) code for tracking. And if there are ever questions about the validity of the data, it allows us to check the data provenance.

DOI

The Digital Object Identifier (DOI) system allows people to assign a unique ID to any piece of digital media: a book, a piece of music, a software package, or a dataset. If you want to learn more about the DOI process, you can learn more at their FAQ. Assigning of DOIs to CERN products is generally handled through Zenodo.

Challenge!

You will find that all the datasets have their DOI listed at the top of their page on the portal. Can you locate where the DOI is shown for this dataset, Record 6029, DoubleElectron primary dataset in AOD format from Run of 2012 (/DoubleElectron/Run2012C-22Jan2013-v1/AOD)

With a DOI, you can create citations to any of these records, for example using a tool like doi2bib.

Provenance

You will hear experimentalists refer to the “provenance” of a dataset. From the Cambridge dictionary, provenance refers to “the place of origin of something”. The way we use it, we are referring to how we keep track of the history of how a dataset was processed: what version of the software was used for reconstruction, what period of calibrations was used during that processing, etc. In this way, we are documenting the data lineage of our datasets.

From Wikipeda

Data lineage includes the data origin, what happens to it and where it moves over time. Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process.

Provenance is an an important part of our data quality checks and another reason we want to make sure you are using only vetted and calibrated data.

This lesson

For all the reasons given above, we encourage you to familiarize yourself with the search features and options on the portal. With your feedback, we can also work to create better search tools/options and landing points.

This exercise will guide you through the current approach to finding data and Monte Carlo. Let’s go!

Key Points

  • Finding the data is non-trivial, but all the information is on the portal

  • A careful understanding of the search options can help with finding what you need