Downloading data using the cernopendata-client

Overview

Teaching: 10 min
Exercises: 15 min
Questions
  • How can I download data from the Open Data portal?

  • Should I stream the data from CERN or have them available locally?

Objectives
  • Understand basic usage of cernopendata-client

The cernopendata-client is a tool that has recently become available, which allows you to access records from the CERN Open Data portal via the command line.

cernopendata-client-go is a light-weight implementation of this with a particular focus on usage “in the cloud”. Additionally, it allows you to download records in parallel. We will be using cernopendata-client-go in the following. Also, mind that you can also execute these commands on your local computer, but they will also work in the CloudShell (which is Linux_x86_64).

Getting the tool

You can download the latest release from GitHub For use on your own computer, download the corresponding binary archive for your processor architecture and operating system. If you are on MacOS Catalina or later, you will have to right-click on the extracted binary, hold CTRL when clicking on “Open”, and confirm again to open. Afterwards, you should be able to execute the file from the command line.

The binary will have a different name depending on your operating system and architecture. Execute it to get help and get more help by using the available sub-commands:

./cernopendata-client-go --help
./cernopendata-client-go list --help
./cernopendata-client-go download --help

As you can see from the releases page, you can also use docker instead of downloading the binaries:

docker run --rm -it clelange/cernopendata-client-go --help

Listing files

When browsing the CERN Open Data portal, you will see from the URL that every record has a number. For example, Analysis of Higgs boson decays to two tau leptons using data and simulation of events at the CMS detector from 2012 has the URL http://opendata.web.cern.ch/record/12350. The record ID is therefore 12350.

You can list the associated files as follows:

docker run --rm -it clelange/cernopendata-client-go list -r 12350

This yields:

http://opendata.cern.ch/eos/opendata/cms/software/HiggsTauTauNanoAODOutreachAnalysis/HiggsTauTauNanoAODOutreachAnalysis-1.0.zip
http://opendata.cern.ch/eos/opendata/cms/software/HiggsTauTauNanoAODOutreachAnalysis/histograms.py
http://opendata.cern.ch/eos/opendata/cms/software/HiggsTauTauNanoAODOutreachAnalysis/plot.py
http://opendata.cern.ch/eos/opendata/cms/software/HiggsTauTauNanoAODOutreachAnalysis/skim.cxx

Downloading files

You can download full records in an analoguous way:

docker run --rm -it clelange/cernopendata-client-go download -r 12350

By default, these files will end up in a directory called download, which contains directories of the record ID (i.e. ./download/12350 here).

Creating download jobs

As mentioned in the introductory slides, it is advantageous to download the desired data sets/records to cloud storage before processing them. With the cernopendata-client-go this can be achieved by creating Kubernetes Jobs.

A job to download a record making use of the volume we have available could look like this:

# job-data-download.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: job-data-download
spec:
  backoffLimit: 2
  ttlSecondsAfterFinished: 100
  template:
    spec:
      volumes:
        - name: volume-opendata
          persistentVolumeClaim:
            claimName: nfs-<NUMBER>
      containers:
        - name: pod-data-download
          image: clelange/cernopendata-client-go
          args: ["download", "-r", "12351", "-o", "/opendata"]
          volumeMounts:
            - mountPath: "/opendata/"
              name: volume-opendata
      restartPolicy: Never

Again, please adjust the <NUMBER>.

You can create (submit) the job like this:

kubectl apply -n argo -f job-data-download.yaml

The job will create a pod, and you can monitor the progress both via the pod and the job resources:

kubectl get pod -n argo
kubectl get job -n argo

The files are downloaded to the storage volume, and you can see them through the fileserver as instructed in the previous episode or through the storage pod:

kubectl exec pod/pv-pod -n argo -- ls /mnt/data/

Challenge: Download all records needed

In the following, we will run the Analysis of Higgs boson decays to two tau leptons using data and simulation of events at the CMS detector from 2012 , for which 9 different records are listed on the webpage. We only downloaded the GluGluToHToTauTau dataset so far. Can you download the remaining ones using Jobs?

Solution: Download all records needed

In principle, the only thing that needs to be changed is the record ID argument. However, resources need to have a unique name, which means that the job name needs to be changed or the finished job(s) have to be deleted.

Key Points

  • It is usually of advantage to have the data where to CPU cores are.