How to access metadata on the command line?

Overview

Teaching: 5 min
Exercises: 5 min
Questions
  • What is cernopendata-client?

  • How to use cernopendata-client container image?

  • How to get the list of files in a dataset on the command line?

Objectives
  • To be able to access the information programmatically

Dataset information

Each CMS Open Data dataset comes with metadata that describes its contents (size, number of files, number of events, file listings, etc) and provenance (data-taking or generator configurations, and information on reprocessings), or provides usage instructions (the recommended CMSSW version, Global tag, container image, etc). The previous section showed how this information is displayed on the CERN Open Data portal. Now we see how it can also be retrieved programmatically or on the command line.

Command-line tool

cernopendata-client is a command-line tool to download files or metadata of CERN Open Data portal records. It can be installed with pip or used through the container image.

As you already have Docker installed, we use the latter option. Pull the container image with:

docker pull docker.io/cernopendata/cernopendata-client

Display the command help with:

docker run -i -t --rm docker.io/cernopendata/cernopendata-client --help
Usage: cernopendata-client [OPTIONS] COMMAND [ARGS]...

  Command-line client for interacting with CERN Open Data portal.

Options:
  --help  Show this message and exit.

Commands:
  download-files      Download data files belonging to a record.
  get-file-locations  Get a list of data file locations of a record.
  get-metadata        Get metadata content of a record.
  list-directory      List contents of a EOSPUBLIC Open Data directory.
  verify-files        Verify downloaded data file integrity.
  version             Return cernopendata-client version.

This is equivalent to running cernopendata-client --help, when the tool is installed with pip. The command runs in the container, it displays the output and exits. Note the flag --rm: it removes the container created by the command so that they are not accumulated in your container list each time you run a command.

Get dataset information

Each dataset has a unique identifier, a record id (or recid), that shows in the dataset URL and can be used in the cernopendata-client commands. For example, in the previous sections, you have seen that file listings are in multiple text files. To get the list of all files in the /SingleMu/Run2012B-22Jan2013-v1/AOD dataset, use the following (recid of this dataset is 6021):

docker run -i -t --rm docker.io/cernopendata/cernopendata-client  get-file-locations --recid 6021 --protocol xrootd

or pipe it to a local file with:

docker run -i -t --rm docker.io/cernopendata/cernopendata-client  get-file-locations --recid 6021 --protocol xrootd > files-recid-6021.txt

Explore other metadata fields

See how to get the metadata fields in JSON format with

docker run -i -t --rm docker.io/cernopendata/cernopendata-client  get-metadata --recid 6021

Note that the JSON format can be displayed in the CERN Open data portal web interface by adding /export/json in the dataset URL. Try it!

Summary

cernopendata-client is a handy tool for getting dataset information programmatically in scripts and workflows.

We will be working on getting all metadata needed for an analysis of CMS Open Data datasets retrievable through cernopendata-client, and some improvements are about to be included in the latest version. Let us know what you would like to see implemented on Mattermost!

Key Points

  • cernopendata-client is a command-line tool to download dataset files and metadata from the CERN Open Data portal.