How to access metadata on the command line?
Overview
Teaching: 5 min
Exercises: 5 minQuestions
What is cernopendata-client?
How to use cernopendata-client container image?
How to get the list of files in a dataset on the command line?
Objectives
To be able to access the information programmatically
Dataset information
Each CMS Open Data dataset comes with metadata that describes its contents (size, number of files, number of events, file listings, etc) and provenance (data-taking or generator configurations, and information on reprocessings), or provides usage instructions (the recommended CMSSW version, Global tag, container image, etc). The previous section showed how this information is displayed on the CERN Open Data portal. Now we see how it can also be retrieved programmatically or on the command line.
Command-line tool
cernopendata-client is a command-line tool to download files or metadata of CERN Open Data portal records. It can be installed with pip
or used through the container image.
As you already have Docker installed, we use the latter option. Pull the container image with:
docker pull docker.io/cernopendata/cernopendata-client
Display the command help with:
docker run -i -t --rm docker.io/cernopendata/cernopendata-client --help
Usage: cernopendata-client [OPTIONS] COMMAND [ARGS]...
Command-line client for interacting with CERN Open Data portal.
Options:
--help Show this message and exit.
Commands:
download-files Download data files belonging to a record.
get-file-locations Get a list of data file locations of a record.
get-metadata Get metadata content of a record.
list-directory List contents of a EOSPUBLIC Open Data directory.
verify-files Verify downloaded data file integrity.
version Return cernopendata-client version.
This is equivalent to running cernopendata-client --help
, when the tool is installed with pip
. The command runs in the container, it displays the output and exits. Note the flag --rm
: it removes the container created by the command so that they are not accumulated in your container list each time you run a command.
Get dataset information
Each dataset has a unique identifier, a record id (or recid
), that shows in the dataset URL and can be used in the cernopendata-client commands.
For example, in the previous sections, you have seen that file listings are in multiple text files. To get the list of all files in the /SingleMu/Run2012B-22Jan2013-v1/AOD dataset, use the following (recid
of this dataset is 6021):
docker run -i -t --rm docker.io/cernopendata/cernopendata-client get-file-locations --recid 6021 --protocol xrootd
or pipe it to a local file with:
docker run -i -t --rm docker.io/cernopendata/cernopendata-client get-file-locations --recid 6021 --protocol xrootd > files-recid-6021.txt
Explore other metadata fields
See how to get the metadata fields in JSON format with
docker run -i -t --rm docker.io/cernopendata/cernopendata-client get-metadata --recid 6021
Note that the JSON format can be displayed in the CERN Open data portal web interface by adding
/export/json
in the dataset URL. Try it!
Summary
cernopendata-client is a handy tool for getting dataset information programmatically in scripts and workflows.
We will be working on getting all metadata needed for an analysis of CMS Open Data datasets retrievable through cernopendata-client, and some improvements are about to be included in the latest version. Let us know what you would like to see implemented on Mattermost!
Key Points
cernopendata-client is a command-line tool to download dataset files and metadata from the CERN Open Data portal.