Summary and Setup
This tutorial will guide you through setting up CMS Open data processing using Google Cloud Platform resources.
This is for you if:
- the standard NanoAOD content is not sufficient for your use-case
- you would like to use MiniAOD content for your analysis, but you do not have computing resources available.
The example task is to process a custom NanoAOD starting from MiniAOD. The output of this example task is NanoAOD format enriched with the Particle-Flow (PF) candidates. Part of the existing open data is available in this format, and following this example you will be able to process more of it if needed in your analysis.
Google Cloud account
Create an account for Google Cloud Platform (GCP). The login is an e-mail address: it can be your existing gmail account, but you can create an account with another e-mail address.
If you want to use the free trial ($300), you need to activate them. A billing account with a credit card number is needed. You will not be charged during the free trial.
Software setup
In this tutorial, we expect that you will be using a Linux shell on native Linux, macOS or through WSL2 on Windows. If you do not have any of them available, you can work through Google Cloud Shell but the instructions in this tutorial are for the Linux shell.
You should be familiar with basic Linux shell commands and have
git
available.
GCP Command-line interface: gcloud
Follow these instructions to install glcoud, the command-line interface to GCP resources.
Initialize it with
List your project(s) with
Check the current project with
Change the project if needed with:
Kubernetes command-line tool: kubectl
kubectl
is the tool to interact with the Kubernetes
clusters. Install it either on
its own or with Docker
Desktop.
Deploying resources: Terraform
This tutorial uses Terraform scripts to facilitate the creation and deletion of GCP resources. Install it following these instructions (WSL2 users should follow the Linux instructions).
Running workflows: Argo CLI
The example processing workflow is defined as an “Argo workflow”. To be able to submit it and follow its progress from your terminal, download Argo command-line interface following these instructions. Take note that what is under “Controller and Server” will be done once the cluster is available, do not do it yet.
Optional for building a customized CMSSW container image: Docker
This tutorial uses a CMSSW open data container image with the pfnano producer code downloaded and compiled. You do not need to install Docker to use it in the context of this tutorial. You need Docker, if you want to modify the code with your selections and build a new container image.
Building an image disk: go
A secondary boot disk image with the CMSSW container image preinstalled makes the processing workflow step start immediately. Otherwise, the image needs to be pulled at the start of the processing step and done for each cluster node separately.
This disk image can be created and stored using GCP resources. A
go
script is available for creating the image. To run it,
install go
following these instructions (WSL2 users
should follow the Linux instructions).