Summary and Setup

This tutorial will guide you through setting up CMS Open data processing using Google Cloud Platform resources.

This is for you if:

the standard NanoAOD content is not sufficient for your use-case
you would like to use MiniAOD content for your analysis, but you do not have computing resources available.

The example task is to process a custom NanoAOD starting from MiniAOD. The output of this example task is NanoAOD format enriched with the Particle-Flow (PF) candidates. Part of the existing open data is available in this format, and following this example you will be able to process more of it if needed in your analysis.

Google Cloud account

Create an account for Google Cloud Platform (GCP). The login is an e-mail address: it can be your existing gmail account, but you can create an account with another e-mail address.

If you want to use the free trial ($300), you need to activate them. A billing account with a credit card number is needed. You will not be charged during the free trial.

Software setup

In this tutorial, we expect that you will be using a Linux shell on native Linux, macOS or through WSL2 on Windows. If you do not have any of them available, you can work through Google Cloud Shell but the instructions in this tutorial are for the Linux shell.

You should be familiar with basic Linux shell commands and have git available.

GCP Command-line interface: gcloud

Follow these instructions to install glcoud, the command-line interface to GCP resources.

Initialize it with

BASH

gcloud init

List your project(s) with

BASH

gcloud projects list

Check the current project with

BASH

gcloud config list

Change the project if needed with:

BASH

gcloud config set project <PROJECT_ID>

Kubernetes command-line tool: kubectl

kubectl is the tool to interact with the Kubernetes clusters. Install it either on its own or with Docker Desktop.

Deploying resources: Terraform

This tutorial uses Terraform scripts to facilitate the creation and deletion of GCP resources. Install it following these instructions (WSL2 users should follow the Linux instructions).

Running workflows: Argo CLI

The example processing workflow is defined as an “Argo workflow”. To be able to submit it and follow its progress from your terminal, download Argo command-line interface following these instructions. Take note that what is under “Controller and Server” will be done once the cluster is available, do not do it yet.

Optional for building a customized CMSSW container image: Docker

This tutorial uses a CMSSW open data container image with the pfnano producer code downloaded and compiled. You do not need to install Docker to use it in the context of this tutorial. You need Docker, if you want to modify the code with your selections and build a new container image.

Building an image disk: go

A secondary boot disk image with the CMSSW container image preinstalled makes the processing workflow step start immediately. Otherwise, the image needs to be pulled at the start of the processing step and done for each cluster node separately.

This disk image can be created and stored using GCP resources. A go script is available for creating the image. To run it, install go following these instructions (WSL2 users should follow the Linux instructions).