Summary and Schedule

This tutorial will guide you through setting up CMS Open data processing using Google Cloud Platform resources.

This is for you if:

the standard NanoAOD content is not sufficient for your use-case
you would like to use MiniAOD content for your analysis, but you do not have computing resources available.

The example task is to process a custom NanoAOD starting from MiniAOD. The output of this example task is NanoAOD format enriched with the Particle-Flow (PF) candidates. Part of the existing open data is available in this format, and following this example you will be able to process more of it if needed in your analysis.

Setup Instructions

Download files required for the lesson

00h 00m

1. Introduction

What are public cloud providers?
Why would you use them?
What is Kubernetes?
What else do you need?

00h 15m

2. Persistent storage

How to create a Google Cloud Storage bucket?
What are the basic operations?
What are the costs of storage and download?

00h 30m

3. Disk image

Why to build a disk image for cluster nodes?
How to build a disk image?

00h 45m

4. Kubernetes cluster

How to create a Google Kubernetes Engine cluster?
How to access the cluster from the command line?

01h 00m

5. Set up workflow

How to set up Argo Workflow engine?
How to submit a test job?
Where to find the output?

01h 15m

6. Scale up

How to process a full dataset?
What is an optimal cluster setup?
What is an optimal job configuration?

01h 30m

7. Discussion

What was the user experience?
What worked well?
What difficulties were observed?

01h 40m

Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.

Google Cloud account

Create an account for Google Cloud Platform (GCP). The login is an e-mail address: it can be your existing gmail account, but you can create an account with another e-mail address.

If you want to use the free trial ($300), you need to activate them. A billing account with a credit card number is needed. You will not be charged during the free trial.

Software setup

In this tutorial, we expect that you will be using a Linux shell on native Linux, macOS or through WSL2 on Windows. If you do not have any of them available, you can work through Google Cloud Shell but the instructions in this tutorial are for the Linux shell.

You should be familiar with basic Linux shell commands and have git available.

GCP Command-line interface: gcloud

Follow these instructions to install glcoud, the command-line interface to GCP resources.

Initialize it with

BASH

gcloud init

List your project(s) with

BASH

gcloud projects list

Check the current project with

BASH

gcloud config list

Change the project if needed with:

BASH

gcloud config set project <PROJECT_ID>

Kubernetes command-line tool: kubectl

kubectl is the tool to interact with the Kubernetes clusters. Install it either on its own or with Docker Desktop.

Deploying resources: Terraform

This tutorial uses Terraform scripts to facilitate the creation and deletion of GCP resources. Install it following these instructions (WSL2 users should follow the Linux instructions).

Running workflows: Argo CLI

The example processing workflow is defined as an “Argo workflow”. To be able to submit it and follow its progress from your terminal, download Argo command-line interface following these instructions. Take note that what is under “Controller and Server” will be done once the cluster is available, do not do it yet.

Optional for building a customized CMSSW container image: Docker

This tutorial uses a CMSSW open data container image with the pfnano producer code downloaded and compiled. You do not need to install Docker to use it in the context of this tutorial. You need Docker, if you want to modify the code with your selections and build a new container image.

Building an image disk: go

A secondary boot disk image with the CMSSW container image preinstalled makes the processing workflow step start immediately. Otherwise, the image needs to be pulled at the start of the processing step and done for each cluster node separately.

This disk image can be created and stored using GCP resources. A go script is available for creating the image. To run it, install go following these instructions (WSL2 users should follow the Linux instructions).