Summary and Schedule
This tutorial will guide you through setting up CMS Open data processing using Google Cloud Platform resources.
This is for you if:
- the standard NanoAOD content is not sufficient for your use-case
- you would like to use MiniAOD content for your analysis, but you do not have computing resources available.
The example task is to process a custom NanoAOD starting from MiniAOD. The output of this example task is NanoAOD format enriched with the Particle-Flow (PF) candidates. Part of the existing open data is available in this format, and following this example you will be able to process more of it if needed in your analysis.
Setup Instructions | Download files required for the lesson | |
Duration: 00h 00m | 1. Introduction |
What are public cloud provides? Why would you use them? What is Kubernetes? What else do you need? |
Duration: 00h 15m | 2. Persistent storage |
How to create a Google Cloud Storage bucket? What are the basic operations? What are the cost of storage and download? |
Duration: 00h 30m | 3. Disk image |
Why to build a disk image for cluster nodes? How to build a disk image? |
Duration: 00h 45m | 4. Kubernetes cluster |
How to create a Google Kubernetes Engine cluster? How to access the cluster from the command line? |
Duration: 01h 00m | 5. Set up workflow |
How to set up Argo Workflow engine? How to submit a test job? Where to find the output? |
Duration: 01h 15m | 6. Scale up |
How to process a full dataset? What is an optimal cluster setup? What is an optimal job configuration? |
Duration: 01h 30m | Finish |
The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.
Google Cloud account
Create an account for Google Cloud Platform (GCP). The login is an e-mail address: it can be your existing gmail account, but you can create an account with another e-mail address.
If you want to use the free trial ($300), you need to activate them. A billing account with a credit card number is needed. You will not be charged during the free trial.
Software setup
In this tutorial, we expect that you will be using a Linux shell on native Linux, macOS or through WSL2 on Windows. If you do not have any of them available, you can work through Google Cloud Shell but the instructions in this tutorial are for the Linux shell.
You should be familiar with basic Linux shell commands and have
git
available.
GCP Command-line interface: gcloud
Follow these instructions to install glcoud, the command-line interface to GCP resources.
Initialize it with
List your project(s) with
Check the current project with
Change the project if needed with:
Kubernetes command-line tool: kubectl
kubectl
is the tool to interact with the Kubernetes
clusters. Install it either on
its own or with Docker
Desktop.
Deploying resources: Terraform
This tutorial uses Terraform scripts to facilitate the creation and deletion of GCP resources. Install it following these instructions (WSL2 users should follow the Linux instructions).
Running workflows: Argo CLI
The example processing workflow is defined as an “Argo workflow”. To be able to submit it and follow its progress from your terminal, download Argo command-line interface following these instructions. Take note that what is under “Controller and Server” will be done once the cluster is available, do not do it yet.
Optional for building a customized CMSSW container image: Docker
This tutorial uses a CMSSW open data container image with the pfnano producer code downloaded and compiled. You do not need to install Docker to use it in the context of this tutorial. You need Docker, if you want to modify the code with your selections and build a new container image.
Building an image disk: go
A secondary boot disk image with the CMSSW container image preinstalled makes the processing workflow step start immediately. Otherwise, the image needs to be pulled at the start of the processing step and done for each cluster node separately.
This disk image can be created and stored using GCP resources. A
go
script is available for creating the image. To run it,
install go
following these instructions (WSL2 users
should follow the Linux instructions).