CMS Open Data using Kubernetes

Introduction

Overview

Teaching: 30 min
Exercises: 0 min
Questions
  • What is Kubernetes?

  • Why would I use Kubernetes?

Objectives
  • Get a rough understanding of Kubernetes and some of its components

The introduction will be performed using slides available on the Workshop Indico page.

Once you have successfully completed the steps there, please continue to the next episode.

Key Points

  • Kubernetes is a powerful tool to schedule containers.


Getting started with Google Kubernetes Engine

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • How to create a cluster on Google Cloud Platform?

  • How to setup Google Kubernetes Engine?

  • How to setup a workflow engine to submit jobs?

  • How to run a simple job?

Objectives
  • Understand how to run a simple workflows in a commercial cloud environment

Get access to Google Cloud Platform

Usually, you would need to create your own account (and add a credit card) in order to be able to access the public cloud. For this workshop, however, you will have access to Google Cloud via the ARCHIVER project. Alternatively, you can create a Google Cloud account and you will get free credits worth $300, valid for 90 days (requires a credit card).

The following steps will be performed using the Google Cloud Platform console.

Use an incognito/private browser window

We strongly recommend you use an incognito/private browser window to avoid possible confusion and mixups with personal Google accounts you might have.

Find your cluster

By now, you should already have created your first Kubernetes cluster. In case you do not find it, use the “Search products and resources” field to search and select “Kubernetes Engine”. It should show up there.

Open the working environment

You can work from the cloud shell, which opens from an icon in the tool bar, indicated by the red arrow in the figure below:

In the following, all the commands are typed in that shell. We will make use of many kubectl commands. If interested, you can read an overview of this command line tool for Kubernetes.

Install argo as a workflow engine

While jobs can also be run manually, a workflow engine makes defining and submitting jobs easier. In this tutorial, we use argo. Install it into your working environment with the following commands (all commands to be entered into the cloud shell):

kubectl create ns argo
kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo/stable/manifests/quick-start-postgres.yaml
curl -sLO https://github.com/argoproj/argo/releases/download/v2.11.1/argo-linux-amd64.gz
gunzip argo-linux-amd64.gz
chmod +x argo-linux-amd64
sudo mv ./argo-linux-amd64 /usr/local/bin/argo

This will also install the argo binary, which makes managing the workflows easier.

Reconnecting after longer time away

In case you leave your computer, you might have to reconnect to the CloudShell again, and also on a different computer. If the argo command is not found, run the command above again starting from the curl command.

You need to execute the following command so that the argo workflow controller has sufficient rights to manage the workflow pods. Replace XXX with the number for the login credentials you received.

kubectl create clusterrolebinding cern-cms-cluster-admin-binding --clusterrole=cluster-admin --user=cms-gXXX@arkivum.com

You can now check that argo is available with

argo version

We need to apply a small patch to the default argo config. Create a file called patch-workflow-controller-configmap.yaml:

data:
  artifactRepository: |
    archiveLogs: false

Apply:

kubectl patch configmap workflow-controller-configmap -n argo --patch "$(cat patch-workflow-controller-configmap.yaml)"

Run a simple test workflow

To test the setup, run a simple test workflow with

argo submit -n argo --watch https://raw.githubusercontent.com/argoproj/argo/master/examples/hello-world.yaml
argo list -n argo
argo get -n argo @latest
argo logs -n argo @latest
argo delete -n argo @latest

Please mind that it is important to delete your workflows once they have completed. If you do not do this, the pods associated with the workflow will remain scheduled in the cluster, which might lead to additional charges. You will learn how to automatically remove them later.

Kubernetes namespaces

The above commands as well as most of the following use a flag -n argo, which defines the namespace in which the resources are queried or created. Namespaces separate resources in the cluster, effectively giving you multiple virtual clusters within a cluster.

You can change the default namespace to argo as follows:

kubectl config set-context --current --namespace=argo

Key Points

  • With Kubernetes one can run workflows similar to a batch system


Storing workflow output on Google Kubernetes Engine

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • How can I set up shared storage for my workflows?

  • How to run a simple job and get the the ouput?

  • How to run a basic CMS Open Data workflow and get the output files?

Objectives
  • Understand how to set up shared storage and use it in a workflow

One concept that is particularly different with respect to running on a local computer is storage. Similarly as when mounting directories or volumes in Docker, one needs to mount available storage into a Pod for it to be able to access it. Kubernetes provides a large amount of storage types.

Defining a storage volume

The test job above did not produce any output data files, just text logs. The data analysis jobs will produce output files and, in the following, we will go through a few steps to setup a volume where the output files will be written and from where they can be fetched. All definitions are passed as “yaml” files, which you’ve already used in the steps above. Due to some restrictions of the Google Kubernetes Engine, we need to use an NFS persistent volumes to allow parallel access (see also this post).

The following commands will take care of this. You can look at the content of the files that are directly applied from GitHub at the workshop-paylod-kubernetes repository.

As a first step, a volume needs to be created:

gcloud compute disks create --size=100GB --zone=us-central1-c gce-nfs-disk-<NUMBER>

Replace <NUMBER> by your account number, e.g. 023. The zone should be the same as the one of the cluster created (no need to change here).

The output will look like this (don’t worry about the warnings):

WARNING: You have selected a disk size of under [200GB]. This may result in poor I/O performance. For more information, see: https://developers.google.com/compute/docs/disks#per
formance.
Created [https://www.googleapis.com/compute/v1/projects/cern-cms/zones/us-central1-c/disks/gce-nfs-disk-001].
NAME              ZONE           SIZE_GB  TYPE         STATUS
gce-nfs-disk-001  us-central1-c  100      pd-standard  READY
New disks are unformatted. You must format and mount a disk before it
can be used. You can find instructions on how to do this at:
https://cloud.google.com/compute/docs/disks/add-persistent-disk#formatting

Now, let’s get to using this disk:

curl -LO https://raw.githubusercontent.com/cms-opendata-workshop/workshop-payload-kubernetes/master/001-nfs-server.yaml

The file looks like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nfs-server-<NUMBER>
spec:
  replicas: 1
  selector:
    matchLabels:
      role: nfs-server-<NUMBER>
  template:
    metadata:
      labels:
        role: nfs-server-<NUMBER>
    spec:
      containers:
      - name: nfs-server-<NUMBER>
        image: gcr.io/google_containers/volume-nfs:0.8
        ports:
          - name: nfs
            containerPort: 2049
          - name: mountd
            containerPort: 20048
          - name: rpcbind
            containerPort: 111
        securityContext:
          privileged: true
        volumeMounts:
          - mountPath: /exports
            name: mypvc
      volumes:
        - name: mypvc
          gcePersistentDisk:
            pdName: gce-nfs-disk-<NUMBER>
            fsType: ext4

Replace all occurences of <NUMBER> by your account number, e.g. 023. You can edit files directly in the console or by opening the built-in graphical editor. Then apply the manifest:

kubectl apply -n argo -f 001-nfs-server.yaml

Then on to the server service:

curl -OL https://raw.githubusercontent.com/cms-opendata-workshop/workshop-payload-kubernetes/master/002-nfs-server-service.yaml

This looks like this:

apiVersion: v1
kind: Service
metadata:
  name: nfs-server-<NUMBER>
spec:
  ports:
    - name: nfs
      port: 2049
    - name: mountd
      port: 20048
    - name: rpcbind
      port: 111
  selector:
    role: nfs-server-<NUMBER>

As above, replace all occurences of <NUMBER> by your account number, e.g. 023 and apply the manifest:

kubectl apply -n argo -f 002-nfs-server-service.yaml

Download the manifest needed to create the PersistentVolume and the PersistentVolumeClaim:

curl -OL https://raw.githubusercontent.com/cms-opendata-workshop/workshop-payload-kubernetes/master/003-pv-pvc.yaml

The file looks as follows:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-<NUMBER>
spec:
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteMany
  nfs:
    server: <Add IP here>
    path: "/"

---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: nfs-<NUMBER>
  namespace: argo
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: ""
  resources:
    requests:
      storage: 100Gi

In the line containing server: replace <Add IP here> by the output of the following command:

kubectl get -n argo svc nfs-server-<NUMBER> |grep ClusterIP | awk '{ print $3; }'

This command queries the nfs-server service that we created above and then filters out the ClusterIP that we need to connect to the NFS server. Replace <NUMBER> as before.

Apply this manifest:

kubectl apply -n argo -f 003-pv-pvc.yaml

Let’s confirm that this worked:

kubectl get pvc nfs-<NUMBER> -n argo

You will see output similar to this

NAME   STATUS   VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
nfs    Bound    nfs      100Gi      RWX                           5h2m

Note that it may take some time before the STATUS gets to the state “Bound”.

Now we can use this volume in the workflow definition. Create a workflow definition file argo-wf-volume.yaml with the following contents:

# argo-wf-volume.ysml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: test-hostpath-
spec:
  entrypoint: test-hostpath
  volumes:
    - name: task-pv-storage
      persistentVolumeClaim:
        claimName: nfs-<NUMBER>
  templates:
  - name: test-hostpath
    script:
      image: alpine:latest
      command: [sh]
      source: |
        echo "This is the ouput" > /mnt/vol/test.txt
        echo ls -l /mnt/vol: `ls -l /mnt/vol`
      volumeMounts:
      - name: task-pv-storage
        mountPath: /mnt/vol

Submit and check this workflow with

argo submit -n argo argo-wf-volume.yaml
argo list -n argo

Take the name of the workflow from the output (replace XXXXX in the following command) and check the logs:

kubectl logs pod/test-hostpath-XXXXX  -n argo main

Once the job is done, you will see something like:

ls -l /mnt/vol: total 20 drwx------ 2 root root 16384 Sep 22 08:36 lost+found -rw-r--r-- 1 root root 18 Sep 22 08:36 test.txt

Get the output file

The example job above produced a text file as an output. It resides in the persistent volume that the workflow job has created. To copy the file from that volume to the cloud shell, we will define a container, a “storage pod” and mount the volume there so that we can get access to it.

Create a file pv-pod.yaml with the following contents:

# pv-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: pv-pod
spec:
  volumes:
    - name: task-pv-storage
      persistentVolumeClaim:
        claimName: nfs-<NUMBER>
  containers:
    - name: pv-container
      image: busybox
      command: ["tail", "-f", "/dev/null"]
      volumeMounts:
        - mountPath: /mnt/data
          name: task-pv-storage

Create the storage pod and copy the files from there

kubectl apply -f pv-pod.yaml -n argo
kubectl cp  pv-pod:/mnt/data /tmp/poddata -n argo

and you will get the file created by the job in /tmp/poddata/test.txt.

Run a CMS open data workflow

If the steps above are successful, we are now ready to run a workflow to process CMS open data.

Create a workflow file argo-wf-cms.yaml with the following content:

# argo-wf-cms.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: nanoaod-argo-
spec:
  entrypoint: nanoaod-argo
  volumes:
  - name: task-pv-storage
    persistentVolumeClaim:
      claimName: nfs-<NUMBER>
  templates:
  - name: nanoaod-argo
    script:
      image: cmsopendata/cmssw_5_3_32
      command: [sh]
      source: |
        source /opt/cms/entrypoint.sh
        sudo chown $USER /mnt/vol
        mkdir workspace
        cd workspace
        git clone git://github.com/cms-opendata-analyses/AOD2NanoAODOutreachTool  AOD2NanoAOD
        cd AOD2NanoAOD
        scram b -j8
        nevents=100
        eventline=$(grep maxEvents configs/data_cfg.py)
        sed -i "s/$eventline/process.maxEvents = cms.untracked.PSet( input = cms.untracked.int32($nevents) )/g" configs/data_cfg.py
        cmsRun configs/data_cfg.py
        cp output.root /mnt/vol/
        echo  ls -l /mnt/vol
        ls -l /mnt/vol
      volumeMounts:
      - name: task-pv-storage
        mountPath: /mnt/vol

Submit the job with

argo submit -n argo argo-wf-cms.yaml --watch

The option --watch gives a continuous follow-up of the progress. To get the logs of the job, use the process name (nanoaod-argo-XXXXX) which you can also find with

argo get -n argo @latest

and follow the container logs with

kubectl logs pod/nanoaod-argo-XXXXX  -n argo main

Get the output file output.root from the storage pod in a similar manner as it was done above.

Accessing files via http

With the storage pod, you can copy files between the storage element and the CloudConsole. However, a practical use case would be to run the “big data” workloads in the cloud, and then download the output to your local desktop or laptop for further processing. An easy way of making your files available to the outside world is to deploy a webserver that mounts the storage volume.

We first patch the config of the webserver to be created as follows:

mkdir conf.d
cd conf.d
curl -sLO https://raw.githubusercontent.com/cms-opendata-workshop/workshop-payload-kubernetes/master/conf.d/nginx-basic.conf
cd ..
kubectl create configmap basic-config --from-file=conf.d -n argo

Then prepare to deploy the fileserver by downloading the manifest:

curl -sLO https://github.com/cms-opendata-workshop/workshop-payload-kubernetes/raw/master/deployment-http-fileserver.yaml

Open this file and again adjust the <NUMBER>:

# deployment-http-fileserver.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    service: http-fileserver
  name: http-fileserver
spec:
  replicas: 1
  strategy: {}
  selector:
    matchLabels:
      service: http-fileserver
  template:
    metadata:
      labels:
        service: http-fileserver
    spec:
      volumes:
      - name: volume-output
        persistentVolumeClaim:
          claimName: nfs-<NUMBER>
      - name: basic-config
        configMap:
          name: basic-config
      containers:
      - name: file-storage-container
        image: nginx
        ports:
        - containerPort: 80
        volumeMounts:
          - mountPath: "/usr/share/nginx/html"
            name: volume-output
          - name: basic-config
            mountPath: /etc/nginx/conf.d

Apply and expose the port as a LoadBalancer:

kubectl create -n argo -f deployment-http-fileserver.yaml
kubectl expose deployment http-fileserver -n argo --type LoadBalancer --port 80 --target-port 80

Exposing the deployment will take a few minutes. Run the following command to follow its status:

kubectl get svc -n argo

You will initially see a line like this:

NAME                          TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
http-fileserver               LoadBalancer   10.8.7.24    <pending>     80:30539/TCP   5s

The <pending> EXTERNAL-IP will update after a few minutes (run the command again to check). Once it’s there, copy the IP and paste it into a new browser tab. This should welcome you with a “Hello from NFS” message. In order to enable file browsing, we need to delete the index.html file in the pod. Determine the pod name using the first command listed below and adjust the second command accordingly.

kubectl get pods -n argo
kubectl exec http-fileserver-XXXXXXXX-YYYYY -n argo -- rm /usr/share/nginx/html/index.html

Warning: anyone can now access these files

This IP is now accessible from anywhere in the world, and therefore also your files (mind: there are charges for outgoing bandwidth). Please delete the service again once you have finished downloading your files.

kubectl delete svc/http-fileserver -n argo

Run the kubectl expose deployment command to expose it again.

Remember to delete your workflow again to avoid additional charges.

argo delete -n argo @latest

Key Points

  • CMS Open Data workflows can be run in a commercial cloud environment using modern tools


Downloading data using the cernopendata-client

Overview

Teaching: 10 min
Exercises: 15 min
Questions
  • How can I download data from the Open Data portal?

  • Should I stream the data from CERN or have them available locally?

Objectives
  • Understand basic usage of cernopendata-client

The cernopendata-client is a tool that has recently become available, which allows you to access records from the CERN Open Data portal via the command line.

cernopendata-client-go is a light-weight implementation of this with a particular focus on usage “in the cloud”. Additionally, it allows you to download records in parallel. We will be using cernopendata-client-go in the following. Also, mind that you can also execute these commands on your local computer, but they will also work in the CloudShell (which is Linux_x86_64).

Getting the tool

You can download the latest release from GitHub For use on your own computer, download the corresponding binary archive for your processor architecture and operating system. If you are on MacOS Catalina or later, you will have to right-click on the extracted binary, hold CTRL when clicking on “Open”, and confirm again to open. Afterwards, you should be able to execute the file from the command line.

The binary will have a different name depending on your operating system and architecture. Execute it to get help and get more help by using the available sub-commands:

./cernopendata-client-go --help
./cernopendata-client-go list --help
./cernopendata-client-go download --help

As you can see from the releases page, you can also use docker instead of downloading the binaries:

docker run --rm -it clelange/cernopendata-client-go --help

Listing files

When browsing the CERN Open Data portal, you will see from the URL that every record has a number. For example, Analysis of Higgs boson decays to two tau leptons using data and simulation of events at the CMS detector from 2012 has the URL http://opendata.web.cern.ch/record/12350. The record ID is therefore 12350.

You can list the associated files as follows:

docker run --rm -it clelange/cernopendata-client-go list -r 12350

This yields:

http://opendata.cern.ch/eos/opendata/cms/software/HiggsTauTauNanoAODOutreachAnalysis/HiggsTauTauNanoAODOutreachAnalysis-1.0.zip
http://opendata.cern.ch/eos/opendata/cms/software/HiggsTauTauNanoAODOutreachAnalysis/histograms.py
http://opendata.cern.ch/eos/opendata/cms/software/HiggsTauTauNanoAODOutreachAnalysis/plot.py
http://opendata.cern.ch/eos/opendata/cms/software/HiggsTauTauNanoAODOutreachAnalysis/skim.cxx

Downloading files

You can download full records in an analoguous way:

docker run --rm -it clelange/cernopendata-client-go download -r 12350

By default, these files will end up in a directory called download, which contains directories of the record ID (i.e. ./download/12350 here).

Creating download jobs

As mentioned in the introductory slides, it is advantageous to download the desired data sets/records to cloud storage before processing them. With the cernopendata-client-go this can be achieved by creating Kubernetes Jobs.

A job to download a record making use of the volume we have available could look like this:

# job-data-download.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: job-data-download
spec:
  backoffLimit: 2
  ttlSecondsAfterFinished: 100
  template:
    spec:
      volumes:
        - name: volume-opendata
          persistentVolumeClaim:
            claimName: nfs-<NUMBER>
      containers:
        - name: pod-data-download
          image: clelange/cernopendata-client-go
          args: ["download", "-r", "12351", "-o", "/opendata"]
          volumeMounts:
            - mountPath: "/opendata/"
              name: volume-opendata
      restartPolicy: Never

Again, please adjust the <NUMBER>.

You can create (submit) the job like this:

kubectl apply -n argo -f job-data-download.yaml

The job will create a pod, and you can monitor the progress both via the pod and the job resources:

kubectl get pod -n argo
kubectl get job -n argo

The files are downloaded to the storage volume, and you can see them through the fileserver as instructed in the previous episode or through the storage pod:

kubectl exec pod/pv-pod -n argo -- ls /mnt/data/

Challenge: Download all records needed

In the following, we will run the Analysis of Higgs boson decays to two tau leptons using data and simulation of events at the CMS detector from 2012 , for which 9 different records are listed on the webpage. We only downloaded the GluGluToHToTauTau dataset so far. Can you download the remaining ones using Jobs?

Solution: Download all records needed

In principle, the only thing that needs to be changed is the record ID argument. However, resources need to have a unique name, which means that the job name needs to be changed or the finished job(s) have to be deleted.

Key Points

  • It is usually of advantage to have the data where to CPU cores are.


Running large-scale workflows

Overview

Teaching: 10 min
Exercises: 10 min
Questions
  • How can I run more than a toy workflow?

Objectives
  • Run a workflow with several parallel jobs

Autoscaling

Google Kubernetes Engine allows you to configure your cluster so that it is automatically rescaled based on pod needs.

When creating a pod you can specify how much of each reasource a container needs. More information on compute resources can be found on Kubernetes pages. This information is then used to schedule your pod. If there is no node matching your pod’s requirement then it has to wait until some more pods are terminated or new nodes is added.

Cluster autoscaler keeps an eye on the pods scheduled and checks if adding a new node, similar to the other in the cluster, would help. If yes, then it resizes the cluster to accommodate the waiting pods.

Cluster autoscaler also scales down the cluster to save resources. If the autoscaler notices that one or more nodes are not needed for an extended period of time (currently set to 10 minutes) it downscales the cluster.

To configure the autoscaler you simply specify a minimum and maximum for the node pool. The autoscaler then periodically checks the status of the pods and takes action accordingly. You can set the configuration either with the gcloud command-line tool or via the dashboard.

Deleting pods automatically

Argo allows you to describes the strategy to use when deleting completed pods. The pods are deleted automatically without deleting the workflow. Define one of the following strategies in your Argo workflow under the field spec:

spec:
  podGC:
    # pod gc strategy must be one of the following
    # * OnPodCompletion - delete pods immediately when pod is completed (including errors/failures)
    # * OnPodSuccess - delete pods immediately when pod is successful
    # * OnWorkflowCompletion - delete pods when workflow is completed
    # * OnWorkflowSuccess - delete pods when workflow is successful
    strategy: OnPodSuccess

Scaling down

Occasionally, the cluster autoscaler cannot scale down completely and extra nodes are left hanging behind. Some situations like those can be found documented here. Therefore it is useful to know how to manually scale down your cluster.

Click on your cluster, listed at Kubernetes Engine - Clusters. Scroll down to the end of the page where you will find the Node pools section. Clicking on your node pool will take you to its details page.

In the upper right corner, next to EDIT and DELETE you’ll find RESIZE.

Clicking on RESIZE opens a textfield that allows you to manually adjust the number of pods in your cluster.

Key Points

  • Argo Workflows on Kubernetes are very powerful.


Building a Docker image on GCP

Overview

Teaching: 10 min
Exercises: 30 min
Questions
  • How do I push to a private registry on GCP?

Objectives
  • Build a Docker image on GCP and push it to the private container registry GCR.

Creating your Docker image

The first thing to do when you want to create your Dockerfile is to ask yourself what you want to do with it. Our goal here is to run a physics analysis with the ROOT framework. To do this our Docker container must contain ROOT as well as all the necessary dependencies it requires. Fortunately there are already existing images matching this, such as rootproject/root-conda:6.18.04. We will use this image as our base.

Next we need our source code. We will git clone this into our container from the Github repository https://github.com/awesome-workshop/payload.git.

To save time we will compile the source code. The other option is to add the source code to our image without compiling it. However by compiling the code at this step, we avoid having to do it over and over again as our workflow runs multiple replicas of our container.

In in the CloudShell, create a file compile.sh with the following contents

#!/bin/bash

# Compile executable
echo ">>> Compile skimming executable ..."
COMPILER=$(root-config --cxx)
FLAGS=$(root-config --cflags --libs)
$COMPILER -g -O3 -Wall -Wextra -Wpedantic -o skim skim.cxx $FLAGS

Make it executable with

chmod +x compile.sh

Finally create a Dockerfile that executes everything mentioned above. Create the file Dockerfile with the following contents

# The Dockerfile defines the image's environment
# Import ROOT runtime
FROM rootproject/root-conda:6.18.04

# Argument available during build-time
ARG HOME=/home

# Set up working directory
WORKDIR $HOME

# Clone source code
RUN git clone https://github.com/awesome-workshop/payload.git
WORKDIR $HOME/payload

# Compile source code
ADD compile.sh $HOME/payload
RUN ./compile.sh

Building your Docker image

A Docker image can be built with the docker build command:

docker build -t [IMAGE-NAME]:[TAG] .

If you do not specify a container registry your image will be pushed to DockerHub, the default registry. We would like to push our images to the GCP container registry. To do so we need to name it gcr.io/[PROJECT-ID]/[IMAGE-NAME]:[TAG] , where [PROJECT-ID] is your GCP project ID. You can find your ID by clicking your project in the top left corner, or in the cloud shell prompt. In our case the ID is cern-cms. The hostname gcr.io tells the client to push it to the GCP registry.

[IMAGE-NAME] and [TAG] are up to you to choose, however a rule of thumb is to be as descriptive as possible. Due to Docker naming rules we aslo have to keep them lowercase. One example of how the command could look like is:

docker build -t gcr.io/cern-cms/root-conda-<NUMBER>:higgstautau .

Replace <NUMBER> with the number for the login credentials you received.

Choose a unique name

Note that all attendants of this workshop are sharing the same registry! If you fail to choose a unique name and tag combination you risk overwriting the image of a fellow attendee. Naming your image with the number of your login credentials is a good way to keep that from happening.

Adding your image to the container registry

Before you push or pull images on GCP you need to configure Docker to use the gcloud command-line tool to autheticate with the registry. Run the following command

gcloud auth configure-docker

This will list some warnings and instructions, which you can ignore for this tutorial. To push the image run

docker push gcr.io/cern-cms/root-conda-<NUMBER>:higgstautau

View your images

You can view images hosted by the container registry via the cloud console, or by visiting the image’s registry name in your web browser at http://gcr.io/cern-cms/root-conda-<NUMBER>.

Cleaning up your private registry

To avoid incurring charges to your GCP account you can remove your Docker image from the container registry. Do not execute this command right now, we want to use our container in the next step.

When it is time to remove your Docker image, execute the following command:

gcloud container images delete gcr.io/cern-cms/root-conda-<NUMBER>:higgstautau --force-delete-tags

Key Points

  • GCP allows you to store your Docker images in your own private container registry.


Getting real

Overview

Teaching: 5 min
Exercises: 20 min
Questions
  • How can I now use this for real?

Objectives
  • Get an idea of a full workflow with argo

So far we’ve run smaller examples, but now we have everything at hands to run physics analysis jobs in parallel.

Download the workflow with the following command:

curl -OL https://raw.githubusercontent.com/cms-opendata-workshop/workshop-payload-kubernetes/master/higgs-tau-tau-workflow.yaml

The file will look something like this:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: parallelism-nested-
spec:
  arguments:
    parameters:
    - name: files-list
      value: |
        [
          {"file": "GluGluToHToTauTau", "x-section": "19.6", "process": "ggH"},
          {"file": "VBF_HToTauTau", "x-section": "1.55", "process": "qqH"},
          {"file": "DYJetsToLL", "x-section": "3503.7", "process": "ZTT"},
          {"file": "TTbar", "x-section": "225.2", "process": "TT"},
          {"file": "W1JetsToLNu", "x-section": "6381.2", "process": "W1J"},
          {"file": "W2JetsToLNu", "x-section": "2039.8", "process": "W2J"},
          {"file": "W3JetsToLNu", "x-section": "612.5", "process": "W3J"},
          {"file": "Run2012B_TauPlusX", "x-section": "1.0", "process": "dataRunB"},
          {"file": "Run2012C_TauPlusX", "x-section": "1.0", "process": "dataRunC"}
        ]
    - name: histogram-list
      value: |
        [
          {"file": "GluGluToHToTauTau", "x-section": "19.6", "process": "ggH"},
          {"file": "VBF_HToTauTau", "x-section": "1.55", "process": "qqH"},
          {"file": "DYJetsToLL", "x-section": "3503.7", "process": "ZTT"},
          {"file": "DYJetsToLL", "x-section": "3503.7", "process": "ZLL"},
          {"file": "TTbar", "x-section": "225.2", "process": "TT"},
          {"file": "W1JetsToLNu", "x-section": "6381.2", "process": "W1J"},
          {"file": "W2JetsToLNu", "x-section": "2039.8", "process": "W2J"},
          {"file": "W3JetsToLNu", "x-section": "612.5", "process": "W3J"},
          {"file": "Run2012B_TauPlusX", "x-section": "1.0", "process": "dataRunB"},
          {"file": "Run2012C_TauPlusX", "x-section": "1.0", "process": "dataRunC"}
        ]
  entrypoint: parallel-worker
  volumes:
  - name: task-pv-storage
    persistentVolumeClaim:
      claimName: nfs-<NUMBER>
  templates:
  - name: parallel-worker
    inputs:
      parameters:
      - name: files-list
      - name: histogram-list
    dag:
      tasks:
      - name: skim-step
        template: skim-template
        arguments:
          parameters:
          - name: file
            value: ""
          - name: x-section
            value: ""
        withParam: ""
      - name: histogram-step
        dependencies: [skim-step]
        template: histogram-template
        arguments:
          parameters:
          - name: file
            value: ""
          - name: process
            value: ""
        withParam: ""

      - name: merge-step
        dependencies: [histogram-step]
        template: merge-template

      - name: plot-step
        dependencies: [merge-step]
        template: plot-template

      - name: fit-step
        dependencies: [merge-step]
        template: fit-template

  - name: skim-template
    inputs:
      parameters:
      - name: file
      - name: x-section
    script:
      image: gcr.io/cern-cms/root-conda-002:higgstautau
      command: [sh]
      source: |
        LUMI=11467.0 # Integrated luminosity of the unscaled dataset
        SCALE=0.1 # Same fraction as used to down-size the analysis
        mkdir -p $HOME/skimm
        ./skim root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/.root $HOME/skimm/-skimmed.root  $LUMI $SCALE
        ls -l $HOME/skimm
        cp $HOME/skimm/* /mnt/vol
      volumeMounts:
      - name: task-pv-storage
        mountPath: /mnt/vol
      resources:
        limits:
          memory: 2Gi
        requests:
          memory: 1.7Gi
          cpu: 750m

  - name: histogram-template
    inputs:
      parameters:
      - name: file
      - name: process
    script:
      image: gcr.io/cern-cms/root-conda-002:higgstautau
      command: [sh]
      source: |
        mkdir -p $HOME/histogram
        python histograms.py /mnt/vol/-skimmed.root  $HOME/histogram/-histogram-.root
        ls -l $HOME/histogram
        cp $HOME/histogram/* /mnt/vol
      volumeMounts:
      - name: task-pv-storage
        mountPath: /mnt/vol
      resources:
        limits:
          memory: 2Gi
        requests:
          memory: 1.7Gi
          cpu: 750m

  - name: merge-template
    script:
      image: gcr.io/cern-cms/root-conda-002:higgstautau
      command: [sh]
      source: |
        hadd -f /mnt/vol/histogram.root /mnt/vol/*-histogram-*.root
        ls -l /mnt/vol
      volumeMounts:
      - name: task-pv-storage
        mountPath: /mnt/vol
      resources:
        limits:
          memory: 2Gi
        requests:
          memory: 1.7Gi
          cpu: 750m

  - name: plot-template
    script:
      image: gcr.io/cern-cms/root-conda-002:higgstautau
      command: [sh]
      source: |
        SCALE=0.1
        python plot.py /mnt/vol/histogram.root /mnt/vol $SCALE
        ls -l /mnt/vol
      volumeMounts:
      - name: task-pv-storage
        mountPath: /mnt/vol

  - name: fit-template
    script:
      image: gcr.io/cern-cms/root-conda-002:higgstautau
      command: [sh]
      source: |
        python fit.py /mnt/vol/histogram.root /mnt/vol
        ls -l /mnt/vol
      volumeMounts:
      - name: task-pv-storage
        mountPath: /mnt/vol

Adjust the workflow as follows:

Then execute the workflow and keep your thumbs pressed:

argo submit -n argo --watch higgs-tau-tau-workflow.yaml

Good luck!

Key Points

  • Argo is a powerful tool for running parallel workflows


Cleaning up

Overview

Teaching: 3 min
Exercises: 2 min
Questions
  • How can I delete my cluster and disks?

Objectives
  • Know what needs to be deleted to not be charged.

Two things need to be done:

Key Points

  • The cluster and disks should be deleted if not needed anymore.