This lesson is in the early stages of development (Alpha version)

Cloud Computing

Prep-work: Kubernetes Clusters

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is Kubernetes?

  • What is a Kubernetes cluster and why do I need one?

Objectives
  • Learn the very basics of Kubernetes

  • Learn a bit about the architecture of a Kubernetes cluster

Introduction

Most of you have been working with Docker containers throughout this workshop. As you know now, one could see these Docker containers like isolated machines that can run on a host machine in a very efficient way. Your desktop or laptop could actually run several of these containers if enough resources are available. For instance, you could think that one can maximize the hardware resources by running multiple CMSSW open data containers in one desktop. Imagine, if you could run 10 CMSSW containers (each over a single root file) at the same time in a single physical machine, then it will take you 10 times less time to skim a full dataset.

Now, if you had more machines available, let’s say 3 or 4, and in each one of them you could run 10 containers, that will certainly speed up the data processing (specially the first stage in our analysis flow example, which we will see later) Now, if you could have access to many more machines, then a few problems may appear. For instance, how would you install the software required in all those machines? Would there be enough personpower to take care of that? How would you take care of, and babysit all those containers?

The answer to most of these questions is Kubernetes, and, particularly, Kubernetes running on commercial clusters. Kubernetes (K8s) software is said to be an orchestrator of containers. Not necessarily Docker containers, but any brand which shares the same basic technological principles.

In this lesson you will learn about using commercial cloud computing resources in order to process CMS open data. Basically, we will introduce you to the arts of using a few machines (a cluster of computers) to run your containerized open data analysis workflows and manage them with tools that interact with the Kubernetes software.

K8s architecture

We believe that learning a bit a bout K8s architecture will help you understand better what goes on in the next episodes of this lesson. In simple terms, when you access a cloud K8s cluster you are getting access to a cluster of computers run by Kubernetes . These computers will have main servers (also called control planes) and some worker nodes. The main servers will take care of the bookkeeping and handling of the worker nodes, while the worker nodes will be the ones running your containers (spoiler alert, each container is essentially called a pod in the K8s abstraction). As a matter of fact, you should think of the K8s cluster not as one of usual physical machines, but a cluster of pods running on physical machines.

Take into account that K8s commercial clusters were built with software and application developers in mind. They are mostly used to run services like web pages with database access, apps, etc., but the technology works well for our needs, i.e., sending batch jobs, maybe performing some additional steps, and getting some output.

Plese watch this video, which will explain the basic Kubernetes cluster architecture with a nice analogy:

Key Points

  • Kubernetes is an orchestrator of containers. It is most useful when it is run in a cluster of computers

  • Commercial K8s clusters are a good option for large computing needs.

  • We can run our containerized CMSSW jobs and subsequent analysis workflows in a K8s cluster.


See you tomorrow

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question (FIXME)

Objectives
  • First learning objective. (FIXME)

FIXME

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Demo: Creating a cluster

Overview

Teaching: 20 min
Exercises: 0 min
Questions
  • What are the basic concepts and jargon I need to know?

  • Do do I manually create a K8s cluster on the GCP

Objectives
  • Learn a few words and concepts that will be used during this lesson

  • Lear how to create a K8s cluster from scratch

Introduction

In this demonstration we will show you the very basic way in which you can create a computer cluster (a Kubernetes cluster to be exact) in the cloud so you can do some data processing and analysis using those resources. In the process we will make sure you learn about the jargon. During the hands-on session tomorrow, a cluster similar to this one will be provided to you for the exercises.

Basic concepts

The Google Cloud Platform (GCP)

A place on the web that interfaces the user with all the different services that google provide on the cloud.

GCP Console

The exact name of the GCP interface where you can explore all the different services that GCP provides. They include, but are not limited to individual virtual machines, disk storage, kubernetes clusters, etc.

Google Kubernetes Engine

A Google service to create Kubernetes clusters and run conteinerized application and/or jobs/workflows.

Kubernetes (K8s)

Software which orchestrates containers in a computer cluster. You already had a chance to learn about its architecture.

Workflow

A series of sequential operations in order to achieve a final result. In our case it could be, for instance

skimming -> merging output files -> EventLoop analysis of resulting files -> Plottin histograms

In the context of the cloud, they are written in yaml files.

Pod

The smallest abstraction layer in a K8s cluter. For any practical purposes, a pod is an abstraction of a container running in the K8s cluster.

Deployment

This is an abstraction layer which is above pods. In practice, you always create deployments in K8s, not pods.

Argo

Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes.

Creating your own cluster on GKE

For the hands-on part of this lesson you will not have to create the cluster for yourself, it will be already done for you. For pedagogical reasons, however, we will show an example of how to do it by hand. The settings below should be good and cheap enough for CMSSW-related workflows.

While we wait, lets inspect the Cloud shell…

Cloud shell

GCP provides an access machine so you can interact with their different services, including our newly created K8s cluster. This machine (and the terminal) is not really part of the cluster. As was said, it is an entry point. From here you could connect to your cluster.

This is something you will have to do only once. For the hands-on part of the lesson, it is likely that you were already authenticated.

The gcloud command

The gcloud command-line interface is the primary CLI tool to create and manage Google Cloud resources. You can use this tool to perform many common platform tasks either from the command line or in scripts and other automations.

Connect to your cluster

Once the cluster is ready (green check-mark should appear)

Key Points

  • It takes just a few clicks to create you own K8s cluster


Kubectl and additional tools and services

Overview

Teaching: 20 min
Exercises: 0 min
Questions
  • What is kubectl?

  • What is Argo workflows?

  • What kind of services/resources will I need to instantiate in my cluster?

Objectives
  • Lear what the kubectl command can do

  • Appreciate the necessity for the Argo workflows tool (or similar)

  • Lear how to set up different services/resources to get the most of your cluster

The kubectl command

Just as gcloud is the one command to rule them all for the GCP, the kubectl command is the main tool for interacting with your K8s cluster. You will use it to do essentially anything in the cluster. Here is the official cheatsheet, which is very useful but already very long.

Let’s run a few examples.

kubectl get nodes  
kubectl cluster-info  # Display addresses of the master and services

Let’s list some kubernetes components:

kubectl get pod
kubectl get services

We don’t have much going on. Let’s create some components.

kubectl create -h

Note there is no pod on the list, so in K8s you don’t create pods but deployments. These will create pods, which will run under the hood.

kubectl create deployment mynginx-depl --image=nginx

The nginx image will be pulled down from the Docker Hub. This is the most minimalist way of creating a deployment.

kubectl get deployment
kubectl get pod

Yaml Files

Another way of creating components in a K8s cluster is trough yaml files. These are intuitive, logical and configurable, although very picky about identation.

Let’s take a look at one of these files, nginx-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.16
        ports:
        - containerPort: 8080

Let’s delete our previous deployment and deploy using the yaml file:

kubectl delete deployment mynginx-depl
kubectl apply -f nginx-deployment.yaml

Namespaces

Namespaces are a kind of reservations in your K8s cluster. Let’s create one for the Argo workflow we will user

kubectl create ns <NAMESPACE>

Argo

While jobs can also be run manually, a workflow engine makes defining and submitting jobs easier. In this tutorial, we use argo quick start page to install it:

kubectl create clusterrolebinding YOURNAME-cluster-admin-binding --clusterrole=cluster-admin --user=YOUREMAIL@gmail.com
kubectl create ns argo
kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo-workflows/stable/manifests/quick-start-postgres.yaml

Download argo CLI:

# Download the binary
curl -sLO https://github.com/argoproj/argo-workflows/releases/download/v3.1.2/argo-linux-amd64.gz

# Unzip
gunzip argo-linux-amd64.gz

# Make binary executable
chmod +x argo-linux-amd64

# Move binary to path
sudo mv ./argo-linux-amd64 /usr/local/bin/argo

# Test installation
argo version

Run a simple test flow:

argo submit -n argo --watch https://raw.githubusercontent.com/argoproj/argo-workflows/master/examples/hello-world.yaml
argo list -n argo
argo get -n argo @latest
argo logs -n argo @latest

Storage volumes

If we run some application or workflow, we usually require a disk space where to dump our results. There is no persistent disk by default, we have to create it.

You could create a disk clicking on the web interface above, but lets do it faster in the command line.

Create the volume (disk) we are going to use

gcloud compute disks create --size=100GB --zone=us-central1-c gce-nfs-disk-1

Set up an nfs server for this disk:

wget https://cms-opendata-workshop.github.io/workshop2021-lesson-cloud/files/001-nfs-server.yaml
kubectl apply -n argo -f 001-nfs-server.yaml

Set up a nfs service, so we can access the server:

wget https://cms-opendata-workshop.github.io/workshop2021-lesson-cloud/files/002-nfs-server-service.yaml
kubectl apply -n argo -f 002-nfs-server-service.yaml

Let’s find out the IP of the nfs server:

kubectl get -n argo svc nfs-server |grep ClusterIP | awk '{ print $3; }'

Let’s create a persisten volume out of this nfs disk. Note that persisten volumes are not namespaced they are available to the whole cluster.

We need to write that IP number above into the appropriate place in this file:

wget https://cms-opendata-workshop.github.io/workshop2021-lesson-cloud/files/003-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-1
spec:
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteMany
  nfs:
    server: <Add IP here>
    path: "/"

Deploy:

kubectl apply -f 003-pv.yaml

Check:

kubectl get pv

Apps can claim persistent volumes through persistent volume claims (pvc). Let’s create a pvc:

wget https://cms-opendata-workshop.github.io/workshop2021-lesson-cloud/files/003-pvc.yaml
kubectl apply -n argo -f 003-pvc.yaml

Check:

kubectl get pvc -n argo

Now an argo workflow coul claim and access this volume with a configuration like:

# argo-wf-volume.ysml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: test-hostpath-
spec:
  entrypoint: test-hostpath
  volumes:
    - name: task-pv-storage
      persistentVolumeClaim:
        claimName: nfs-<NUMBER>
  templates:
  - name: test-hostpath
    script:
      image: alpine:latest
      command: [sh]
      source: |
        echo "This is the ouput" > /mnt/vol/test.txt
        echo ls -l /mnt/vol: `ls -l /mnt/vol`
      volumeMounts:
      - name: task-pv-storage
        mountPath: /mnt/vol

Now the workflow could access the disk. However is rather cumbersome to get the ouput outside the cluster. One of he best ways to do that is to follow similar procedures to set up an http web server for example.

As you can see all these tasks take time. The good news is that there are tools to make the cluster creation and the setting up of volumes, pvs, pvcs, servers, services, etc., automatic. This is what we will be doing for you (using Terraform) for the upcoming hands-on part of this lesson.

You will continue learning about this, hands-on, tomorrow, but concentrating on analysis workflows.

Key Points

  • kubectl is the ruler of GKE

  • Argo is a very useful tool for running workflows and parallel jobs

  • To be able to write, read and extract data, a few services/resources need to be set up on the GCP


See you tomorrow

Overview

Teaching: 0 min
Exercises: 0 min
Questions
Objectives

See you tomorrow for the hands-on activity.

Key Points


Hands-on: Accessing your K8s cluster

Overview

Teaching: 0 min
Exercises: 45 min
Questions
  • How do I join the GCP project for these hands-on activities?

  • What GCP resources do I have?

  • How do I connect to my cluster?

Objectives
  • Log into the CMS-opendata project on the GCP

  • Identify the different resources and tools that come pre-installed with your cluster

  • Learn how to connect to your cluster

Introduction

Sign-in instructions

As it is possible to automate the process of creating resources in the cloud, we have built all the required resources for this tutorial in a GCP Project.

Here you will find some instructions on how to join this project, called CMS-opendata, on the GCP. Please do follow that procedure in order for you to have access to your K8s cluster.

Your K8s cluster

Every one of you will have your own Kubernetes cluster already created. Remember we went through the trouble, yesterday, of creating the cluster by hand. We did not even finish because all the needed elements require some time (and knowledge) to be set up. The good news is that we have taken care of that for you.

Your cluster contains all the toys that you need to run some analysis workflow and extract the output. The most relevant are the access to a an abstraction of disk space, called pvc (persistent volume claim), and an http fileserver that will allow you to access from outside the files you will be producing in this disk. Also, some processes (in pods), which belong to this tool called Argo that we mentioned yesterday, are also running already on your cluster. This will allow you to execute workflows for your analysis.

Each participant has a special ID consisting from the first 7 letters of the participant’s email [0-9a-z]. All the resources you will see later reflect this fact:

Resources

  • gce-nfs-disk-<ID>

cluster-<ID> (namespace “argo”)

  • Deploymet/Service: nfs-server-<ID>
  • Deploymet/Service: http-fileserver-<ID>
  • Deploymet/Service: argo-server-<ID>
  • PV/PVC: nfs-server-<ID>

For example, jo.hn2000@cern.ch has a cluster-john200 in the namespace argo and a process called nfs-server-john200 running on his cluster.

Remember, namespaces are reservations inside a K8s cluster. Argo gets installed within the argo name space. Other than that, this will not influence our work in any visible way.


Connect to your cluster

First, notice, by looking at the top of your web browser, that you are already in the GCP CMS-opendata project.

To connect to your cluster we are going to use the GCP CloudShell. Head to GCP > Kubernetes Engine > Clusters:

image


Remember

You can always use the filter with your ID to list only your resources

image


Find your cluster, press connect and choose RUN IN CLOUD SHELL at the emerging window. Then hit enter to execute the gcloud command that will connect you to your cluster.

Press Authorize if prompted to do so.

image

Getting to know your cluster

Let’s go to the Cloud Shell terminal and run some kubectl commands to verify that all the different resources are actually running on your cluster.

If you get disconnected from the terminal, just hit the Reconnect button.

Check the pods that are running under the namespace argo:

kubectl get pods -n argo

If you forget the namespace declaration, -n argo, you will get null results. This is because all the elements were created under this namespace (with the exception of the pv, which is not namespaced).

NAME                                       READY   STATUS    RESTARTS   AGE
argo-server-edgarfe-64444f4574-brq7d       1/1     Running   2          10h
http-fileserver-edgarfe-5f8468d9fb-nvljz   1/1     Running   0          10h
minio-edgarfe-58ff684dbd-bbt5t             1/1     Running   0          10h
nfs-server-edgarfe-7ccc5c5fb8-gtbwb        1/1     Running   0          10h
postgres-6b5c55f477-5hqkj                  1/1     Running   0          10h
workflow-controller-7fb47d49bb-6l6wc       1/1     Running   2          10h

Note that there are two instances of the argo-server and its workflow-controller. This is, we presume, to make sure there is always this service available to listen and deal with the requested workflows the client might request.

Now, let’s check the persistent volume, pv, that was created globally (it is not namespaced) as an abstraction for the disk storage infrastructure:

kubectl get pv
NAME          CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM              STORAGECLASS   REASON   AGE
nfs-edgarfe   100Gi      RWX            Retain           Bound    argo/nfs-edgarfe   standard                12h

Our workflows will be claiming this persistent volume through a persistent volume claim, pvc, that has been also created for us in our argo namespace. Let’s check:

kubectl get pvc -n argo
NAME          STATUS   VOLUME        CAPACITY   ACCESS MODES   STORAGECLASS   AGE
nfs-edgarfe   Bound    nfs-edgarfe   100Gi      RWX            standard       10h

The Bound STATUS reflects the fact that is linked to the pv. A pvc can be claiming all the disk space available in the pv or only a fraction of it. In this case, we are claiming the whole disk: 100Gi.

Take note of the NAME of the pvc as this will need to be referenced when trying to write to the disk it is linked to. Make sure you replace the name of the examples in this tutorial with the name of your own pvc.

Check the services running and the associated IP addresses:

kubectl get svc -n argo
NAME                          TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)                      AGE
argo-server-edgarfe           LoadBalancer   10.122.199.159   34.88.193.217   2746:30307/TCP               10h
http-fileserver-edgarfe       LoadBalancer   10.122.206.111   35.228.53.183   80:32703/TCP                 10h
minio-edgarfe                 ClusterIP      10.122.201.144   <none>          9000/TCP                     10h
nfs-server-edgarfe            ClusterIP      10.122.206.212   <none>          2049/TCP,20048/TCP,111/TCP   10h
postgres                      ClusterIP      10.122.202.183   <none>          5432/TCP                     10h
workflow-controller-metrics   ClusterIP      10.122.199.13    <none>          9090/TCP                     10h

Take note of the EXTERNAL-IP address for the http-fileserver as this is the address you will point your browser to in order to access any output files written to the mounted volumes.

You can point your browser to this IP address right now. You will find a familiar but empty tree structure. Your files will arrive there if written to the volumen(s) mounted with the pvc:

To remember

When exposing a server like this, with a public IP address, you get charged on GCP for the outgoing and incoming traffic.

Also take note of the EXTERNAL-IP address for the argo-server..

If you point your browser to that IP address, formatted in this way https://34.88.193.217:2746 (replace your own IP; the port is always 2746), you will get access to the Argo GUI. It is a nice application to monitor you workflows. We will check it out later.

Do not mind the minio and the postgres processes, they come with the setup, but we are not going to use them.

Key Points

  • Once a cluster is created on the Kubernetes Engine, it is really simple to explore its elements.

  • Connection to the cluster can be achieved using the Cloud Shell.


Break

Overview

Teaching: 0 min
Exercises: 0 min
Questions
Objectives

Let’s go for a break!

Key Points


Hands-on: Running and understanding Argo workflows

Overview

Teaching: 0 min
Exercises: 40 min
Questions
  • How are Argo workflows configured?

  • What are the main ingredients of an Argo yaml workflow file?

Objectives
  • Learn about the basic structure of Argo yaml workflow files

  • Learn how to put together differt steps in a workflow and make them run in series

Run a simple Argo workflow

In order to understand the logic of yaml files, and to watch a simple Argo workflow take place, let’s use this simple exmple. Download it your Cloud Shell terminal with:

wget https://cms-opendata-workshop.github.io/workshop2021-lesson-cloud/files/argo-wf-volume.yaml

Edit it with nano or, even better, using the incorporated editor from the console (just click on the Open Editor button). Change the <ID> int the nfs-<ID> string to yours (remember this nfs- is the name of the `pvc`, where we are writing out):

#argo-wf-volume.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: test-hostpath-
spec:
  entrypoint: test-hostpath
  volumes:
    - name: task-pv-storage
      persistentVolumeClaim:
        claimName: nfs-<ID>
  templates:
  - name: test-hostpath
    script:
      image: alpine:latest
      command: [sh]
      source: |
        echo "This is the ouput" > /mnt/vol/test.txt
        echo ls -l /mnt/vol: `ls -l /mnt/vol`
      volumeMounts:
      - name: task-pv-storage
        mountPath: /mnt/vol

Before we submit this workflow, let’s try to understand what this configuration is doing.

Indentation

yaml files are very picky about indentation: they have to, it is how they organize themselves. Everything aligned under a certain level, corresponds to the same organizational block. Be careful not to accidentally perturb this structure.

If you start reading it from top to bottom:

The spec is the blue print of whole deployment. It starts at an entrypoint. It is literally the place where the workflow will start processing routines. In this case, the entrypoint is test-hostpath (you can give any name to these variables but they have to be consistent with the logic of the yaml). We have to jump down to find its definition under templates. You may recognize that this template is a template of a routine or action. These templates are identified by their names, and usually have a script, which are the commands that will be executed. This script needs to perform through a container (after all K8s is all about containers) and so it calls the corresponding container image to be downloaded. In the present case, the container will be just a simple alpine:latest linux container. Whatever commands are at the source tag will be executed through this container. Also, this template has a volumeMounts, which works very similar to what we have been doing with our own container, i.e., mounting whatever persistentVolumeClaim disk was defined in the volumes section of the spec, in this case nfs-<ID> (our pvc, remember?), at the /mnt/vol directory mountPath.

apiVersions seem to change from time to time, so, sometimes, that will render your yaml config file obsolete and you will have to update the api information.

Let’s submit our workflow:

argo submit -n argo argo-wf-volume.yaml

Yep, it fails:

-bash: argo: command not found

This is because, even though we have the Argo server and its workflow-controller running, we don’t have the CLI to interact with it. Let’s download Argo CLI:

# Download the binary
curl -sLO https://github.com/argoproj/argo-workflows/releases/download/v3.1.2/argo-linux-amd64.gz

# Unzip
gunzip argo-linux-amd64.gz

# Make binary executable
chmod +x argo-linux-amd64

# Move binary to path
sudo mv ./argo-linux-amd64 /usr/local/bin/argo

# Test installation
argo version

You will see somethin like:

argo: v3.1.2
  BuildDate: 2021-07-15T21:53:44Z
  GitCommit: 98721a96eef8e4fe9a237b2105ba299a65eaea9a
  GitTreeState: clean
  GitTag: v3.1.2
  GoVersion: go1.15.7
  Compiler: gc
  Platform: linux/amd64

Ok, now we are truly ready to submit the workflow. Let’s try again:

argo submit -n argo argo-wf-volume.yaml

If you check the status with:

argo list -n argo

you will first see something like:

NAME                  STATUS    AGE   DURATION   PRIORITY
test-hostpath-r5wpg   Running   22s   22s        0

but then

NAME                  STATUS      AGE   DURATION   PRIORITY
test-hostpath-r5wpg   Succeeded   29s   24s        0

when is finished.

If you were paying attention, this workflow should have written a test.txt file to our disk (which is visible externally through our http fileserver). Aren’t you eager to check whether you were able to produce it successfully? Point your browser to the corresponding IP address. In our case:

we succeded!

Let’s also the check the Argo GUI:

If you click on the name, you can see a sort of workflow diagram. Ok, in this case our workflow was more like a single task. Let’s make things a little bit more interesting.

Let’s spice things up

Now we are going to run some serious workflow. What we will be doing here is to mimic a full analysis flow (so, it is a full workflow) of essentially what we did already during the workshop.

We will start by automatically getting the list of files from the cern open portal with the containerized cernopendata-client tool.

Then we will generate a list of those files that we want to process; we will then run a version of the POET code, then merge the output files into a single ROOT file and finally run a simplified version of the EventLoopAnalysisTemplate to obtain a test histogram to see if everything went well.

Dowload the yaml file with:

wget https://raw.githubusercontent.com/cms-opendata-workshop/workshop2021-poetpayload-cloud/master/PhysObjectExtractor/cloud/workshop_argo.yaml

Inspect it carefully and try to make sense of logic. This is a good time to ask the experts in the room!

Submit your workflow and get the results.

You can open the http file server assigned to your cluster to see the histogram file and to download the merged output file poetoutput.root

Will you be able to change the workflow file to process the dataset of your choice?, the numebr of files of your choice?, etc.? Note that to run over all events in each file, you can set the number of events to -1.

Key Points

  • With a simple but a tight yaml structure, a full-blown analysis can be performed using the Argo tool on a K8s cluster.