  • What is Kubernetes?

  • What is a Kubernetes cluster and why do I need one?

  • What is Argo?

  • Learn the very basics of Kubernetes

  • Learn a bit about the architecture of a Kubernetes cluster

  • Learn the very basics of Argo


Kubernetes is a container orchestration system open sourced by Google. Its main purpose is to schedule services to run on a cluster of computers while abstracting away the existence of the cluster from the services. Kubernetes is now maintained by the Cloud Native Computing Foundation, which is a part of the Linux Foundation. Kubernetes can flexibly handle replication, impose resource limits, and recover quickly from failures.

Most of you have been working with Docker containers throughout this workshop. As you know now, one could see these Docker containers like isolated machines that can run on a host machine in a very efficient way. Your desktop or laptop could actually run several of these containers if enough resources are available. For instance, you could think that one can maximize the hardware resources by running multiple CMSSW open data containers in one desktop. Imagine, if you could run 10 CMSSW containers (each over a single root file) at the same time in a single physical machine, then it will take you 10 times less time to skim a full dataset.

Now, if you had more machines available, let’s say 3 or 4, and in each one of them you could run 10 containers, that will certainly speed up the data processing (specially the first stage in our analysis flow example, which we will see later) Now, if you could have access to many more machines, then a few problems may appear. For instance, how would you install the software required in all those machines? Would there be enough personpower to take care of that? How would you take care of, and babysit all those containers?

Kubernetes (K8s)

A Kubernetes cluster consists of “master” nodes and “worker” nodes. In short, master nodes share state to manage the cluster and schedule jobs to run on workers. It is considered best practice to run an odd number of masters.

Kubernetes Components

When you deploy Kubernetes, you get a cluster. A Kubernetes cluster consists of a set of worker machines, called nodes. The worker nodes host the Pods that are the components of the application workload.

Nodes Components


Kubernetes supports autoscaling to optimise your nodes’ resources as wll as adjust CPU and memory to meet your application’s real usage. When you need to save some money, you can scale down. Probably you want to pay for what you use, keep only with the resources when you need them If you want to learn about pricing, ckeck the next link: Google-cloud

  • Kubernetes is an orchestrator of containers. It is most useful when it is run in a cluster of computers.

  • Commercial K8s clusters are a good option for large computing needs.

  • We can run our containerized CMSSW jobs and subsequent analysis workflows in a K8s cluster.

Getting started with Argo and Kubectl


  • How to use Kubectl commands?

  • What is kubectl?

  • What is Argo workflows?

  • Lear what the kubectl command can do

  • Appreciate the necessity for the Argo workflows tool (or similar)

  • Lear how to set up different services/resources to get the most of your cluster

The kubectl command

Kubernetes provides a kubectl for communicating with a Kubernetes cluster’s control plane, using the Kubernetes API. Use the following syntax to run kubectl commands from your terminal window:

kubectl [command] [TYPE] [NAME] [flags]

where command, TYPE, NAME, and flags are:

command: Specifies the operation that you want to perform on one or more resources, for example create, get, describe, delete.
TYPE: Specifies the resource type.
NAME: Specifies the name of the resource.
flags:Specifies optional flags.

For installation instructions, see Installing kubectl; for a quick guide:

Let’s run a few examples.

kubectl get nodes  
kubectl cluster-info  # Display addresses of the master and services

Let’s list some kubernetes components:

kubectl get pod
kubectl get services

We don’t have much going on. Let’s create some components.

kubectl create -h

Note there is no pod on the list, so in K8s you don’t create pods but deployments. These will create pods, which will run under the hood.

kubectl create deployment mynginx-depl --image=nginx

The nginx image will be pulled down from the Docker Hub. This is the most minimalist way of creating a deployment.

kubectl get deployment
kubectl get pod


Argo is a collection of open source tools that let us to extend the functions in Kubernetes. We can find some benefits from use argo.

Argo as a workflow engine

While jobs can also be run manually, a workflow engine makes defining and submitting jobs easier. In this tutorial, we use argo. Install it into your working environment with the following commands (all commands to be entered into the cloud shell):

While jobs can also be run manually, a workflow engine makes defining and submitting jobs easier. In this tutorial, we use argo quick start page to install it.


Namespaces are a kind of reservations in your K8s cluster. Let’s create one for the Argo workflow we will user

kubectl create ns <NAMESPACE>

Kubernetes namespaces

The above commands as well as most of the following use a flag -n argo, which defines the namespace in which the resources are queried or created. Namespaces separate resources in the cluster, effectively giving you multiple virtual clusters within a cluster.

You can change the default namespace to argo as follows:

kubectl config set-context --current --namespace=argo

  • kubectl is the ruler of GKE

  • Argo is a very useful tool for running workflows and parallel jobs

  • To be able to write, read and extract data, a few services/resources need to be set up on the GCP

Demo: Creating a cluster


  • What are the basic concepts and jargon I need to know?

  • Do do I manually create a K8s cluster on the GCP

  • Learn how to create a K8s cluster from scratch


In this demonstration we will show you the very basic way in which you can create a computer cluster (a Kubernetes cluster to be exact) in the cloud so you can do some data processing and analysis using those resources.  In the process we will make sure you learn about the jargon.  During the hands-on session of the workshop (cloud computing), a cluster similar to this one will be provided to you for the exercises.  

If needed you can watch a walkthrough here:


If you’ve opted to use minikube, creating a kubernetes cluster is as easy as running:

minikube start

Feel free to skip to the next episode.

Creating your own cluster on GKE

For the hands-on part of this lesson you will not have to create the cluster for yourself, it will be already done for you.  For pedagogical reasons, however, we will show an example of how to do it by hand. The settings below should be good and cheap enough for CMSSW-related workflows.

Here you will find some instructions on how to use preemptible machines in GKE. Preemptible VMs offer similar functionality to Spot VMs, but only last up to 24 hours after creation, this might be taken into account if reducing costs is of your concern.

  • It takes just a few clicks to create you own K8s cluster

Demo: Storing a workflow output on Kubernetes


  • How to setup a workflow engine to submit jobs?

  • How to run a simple job?

  • How can I set up shared storage for my workflows?

  • How to run a simple job and get the the ouput?

  • Understand how to run a simple workflows in a commercial cloud environment or local machine

  • Understand how to set up shared storage and use it in a workflow

Storage Volume

If we run some application or workflow, we usually require a disk space where to dump our results. Unlike GKE, our local machine is the persistent disk by default. So let's create a persistent volume out of this nfs disk. Note that persisten volumes are not namespaced they are available to the whole cluster.

nano pv.yaml
apiVersion: v1
kind: PersistentVolume
  name: task-pv-volume
    type: local
  storageClassName: manual
    storage: 10Gi
    - ReadWriteOnce
    path: "/mnt/data"


kubectl apply -f pv.yaml -n argo


kubectl get pv

Apps can claim persistent volumes through persistent volume claims (pvc). Let’s create a pvc:

nano pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
  name: task-pv-claim
  storageClassName: manual
    - ReadWriteOnce
      storage: 3Gi


kubectl apply -f pvc.yaml -n argo


kubectl get pvc -n argo

Now an argo workflow coul claim and access this volume with a configuration like:

# argo-wf-volume.yaml
kind: Workflow
  generateName: test-hostpath-
  entrypoint: test-hostpath
    - name: workdir
        path: /mnt/data
        type: DirectoryOrCreate
  - name: test-hostpath
      image: alpine:latest
      command: [sh]
      source: |
        echo "This is the ouput" > /mnt/vol/test.txt
        echo ls -l /mnt/vol: `ls -l /mnt/vol`
      - name: workdir
        mountPath: /mnt/vol

Submit and check this workflow with:

argo submit -n argo argo-wf-volume.yaml
argo list -n argo

Take the name of the workflow from the output (replace XXXXX in the following command) and check the logs:

kubectl logs pod/test-hostpath-XXXXX  -n argo main

Once the job is done, you will see something like:

time="2022-07-25T05:51:14.221Z" level=info msg="capturing logs" argo=true
ls -l /mnt/vol: total 4 -rw-rw-rw- 1 root root 18 Jul 25 05:51 test.txt

Get the output file

The example job above produced a text file as an output. It resides in the persistent volume that the workflow job has created. To copy the file from that volume to the cloud shell, we will define a container, a “storage pod” and mount the volume there so that we can get access to it.

Create a file pv-pod.yaml with the following contents:

# pv-pod.yaml
apiVersion: v1
kind: Pod
  name: task-pv-pod
    - name: task-pv-storage
        claimName: task-pv-claim
    - name: task-pv-container
      image: busybox
      command: ["tail", "-f", "/dev/null"]
        - mountPath: /mnt/data
          name: task-pv-storage

Create the storage pod and copy the files from there

kubectl apply -f pv-pod.yaml -n argo
kubectl cp  task-pv-pod:/mnt/data /tmp/poddata -n argo

and you will get the file created by the job in /tmp/poddata/test.txt.

Kubernetes namespaces

The above commands as well as most of the following use a flag -n argo, which defines the namespace in which the resources are queried or created. Namespaces separate resources in the cluster, effectively giving you multiple virtual clusters within a cluster.

You can change the default namespace to argo as follows:

kubectl config set-context --current --namespace=argo

  • With Kubernetes one can run workflows similar to a batch system

  • Open Data workflows can be run in a commercial cloud environment using modern tools

Demo: Deploy a Webserver


  • How can I visualize my workflows?

  • How do I deploy my Argo GUI?

  • Prepare to deploy the fileserver that mounts the storage volume.

  • Submit your workflow and get the results.


This episode is relevant when working on the Google Kubernetes Engine, as will be done during the hands-on session of the workshop. If you are going through these pre-exercises on minikube, just read this as part of your information, but do not work through it.

Accessing files via http

With the storage pod, you can copy files between the storage element and the CloudConsole. However, a practical use case would be to run the “big data” workloads in the cloud, and then download the output to your local desktop or laptop for further processing. An easy way of making your files available to the outside world is to deploy a webserver that mounts the storage volume.

We first patch the config of the webserver to be created as follows:

mkdir conf.d
cd conf.d
curl -sLO
cd ..
kubectl create configmap basic-config --from-file=conf.d -n argo

Then prepare to deploy the fileserver by downloading the manifest:

curl -sLO
# deployment-http-fileserver.yaml
apiVersion: apps/v1
kind: Deployment
    service: http-fileserver
  name: http-fileserver
  replicas: 1
  strategy: {}
      service: http-fileserver
        service: http-fileserver
      - name: volume-output
          claimName: nfs-1
      - name: basic-config
          name: basic-config
      - name: file-storage-container
        image: nginx
        - containerPort: 80
          - mountPath: "/usr/share/nginx/html"
            name: volume-output
          - name: basic-config
            mountPath: /etc/nginx/conf.d

Apply and expose the port as a LoadBalancer:

kubectl create -n argo -f deployment-http-fileserver.yaml
kubectl expose deployment http-fileserver -n argo --type LoadBalancer --port 80 --target-port 80

Exposing the deployment will take a few minutes. Run the following command to follow its status:

kubectl get svc -n argo

You will initially see a line like this:

NAME                          TYPE           CLUSTER-IP   EXTERNAL-IP   PORT(S)        AGE
http-fileserver               LoadBalancer    <pending>     80:30539/TCP   5s

Wait a couple minutes recheck the previous command, copy your EXTERNAL-IP and paste it on a new tab, it should look something like this:

The <pending> EXTERNAL-IP will update after a few minutes (run the command again to check). Once it’s there, copy the IP and paste it into a new browser tab. This should welcome you with a “Hello from NFS” message. In order to enable file browsing, we need to delete the index.html file in the pod. Determine the pod name using the first command listed below and adjust the second command accordingly.

kubectl get pods -n argo
kubectl exec http-fileserver-XXXXXXXX-YYYYY -n argo -- rm /usr/share/nginx/html/index.html

Warning: anyone can now access these files

This IP is now accessible from anywhere in the world, and therefore also your files (mind: there are charges for outgoing bandwidth). Please delete the service again once you have finished downloading your files.

kubectl delete svc/http-fileserver -n argo

Run the kubectl expose deployment command to expose it again.

Argo GUI

Check the services running and the associated IP addresses:

kubectl get svc -n argo
kubectl -n argo port-forward deployment/argo-server 2746:2746

Once it has started fowarding the port we will have to manually enable the port, to do this open a new cloud shell tab and run the following command:

lynx https://localhost:2746

Access it and then quit. Return to the previous tab and you will see that the port is being accessed and handled, you can exit with ^C and finally patch the service with:

kubectl patch svc argo-server -n argo -p '{"spec": {"type": "LoadBalancer"}}'

Since it is creating an external ip, wait a couple minutes. You can check if it is ready with:

kubectl get svc -n argo

  • With a simple but a tight yaml structure, a full-blown analysis can be performed using the Argo tool on a K8s cluster.

Cleaning up


  • How do I clean my workspace?

  • How do I delete my cluster?

  • Clean my workflows

  • Delete my storage volume

Cleaning workspace

Remember to delete your workflow again to avoid additional charges: Run this until you get a message indicating there is no more workflows.

argo delete -n argo @latest

Delete the argo namespace and all yaml files and configurations with:

kubectl delete ns argo
rm *
rm -r *

Delete your disk:

gcloud compute disks delete DISK_NAME [DISK_NAME …] [--region=REGION     | --zone=ZONE]

Demo delete disk

To delete the disk ‘gce-nfs-disk-1’ in zone ‘us-central1-c’ that was used as an example in this workshop , run:

gcloud compute disks delete gce-nfs-disk-1 --zone=us-central1-c

Delete cluster

  • Cleaning your workspace in periods of time while you’re not running workflows will save you money.

  • With a couple commands it is easy to get back to square one.