Cloud Pre-Exercise

Introduction

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • What is Kubernetes?

  • What is a Kubernetes cluster and why do I need one?

Objectives
  • Learn the very basics of Kubernetes

  • Learn a bit about the architecture of a Kubernetes cluster

Introduction:

Throughout this workshop, you have become familiar with Docker containers and their ability to function as isolated, efficient virtual machines running on a host machine. Considering this, imagine the potential of maximizing hardware resources by running multiple CMSSW (Compact Muon Solenoid Software) open data containers on a single desktop. For example, running 10 CMSSW containers, each processing a single root file, skimming through entire datasets, in parallel, simultaneously, and on your own machine. Scaling up to a larger number of machines introduces new challenges. How would you manage the software installation across all these machines? Do you have sufficient resources to handle these tasks? How would you effectively manage and monitor all the containers running across the distributed infrastructure?

These questions highlight the need for a robust orchestration system like Kubernetes. By leveraging Kubernetes, you can streamline and automate the deployment, scaling, and management of containers across multiple machines. Kubernetes provides a unified platform to address these challenges and ensures efficient utilization of computing resources, enabling researchers to focus on their analysis tasks rather than infrastructure management.

In the upcoming sections of this workshop, we will delve into the practical aspects of using Kubernetes for managing CMSSW containers and orchestrating data processing workflows. We will explore techniques for software deployment, container management, and effective utilization of distributed resources. By the end of the workshop, you will have gained the knowledge and skills to leverage Kubernetes for efficient and scalable physics data analysis.

Kubernetes (K8s) - Microservices Concepts

Kubernetes is a powerful container orchestration platform that facilitates the deployment, scaling, and management of microservices-based applications. Microservices architecture is an approach to developing software applications as a collection of small, independent services that work together to form a larger application. Kubernetes provides essential features and functionality to support the deployment and management of microservices.

K8s API

The Kubernetes API (Application Programming Interface) is a set of rules and protocols that allows users and external systems to interact with a Kubernetes cluster. It serves as the primary interface for managing and controlling various aspects of the cluster, including deploying applications, managing resources, and monitoring the cluster’s state. Users can interact with the API using various methods, such as command-line tools (e.g., kubectl), programming languages (e.g., Python, Go), or through user interfaces built on top of the API.

Kubernetes Components

When deploying Kubernetes, you establish a cluster that comprises two main components: masters and workers.

By separating the responsibilities of the masters and workers, Kubernetes ensures a distributed and scalable architecture. The masters focus on managing the cluster’s control plane and coordinating the overall state, while the workers handle the execution of application workloads. This division of labor allows for efficient scaling, fault tolerance, and high availability in a Kubernetes cluster.

Nodes Components

Kubernetes nodes, also referred to as worker nodes or simply nodes, are the individual machines or virtual machines that make up a Kubernetes cluster. These nodes are responsible for executing the actual workloads and running the containers that make up your applications. Each node in a Kubernetes cluster plays a crucial role in the distributed system and contributes to the overall functioning of the cluster. Here are the key characteristics and components of Kubernetes nodes:

Nodes form the backbone of a Kubernetes cluster, offering the necessary computational resources for running applications. Working in collaboration with the master components, nodes play a crucial role in orchestrating, scheduling, and managing the lifecycle of containers and pods. By hosting and executing pods, nodes effectively utilize their compute resources, ensuring optimal execution, resource allocation, and scalability. With Kubernetes’ intelligent scheduling capabilities, containers are seamlessly distributed across nodes, enabling efficient resource utilization and facilitating fault tolerance in a distributed environment.

Autoscaling

Autoscaling is a powerful feature supported by Kubernetes that allows you to optimize the allocation of resources on your nodes based on the actual usage patterns of your applications. Kubernetes enables you to automatically scale up or down the number of nodes in your cluster, as well as adjust the CPU and memory resources allocated to those nodes.

By utilizing autoscaling, you can ensure that your applications have the necessary resources to handle increased workloads during peak times, while also dynamically reducing resource allocation during periods of lower demand. This flexibility not only improves performance and responsiveness but also helps optimize costs by allowing you to pay only for the resources you actually need. If you want to learn about pricing for this workshop’s cloud provider, check out Google’s Compute Engine pricing.

Key Points

  • Kubernetes is an orchestrator of containers. It is most useful when it is run in a cluster of computers.

  • Commercial K8s clusters are a good option for large computing needs.

  • We can run our containerized CMSSW jobs and subsequent analysis workflows in a K8s cluster.


Getting started with Kubectl

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • What is Kubectl?

  • How to use Kubectl commands?

Objectives
  • Learn what the kubectl command can do

  • Learn how to set up different services/resources to get the most of your cluster

K8s - Imperative vs Declarative programming

In the context of Kubernetes, imperative and declarative are two different paradigms used to define and manage the desired state of resources within a cluster. While imperative commands are useful for ad-hoc tasks or interactive exploration, the declarative approach is more suitable for managing and maintaining resources in a Kubernetes cluster efficiently. Let’s explore each approach! But first, we need a tool to interact with our cluster.

Kubectl

The kubectl command-line tool is a powerful utility provided by Kubernetes that allows you to interact with and manage Kubernetes clusters. Both Minikube and K8s on Docker Desktop come with a built-in kubectl installation. Use the following syntax to run kubectl commands from your terminal window:

kubectl [command] [TYPE] [NAME] [flags]

Where:

Do not forget to go through the setup episode to get your environment up and running. Also, check out the kubectl cheat sheet.

Windows Users - Reminder

To enable Kubernetes on WSL2, you have two options: activating Kubernetes in Docker Desktop or installing Minikube following the Linux option on WSL2 Ubuntu. It’s important to note that the Windows instructions for the Minikube installation guide users to PowerShell, but running the CMSSW container there will cause issues. Therefore, it is necessary to execute those commands within the Ubuntu shell.

Docker desktop needs to be running in the background in order for Minikube to start. From a terminal with administrator access (but not logged in as root), run:

minikube start
If you have already installed kubectl and it is pointing to some other environment, such as docker-desktop or a GKE cluster, ensure you change the context so that kubectl is pointing to minikube:
kubectl config use-context minikube

If minikube fails to start, see the drivers page for help setting up a compatible container or virtual-machine manager.

Congratulations! You have successfully activated Minikube. You can now use Kubernetes to deploy and manage containerized applications on your local machine. Remember to familiarize yourself with Kubernetes concepts and commands to make the most of its capabilities.

To enable Kubernetes in Docker Desktop, in support to the documentation follow these steps:
  1. From the Docker Dashboard, select the Settings.
  2. Select Kubernetes from the left sidebar.
  3. Next to Enable Kubernetes, select the checkbox.
  4. Select Apply & Restart to save the settings and then click Install to confirm. This instantiates images required to run the Kubernetes server as containers, and installs the /usr/local/bin/kubectl command on your machine.
If you have already installed kubectl and it is pointing to some other environment, such as minikube or a GKE cluster, ensure you change the context so that kubectl is pointing to docker-desktop:
kubectl config use-context docker-desktop
You can test the command by listing the available nodes:
kubectl get nodes

Congratulations! You have successfully activated Docker Desktop Kubernetes. You can now use Kubernetes to deploy and manage containerized applications on your local machine. Remember to familiarize yourself with Kubernetes concepts and commands to make the most of its capabilities.

Imperative Approach

In the imperative approach, you specify the exact sequence of commands or actions to be performed to create or modify Kubernetes resources. You interact with the Kubernetes API by issuing explicit instructions.

Let’s first create a node running Nginx by using the imperative way.

Create the pod using the Imperative way
kubectl run mynginx --image=nginx
Get a list of pods and their status
kubectl get pods

Output

The status of the pod is “ContainerCreating,” which means that Kubernetes is currently in the process of creating the container for the pod. The “0/1” under the “READY” column indicates that the pod has not yet reached a ready state.

NAME      READY   STATUS              RESTARTS   AGE
mynginx   0/1     ContainerCreating   0          5s

Once the pod is successfully created and the container is running, the status will change to “Running” or a similar value, and the “READY” column will reflect that the pod is ready.

NAME      READY   STATUS              RESTARTS   AGE
mynginx   1/1     Running             0          70s
Get more info
kubectl get pods -o wide

Output

The updated output indicates that the pod named “mynginx” is now successfully running and ready to serve requests.

NAME      READY   STATUS    RESTARTS   AGE     IP             NODE       NOMINATED NODE   READINESS GATES
mynginx   1/1     Running   0          4m23s   10.244.0.122   minikube   <none>           <none>
kubectl describe pod mynginx

Output

The status of the pod is “ContainerCreating,” which means that Kubernetes is currently in the process of creating the container for the pod. The “0/1” under the “READY” column indicates that the pod has not yet reached a ready state.

Name:             mynginx                                     # Pod name
Namespace:        default                                     # Namespace of the Pod
Priority:         0                                           # Priority assigned to the Pod
Service Account:  default                                     # Service account used by the Pod
Node:             minikube/192.168.49.2                       # Node where the Pod is running
Start Time:       Thu, 01 Jun 2023 18:46:23 -0500             # Time when the Pod started
Labels:           run=mynginx                                 # Labels assigned to the Pod
Annotations:      <none>                                      # Annotations associated with the Pod
Status:           Running                                     # Current status of the Pod
IP:               10.244.0.122                                # IP address assigned to the Pod
IPs:
  IP:  10.244.0.122                                           
Containers:
  mynginx:
    Container ID:   docker://c22dce8c953394...               # ID of the container
    Image:          nginx                                    # Image used by the container
    Image ID:       docker-pullable://nginx@sha256:af296b... # ID of the container image
    Port:           <none>                                   # Port configuration for the container
    Host Port:      <none>                                   # Host port configuration for the container
    State:          Running                                  # Current state of the container
      Started:      Thu, 01 Jun 2023 18:46:50 -0500          # Time when the container started
    Ready:          True                                     # Indicates if the container is ready
    Restart Count:  0                                        # Number of times the container has been restarted
    Environment:    <none>                                   # Environment variables set for the container
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from...  # Mount points for the container
Conditions:
  Type              Status                                    # Various conditions related to the Pod's status
  Initialized       True                                      # Pod has been initialized
  Ready             True                                      # Pod is ready
  ContainersReady   True                                      # Containers are ready
  PodScheduled      True                                      # Pod has been scheduled
Volumes:
  kube-api-access-hvg8b:                                      # Volumes used by the Pod
    Type:                    Projected
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort                         # Quality of service class for the Pod
Node-Selectors:              <none>                             # Node selectors for the Pod
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age    From               Message         # Events related to the Pod
  ----    ------     ----   ----               -------
  Normal  Scheduled  7m32s  default-scheduler  Successfully assigned default/mynginx to minikube
  Normal  Pulling    7m32s  kubelet            Pulling image "nginx"
  Normal  Pulled     7m5s   kubelet            Successfully pulled image "nginx" in 26.486269096s (26.486276637s including waiting)
  Normal  Created    7m5s   kubelet            Created container mynginx
  Normal  Started    7m5s   kubelet            Started container mynginx
Delete the pod
kubectl delete pod mynginx

Declarative Approach

In the declarative approach, you define the desired state of Kubernetes resources in a declarative configuration file (e.g., YAML or JSON). Rather than specifying the steps to achieve that state, you describe the desired outcome and let Kubernetes handle the internal details.

Create a pod using the declarative way

Download the file:

wget https://cms-opendata-workshop.github.io/workshop2023-lesson-introcloud/files/kubectl/myapp.yaml

YAML File

This YAML file describes a Pod with an nginx web server container that listens on port 80 and has an environment variable set. The specific behavior and functionality of the nginx web server will depend on the configuration of the nginx image used.

# myapp.yaml
apiVersion: v1
kind: Pod
metadata:
  name: myapp-pod
spec:
  containers:
  - name: nginx-container
    image: nginx
    ports:
    - containerPort: 80
    env:
    - name: DBCON
      value: myconnectionstring

Now, let’s create a pod using the YAML file

kubectl create -f myapp.yaml

Get some info

kubectl get pods -o wide
kubectl describe pod myapp-pod

Open a shell in the running pod

kubectl exec -it myapp-pod -- bash

Output

We used the command to execute an interactive shell session within the “myapp-pod” pod. After executing this command, you are now inside the pod and running commands as the “root” user. The prompt “root@myapp-pod:/#” indicates that you are currently in the pod’s shell environment and can run commands.

root@myapp-pod:/#

Print the DBCON environment variable that was set in the YAML file.

echo $DBCON

Output

The command “echo $DBCON” is used to print the value of the environment variable “DBCON”.

root@myapp-pod:/# echo $DBCON
myconnectionstring

Exit from the container

exit

Delete the pod

kubectl delete -f myapp.yaml

The declarative approach is the recommended way to manage resources in Kubernetes. It promotes consistency, reproducibility, and automation. You can easily version control the configuration files, track changes over time, and collaborate with team members more effectively.

Let’s run a few examples.

Kubernetes namespaces partition resources in a cluster, creating isolated virtual clusters. They allow multiple teams or applications to coexist while maintaining separation and preventing conflicts.

Get the currently configured namespaces:

kubectl get namespaces
kubectl get ns

Both commands are equivalent and will retrieve the list of namespaces in your Kubernetes cluster.

Get the pods list:

Get a list of all the installed pods.

kubectl get pods

You get the pods from the default namespace. Try getting the pods from the docker namespace. You will get a different list.

kubectl get pods -n kube-system

Output

The command “kubectl get pods –namespace=kube-system” is used to retrieve information about the pods running in the “kube-system” namespace.

NAME                               READY   STATUS    RESTARTS        AGE
coredns-787d4945fb-62wpc           1/1     Running   9 (4d19h ago)   34d
etcd-minikube                      1/1     Running   8 (4d19h ago)   34d
kube-apiserver-minikube            1/1     Running   9 (4d19h ago)   34d
kube-controller-manager-minikube   1/1     Running   9 (4d19h ago)   34d
kube-proxy-hm78n                   1/1     Running   8 (4d19h ago)   34d
kube-scheduler-minikube            1/1     Running   8 (4d19h ago)   34d
storage-provisioner                1/1     Running   16 (18m ago)    34d

Get nodes information:

Get a list of all the installed nodes. Using Docker Desktop or Minikube, there should be only one.

kubectl get nodes

Get some info about the node.

kubectl describe node

Output

The output provides details about each node, including its name, status, roles, age, and version.

NAME       STATUS   ROLES           AGE   VERSION
minikube   Ready    control-plane   34d   v1.26.3

Run your first deployment

A Deployment is a higher-level resource that provides declarative updates and manages the deployment of Pods. It allows you to define the desired state of your application, including the number of replicas, container images, and resource requirements.
Download the file:

wget https://cms-opendata-workshop.github.io/workshop2023-lesson-introcloud/files/kubectl/deploy-example.yaml

YAML File

This Deployment will create and manage three replicas of an nginx container based on the nginx:alpine image. The Pods will have resource requests and limits defined, and the container will expose port 80. The Deployment ensures that the desired state of the replicas is maintained, managing scaling and updating as needed.

# deploy-example.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: deploy-example
spec:
  replicas: 3
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      app: nginx
      env: prod
  template:
    metadata:
      labels:
        app: nginx
        env: prod
    spec:
      containers:
      - name: nginx
        image: nginx:alpine
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 250m
            memory: 256Mi        
        ports:
        - containerPort: 80

Create the Deployment:

kubectl apply -f deploy-example.yaml

Get the pods list

kubectl get pods -o wide

Output

Deployments are Kubernetes resources that manage the lifecycle of replicated pods, ensuring high availability, scalability, and automated healing, while ReplicaSets are used by deployments to maintain a desired number of pod replicas, enabling rolling updates and fault tolerance.

NAME                             READY   STATUS              RESTARTS   AGE   IP       NODE       NOMINATED NODE   READINESS GATES
deploy-example-54fbc9897-56s28   0/1     ContainerCreating   0          4s    <none>   minikube   <none>           <none>
deploy-example-54fbc9897-9twh8   0/1     ContainerCreating   0          4s    <none>   minikube   <none>           <none>
deploy-example-54fbc9897-lng78   0/1     ContainerCreating   0          4s    <none>   minikube   <none>           <none>

All three pods are currently in the “ContainerCreating” state, indicating that the containers are being created for these pods. After some time you will get:

NAME                             READY   STATUS    RESTARTS   AGE   IP             NODE       NOMINATED NODE   READINESS GATES
deploy-example-54fbc9897-56s28   1/1     Running   0          42s   10.244.0.126   minikube   <none>           <none>
deploy-example-54fbc9897-9twh8   1/1     Running   0          42s   10.244.0.124   minikube   <none>           <none>
deploy-example-54fbc9897-lng78   1/1     Running   0          42s   10.244.0.125   minikube   <none>           <none>

Get more details about the pod

kubectl describe pod deploy-example

Output

The output shows three running pods named “deploy-example-54fbc9897-56s28”, “deploy-example-54fbc9897-9twh8”, and “deploy-example-54fbc9897-lng78” in the “default” namespace. These pods are controlled by the ReplicaSet “deploy-example-54fbc9897” and are running the “nginx:alpine” image. They have successfully pulled the image, created the container, and started running. Each pod has an IP assigned, and they are running on the “minikube” node.

Get the Deployment info

kubectl get deploy

Output

All 3 replicas are ready, up-to-date, and available, indicating that the deployment is successfully running.

NAME             READY   UP-TO-DATE   AVAILABLE   AGE
deploy-example   3/3     3            3           4m57s

Get the ReplicaSet name:

A ReplicaSet is a lower-level resource that ensures a specified number of replicas of a Pod are running at all times.

kubectl get rs

Output

The current replica count matches the desired count, indicating that the ReplicaSet has successfully created and maintained 3 replicas. All 3 replicas are ready and available.

NAME                       DESIRED   CURRENT   READY   AGE
deploy-example-54fbc9897   3         3         3       6m12s

In summary, a Deployment provides a higher-level abstraction for managing and updating the desired state of Pods, while a ReplicaSet is a lower-level resource that ensures the specified number of Pod replicas are maintained. Deployments use ReplicaSets under the hood to achieve the desired state and handle scaling and rolling updates.

Cleanup

Delete the pod

kubectl delete -f deploy-example.yaml

Key Points

  • kubectl is the ruler of GKE


Getting started with Argo

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • What is Argo?

  • How to use Argo commands?

  • What are Argo workflows?

  • How do I deploy my Argo GUI?

Objectives
  • Appreciate the necessity for the Argo workflows tool (or similar)

  • Learn the basics of Argo commands

Argo

Argo is a collection of open-source tools that extend the functionality of Kubernetes, providing several benefits for workflow management. Some key advantages of using Argo include:

In the context of Argo, there are three important tools that facilitate working with workflows, we will be using the Argo Workflow Engine.

Install the Argo Workflows CLI

Download the latest Argo CLI from the releases page. This is a requirement to interact with argo.

Verify the Argo installation with:

argo version

Argo Workflow Engine

The Argo Workflow Engine is designed to execute complex job orchestration, including both serial and parallel execution of stages, with each stage executed as a container.

In the context of scientific analysis, such as physics analysis using datasets from the CMS Open Data portal and CMSSW, Argo’s orchestration capabilities are particularly valuable. By leveraging Argo, researchers can automate and streamline complex analysis workflows, which often involve multiple processing stages and dependencies. Argo’s support for parallel execution and container-based environments allows for efficient utilization of computational resources, enabling faster and more scalable data analysis.

Install argo as a workflow engine

While jobs can be run manually, utilizing a workflow engine like Argo simplifies the process of defining and submitting jobs. In this tutorial, we will use the Argo Quick Start page to install and configure Argo in your working environment.

Install it into your working environment with the following commands (all commands to be entered into your local shell):

kubectl create ns argo
kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo-workflows/master/manifests/quick-start-postgres.yaml

Check that everything has finished running before continuing with:

kubectl get all -n argo

Port-forward the UI

To open a port-forward so you can access the UI, open a new shell and run:

kubectl -n argo port-forward deployment/argo-server 2746:2746

This will serve the UI on https://localhost:2746. Due to the self-signed certificate, you will receive a TLS error which you will need to manually approve. The Argo interface has the following similiarity:

Argo GUI

In the Argo GUI, you can perform various operations and actions related to managing and monitoring Argo Workflows. Here are some of the things you can do in the Argo GUI with Argo Workflows:

Pay close attention to the URL. It uses https and not http. Navigating to http://localhost:2746 result in server-side error that breaks the port-forwarding.

Run a simple test workflow

Make sure that all argo pods are running before submitting the test workflow:

kubectl get pods -n argo

You must get a similar output:

NAME                                  READY   STATUS      RESTARTS        AGE
argo-server-76f9f55f44-9d6c5          1/1     Running     6 (5d14h ago)   23d
httpbin-7d6678b4c5-vhk2k              1/1     Running     3 (5d14h ago)   23d
minio-68dc5544c4-8jst4                1/1     Running     3 (5d14h ago)   23d
postgres-6f9cb49458-sc5fx             1/1     Running     3 (5d14h ago)   23d
workflow-controller-769bfc84b-ndgp7   1/1     Running     8 (13m ago)     23d

To test the setup, run a simple test workflow with:

argo submit -n argo https://raw.githubusercontent.com/argoproj/argo/master/examples/hello-world.yaml

This might take a while, to see the status of our workflows run:

argo list @latest -n argo 

Output

The output you provided indicates that there is a pod named “hello-world-mjgvb” that has a status of “Succeeded”. The pod has been running for 2 minutes, and it took 32 seconds to complete its execution. The priority of the pod is 0, and there is no specific message associated with it.

NAME                STATUS      AGE   DURATION   PRIORITY   MESSAGE
hello-world-mjgvb   Succeeded   2m    32s        0

You will be able to get an interactive glimpse of how argo workflow can be monitored and managing with Argo GUI, feel free to explore the various function this tool offers!

Argo Hello World Workflow

Within your workflow, you can select your pod (in this example, it’s called hello-world-mjgvb) to get a quick summary of the workflow details.

Argo Hello World Workflow

You can get the logs with:

argo logs -n argo @latest

Output

If argo was installed correctly you will have the following:

hello-world-mjgvb: time="2023-06-02T00:37:54.468Z" level=info msg="capturing logs" argo=true
hello-world-mjgvb:  _____________
hello-world-mjgvb: < hello world >
hello-world-mjgvb:  -------------
hello-world-mjgvb:     \
hello-world-mjgvb:      \
hello-world-mjgvb:       \
hello-world-mjgvb:                     ##        .
hello-world-mjgvb:               ## ## ##       ==
hello-world-mjgvb:            ## ## ## ##      ===
hello-world-mjgvb:        /""""""""""""""""___/ ===
hello-world-mjgvb:   ~~~ {~~ ~~~~ ~~~ ~~~~ ~~ ~ /  ===- ~~~
hello-world-mjgvb:        \______ o          __/
hello-world-mjgvb:         \    \        __/
hello-world-mjgvb:           \____\______/
hello-world-mjgvb: time="2023-06-02T00:37:55.484Z" level=info msg="sub-process exited" argo=true error="<nil>"

You can also check the logs with Argo GUI: Argo Hello World Workflow Logs

Please mind that it is important to delete your workflows once they have completed. If you do not do this, the pods associated with the workflow will remain scheduled in the cluster, which might lead to additional charges. You will learn how to automatically remove them later.

argo delete -n argo @latest

Kubernetes namespaces

The above commands as well as most of the following use a flag -n argo, which defines the namespace in which the resources are queried or created. Namespaces separate resources in the cluster, effectively giving you multiple virtual clusters within a cluster.

You can change the default namespace to argo as follows:

kubectl config set-context --current --namespace=argo

Key Points

  • Argo is a very useful tool for running workflows and parallel jobs


Storing a workflow output

Overview

Teaching: 5 min
Exercises: 30 min
Questions
  • How to setup a workflow engine to submit jobs?

  • How to run a simple job?

  • How can I set up shared storage for my workflows?

  • How to run a simple job and get the the ouput?

Objectives
  • Understand how to run a simple workflows in a commercial cloud environment or local machine

  • Understand how to set up shared storage and use it in a workflow

Kubernetes Cluster - Storage Volume

With Minikube, you can utilize persistent volumes and persistent volume claims to enable data persistence and local development capabilities within your local Kubernetes cluster. By leveraging local storage volumes with Minikube, you can conveniently create and utilize storage resources within your local Kubernetes cluster, enabling data persistence and local development capabilities.

Let’s create a persistent volume, retrieve the persistent volume configuration file with:

wget https://cms-opendata-workshop.github.io/workshop2023-lesson-introcloud/files/minikube/pv.yaml

It has the following content, you can alter the storage capacity if you’d like to whatever value.

YAML File

# pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: task-pv-volume
  labels:
    type: local
spec:
  storageClassName: manual
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: "/mnt/vol"

Deploy:

kubectl apply -f pv.yaml

Check:

kubectl get pv

Expected output:

NAME             CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   REASON   AGE
task-pv-volume   5Gi        RWX            Retain           Available           manual                  11s

Apps can claim persistent volumes through persistent volume claims (pvc). Let’s create a pvc, retrieve the pvc.yaml file with:

wget https://cms-opendata-workshop.github.io/workshop2023-lesson-introcloud/files/minikube/pvc.yaml

It has the following content, you can alter the storage request if you’d like, but it mas less or equal than the storage capacity defined in our persistent volume (previous step).

YAML File

# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: task-pv-claim
spec:
  storageClassName: manual
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 5Gi

Deploy:

kubectl apply -f pvc.yaml -n argo

Check:

kubectl get pvc -n argo

Expected output:

NAME            STATUS   VOLUME           CAPACITY   ACCESS MODES   STORAGECLASS   AGE
task-pv-claim   Bound    task-pv-volume   5Gi        RWX            manual         10s

Now an argo workflow can claim and access this volume, retrieve the configuration file with:

wget https://cms-opendata-workshop.github.io/workshop2023-lesson-introcloud/files/minikube/argo-wf-volume.yaml

It has the following content:

YAML File

# argo-wf-volume.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: test-hostpath-
spec:
  entrypoint: test-hostpath
  volumes:
    - name: workdir
      hostPath:
        path: /mnt/vol
        type: DirectoryOrCreate
  templates:
  - name: test-hostpath
    script:
      image: alpine:latest
      command: [sh]
      source: |
        echo "This is the new ouput" > /mnt/vol/test1.txt
        echo ls -l /mnt/vol: `ls -l /mnt/vol`
      volumeMounts:
      - name: workdir
        mountPath: /mnt/vol

Submit and check this workflow with:

argo submit argo-wf-volume.yaml -n argo

Wait till the pod test-hostpath-XXXXX is created, you can check with:

kubectl get pods -n argo

List all the workflows with:

argo list -n argo

Take the name of the workflow from the output (replace XXXXX in the following command) and check the logs:

kubectl logs pod/test-hostpath-XXXXX  -n argo main

Once the job is done, you will see something like:

time="2022-07-25T05:51:14.221Z" level=info msg="capturing logs" argo=true
ls -l /mnt/vol: total 4 -rw-rw-rw- 1 root root 18 Jul 25 05:51 test.txt

Get the output file

The example job above produced a text file as an output. It resides in the persistent volume that the workflow job has created. To copy the file from that volume to the shell, we will define a container, a “storage pod”, and mount the volume there so that we can get access to it.

Retrieve the file pv-pod.yaml with:

wget https://cms-opendata-workshop.github.io/workshop2023-lesson-introcloud/files/minikube/pv-pod.yaml

It has the following content:

YAML File

# pv-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: task-pv-pod
spec:
  volumes:
    - name: task-pv-storage
      persistentVolumeClaim:
        claimName: task-pv-claim
  containers:
    - name: task-pv-container
      image: busybox
      command: ["tail", "-f", "/dev/null"]
      volumeMounts:
        - mountPath: /mnt/vol
          name: task-pv-storage
      resources:
        limits:
          cpu: "2"
          memory: "3Gi"
        requests:
          cpu: "1"
          memory: "512Mi"

Create the storage pod and copy the files from there with:

kubectl apply -f pv-pod.yaml -n argo

Wait till the pod task-pv-pod is created, you can check with:

kubectl get pods -n argo

Now copy the files into your machine with:

kubectl cp task-pv-pod:/mnt/vol /tmp/poddata -n argo

You will get the file created by the job in /tmp/poddata/test1.txt. Remember to unhide your hidden files/folders when using directory GUI. In your terminal run:

cat /tmp/poddata/test1.txt

Expected output:

This is the new ouput

Every time you want the files to get copied from your the pv-pod to your local computer, you must run:

kubectl cp task-pv-pod:/mnt/vol /tmp/poddata -n argo

Key Points

  • With Kubernetes one can run workflows similar to a batch system

  • Open Data workflows can be run in a commercial cloud environment using modern tools


Create an Argo Workflow

Overview

Teaching: 5 min
Exercises: 20 min
Questions
  • How can I visualize my workflows?

  • How do I deploy my Argo GUI?

Objectives
  • Prepare to deploy the fileserver that mounts the storage volume.

  • Submit your workflow and get the results.

Workflow Definition

In this section, we will explore the structure and components of an Argo Workflow. Workflows in Argo are defined using YAML syntax and consist of various tasks that can be executed sequentially, in parallel, or conditionally.

To define a workflow, create a YAML file (e.g., my-workflow.yaml) and define the following:

Here’s an example of a simple Argo Workflow definition, get it with:

wget https://cms-opendata-workshop.github.io/workshop2023-lesson-introcloud/files/argo/container-workflow.yaml

The container template will have the following content:

YAML File

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: my-workflow
spec:
  entrypoint: my-task
  templates:
    - name: my-task
      container:
        image: my-image
        command: [echo, "Hello, Argo!"]

Let’s run the workflow:

argo submit container-workflow.yaml -n argo

You can add the --watch flag to supervise the creation of the workflow in real time as so:

argo submit --watch container-workflow.yaml -n argo

Open the Argo Workflows UI. Then navigate to the workflow, you should see a single container running.

Exercise

Edit the workflow to make it echo “howdy world”.

Solution

apiVersion: argoproj.io/v1alpha1
kind: Workflow                 
metadata:
  generateName: container-   
spec:
  entrypoint: main         
  templates:
  - name: main             
    container:
      image: docker/whalesay
      command: [cowsay]         
      args: ["howdy world"]

Learn more about parameters in the Argo Workflows documentation:

DAG Template

DAG template is a common type of _orchestration_template. Let’s look at a complete example:

wget https://cms-opendata-workshop.github.io/workshop2023-lesson-introcloud/files/argo/dag-workflow.yaml

That has the content:

YAML File

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: dag-
spec:
  entrypoint: main
  templates:
    - name: main
      dag:
        tasks:
          - name: a
            template: whalesay
          - name: b
            template: whalesay
            dependencies:
              - a
    - name: whalesay
      container:
        image: docker/whalesay
        command: [ cowsay ]
        args: [ "hello world" ]

In this example, we have two templates:

The DAG has two tasks: “a” and “b”. Both run the “whalesay” template, but as “b” depends on “a”, it won’t start until “ a” has completed successfully.

Let’s run the workflow:

argo submit --watch dag-workflow.yaml -n argo

You should see something like:

STEP          TEMPLATE  PODNAME              DURATION  MESSAGE
 ✔ dag-shxn5  main                                       
 ├─✔ a        whalesay       dag-shxn5-289972251  6s          
 └─✔ b        whalesay       dag-shxn5-306749870  6s          

Did you see how b did not start until a had completed?

Open the Argo Server tab and navigate to the workflow, you should see two containers.

Exercise

Add a new task named “c” to the DAG. Make it depend on both “a” and “b”. Go to the UI and view your updated workflow graph.

Solution

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: dag-
spec:
  entrypoint: main
  templates:
    - name: main
      dag:
        tasks:
          - name: a
            template: whalesay
          - name: b
            template: whalesay
            dependencies:
              - a
          - name: c
            template: whalesay
            dependencies:
              - a
              - b
    - name: whalesay
      container:
        image: docker/whalesay
        command: [ cowsay ]
        args: [ "hello world" ]

The expected output is:

STEP          TEMPLATE  PODNAME                        DURATION  MESSAGE
✔ dag-hl6lc  main                                                 
├─✔ a        whalesay  dag-hl6lc-whalesay-1306143144  10s         
├─✔ b        whalesay  dag-hl6lc-whalesay-1356476001  10s         
└─✔ c        whalesay  dag-hl6lc-whalesay-1339698382  9s       

And the workflow you should see in Argo GUI is: DAG-diagram1

Learn more about parameters in the Argo Workflows documentation:

Input Parameters

Let’s have a look at an example:

wget https://cms-opendata-workshop.github.io/workshop2023-lesson-introcloud/files/argo/input-parameters-workflow.yaml

See the content:

Yaml file

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: input-parameters-
spec:
  entrypoint: main
  arguments:
    parameters:
      - name: message
        value: hello world
  templates:
    - name: main
      inputs:
        parameters:
          - name: message
      container:
        image: docker/whalesay
        command: [ cowsay ]
        args: [ "" ]

This template declares that it has one input parameter named “message”. See how the workflow itself has arguments?

Run it:

argo submit --watch input-parameters-workflow.yaml -n argo

You should see:

STEP                       TEMPLATE  PODNAME                 DURATION  MESSAGE
 ✔ input-parameters-mvtcw  main      input-parameters-mvtcw  8s          

If a workflow has parameters, you can change the parameters using -p using the CLI:

argo submit --watch input-parameters-workflow.yaml -p message='Welcome to Argo!' -n argo

You should see:

STEP                       TEMPLATE  PODNAME                 DURATION  MESSAGE
 ✔ input-parameters-lwkdx  main      input-parameters-lwkdx  5s          

Let’s check the output in the logs:

argo logs @latest -n argo

You should see:

 ______________
< Welcome to Argo! >
 --------------
    \
     \
      \     
                    ##        .            
              ## ## ##       ==            
           ## ## ## ##      ===            
       /""""""""""""""""___/ ===        
  ~~~ {~~ ~~~~ ~~~ ~~~~ ~~ ~ /  ===- ~~~   
       \______ o          __/            
        \    \        __/             
          \____\______/   

Learn more about parameters in the Argo Workflows documentation:

Output Parameters

Output parameters can be from a few places, but typically the most versatile is from a file. In this example, the container creates a file with a message in it:

  - name: whalesay
    container:
      image: docker/whalesay
      command: [sh, -c]
      args: ["echo -n hello world > /tmp/hello_world.txt"]
    outputs:
      parameters:
      - name: hello-param		
        valueFrom:
          path: /tmp/hello_world.txt

In a DAG template and steps template, you can reference the output from one task, as the input to another task using a template tag:

      dag:
        tasks:
          - name: generate-parameter
            template: whalesay
          - name: consume-parameter
            template: print-message
            dependencies:
              - generate-parameter
            arguments:
              parameters:
                - name: message
                  value: ""

Get the complete workflow:

wget https://cms-opendata-workshop.github.io/workshop2023-lesson-introcloud/files/argo/parameters-workflow.yaml

Yaml File

# parameters-workflow.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: parameters-
spec:
  entrypoint: main
  templates:
    - name: main
      dag:
        tasks:
          - name: generate-parameter
            template: whalesay
          - name: consume-parameter
            template: print-message
            dependencies:
              - generate-parameter
            arguments:
              parameters:
                - name: message
                  value: ""

    - name: whalesay
      container:
        image: docker/whalesay
        command: [ sh, -c ]
        args: [ "echo -n hello world > /tmp/hello_world.txt" ]
      outputs:
        parameters:
          - name: hello-param
            valueFrom:
              path: /tmp/hello_world.txt

    - name: print-message
      inputs:
        parameters:
          - name: message
      container:
        image: docker/whalesay
        command: [ cowsay ]
        args: [ "" ]

Run it:

argo submit --watch parameters-workflow.yaml -n argo

You should see:

STEP                     TEMPLATE       PODNAME                      DURATION  MESSAGE
  parameters-vjvwg      main                                                    
 ├─✔ generate-parameter  whalesay       parameters-vjvwg-4019940555  43s         
 └─✔ consume-parameter   print-message  parameters-vjvwg-1497618270  8s          

Learn more about parameters in the Argo Workflows documentation:

Conclusion

Congratulations! You have completed the Argo Workflows tutorial, where you learned how to define and execute workflows using Argo. You explored workflow definitions, dag templates, input and output parameters, and monitoring. This will be important when processing files from the CMS Open Data Portal as similarily done with the DAG and Parameters examples in this lesson.

Argo Workflows offers a wide range of features and capabilities for managing complex workflows in Kubernetes. Continue to explore its documentation and experiment with more advanced workflow scenarios.

Happy workflow orchestrating with Argo!

Key Points

  • With a simple but a tight yaml structure, a full-blown analysis can be performed using the Argo tool on a K8s cluster.


Cleaning up

Overview

Teaching: 5 min
Exercises: 10 min
Questions
  • How do I clean my workspace?

  • How do I delete my cluster?

Objectives
  • Clean my workflows

  • Delete my storage volume

Cleaning workspace

Remember to delete your workflows: Run this until you get a message indicating there is no more workflows.

argo delete -n argo @latest

Cleaning resources

In respect to K8s, deleting the argo namespace will delete all the resources created in this pre-exercise:

kubectl delete ns argo

Do not forget to download/delete any files created in your /tmp/poddata/ local directory.

Minikube - Stop your cluster

Before closing Docker Desktop, from a terminal with administrator access (but not logged in as root), run:

minikube stop

Key Points

  • With a couple commands it is easy to get back to square one.