Running large-scale workflows

Overview

Teaching: 10 min
Exercises: 10 min

Questions

How can I run more than a toy workflow?

Objectives

Run a workflow with several parallel jobs

Autoscaling

Google Kubernetes Engine allows you to configure your cluster so that it is automatically rescaled based on pod needs.

When creating a pod you can specify how much of each reasource a container needs. More information on compute resources can be found on Kubernetes pages. This information is then used to schedule your pod. If there is no node matching your pod’s requirement then it has to wait until some more pods are terminated or new nodes is added.

Cluster autoscaler keeps an eye on the pods scheduled and checks if adding a new node, similar to the other in the cluster, would help. If yes, then it resizes the cluster to accommodate the waiting pods.

Cluster autoscaler also scales down the cluster to save resources. If the autoscaler notices that one or more nodes are not needed for an extended period of time (currently set to 10 minutes) it downscales the cluster.

To configure the autoscaler you simply specify a minimum and maximum for the node pool. The autoscaler then periodically checks the status of the pods and takes action accordingly. You can set the configuration either with the gcloud command-line tool or via the dashboard.

Deleting pods automatically

Argo allows you to describes the strategy to use when deleting completed pods. The pods are deleted automatically without deleting the workflow. Define one of the following strategies in your Argo workflow under the field spec:

spec:
  podGC:
    # pod gc strategy must be one of the following
    # * OnPodCompletion - delete pods immediately when pod is completed (including errors/failures)
    # * OnPodSuccess - delete pods immediately when pod is successful
    # * OnWorkflowCompletion - delete pods when workflow is completed
    # * OnWorkflowSuccess - delete pods when workflow is successful
    strategy: OnPodSuccess

Scaling down

Occasionally, the cluster autoscaler cannot scale down completely and extra nodes are left hanging behind. Some situations like those can be found documented here. Therefore it is useful to know how to manually scale down your cluster.

Click on your cluster, listed at Kubernetes Engine - Clusters. Scroll down to the end of the page where you will find the Node pools section. Clicking on your node pool will take you to its details page.

In the upper right corner, next to EDIT and DELETE you’ll find RESIZE.

Clicking on RESIZE opens a textfield that allows you to manually adjust the number of pods in your cluster.

Key Points

Argo Workflows on Kubernetes are very powerful.

previous episode

CMS Open Data using Kubernetes

next episode