Content from Introduction


Last updated on 2024-10-21 | Edit this page

Estimated time: 15 minutes

Overview

Questions

  • What are public cloud provides?
  • Why would you use them?
  • What is Kubernetes?
  • What else do you need?

Objectives

  • Explain the motivation for using Kubernetes cluster from public cloud providers.
  • Explain the tools used to set up the Kubernetes cluster and run the processing workflow.

Introduction


This tutorial shows how to set up a CMS open data processing workflow using Kubernetes clusters from Google Cloud Platform (GCP).

Callout

To learn about CMS open data and the different data formats, work through the tutorials in one of our workshops.

Using public cloud resources is an option for you if you do not have enough computing resources and want to run some heavy processing. In this tutorial, we use as an example processing of CMS open data MiniAOD to a “custom” NanoAOD, including more information than the standard NanoAOD but still in the same flat file format.

We assume that you would want to download the output files to your local area and analyse them with your own resources. Note that analysis using GCP resources with your files stored on GCP is also possible, but is not covered in this tutorial.

Google Cloud Platform


Public cloud providers are companies that offers computing resources and services over the internet to multiple users or organizations. Google Cloud Platform (GCP) is one of them. You define and deploy the resources that you need and pay for what you use. As many other such resource providers (for example AWS, Azure, OHV), it offers some free getting-started “credits”.

Callout

GCP offers free trial credits for $300 for a new account. This credit is valid for 90 days.

This tutorial was set up using Google Cloud Research credits. You can apply for similar credits for your research projects. Take note that the credit has to be used within 12 months.

You can create, manage and delete resources using the Google Cloud Console (a Web UI) or a command-line tool gcloud.

In this tutorial, we use gcloud commands to create the persistent storage for the output data, and a Terraform script to provision the Kubernetes cluster where the processing workflow will run.

Terraform


Terraform is a tool to define, provision, and manage cloud infrastructure using configuration scripts.

In this tutorial, we use Terraform scripts to create the Kubernetes cluster in a single step. The advantage - compared to plain command-line gcloud commands - is that we can easily configure input variables. Also, after the workflow finishes, it is easy to delete the resources in a single step.

Kubernetes


Kubernetes is a system to managed containerized workflows on computing clusters. kubectl is the command-line tool to interact with the cluster resources.

In this tutorial, we use kubectl commands to set up some services and to observe the status of the cluster and the data processing workflow.

Argo Workflows


The processing workflow consist of some sequential and parallel steps. We use Argo Workflows to define (or “orchestrate”) the workflow steps.

In this tutorial, the Argo Workflows services are set up using a kubectl command. We use argo, the command-line tool, to submit and manage the workflows.

Ready to go?


Checklist

Check the instructions in Software setup

When done, let’s go!

If you don’t have access to a Linux terminal or prefer not to install tools locally, you can use Google Cloud Shell. You’ll need a Google Cloud Platform (GCP) account and a GCP project. To open Cloud Shell, click the Cloud Shell icon in the top-right corner of the Google Cloud Console.

Cloud Shell comes pre-installed with gcloud, kubectl, terraform, and go. However, you’ll need to install the Argo CLI manually.

Remember that while using Cloud Shell is free, you will need to have credits (or pay) for the resources you deploy.

Key Points

  • Public cloud providers are companies that offer pay-as-you-go computing resources and services over the internet to multiple users or organizations.
  • Terraform is an open-source tool to provision and delete computing infrastructure.
  • Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications and their associated workflows across clusters of hosts..
  • Argo Workflows is an open-source tool for orchestrating sequential and parallel jobs on Kubernetes.

Content from Persistent storage


Last updated on 2024-10-21 | Edit this page

Estimated time: 15 minutes

Overview

Questions

  • How to create a Google Cloud Storage bucket?
  • What are the basic operations?
  • What are the cost of storage and download?

Objectives

  • Learn to create a Google Cloud Storage bucket
  • Learn to list the contents of a bukcet
  • Understand the persistant storage costs.

Storage for the output files


The processing workflow writes the output files to a storage from which they can be downloaded afterwards.

For this tutorial, we use a Google Cloud Storage (GCS) bucket.

Callout

The storage is created separately from the cluster resources. You can then delete the cluster just after the processing and avoid unnecessary costs, but keep the output files.

Prerequisites


GCP account and project

Make sure that you are in the GCP account and project that you intend to use for this work. In your Linux terminal, type

BASH

gcloud config list

The output shows your account and project.

If they are not what you expect (in case you have many), change them with

gcloud config set account <ACCOUNT>

and

BASH

gcloud config set project <PROJECT_ID>

Accounts

You can can check the credentialed accounts with

BASH

gcloud auth list

If you get none or your account does not appear in the list, run

BASH

gcloud init

and follow the steps in the output.

Projects

List the projects within the active account with

BASH

gcloud projects list

Billing account?

If this is your first project or you created it from the Google Cloud Console Web UI, it will have a billing account linked to it, and you are ready to go.

If you created the project from the command line without specifying the billing account, you must link it to an existing billing account.

First list the billing accounts

BASH

gcloud billing accounts list

Take the account id from the output, and check if your project is linked to it

BASH

gcloud billing projects list --billing-account <ACCOUNT_ID>

If not, link your project to this account with

BASH

gcloud billing projects link <PROJECT_ID> --billing-account <ACCOUNT_ID>

Choose your region

In this tutorial, we will use the Google computing centre europe-west4 located in Netherlands. You can use another one, but use it consistently.

If you computing resources are in a different region than your storage, additional costs and delay may occur.

Create the bucket


Create a storage bucket with

BASH

gcloud storage buckets create gs://<BUCKET_NAME> --location europe-west4

You can test copying a file to it with

BASH

echo test > test.txt
gcloud storage cp test.txt gs://<BUCKET_NAME>/

and list the contents of the bucket with

BASH

gcloud storage ls gs://<BUCKET_NAME>/*

You can remove the file with

BASH

gcloud storage rm gs://<BUCKET_NAME>/test.txt

Note that the bucket is tied to your GCP project and you can only access it when authenticated.

Costs


Storage

The storage cost of a GCS bucket depends on the data volume. The costs may vary from a region to another. For europe-west4, the current (October 2024) monthly cost is $0.020 / GB.

Operations

An operation is an action that makes changes to or retrieves information from a bucket and it has a tiny cost: $0.005 per 1000 operations. This applies, for example, to listing the contents of the bucket on your terminal. The traffic between GCP computing resources and storage within the same region (e.g. in europe-west4) is free.

In the context of this tutorial, the operations costs are insignificant. However, keep this is mind if you plan to write scripts that list the bucket content from your terminal.

Networking and download

Downloading data from the GCS bucket to your computer has a significant cost: the current (October 2024) cost to internet locations (excluding China and Austalia) is $0.12 / GB.

In the context of the example of this tutorial, the resulting output files are approximately 30% of the original MiniAOD dataset volume. For example, downloading the 330 GB ouput from a processing of a 1.1 TB MiniAOD dataset will cost $40.

Key Points

  • Google Cloud Storage bucket can be used to store the output files.
  • The storage cost depends on the volume stored and for this type of processing is very small.
  • The download of big output files from the bucket can be costly.

Content from Disk image


Last updated on 2024-10-21 | Edit this page

Estimated time: 15 minutes

Overview

Questions

  • Why to build a disk image for cluster nodes?
  • How to build a disk image?

Objectives

  • Create a disk image with the container images.

Why a disk image?


The container image for the MiniAOD processing is very big and it needs to be pulled to the nodes of the cluster. It can take 30 mins to pull it. Making it available to the nodes of the cluster through a pre-built secondary disk image speeds up the workflow start.

We will follow the instructions provided by GCP to build the secondary disk image. All necessary steps are provided in this lesson.

Prerequisites


GCP account and project

Make sure that you are in the GCP account and project that you intend to use for this work. In your Linux terminal, type

BASH

gcloud config list

The output shows your account and project.

Bucket for logs

To build an image disk, we will use a script that produces some logs. Create a bucket for these logs with

gcloud storage buckets create gs://<BUCKET_FOR_LOGS>/ --location europe-west4

Go installed?

You should have go installed, see Software setup.

Enabling services

Before you can create resources on GCP, you will need to enable the corresponding services.

Your project usually has several services enabled already. You can list them with

BASH

gcloud services list --enabled

You can list all available services with

BASH

gcloud services list --available

but that’s a long list!

To build a disk image, you will need to enable Cloud Build API (cloudbuild.googleapis.com) and Compute Engine API (compute.googleapis.com) with

BASH

gcloud services enable cloudbuild.googleapis.com compute.googleapis.com

When services are enabled, some “service accounts” with specific roles get created. You can list them with

BASH

gcloud projects get-iam-policy <PROJECT_ID>

Often, the resources need to work with each other. In this case, the Cloud Build service needs to have two additional roles to access Compute resources (the image disk belongs to that category).

Add them with

BASH

gcloud projects add-iam-policy-binding <PROJECT_ID> --member serviceAccount:<PROJECT_NR>@cloudbuild.gserviceaccount.com --role roles/compute.serviceAgent

BASH

gcloud projects add-iam-policy-binding <PROJECT_ID> --member serviceAccount:<PROJECT_NR>@cloudbuild.gserviceaccount.com --role roles/compute.admin

Application login

You need to create a credential file to run the script:

BASH

gcloud auth application-default login

and authenticate in the browser window that opens.

Get the code


Pull the code from https://github.com/GoogleCloudPlatform/ai-on-gke/tree/main/tools/gke-disk-image-builder

If you want only this directory and not the full repository, use the sparse checkout:

git init <your_new_folder>
cd <your_new_folder>
git remote add -f origin git@github.com:GoogleCloudPlatform/ai-on-gke.git

This takes a while.

git config core.sparseCheckout true
echo "tools/gke-disk-image-builder/" >>  .git/info/sparse-checkout
git pull origin main
cd tools/gke-disk-image-builder/

Create the image


To run the script to build the image, you must have go installed.

Run the script with

BASH

go run ./cli --project-name=<PROJECT_ID> --image-name=pfnano-disk-image --zone=europe-west4-a --gcs-path=gs://<BUCKET_FOR_LOGS> --disk-size-gb=50 --container-image=docker.io/cernopendata/cernopendata-client:latest --container-image=docker.io/rootproject/root:latest  --container-image=ghcr.io/cms-dpoa/pfnano-image-build:main --timeout 100m

The script will create a secondary disk images with the container images that are needed in the processing workflow:

  • cernopendata/cernopendata-client for getting the metadata
  • ghcr.io/cms-dpoa/pfnano-image-build:main for the processing
  • rootproject/root for an eventual test plot.

The container image pfnano-image-build is the standard CMS Open data container image with the PFNano processing code compiled.

Note the timeout options, the default tiemout of 20 mins is not enough.

Callout

Note that while images can in most cases be “pulled” from Dockerhub specifying only the image name (e.g. cernopendata/cernopendata-client and rootproject/root), in this script you must give the full registry address starting with docker.io and specify a tag (i.e. :latest).

Failure with errors of type

Code: QUOTA_EXCEEDED
Message: Quota 'N2_CPUS' exceeded.

are due to requested machine type no being available in the requested zone. Nothing to do with you quota.

Try in a different region or with a different machine type. You can give them as parameters e.g. --zone=europe-west4-a --machine-type=e2-standard-4. Independent of the zone specified in parameters, the disk image will have eu as the location, so any zone in europe is OK (if you plan to create your cluster in a zone in europe).

Note that the bucket for logs has to be in the same region so you might need to create another one. Remove the old one with gcloud storage rm -r gs://<BUCKET_FOR_LOGS>.

Once the image is built, you can see it in the list of available images with

BASH

gcloud compute images list

That’s a long list, there are many images already available. Your new image has your project name under “PROJECT” and secondary-disk-image under “FAMILY”.

Costs


Computing

The script runs a Google Cloud Build process and there’s per-minute small cost. 120 minutes are included in the Free tier services.

Storage

The image is stored in Google Compute Engine image storage, and the cost is computed by the archive size of the image. The cost is very low: in our example case, the size is 12.25 GB, and the monthly cost is $0.05/GB.

There’s a minimal cost for the output logs GCS bucket. The bucket can be deleted after the build.

Key Points

  • A secondary boot disk with the container image preloaded can speed up the workflow start.

Content from Kubernetes cluster


Last updated on 2024-10-21 | Edit this page

Estimated time: 15 minutes

Overview

Questions

  • How to create a Google Kubernetes Engine cluster?
  • How to access the cluster from the command line?

Objectives

  • Learn to create a Google Kubernetes Engine cluster.
  • Access the cluster and inspect it from the command line.

Prerequisites


GCP account and project

Make sure that you are in the GCP account and project that you intend to use for this work. In your Linux terminal, type

BASH

gcloud config list

The output shows your account and project.

Enabling services

Before you can create resources on GCP, you will need to enable them

In addition to what was enabled in the previous section, we will now enable Kubernetes Engine API (container.googleapis.com):

BASH

gcloud services enable container.googleapis.com

Bucket

If you worked through Section 02, you have now a storage bucket for the output files.

List the buckets with

BASH

gcloud storage ls

Secondary disk

If you worked through Section 03, you have a secondary boot disk image available

Get the code


The example Terraform scripts and Argo Workflow configuration are in https://github.com/cms-dpoa/cloud-processing/tree/main/standard-gke-cluster-gcs-imgdisk

Get them with

BASH

git clone git@github.com:cms-dpoa/cloud-processing.git
cd cloud-processing/standard-gke-cluster-gcs-imgdisk

About Terraform


The resources are created by Terraform scripts (with the file extension .tf) in the working directory.

In the example case, they are very simple and the same could be easily done with gcloud commands.

Terraform, however, makes it easy to change and keep track of the parameters. Read more about Terraform on Google Cloud in the Terraform overview.

The configurable parameters are defined in variables.tf and can be modified in terraform.tfvars.

Create the cluster


Set the variables in the terraform.tfvars file for example to

project_id          = "<PROJECT_ID>"
region              = "europe-west4-a"
name                = "1"
gke_num_nodes       = 2

With these parameters, a cluster named cluster-1 with 2 nodes will be created in the region europe-west4-a. You must define your GCP project.

To create the resources, run

BASH

terraform apply

and confirm “yes”.

Connect to the cluster and inspect


Once the cluster is created - it will take a while - connect to it with

BASH

gcloud container clusters get-credentials <CLUSTER_NAME> --region europe-west4-a --project <PROJECT_ID>

You can inspect the cluster with kubectl commands, for example the nodes:

BASH

kubectl get nodes

and the namespaces:

BASH

kubectl get ns

You will see several namespaces, and most of them are used by Kubernetes for different services. We will be using the argo namespace.

For more information about kubectl, check the Quick Reference and the links therein, or use the --help option with any of the kubectl commands.

Costs


Cluster management fee

For the GKE “Standard” cluster, there’s a cluster management fee of $0.10 per hour.

CPU and memory

The cost is determined by the machine and disk type and is per time. For this small example cluster with two e2-standard-4 nodes (4 vCPUs and 16 GB memory) the cost the cost is 0.3$ per hour. Each node has a 100 GB disk, and the cost is for two of these disks is 0.006$ per hour, i.e. very small compared to the machine cost.

The cluster usage contributes to the cost through data transfer and networking, but for this example case it is minimal.

Key Points

  • Kubernetes clusters can be created with Terraform scripts.
  • kubectl is the tool to interact with the cluster.

Content from Set up workflow


Last updated on 2024-10-21 | Edit this page

Estimated time: 15 minutes

Overview

Questions

  • How to set up Argo Workflow engine?
  • How to submit a test job?
  • Where to find the output?

Objectives

  • Deploy Argo Workflows services to the cluster.
  • Submit a test job.
  • Find the output in your bucket.

Prerequisites


GCP account and project

Make sure that you are in the GCP account and project that you intend to use for this work. In your Linux terminal, type

BASH

gcloud config list

The output shows your account and project.

Bucket

If you worked through Section 02, you have now a storage bucket for the output files.

List the buckets with

BASH

gcloud storage ls

Code

In the previous section, you pulled the code and move to the cloud-processing/standard-gke-cluster-gcs-imgdisk directory.

Argo CLI installed?

You should have Argo CLI installed, see Software setup.

Deploy Argo Workflows service


Deploy Argo Workflows services with

BASH

kubectl apply -n argo  -f https://github.com/argoproj/argo-workflows/releases/download/v3.5.10/install.yaml
kubectl apply -f argo/service_account.yaml
kubectl apply -f argo/argo_role.yaml
kubectl apply -f argo/argo_role_binding.yaml

Wait for the services to start.

You should see the following:

BASH

$ kubectl get all -n argo
NAME                                       READY   STATUS    RESTARTS   AGE
pod/argo-server-5f7b589d6f-jkf4z           1/1     Running   0          24s
pod/workflow-controller-864c88655d-wsfr8   1/1     Running   0          24s

NAME                  TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/argo-server   ClusterIP   34.118.233.69   <none>        2746/TCP   25s

NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/argo-server           1/1     1            1           24s
deployment.apps/workflow-controller   1/1     1            1           24s

NAME                                             DESIRED   CURRENT   READY   AGE
replicaset.apps/argo-server-5f7b589d6f           1         1         1       24s
replicaset.apps/workflow-controller-864c88655d   1         1         1       24s

About Argo Workflows


The data processing example is defined as an Argo workflow. You can learn about Argo Workflows in their documentation.

Every step in the workflow runs in a container, and there several ways to pass the information between the steps.

The example configuration in argo/argo_bucket_run.yaml has comments to help you to understand how the files and/or parameters can be passed from a step to another.

Submit a test job


The workflow is defined in the argo/argo_bucket_run.yaml file. It is composed of four steps:

  • get-metadata: gets the number of files and the file list for a given CMS Open Data 2016 MiniAOD record
  • joblist: divides the file list into given number of jobs and coumputes the number of events per job (if not all)
  • runpfnano: runs the processing in parallel jobs
  • plot: creates some simple plots.

Edit the parameters in the argo/argo_bucket_run.yaml so that they are

    parameters:
    - name: nEvents
      # Number of events in the dataset to be processed (-1 is all)
      value: 1000
    - name: recid
      # Record id of the dataset to be processed
      value: 30511
    - name: nJobs
      # Number of jobs the processing workflow should be split into
      value: 2
    - name: bucket
      # Name of cloud storage bucket for storing outputs
      value: <YOUR_BUCKET_NAME>

Now submit the workflow with

BASH

argo submit -n argo argo/argo_bucket_run.yaml

Observe its progress with

BASH

argo get -n argo @latest

Once done, check the ouput in the bucket with

BASH

$ gcloud storage ls gs://<YOUR_BUCKET_NAME>/**
gs://<YOUR_BUCKET_NAME>/pfnano/30511/files_30511.txt
gs://<YOUR_BUCKET_NAME>/pfnano/30511/logs/1.logs
gs://<YOUR_BUCKET_NAME>/pfnano/30511/logs/2.logs
gs://<YOUR_BUCKET_NAME>/pfnano/30511/plots/h_num_cands.png
gs://<YOUR_BUCKET_NAME>/pfnano/30511/plots/h_pdgid_cands.png
gs://<YOUR_BUCKET_NAME>/pfnano/30511/scatter/pfnanooutput1.root
gs://<YOUR_BUCKET_NAME>/pfnano/30511/scatter/pfnanooutput2.root

You can copy the files with gcloud storage cp ....

Delete resources

Delete the workflow after each run so that the “pods” do not accumulate. They are not running anymore but still visible.

BASH

argo delete -n argo @latest

Do not delete the cluster if you plan to continue to the next section, but do not keep it idle. The cost goes by the time it exists, not by the time it is in use. You can always create a new cluster.

You can delete all resources created by the Terraform script with

BASH

terraform destroy

Confirm with “yes”.

Costs


Cluster

Running the workflow in the cluster does not increase the cost, so for this small test, the estimates in the previous section are valid.

Data download

As detailed in Section 02, downloading data from the storage costs $0.12 / GB. The output file size - and consequently the download cost - for this quick example is small.

Key Points

  • Once the cluster is up, you will first deploy the Argo Workflows services using kubectl.
  • You will submit and monitor the workflow with argo.
  • You can see the output in the bucket with gcloud commands or on Google Cloud Console Web UI.

Content from Scale up


Last updated on 2024-10-21 | Edit this page

Estimated time: 15 minutes

Overview

Questions

  • How to process a full dataset?
  • What is an optimal cluster setup?
  • What is an optimal job configuration?

Objectives

  • Optimize the cluster setup for a full dataset processing.
  • Learn about job configuration.

Resource needs


Create a small cluster and connect

If you do not have the cluster from the previous sections, create a new one. Set the number of nodes gke_num_nodes to 2 in the terraform.tfvars file and create the resources, run

BASH

terraform apply

and confirm “yes”.

Connect to it with

BASH

gcloud container clusters get-credentials <CLUSTER_NAME> --region europe-west4-a --project <PROJECT_ID>

Run a test job

Run a small test as in the previous section, but set the number of step to one. While the job is running, observe the resources usage with

BASH

kubectl get pods

The big processing step - with runpfnano in the name - defines the resource needs. The output indicates how much CPU (in units of 1/1000 of a CPU) and memory the process consumes. Follow the resource consumption during the job to see how it evolves.

The job configuration, i.e. how many jobs will be submitted in a node, is defined by the resouce requests in the workflow definition.

In this example, we have defined them as

      resources:
        requests:
          cpu: "780m"
          memory: "1.8Gi"
          ephemeral-storage: "5Gi"  

and the main constraint here is the CPU. This request will guarantee that only 1 jobs will run on a CPU.

Delete the cluster

Once you have understood the resource consumption for a single job, delete the cluster with terraform destroy.

Cluster configuration


Input data

The optimal cluster configuration depends on the input dataset. Datasets consist of files, and the number of files can vary. In practical terms, the input to the parallel processing steps is a list of files. Dividing events from input files to different processing steps can be done, but would require a filtering list as an input to the processing.

In an ideal case, the parallel steps should take the same amount of time to complete. However, this is usually not the case because

  • input files are not equal in size
  • processing time per events can vary.

However, the best approach is to have the same amount of files in each parallel step. Eventually, the files could be sorted according to their size and their share to the nodes could be optimized.

In the example case, we have used the MuonEG MiniAOD dataset with 353 input files. For that number of files, we deployed a cluster with 90 nodes 4-vCPU nodes, providing a total of 360 vCPUs. We can therefore define a workflow with 353 parallel jobs and have a close to full occupation of the cluster.

The relevant Terraform input variables fur such cluster are

project_id          = "<PROJECT_ID>"
region              = "europe-west4"
gke_num_nodes       = 30

Note that as the “zone” (a, b or c in the location name) is not defined, the cluster will have 30 nodes in each zone, in total 90.

Alternatively, a smaller cluster would be less expensive per unit time but the processing takes a longer. In the benchmarking, a large cluster was found to be practical.

GCP sets quotas to resources, and you can either increase them - to a certain limit - or request a quota increase.

You will notice when you go beyond the predefined quota from this type of message during the cluster creation:

[...]
Error: error creating NodePool: googleapi: Error 403:
Error waiting for creating GKE cluster: Insufficient quota to satisfy the request:

If that happens, go to the Quotas & System Limits page at the Google Cloud Console. The link is also in the error message. Search for quotas that you need to increase.

For a cluster with a big number of nodes, you must increase the quotas for “CPUs” and “In-use regional external IPv4 addresses”.

Once you find the quota line, click on the three vertical dots and choose “Edit quota”.

If you can’t increase them to desired value, submit a quota increase request through this form. You will receive an email with increase request approved (or rarely denied if the location is down in resources). It is usually immediate, but takes some minutes to propagate.

Autoscaling

The Terraform script gke.tf has the autoscaling activated. This makes the cluster scale up or down according to resources in use. This reduces the cost in particular for a cluster with a big amount of nodes. It often happens that some jobs get longer than the other, and in that case the cluster lifetime (and the cost) is defined by the longest job. Autoscaling removes the nodes once they do not have active processes running.

Costs


Cluster management fee

For the GKE “Standard” cluster, there’s a cluster management fee of $0.10 per hour.

CPU and memory

The cost is determined by the machine and disk type and is per time.

The cluster usage contributes to the cost through data transfer and networking, but for this example case it is minimal.

Data download

Key Points