Storing workflow output on Google Kubernetes Engine

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • How can I set up shared storage for my workflows?

  • How to run a simple job and get the the ouput?

  • How to run a basic CMS Open Data workflow and get the output files?

Objectives
  • Understand how to set up shared storage and use it in a workflow

One concept that is particularly different with respect to running on a local computer is storage. Similarly as when mounting directories or volumes in Docker, one needs to mount available storage into a Pod for it to be able to access it. Kubernetes provides a large amount of storage types.

Defining a storage volume

The test job above did not produce any output data files, just text logs. The data analysis jobs will produce output files and, in the following, we will go through a few steps to setup a volume where the output files will be written and from where they can be fetched. All definitions are passed as “yaml” files, which you’ve already used in the steps above. Due to some restrictions of the Google Kubernetes Engine, we need to use an NFS persistent volumes to allow parallel access (see also this post).

The following commands will take care of this. You can look at the content of the files that are directly applied from GitHub at the workshop-paylod-kubernetes repository.

As a first step, a volume needs to be created:

gcloud compute disks create --size=100GB --zone=us-central1-c gce-nfs-disk-<NUMBER>

Replace <NUMBER> by your account number, e.g. 023. The zone should be the same as the one of the cluster created (no need to change here).

The output will look like this (don’t worry about the warnings):

WARNING: You have selected a disk size of under [200GB]. This may result in poor I/O performance. For more information, see: https://developers.google.com/compute/docs/disks#per
formance.
Created [https://www.googleapis.com/compute/v1/projects/cern-cms/zones/us-central1-c/disks/gce-nfs-disk-001].
NAME              ZONE           SIZE_GB  TYPE         STATUS
gce-nfs-disk-001  us-central1-c  100      pd-standard  READY
New disks are unformatted. You must format and mount a disk before it
can be used. You can find instructions on how to do this at:
https://cloud.google.com/compute/docs/disks/add-persistent-disk#formatting

Now, let’s get to using this disk:

curl -LO https://raw.githubusercontent.com/cms-opendata-workshop/workshop-payload-kubernetes/master/001-nfs-server.yaml

The file looks like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nfs-server-<NUMBER>
spec:
  replicas: 1
  selector:
    matchLabels:
      role: nfs-server-<NUMBER>
  template:
    metadata:
      labels:
        role: nfs-server-<NUMBER>
    spec:
      containers:
      - name: nfs-server-<NUMBER>
        image: gcr.io/google_containers/volume-nfs:0.8
        ports:
          - name: nfs
            containerPort: 2049
          - name: mountd
            containerPort: 20048
          - name: rpcbind
            containerPort: 111
        securityContext:
          privileged: true
        volumeMounts:
          - mountPath: /exports
            name: mypvc
      volumes:
        - name: mypvc
          gcePersistentDisk:
            pdName: gce-nfs-disk-<NUMBER>
            fsType: ext4

Replace all occurences of <NUMBER> by your account number, e.g. 023. You can edit files directly in the console or by opening the built-in graphical editor. Then apply the manifest:

kubectl apply -n argo -f 001-nfs-server.yaml

Then on to the server service:

curl -OL https://raw.githubusercontent.com/cms-opendata-workshop/workshop-payload-kubernetes/master/002-nfs-server-service.yaml

This looks like this:

apiVersion: v1
kind: Service
metadata:
  name: nfs-server-<NUMBER>
spec:
  ports:
    - name: nfs
      port: 2049
    - name: mountd
      port: 20048
    - name: rpcbind
      port: 111
  selector:
    role: nfs-server-<NUMBER>

As above, replace all occurences of <NUMBER> by your account number, e.g. 023 and apply the manifest:

kubectl apply -n argo -f 002-nfs-server-service.yaml

Download the manifest needed to create the PersistentVolume and the PersistentVolumeClaim:

curl -OL https://raw.githubusercontent.com/cms-opendata-workshop/workshop-payload-kubernetes/master/003-pv-pvc.yaml

The file looks as follows:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-<NUMBER>
spec:
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteMany
  nfs:
    server: <Add IP here>
    path: "/"

---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: nfs-<NUMBER>
  namespace: argo
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: ""
  resources:
    requests:
      storage: 100Gi

In the line containing server: replace <Add IP here> by the output of the following command:

kubectl get -n argo svc nfs-server-<NUMBER> |grep ClusterIP | awk '{ print $3; }'

This command queries the nfs-server service that we created above and then filters out the ClusterIP that we need to connect to the NFS server. Replace <NUMBER> as before.

Apply this manifest:

kubectl apply -n argo -f 003-pv-pvc.yaml

Let’s confirm that this worked:

kubectl get pvc nfs-<NUMBER> -n argo

You will see output similar to this

NAME   STATUS   VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
nfs    Bound    nfs      100Gi      RWX                           5h2m

Note that it may take some time before the STATUS gets to the state “Bound”.

Now we can use this volume in the workflow definition. Create a workflow definition file argo-wf-volume.yaml with the following contents:

# argo-wf-volume.ysml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: test-hostpath-
spec:
  entrypoint: test-hostpath
  volumes:
    - name: task-pv-storage
      persistentVolumeClaim:
        claimName: nfs-<NUMBER>
  templates:
  - name: test-hostpath
    script:
      image: alpine:latest
      command: [sh]
      source: |
        echo "This is the ouput" > /mnt/vol/test.txt
        echo ls -l /mnt/vol: `ls -l /mnt/vol`
      volumeMounts:
      - name: task-pv-storage
        mountPath: /mnt/vol

Submit and check this workflow with

argo submit -n argo argo-wf-volume.yaml
argo list -n argo

Take the name of the workflow from the output (replace XXXXX in the following command) and check the logs:

kubectl logs pod/test-hostpath-XXXXX  -n argo main

Once the job is done, you will see something like:

ls -l /mnt/vol: total 20 drwx------ 2 root root 16384 Sep 22 08:36 lost+found -rw-r--r-- 1 root root 18 Sep 22 08:36 test.txt

Get the output file

The example job above produced a text file as an output. It resides in the persistent volume that the workflow job has created. To copy the file from that volume to the cloud shell, we will define a container, a “storage pod” and mount the volume there so that we can get access to it.

Create a file pv-pod.yaml with the following contents:

# pv-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: pv-pod
spec:
  volumes:
    - name: task-pv-storage
      persistentVolumeClaim:
        claimName: nfs-<NUMBER>
  containers:
    - name: pv-container
      image: busybox
      command: ["tail", "-f", "/dev/null"]
      volumeMounts:
        - mountPath: /mnt/data
          name: task-pv-storage

Create the storage pod and copy the files from there

kubectl apply -f pv-pod.yaml -n argo
kubectl cp  pv-pod:/mnt/data /tmp/poddata -n argo

and you will get the file created by the job in /tmp/poddata/test.txt.

Run a CMS open data workflow

If the steps above are successful, we are now ready to run a workflow to process CMS open data.

Create a workflow file argo-wf-cms.yaml with the following content:

# argo-wf-cms.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: nanoaod-argo-
spec:
  entrypoint: nanoaod-argo
  volumes:
  - name: task-pv-storage
    persistentVolumeClaim:
      claimName: nfs-<NUMBER>
  templates:
  - name: nanoaod-argo
    script:
      image: cmsopendata/cmssw_5_3_32
      command: [sh]
      source: |
        source /opt/cms/entrypoint.sh
        sudo chown $USER /mnt/vol
        mkdir workspace
        cd workspace
        git clone git://github.com/cms-opendata-analyses/AOD2NanoAODOutreachTool  AOD2NanoAOD
        cd AOD2NanoAOD
        scram b -j8
        nevents=100
        eventline=$(grep maxEvents configs/data_cfg.py)
        sed -i "s/$eventline/process.maxEvents = cms.untracked.PSet( input = cms.untracked.int32($nevents) )/g" configs/data_cfg.py
        cmsRun configs/data_cfg.py
        cp output.root /mnt/vol/
        echo  ls -l /mnt/vol
        ls -l /mnt/vol
      volumeMounts:
      - name: task-pv-storage
        mountPath: /mnt/vol

Submit the job with

argo submit -n argo argo-wf-cms.yaml --watch

The option --watch gives a continuous follow-up of the progress. To get the logs of the job, use the process name (nanoaod-argo-XXXXX) which you can also find with

argo get -n argo @latest

and follow the container logs with

kubectl logs pod/nanoaod-argo-XXXXX  -n argo main

Get the output file output.root from the storage pod in a similar manner as it was done above.

Accessing files via http

With the storage pod, you can copy files between the storage element and the CloudConsole. However, a practical use case would be to run the “big data” workloads in the cloud, and then download the output to your local desktop or laptop for further processing. An easy way of making your files available to the outside world is to deploy a webserver that mounts the storage volume.

We first patch the config of the webserver to be created as follows:

mkdir conf.d
cd conf.d
curl -sLO https://raw.githubusercontent.com/cms-opendata-workshop/workshop-payload-kubernetes/master/conf.d/nginx-basic.conf
cd ..
kubectl create configmap basic-config --from-file=conf.d -n argo

Then prepare to deploy the fileserver by downloading the manifest:

curl -sLO https://github.com/cms-opendata-workshop/workshop-payload-kubernetes/raw/master/deployment-http-fileserver.yaml

Open this file and again adjust the <NUMBER>:

# deployment-http-fileserver.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    service: http-fileserver
  name: http-fileserver
spec:
  replicas: 1
  strategy: {}
  selector:
    matchLabels:
      service: http-fileserver
  template:
    metadata:
      labels:
        service: http-fileserver
    spec:
      volumes:
      - name: volume-output
        persistentVolumeClaim:
          claimName: nfs-<NUMBER>
      - name: basic-config
        configMap:
          name: basic-config
      containers:
      - name: file-storage-container
        image: nginx
        ports:
        - containerPort: 80
        volumeMounts:
          - mountPath: "/usr/share/nginx/html"
            name: volume-output
          - name: basic-config
            mountPath: /etc/nginx/conf.d

Apply and expose the port as a LoadBalancer:

kubectl create -n argo -f deployment-http-fileserver.yaml
kubectl expose deployment http-fileserver -n argo --type LoadBalancer --port 80 --target-port 80

Exposing the deployment will take a few minutes. Run the following command to follow its status:

kubectl get svc -n argo

You will initially see a line like this:

NAME                          TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
http-fileserver               LoadBalancer   10.8.7.24    <pending>     80:30539/TCP   5s

The <pending> EXTERNAL-IP will update after a few minutes (run the command again to check). Once it’s there, copy the IP and paste it into a new browser tab. This should welcome you with a “Hello from NFS” message. In order to enable file browsing, we need to delete the index.html file in the pod. Determine the pod name using the first command listed below and adjust the second command accordingly.

kubectl get pods -n argo
kubectl exec http-fileserver-XXXXXXXX-YYYYY -n argo -- rm /usr/share/nginx/html/index.html

Warning: anyone can now access these files

This IP is now accessible from anywhere in the world, and therefore also your files (mind: there are charges for outgoing bandwidth). Please delete the service again once you have finished downloading your files.

kubectl delete svc/http-fileserver -n argo

Run the kubectl expose deployment command to expose it again.

Remember to delete your workflow again to avoid additional charges.

argo delete -n argo @latest

Key Points

  • CMS Open Data workflows can be run in a commercial cloud environment using modern tools