Complicated Job DAGs, Ingress, and Egress

We can continue to build upon this Fuzzfile to create a workflow with a more complicated Directed Acyclic Graph (DAG) of jobs and dependencies. We will also add data ingress and egress to demonstrate the process of importing and exporting data to and from your workflow.

Our new workflow should create multiple jobs that will run the Cowsay program on the input data using different options to produce different results. Instead of creating a fortune from scratch, we will use data ingress to import data to our workflow, and then add to the data using the fortune command. Finally, we will use job to concatenate all of our results into a single file and data egress will move the file to an S3 bucket where we have read/write access through a saved secret.

This example requires that you are able to create ephemeral volumes, you have access to an external S3 bucket, and you have configured a secret to access the bucket. You may need to spend some time reviewing the linked docs and/or you may need your administrator to set permissions for you.

You can open the workflow that we created in the last example in the Workflow Editor. If it is not already open you can navigate to the Workflows tab (on the left), find the workflow that you executed in the last example containing the cowsay command displaying text created by the fortune command, open it in the workflow dashboard, and select “Open in Workflow Editor” in the upper right.

First, let’s add data ingress to the existing volume so that it starts with some data already saved on it. We’ll just grab a public file from GitHub containing the text “Hello World!”. Click on the vertical Volumes tab, and open the Volume we created to support the jobs (called testVolume in our example). Now click on the Add Ingress button:

menu for adding a new ingress to the existing volume

A new dialog box opens up to guide you through the process of creating ingress. You can use the drop-down under the Location heading to select “https”. Now you can add the following text to the dialog box to get the README.md file from one of Octocat’s repo on GitHub.

raw.githubusercontent.com/octocat/Hello-World/master/README

Since, this file will be used as the starting point for a new fortune, we will set the destination to be a file called fortune.txt. (Keeping the file name the same as the last example will also help minimize the custom changes we have to make.)

configuring ingress from github

After pressing “OK” you can double check the changes in the new Ingess box in the Volume Configuration tab.

Now let’s make a small tweak to our fortune job. First click the job to open the Job tab in the job configuration menu. Add a second angle bracket (>) to the command so that standard out is appended to the file end instead of overwriting. The command should look like this:

fortune >>/tmp/fortune.txt

new command in fortune job

Now the fortune command will append some text to the file we got from GitHub instead of overwriting it.

Now let’s tweak our cowsay job so that it saves a file instead of echoing the standard output to the log. Add the following to the commands end >/tmp/cow1.txt so that the full command looks like this:

cat /tmp/fortune.txt | cowsay >/tmp/cow1.txt

Next, let’s make a few more jobs that will run in parallel with the job that executes the cowsay command. These jobs will create a few different ascii-art files. Use the button with the plus sign in the lower right corner and drag and drop 2 more jobs into the workflow grid. You can name them sheepsay and tuxsay. Finally, you can draw lines from the fortune job to these new jobs to indicate dependencies.

adding two new jobs to run in parallel with cowsay

If you are comfortable editing the Fuzzfile directly, it might be easier to press the “e” key to open the editor and then copy text blocks under the jobs field to “clone” the cowsay job with all of its settings in place.

If your workflow looks a little more messy, you can use the “o” key to automatically organize the workflow jobs.

Now we need to configure the new jobs. You can simply copy all of the settings from the cowsay job to the newly created sheepsay and tuxsay jobs. Don’t forget to copy the Mounted Volume configuration as well.

The one change that we will make to the sheepsay and tuxsay jobs respectively is that we will need to tweak their commands to make them produce different output and save it to different files. You can make their commands look like this:

cat /tmp/fortune.txt | cowsay -f sheep >/tmp/cow2.txt

cat /tmp/fortune.txt | cowsay -f tux >/tmp/cow3.txt

Now let’s create another job that will concatenate all of the results into a single file. Of course, this job will need to run after all of the other jobs have completed. The DAG will reflect this.

Use the drag and drop widget to create another job in your workflow grid. Name this job concatenate. Draw connections from all 3 *say jobs to the top of the new concatenate job. Your finished workflow should look like this in the editor.

finished workflow in editor grid

This workflow starts with one “preprocessing” job and then fans out into several other jobs that perform work in parallel. The output from these jobs is reduced back to one job like a funnel. This is a useful pattern that will be familiar to many HPC users.

In the Job tab, make sure to add the following command to the concatenate job.

cat /tmp/cow*txt >/tmp/output.txt

We don’t need any special programs to be installed in the container that supports this job, so you can use the lightweight Alpine container by heading to the Environment tab and adding the following URI:

docker://alpine

While you are there, you can go ahead and click the Add Mounted Volume button and bind the testVolume to /tmp as we did with the previous jobs.

Then you can go to the Resources tab and allocate 1 core and 1GB memory. Once you save your changes, the concatenate jobs should be fully configured.

To finish out the example, let’s make sure that the output.txt file gets saved in a location of our choice using data egress. You can click on the vertical Volumes tab and click on the testVolume that we set up earlier. Then you can click the button to Add Data Egress. In the dialog box that opens, add the appropriate values to access your S3 bucket. In my case these values will do the trick, but you will need to use something different to access your S3 bucket with your configured secret.

egress configured for S3 bucket

And that’s it! After running the workflow and downloading the resulting output.txt file you can see that it contains something like this. (Your fortune will probably be different.)

 _________________________________________
/ Hello World! If you sow your wild oats, \
\ hope for a crop failure.                /
 -----------------------------------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||
 _________________________________________
/ Hello World! If you sow your wild oats, \
\ hope for a crop failure.                /
 -----------------------------------------
  \
   \
       __
      UooU\.'@@@@@@`.
      \__/(@@@@@@@@@@)
           (@@@@@@@@)
           `YY~~~~YY'
            ||    ||
 _________________________________________
/ Hello World! If you sow your wild oats, \
\ hope for a crop failure.                /
 -----------------------------------------
   \
    \
        .--.
       |o_o |
       |:_/ |
      //   \ \
     (|     | )
    /'\_   _/`\
    \___)=(___/

As in the previous sections, you can see the Fuzzfile at any time from the Workflow Editor by clicking the ellipsis menu in the lower right of the workflow grid and selecting “Edit YAML” or by pressing “e” on your keyboard. You can also view the Fuzzfile by clicking on the “Definition” tab in the “Workflows” dashboard. The Fuzzfile that is generated by the Workflow Editor is now pretty complicated. But hopefully it is also approachable and understandable after we’ve gone through the exercise of creating it.

version: v1
jobs:
  cowsay:
    image:
      uri: oras://godlovedc/lolcow:sif
    mounts:
      testVolume:
        location: /tmp
    command:
      - /bin/sh
      - '-c'
      - cat /tmp/fortune.txt | cowsay >/tmp/cow1.txt
    requires:
      - fortune
    resource:
      cpu:
        cores: 1
      memory:
        size: 1GB
  tuxsay:
    image:
      uri: oras://godlovedc/lolcow:sif
    mounts:
      testVolume:
        location: /tmp
    command:
      - /bin/sh
      - '-c'
      - cat /tmp/fortune.txt | cowsay -f tux >/tmp/cow3.txt
    requires:
      - fortune
    resource:
      cpu:
        cores: 1
      memory:
        size: 1GB
  fortune:
    image:
      uri: oras://godlovedc/lolcow:sif
    mounts:
      testVolume:
        location: /tmp
    command:
      - /bin/sh
      - '-c'
      - fortune >>/tmp/fortune.txt
    resource:
      cpu:
        cores: 1
      memory:
        size: 1GB
  sheepsay:
    image:
      uri: oras://godlovedc/lolcow:sif
    mounts:
      testVolume:
        location: /tmp
    command:
      - /bin/sh
      - '-c'
      - cat /tmp/fortune.txt | cowsay -f sheep >/tmp/cow2.txt
    requires:
      - fortune
    resource:
      cpu:
        cores: 1
      memory:
        size: 1GB
  concatenate:
    image:
      uri: docker://alpine
    mounts:
      testVolume:
        location: /tmp
    command:
      - /bin/sh
      - '-c'
      - cat /tmp/cow*txt >/tmp/output.txt
    requires:
      - cowsay
      - sheepsay
      - tuxsay
    resource:
      cpu:
        cores: 1
      memory:
        size: 1GB
volumes:
  testVolume:
    egress:
      - source:
          uri: file://output.txt
        destination:
          uri: s3://co-ciq-misc-support/godloved/output.txt
          secret: secret://user/GODLOVED_S3
    ingress:
      - source:
          uri: https://raw.githubusercontent.com/octocat/Hello-World/master/README
        destination:
          uri: file://fortune.txt
    reference: volume://user/ephemeral

If you want to replicate this or any of the workflows in these examples, but you don’t want to manually recreate them using the Workflow Editor, you can always copy and paste this text into a file and open the file in the Workflow Editor. Or you can just press “e” to open the text editor window in the Workflow Editor and paste in this text!

The preceding sections have covered the major aspects of building workflow through the GUI. We will cover job arrays, distributed jobs, GPU jobs and other advanced resource requests in other sections.