Parallel Workflows with Job Arrays

Overview

Fuzzball supports parallel workloads through task arrays.

A parallel workload has some number of independent tasks that follow a similar pattern. For example, imagine you have 1000 data files, and processing each file requires 1 core and takes about an hour. If you processed all of these files sequentially, the job would take around 42 days to complete. Now imagine you have a powerful workstation with 32 cores allowing you to process 32 input files simultaneously. As files finish processing, other files will take their place and begin processing. Using this strategy, you can decrease the time needed to process the entire data set to around 31 1/2 hours. If you had access to 1000 cores you could reduce the processing time to a single hour!

Basic example

Here is a basic example Fuzzfile illustrating task arrays in Fuzzball.

version: v1
jobs:
  rollcall:
    image:
      uri: oras://docker.io/godlovedc/lolcow:sif
    command:
      - /bin/sh
      - '-c'
      - 'echo "cow #${FB_TASK_ID}" reporting for duty | cowsay'
    resource:
      cpu:
        cores: 1
      memory:
        size: 1GB
    task-array:
      start: 1
      end: 6
      concurrency: 3

The task-array section lets you specify that multiple copies of your job should run in parallel. Each task has a task ID. You can set the range using the start and end fields. Jobs contain an environment variable called FB_TASK_ID that allows you to reference the task ID of the currently running job. The concurrency field allows you to specify how many jobs should run in parallel (assuming that enough resources exist in the cluster).

Screenshots from the Fuzzball GUI may give you a better idea of how this task array runs. Note that Fuzzball queues up 3 tasks in the array at a time, respecting the concurrency field.

first three tasks in array starting

After the job runs you can check the logs to see how the FB_TASK_ID variable is used to change the standard output in each job.

herd of cows task array output

A Directory of Files

It is a very common pattern to have a collection of files and to need to do something to all of them. Consider this collection of files in the directory some-files.

$ ls some-files/
file10.txt  file2.txt  file3.txt  file4.txt  file5.txt  file6.txt
file7.txt  file8.txt  file999.txt  file9.txt

$ cat /tmp/some-files/*.txt
this is file 10
this is file 2
this is file 3
this is file 4
this is file 5
this is file 6
this is file 7
this is file 8
this is file 999
this is file 9

Note that the filenames are numbered, but that number 1 is missing and the last file is 999. The commands in our job cannot just iterate on file numbers.

Here is a generic Fuzzfile that will let you iterate through a directory of files and perform some action on them in a parallel workflow. In this example, we simply echo the contents of each file with ascii-art, but you can change the command to do whatever you want.

version: v1
jobs:
  untar:
    mounts:
      v1:
        location: /tmp
    cwd: /tmp
    command:
      - /bin/sh
      - '-c'
      - tar xvf some-files.tar.gz
    image:
      uri: docker://alpine
    resource:
      cpu:
        cores: 1
      memory:
        size: 1GB
  process-files:
    requires:
      - untar
    mounts:
      v1:
        location: /tmp
    cwd: /tmp/
    command:
      - /bin/bash
      - '-c'
      - cd some-files && v=$(ls) && a=($v) && cat "${a[$FB_TASK_ID]}" | cowsay
    image:
      uri: oras://docker.io/godlovedc/lolcow:sif
    resource:
      cpu:
        cores: 1
      memory:
        size: 1GB
    task-array:
      start: 1
      end: 10
      concurrency: 10
volumes:
  v1:
    ingress:
      - source:
          uri: s3://co-ciq-misc-support/godloved/some-files.tar.gz
          secret: secret://user/GODLOVED_S3
        destination:
          uri: file://some-files.tar.gz
    reference: volume://user/ephemeral

The key to this strategy is this command that appears in the process-files job.

cd some-files && v=$(ls) && a=($v) && cat "${a[$FB_TASK_ID]}" | cowsay

After moving to the correct directory, the files are listed and captured in a variable (v). Then v is converted to an array (a) that can then be indexed by $FB_TASK_ID to access the correct file.

The typical /bin/sh invocation has been replaced in this Fuzzfile with /bin/bash. This allows us to use a Bash array to index the file list by job number.

This job produces logs like the following:

logs from the processing files example job