Application packages

Workflows take the form of OGC Application Packages which include a Common Workflow Language (CWL) script that defines the step(s) to be computed within a workflow, also defining outputs from each step as well as the overall inputs and outputs for the workflow. An example CWL script can be found here. This script takes as input a URL to an image file, a function name and a scale factor, as a percentage. It then carries out the specified function on the image, in this case that is a resize by the provided scale factor. This image is then output, along with a STAC catalogue defining the resulting image asset.

Some best practices when creating application packages can be found here and include requirements that a workflow needs to comply with in order to work with the WR API. This includes specifying the type of inputs supported as well as the required output format to successfully work with the WR STAGEOUT steps.

Specific restrictions of note which must be followed:

Workflow IDs cannot contain underscores, and must be at most 19 characters long
Your final outputs must be captured in a `glob` command and must capture a directory containing a catalog.json file as well as any other outputs you wish to export
You may define defaults in the workflow but these are not currently used during execution, so you must also provide these values again when executing a workflow
The requested resource size, if specified, must be no more than what is provided by the cluster, this is currently 16GB of RAM and 4 CPUs
Your workflow must produce a catalog.json STAC Catalog which points to your generated STAC items, these must be captured in the output of the CWL using a "glob"
CWL inputs are passed as command line strings and there is a limit in how long these can be. Currently on the cluster this is 2621440 bytes, meaning longer command line arguments will lead to errors. Instead you should attempt to load such data from a file, for example using the STAGEIN functionality discussed later

The Workflow Runner also includes two built-in steps that are run automatically, when required, for each workflow.

This includes a STAGEIN step which is run whenever a defined workflow includes a Directory definition pointing to a STAC asset. This step extracts the STAC data from the provided location and downloads it ready for local processing. The data provided in the Directory input can come from either a URL, workspace Block Store or S3 Object store. Any Directory inputs must point to a STAC Feature and the STAGEIN step then extracts all Assets within this Item and constructs a new STAC Catalog containing the Item and any assets. This new Catalog is indexed locally with local file hrefs. See this example workflow that will invoke the STAGEIN step providing a URL to a STAC Feature.
This also includes a STAGEOUT step which is always run for a workflow and has the responsibility of extracting the workflow outputs and placing them in the correct Workspace S3 Object Store location, ready for harvesting and ingesting into the Resource Catalogue. This leads to a major requirement for OGC Application Packages: that they produce valid STAC catalog outputs containing a catalog.json file and any number of STAC Features and assets. You can see a simple function that constructs this STAC Catalog and STAC Items here. Upon workflow completion the generated STAC Catalog (catalog.json) file must be present in the set of results files, and it must point to any STAC Features you also wish to be output from the workflow. An example catalog.json file can be seen below:

{
"stac_version": "1.0.0",
"id": "catalog",
"type": "Catalog",
"description": "Root catalog",
"links": [
    {
"type": "application/geo+json",
"rel": "item",
"href": "item.json"
    },
    {
"type": "application/json",
"rel": "self",
"href": "catalog.json"
    }
  ]
}

And an example of a STAC Feature, item.json, generated from the same workflow:

{
"stac_version": "1.0.0",
"id": " item -1728909682.980245290",
"type": "Feature",
"geometry": {
"type": "Polygon",
"coordinates": [
      [
        [-180, -90],
        [-180, 90],
        [180, 90],
        [180, -90],
        [-180, -90]
      ]
    ]
  },
"properties": {
"created": "2024-10-14T12:41:22.980Z",
"datetime": "2024-10-14T12:41:22.980Z",
"updated": "2024-10-14T12:41:22.980Z"
  },
"bbox": [-180, -90, 180, 90],
"assets": {
" item": {
"type": "image/png",
"roles": ["data"],
"href": "item.png",
"file:size": 19133
    }
  },
"links": [
    {
"type": "application/json",
"rel": "parent",
"href": "catalog.json"
    },
    {
"type": "application/geo+json",
"rel": "self",
"href": "item.json"
    },
    {
"type": "application/json",
"rel": "root",
"href": "catalog.json"
    }
  ]
}

It is vital that these outputs are generated correctly and are captured in the workflow outputs as the WR uses the links in these files to ensure the outputs are captured and harvested into the Resource Catalogue.

Note, for the time being Collections need not be generated by the workflow, as a Collection is automatically generated in the STAGEOUT step and it is this Collection, “col_{job_id}” that will be ingested into the User Workspace Catalogue. Any Collections will not be used in the STAGEOUT step and should not be generated at this time.