Workflow workspace access

As discussed throughout this guide, workflow inputs can be specified however the developer wishes. However, if the inputs need to be data read from a file, rather than simply string or array inputs, this can come from a variety of locations, including both public and private data.

The simplest example is just to pass in a public URL pointing to a STAC item as a Directory input, the Workflow Runner will call that URL and download the data as a STAC Catalog, provided it is public, ready for processing. This URL can of course be data taken directly from the DataHub Resource Catalog, for example this Sentinel2_ARD item could be used as an input.

However, if you wish to pass in data from your own workspace, for example when trying to process private data, the WR supports passing in data files from both S3 Object Stores and AWS Block Stores.

Note, in the following guidance, when we refer to "your workspace" we mean the workspace you are using to execute the workflow, i.e. the one which you are authenticated as when calling the execution endpoint. This does not depend on who deployed the workflow, only who is executing it, should you be calling a public workflow.

To input data from an S3 Object Store within your workspace you can follow the example here. In order to upload data to your S3 Object Store you can use an S3 client with credentials generated from the DataHub here.

Accessing S3 data within your workflows

There are two ways you can access S3 data within your workflows
The first allows you to access S3 Objects directly within your workflow steps:

Make sure your workflow configures an S3 client (e.g. boto3 in Python) with no credentials. The credentials will be provided by the Service Account applied to the workflow at execution time
Then pass in string inputs to your workflow: one for the access-point allowing access to your workspace sub-directory in the workspace bucket, and others specifying the file keys within this bucket you wish to access. You can of course specify as many file keys as you wish, or can even hard code these in the workflow itself should you not expect them to change
Make sure the file does indeed sit within your workspace directory
The access point should look like

arn:aws:s3:<region>:<account-id>:accesspoint/<access-point-name>

To construct your access-point please request it from a Hub administrator

The second will user the STAGEIN step to load the data from S3:

Make sure your workflow defines a Directory input
Pass in an S3 link pointing to the file you want to load as input, using your workspace access point, e.g.

s3://arn:aws:s3:<region>:<account-id>:accesspoint/<access-point-name>/path/to/file

here "path/to/file" is the full file key for the item in this bucket.

To construct your access-point please request it from a Hub administrator
Make sure the file is in your workspace directory
The STAGEIN will handle this access-point and file key and access your workspace directory to download the specified data
The other option is to input data directly from your workspace Block Store. A Block Store behaves more like the file system of your computer and means we can work with file paths rather than URLs.

Managing datasets in your workspace

To first manage datasets in your workspace Block Store you can use the AppHub (JupyterLab) application on the Hub. This allows you to upload data and create directories to organise your data as you wish. Once you have the files you want to use in your Block Store you can construct a workflow that again uses the STAGEIN step to load your input data.

Make sure your workflow defines a Directory input
You workspace Block Store is mounted within the workflow pods at the path

/workspace/pv-<workspace-name>-workspace

so ensure your input path starts with this prefix and then add the path to your file within your Block Store

Pass in this constructed file path as a Directory input
The STAGEIN will handle this file path and access your workspace Block Store to load the specified data ready for processing