The toasty Image-Processing Pipeline Framework

The toasty package provides a series of “pipeline” commands that automate the process of tiling a collection of images and publishing the resulting data files in a format easily usable by the AAS WorldWide Telescope.

Overview

The toasty image-processing pipeline is designed to run automatically and incrementally. The basic framework is that there should be some central data repository that contains the pipeline configuration and previously processed data; in production circumstances, this repository is web-accessible such that the data can be accessed by WWT clients. As much configuration as possible is stored in the data repository, to promote reproducibility. Pipeline operations then follow this general scheme:

  1. Fetch a set of new images to process from a data source.

  2. Process each image into WWT-friendly formats as needed, emitting appropriate metadata files.

  3. Publish the processed image data to the data repository.

  4. Rebuild the repository’s indexes of available data sets.

The fetch stage can use stored configuration and data to ensure that only new images are processed. The processing and publishing stages are separated so that a human can review the processing outputs if so desired. The re-indexing stage is logically separate and can be run independently of the data ingest.

Each pipeline stage is implemented as a subcommand of the toasty command-line program.

Data Access

Access to the central data repository is essential to pipeline operations. When running the toasty pipeline commands, the access mechanism is specified with command-line arguments. There are currently two categories of data repository that can be accessed, each of which has its own arguments. When running the toasty pipeline commands, give these arguments to every command.

Local

The “repository” can be a directory on the local machine. This mode is mostly intended for testing but could also be useful if the pipeline is being run on a web server. To use this data acces mode, pass the command-line argument:

--local=path/to/data

where the value of the argument is some directory name. Data and configuration will be stored inside the directory.

Azure Blob Storage

The data can also be stored on an Azure Blob Storage server. To use this data access mode, pass the following command-line arguments.

  • --azure-conn-env=CONNSTR — this argument gives the name of an environment variable that contains an Azure Blob Storage connection string. Not that it is not the connection string itself, since it is easy to leak secrets from command-line arguments. In the above example, there should be an environment variable named CONNSTR whose value is the actual connection string.

  • --azure-container=mydata — this is the name of the blob container that stores the data.

  • --azure-path-prefix=myfeed — this is a Unix-style folder path under which data will be stored inside the container. This argument is optional; if it is not given, the data will be stored in the root of the container.

Configuration

Wherever your data are stored, the root of the repository should contain a configuration file named toasty-pipeline-config.yaml. As implied, this file contains structured data in the YAML format. An example is:

source_type: astropix
publish_url_prefix: //wwtfiles.blob.core.windows.net/feeds/nrao/
folder_name: NRAO Studies
folder_thumbnail_url: //www.worldwidetelescope.org/wwtweb/thumbnail.aspx?name=radiostudies

astropix:
  json_query_url: https://astropix.ipac.caltech.edu/link/c60?format=json

The toplevel settings are:

  • source_type specifies the source of imagery to be obtained in the fetch stage. Values are documented below.

  • publish_url_prefix specifies the URL prefix below which the data produced by the pipeline will ultimately become publicly accessible. This setting needs to be provided in order to write the correct data access URLs into the WTML files generated by the pipeline.

  • folder_name specifies the user-facing name that will be given to the folder of image data emitted by the pipeline.

  • folder_thumbnail_url gives a URL for the thumbnail image to be associated with the folder.

AstroPix Data Source

Currently, the only allowed source_type is astropix, which downloads and parses an imagery feed from the AstroPix service.

When using the astropix data source, the toasty-pipeline-config.yaml file should contain a dictionary named astropix as in the example above. This dictionary should contain one key, json_query_url. This key should give the URL of a saved AstroPix search in the JSON output format. When fetching new data, the pipeline will download JSON from this URL and parse it to index the available imagery.

Stage 1: Data Fetching

The first step of the pipeline is to download new imagery for local processing. The command-line invocation is:

toasty pipeline-fetch-inputs {data-args} {work-dir}

where {data-args} should be replaced with the correct data-access arguments and {work-dir} is a path to a local directory that will store pipeline data. The current directory (.) is a fine choice.

The data will be downloaded into a subdirectory cache_todo of the work directory. Within this directory, there will be one subdirectory for each image to process. Images that have already been processed, as determined by checking for an index.wtml in the data repository inside a folder with each image’s unique ID, will be skipped. Images can be forced to be skipped by creating a file named skip.flag in the subfolder where the index.wtml would go.

Stage 2: Data Processing

Once the data have been cached locally, the next step is to convert them into WWT formats. This is done with:

toasty pipeline-process-todos {data-args} {work-dir}

where the braced parameters should be replaced with task-specific values as described above.

This stage will process the images, potentially creating tile pyramids, into a directory out_todo of the work directory. As before there will be one subdirectory inside this directory for each successfully processed image. The image cache directories will be moved from cache_todo to cache_done as they are successfully processed, allowing the pipeline to work its way through the data incrementally if any problems are encountered.

Each “out” subdirectory will contain at least two WTML files, both of which contain a folder with a single item corresponding to the processed image in question. The file index.wtml contains absolute URLs pointing to the eventual destination of the published data, while index_rel.wtml contains relative URLs. These files can be used or modified to verify the success of the processing of each image.

Stage 3: Data Publishing

After all the new images have been successfully processed and verified, the next step is to upload the processed data to the repository. This is done with:

toasty pipeline-publish-todos {data-args} {work-dir}

where the braced parameters should be replaced with task-specific values as described above.

As before, this will run through each image subdirectory in out_todo inside the work directory, and move it to out_done when the image is fully uploaded. Once again this allows incremental operation in the case of any problems.

Stage 4: Reindexing

After all of the new images are uploaded, the collection should be re-indexed. The command interface follows the same pattern as before:

toasty pipeline-reindex {data-args} {work-dir}

where the braced parameters should be replaced with task-specific values as described above.

Unlike the previous stage, this stage doesn’t particularly care about which images may have been processed or cached locally. It scans the data repository and builds a list of all available images, then writes an index.wtml file in the repository root the contains a reverse-chronological list of everything available. The contents of this file are obtained by reading the set of per-image index.wtml files and synthesizing them all.