============================================== The toasty Image-Processing Pipeline Framework ============================================== The toasty_ package provides a series of “pipeline” commands that automate the process of tiling a collection of images and publishing the resulting data files in a format easily usable by the `AAS WorldWide Telescope`_. .. _toasty: https://toasty.readthedocs.io/ .. _AAS WorldWide Telescope: http://worldwidetelescope.org/ Overview ======== The toasty_ image-processing pipeline is designed to run automatically and incrementally. The basic framework is that there should be some central data repository that contains the pipeline configuration and previously processed data; in production circumstances, this repository is web-accessible such that the data can be accessed by WWT clients. As much configuration as possible is stored in the data repository, to promote reproducibility. Pipeline operations then follow this general scheme: 1. Fetch a set of new images to process from a data source. 2. Process each image into WWT-friendly formats as needed, emitting appropriate metadata files. 3. Publish the processed image data to the data repository. 4. Rebuild the repository's indexes of available data sets. The *fetch* stage can use stored configuration and data to ensure that only new images are processed. The *processing* and *publishing* stages are separated so that a human can review the processing outputs if so desired. The *re-indexing* stage is logically separate and can be run independently of the data ingest. Each pipeline stage is implemented as a subcommand of the ``toasty`` command-line program. Data Access =========== Access to the central data repository is essential to pipeline operations. When running the ``toasty`` pipeline commands, the access mechanism is specified with command-line arguments. There are currently two categories of data repository that can be accessed, each of which has its own arguments. When running the ``toasty`` pipeline commands, give these arguments to every command. Local ----- The "repository" can be a directory on the local machine. This mode is mostly intended for testing but could also be useful if the pipeline is being run on a web server. To use this data acces mode, pass the command-line argument: .. code-block:: --local=path/to/data where the value of the argument is some directory name. Data and configuration will be stored inside the directory. Azure Blob Storage ------------------ The data can also be stored on an Azure Blob Storage server. To use this data access mode, pass the following command-line arguments. - ``--azure-conn-env=CONNSTR`` — this argument gives the *name of an environment variable* that contains an Azure Blob Storage connection string. Not that it is **not** the connection string itself, since it is easy to leak secrets from command-line arguments. In the above example, there should be an environment variable named ``CONNSTR`` whose value is the actual connection string. - ``--azure-container=mydata`` — this is the name of the blob container that stores the data. - ``--azure-path-prefix=myfeed`` — this is a Unix-style folder path under which data will be stored inside the container. This argument is optional; if it is not given, the data will be stored in the root of the container. Configuration ============= Wherever your data are stored, the root of the repository should contain a configuration file named ``toasty-pipeline-config.yaml``. As implied, this file contains structured data in the `YAML `_ format. An example is: .. code-block:: YAML source_type: astropix publish_url_prefix: //wwtfiles.blob.core.windows.net/feeds/nrao/ folder_name: NRAO Studies folder_thumbnail_url: //www.worldwidetelescope.org/wwtweb/thumbnail.aspx?name=radiostudies astropix: json_query_url: https://astropix.ipac.caltech.edu/link/c60?format=json The toplevel settings are: - ``source_type`` specifies the source of imagery to be obtained in the fetch stage. Values are documented below. - ``publish_url_prefix`` specifies the URL prefix below which the data produced by the pipeline will ultimately become publicly accessible. This setting needs to be provided in order to write the correct data access URLs into the WTML files generated by the pipeline. - ``folder_name`` specifies the user-facing name that will be given to the folder of image data emitted by the pipeline. - ``folder_thumbnail_url`` gives a URL for the thumbnail image to be associated with the folder. AstroPix Data Source -------------------- Currently, the only allowed ``source_type`` is ``astropix``, which downloads and parses an imagery feed from the `AstroPix `_ service. When using the ``astropix`` data source, the ``toasty-pipeline-config.yaml`` file should contain a dictionary named ``astropix`` as in the example above. This dictionary should contain one key, ``json_query_url``. This key should give the URL of a saved AstroPix search in the JSON output format. When fetching new data, the pipeline will download JSON from this URL and parse it to index the available imagery. Stage 1: Data Fetching ====================== The first step of the pipeline is to download new imagery for local processing. The command-line invocation is: .. code-block:: toasty pipeline-fetch-inputs {data-args} {work-dir} where ``{data-args}`` should be replaced with the correct data-access arguments and ``{work-dir}`` is a path to a local directory that will store pipeline data. The current directory (``.``) is a fine choice. The data will be downloaded into a subdirectory ``cache_todo`` of the work directory. Within this directory, there will be one subdirectory for each image to process. Images that have already been processed, as determined by checking for an ``index.wtml`` in the data repository inside a folder with each image’s unique ID, will be skipped. Images can be forced to be skipped by creating a file named ``skip.flag`` in the subfolder where the ``index.wtml`` would go. Stage 2: Data Processing ======================== Once the data have been cached locally, the next step is to convert them into WWT formats. This is done with: .. code-block:: toasty pipeline-process-todos {data-args} {work-dir} where the braced parameters should be replaced with task-specific values as described above. This stage will process the images, potentially creating tile pyramids, into a directory ``out_todo`` of the work directory. As before there will be one subdirectory inside this directory for each successfully processed image. The image cache directories will be moved from ``cache_todo`` to ``cache_done`` as they are successfully processed, allowing the pipeline to work its way through the data incrementally if any problems are encountered. Each "out" subdirectory will contain at least two WTML files, both of which contain a folder with a single item corresponding to the processed image in question. The file ``index.wtml`` contains absolute URLs pointing to the eventual destination of the published data, while ``index_rel.wtml`` contains relative URLs. These files can be used or modified to verify the success of the processing of each image. Stage 3: Data Publishing ======================== After all the new images have been successfully processed and verified, the next step is to upload the processed data to the repository. This is done with: .. code-block:: toasty pipeline-publish-todos {data-args} {work-dir} where the braced parameters should be replaced with task-specific values as described above. As before, this will run through each image subdirectory in ``out_todo`` inside the work directory, and move it to ``out_done`` when the image is fully uploaded. Once again this allows incremental operation in the case of any problems. Stage 4: Reindexing =================== After all of the new images are uploaded, the collection should be re-indexed. The command interface follows the same pattern as before: .. code-block:: toasty pipeline-reindex {data-args} {work-dir} where the braced parameters should be replaced with task-specific values as described above. Unlike the previous stage, this stage doesn't particularly care about which images may have been processed or cached locally. It scans the data repository and builds a list of *all* available images, then writes an ``index.wtml`` file in the repository root the contains a reverse-chronological list of everything available. The contents of this file are obtained by reading the set of per-image ``index.wtml`` files and synthesizing them all.