Workflow provenance-based scheduling

Scientific computing workflows have become increasingly complex, often comprising of numerous interdependent tasks executed on distributed computing resources. Provenance data, or the history of computational processes, provide a vital link between data reproducibility and task scheduling. Workflows with recorded data provenance can seamlessly integrate with separate workflow management systems, eliminating the need for inter-system communication. In this talk, we introduce a novel tool to perform provenance-based workflow scheduling. Our approach leverages an abstract graph builder tool designed to create abstract graphs representing the high-level structure of workflows. These abstract graphs emphasize dependencies and data flows, facilitating a better understanding of the computational process. Concurrently, we extract concrete graphs from workflow provenance data recorded with DataLad that reflect the actual execution history. The core of our approach lies in comparing the abstract graph to concrete graphs produced by separate runs of the workflow for a set of input parameters. By computing the difference we can pinpoint tasks that remain unexecuted or require re-execution due to errors or changes in input data and automatically schedule these tasks. We will outline future directions for this research, including potential extensions to support system agnostic scheduling, and scalability considerations.

Watch this video on YouTube.