Important dates

  • Registration opens: September 14th
  • Registration and submission closes: midnight October 22nd
  • Registration and submission feedback: November 1st
  • Late Registration opens, with hackathon waiting list: November 6th

At a glance

Event schedule (preliminary)

You can access the (preliminary) raw schedule data as Pentabarf XML.

For viewing the schedule on mobile, we recommend the Giggity app (F-Droid, Google Play) – enter the link to the schedule file inside the app.

2024-04-04

Coffee

08:30 (00:30) Foyer

Welcome and overview

09:00 (00:30) Event Hall

Welcome

"what's in the Datalad sandwich" AKA DataLad "ecosystem"

09:30 (00:20) Event Hall

Yaroslav Halchenko

Overview of the core and extensions, CIs, testing, building (git-annex), logs archival (con/tinuous).

"git annex is complete, right?"

09:50 (00:20) Event Hall

Joey Hess

My father has asked me this question before over the years. So has an experienced developer recently. Seeing the same question from two such different perspectives got me asking it of myself. While a new data storage system can always be added to git-annex, or a new command be added to improve some use case, both of those can also be accomplished without needing changes to git-annex, by external remotes and more targeted frontends such as Datalad. So what then is the potential surface area of problem space that git-annex may expand to cover? Do diminishing returns and complexity make such expansions a good idea? I will explore this by considering recent developments in git-annex, and the impact of lesser-used features.

Git annex recipes

10:10 (00:20) Event Hall

Timothy Sanders

I have come up with many recipes over the years for scaling git-annex repositories in terms of large numbers of keys, large file sizes, and increased transfer efficiency. I have working examples that I use internally that I can demonstrate. (1) Second-order keys; using metadata to describe keys that can be derived from other keys. I primarily used this to help with the problem of too many keys referencing small files. This is building off of the work of others, but I believe I have made useful improvements, and I would like to polish it up and share it. One very early example is here: https://github.com/unqueued/repo.macintoshgarden.org-fileset/ For now, I stripped out all but the location data from the git-annex branch. Files smaller than 50M are contained in second-order keys (8b0b0af8-5e76-449c-b0ae-766fcac0bc58). The other uuids are for standard backends, including a Google Drive account which has very strict limits on requests, and it would have been very difficult to process over 10k keys directly. There are also other cases where keys can be reliably reproduced from other keys. (2) Differential storage with git-annex using bup (or borg). I built of off posts on the forums from years ago, and came up with some really useful workflows for combining the benefits of git-annex location tracking and branching with differential compression. I have scripts used for automation, and some example repos and case studies. For example, I have a repo which contains file indexes that are over 60GiB, but only consume about 6GiB, using bup packfiles. I can benefit from differential compression over different time ranges, like per year, or for the entire history, while minimizing storage usage. I will publish a working example in the next few weeks, but I have only used it internally for years.

Datalad-annex

10:30 (00:20) Event Hall

Michael Hanke

Talk on datalad-annex, a git-remote helper to deposit a repository in any place reachable by a git-annex special remote implementation.

OpenNeuro and DataLad

10:50 (00:20) Event Hall

Nell Hardcastle

A history of OpenNeuro's adoption of DataLad and the evolution of DataLad and git-annex support on the platform. In 2017 OpenNeuro was preparing to launch with the original data backend implemented as block storage without git-annex. The decision was made to move OpenNeuro to DataLad and a quick prototype for this backend service was created and eventually brought to production for the public release of OpenNeuro. Since 2017 the platform has evolved to support many of the unique advantages of DataLad datasets. This talk discusses the architecture of OpenNeuro, some of the challenges encountered using git-annex as the center of our application’s data model in cloud environments, solutions developed, and future work to improve upon OpenNeuro’s archival and distribution of DataLad datasets.

Questions and panel discussion

11:10 (00:50) Event Hall

Questions and panel discussion

Lunch (outside venue)

12:00 (01:30) Foyer

GIN-Tonic: (trying to) making datalad accessible to non-coders

13:50 (00:20) Event Hall

Julien Colomb

With the GIN-Tonic suite, we work toward a workflow where the users only needs to use the browser and some synchronisation scripts to manage their datalad repositories. Settings are prepared in a template (including submodules) and a browser extension (written in Go) create new repositories from that template. I would report on successes and failures of the strategy.

Onedata as a Platform: Distributed Repository for Virtualizing Data and Long-term Preservation

14:10 (00:20) Event Hall

Łukasz Dutka

With the proliferation of digital data, reliable storage, easy accessibility, and long-term preservation have become paramount. Onedata, a novel platform, emerges as a solution for these challenges by enabling a distributed repository framework for virtualizing data. This presentation delves into how Onedata facilitates seamless data management and ensures long-term preservation. By virtualizing data, Onedata abstracts the underlying storage infrastructures enabling a unified view and easy sharing among different stakeholders. Furthermore, its distributed repository nature significantly enhances data durability and availability. The in-built mechanisms for metadata management and data replication ensure that the information remains intact and accessible over extended periods. Through a detailed exploration of its architecture and functionalities, this presentation will highlight how Onedata can be a robust platform for modern data management and long-term preservation needs, catering to academia, industry, and beyond. The insights provided will foster a better understanding of leveraging distributed repository platforms in navigating the complex landscape of digital data preservation.

Questions and panel discussion

14:30 (00:30) Event Hall

Questions and panel discussion

Coffee

15:00 (00:30) Foyer

Workflow provenance–based scheduling

15:30 (00:20) Event Hall

Pedro Martinez

Scientific computing workflows have become increasingly complex, often comprising of numerous interdependent tasks executed on distributed computing resources. Provenance data, or the history of computational processes, provide a vital link between data reproducibility and task scheduling. Workflows with recorded data provenance can seamlessly integrate with separate workflow management systems, eliminating the need for inter-system communication. In this talk, we introduce a novel tool to perform provenance-based workflow scheduling. Our approach leverages an abstract graph builder tool designed to create abstract graphs representing the high-level structure of workflows. These abstract graphs emphasize dependencies and data flows, facilitating a better understanding of the computational process. Concurrently, we extract concrete graphs from workflow provenance data recorded with DataLad that reflect the actual execution history. The core of our approach lies in comparing the abstract graph to concrete graphs produced by separate runs of the workflow for a set of input parameters. By computing the difference we can pinpoint tasks that remain unexecuted or require re-execution due to errors or changes in input data and automatically schedule these tasks. We will outline future directions for this research, including potential extensions to support system agnostic scheduling, and scalability considerations.

Optimisation in Network Engineering: Challenges and Solutions in Research Data Management

15:50 (00:10) Event Hall

Julius Breuer

In the complex realm of network engineering design, optimisation methods have been instrumental, using a range of components across different systems and scenarios. However, this complexity presents a dual challenge: first, managing, tracking and combining thousands of optimisation calculations, including the specifics of component data, system classifications, scenarios considered, and settings applied. Second, integrating diverse data from multiple sources that do not all reside in one place. Third, the possibility of collaboration (in this case with students, potentially with more people). Such challenges emphasise the need for rigorous research data management. Questions such as "which component data was used in which system?" or the provenance of component data come to the fore. To answer these questions, Datalad is used to store disparate data, models, settings and results in an effective and distributed manner. Datalad's provenance reduces the redundancy of storage and the effort required for publication, while increasing confidence in the results. This is done in the context of a research project, but the same questions arise for the industrial application of what has been researched.

fMRI Pipelines on HPC with DataLad and ReproMan

16:00 (00:10) Event Hall

Joe Wechsler

In this lightning talk, I will share my experience using DataLad, git-annex and ReproMan to run software pipelines on hundreds of fMRI datasets on an HPC cluster. Potential topics may include: (a) The use of ReproMan to avoid the difficulties of using datalad containers-run in parallel on an HPC. (b) How to use DataLad on a scratch filesystem that periodically purges files to save space. (c) A simple algorithm I implemented in ReproMan to prevent excess runtime due to outliers in parallel jobs. (d) The pros and cons of the YODA-BIDS layout for neuroimaging data. I hope my talk will prompt discussion with those hoping to learn more from my experience as well as those who have found alternative solutions to similar challenges.

Reproducibility vs. computational efficiency on HPC systems

16:10 (00:10) Event Hall

Felix Hoffstaedter

Questions and panel discussion

16:20 (00:40) Event Hall

Questions and panel discussion

Dinner and social (outside venue)

17:00 (03:00) Foyer


2024-04-05

Coffee

08:30 (00:30) Foyer

Coffee time!

Welcome and overview

09:00 (00:15) Event Hall

Welcome and overview: day 2.

Neuroscientific data management using DataLad

09:15 (00:20) Event Hall

Julian Kosciessa

Robust data management from raw data to result publication is necessary to make scientific research more widely reusable. This remains a challenge, particularly in projects that involve a variety of subcomponents and large data. In this talk, I provide a user perspective on using DataLad procedures for structuring, managing, and sharing complex cognitive neuroscience projects. By showcasing example multimodal neuroimaging projects that include e.g., electroencephalography (EEG), functional magnetic resonance imaging (fMRI), and behavioral data, I will highlight workflows that are uniquely enabled by the distributed nature of DataLad. Based on my experiences, I will also indicate remaining roadblocks I perceive to widespread adoption.

Staying in Control of your Scientific Data with Git Annex

09:35 (00:20) Event Hall

Yann Büchau

Scientific experiments can produce a lot of data, often very different in kind and scattered across devices and even remote locations. Keeping all of this in check is not a simple task and failure to do so can easily cause data loss due to accidental deletion or hardware failure (think cheap SD cards in measurement devices at remote locations). Git Annex can help with synchronisation, catalogisation, versioning and archival of data as well as collaboration.

Questions and panel discussion

09:55 (00:20) Event Hall

Questions and panel discussion

Fusion of Multiple Open Source Applications for Data Management Workflows in Psychology and Neuroscience

10:15 (00:20) Event Hall

Julia-Katharina Pfarr

Finding a compromise between researchers’ needs, their skills in data management, data access restrictions, and limited funding for RDM is a complex but highly relevant and timely challenge. At the University of Marburg this challenge is taken on by the team of the “Data Hub”. The team consists of several people with different responsibilities, backgrounds, and affiliations such as project management staff, scientific staff, data stewards, data scientists, technical administrative staff, located in Marburg and Gießen. The Data Hub is funded by The Adaptive Mind (TAM) and supported by the information infrastructure project (NOWA) of the SFB135, which are consortia in the fields of psychology and neuroscience, with over 50 involved PIs, based in several locations in the federal state of Hesse, Germany. Although the research data in the two consortia are restricted to the fields of psychology and neuroscience, a major challenge is the need to harmonize heterogeneous data. The data encompass research data from different modalities such as behavior, eye tracking, EEG and neuroimaging as well as code for experiments and analysis in various programming languages. Therefore, the data management workflow needs to be applicable to heterogeneous in- and output data, different project sizes, and numbers of researchers involved. Furthermore, tools need to be able to integrate those heterogeneous data by utilizing a harmonizing standard in the field (here: BIDS). To increase the reproducibility of research findings, an integration of version control and provenance tracking (here: DataLad) should be available. For this, the team must have an understanding at which point to include the researchers: How much background knowledge about software do they have and how much do they really need? Which functions of the software are necessary and which ones can be skipped because they’ll never apply to the researchers’ work? Do they need a lot of hands-on practice or is the concept enough? In our presentation, we will first introduce the Data Hub of the University of Marburg and its technical architecture. We will then present the data management tools utilized in the Data Hub, i.e., DataLad, GIN, GitLab, JupyterHub, and BIDS. We will specifically focus on how these tools are interconnected, i.e., the research data management workflow of the Data Hub. Then, we will outline the challenges for both the researchers as well as the Data Stewards regarding training, support and maintenance of the services.

Balancing Efficiency and Standardization for a Microscopic Image Repository on an HPC System

10:35 (00:20) Event Hall

Julia Thönnißen

Understanding the human brain is one of the greatest challenges of modern science. In order to study its complex structural and functional organization, data from different modalities and resolutions must be linked together. This requires scalable and reproducible workflows ranging from the extraction of multimodal data from different repositories to AI-driven analysis and visualization [1]. One fundamental challenge therein is to store and organize big image datasets in appropriate repositories. Here we address the case of building a repository of high-resolution microscopy scans for whole human brain sections in the order of multiple Petabytes [1]. Since data duplication is prohibitive for such volumes, images need to be stored in a way that follows community standards, supports provenance tracking, and meets performance requirements of high-throughput ingestion, highly parallel processing on HPC systems, as well as ad-hoc random access for interactive visualization. To digitize an entire human brain, high-throughput scanners need to capture over 7000 histological brain sections. During this process, a scanner acquires a z-stack, which consists of 30 TIFF images per tissue section, each representing a different focus level. The images are automatically transferred from the scanner to a gateway server, where they are pre-organised into subfolders per brain section for detailed automated quality control (QC). Once a z-stack passes QC, it is transferred to the parallel file system (GPFS) on the supercomputer via NFS-mount. For one human brain, this results in 7000 folders with about 2 PByte of image data in about 20K files in total. From there, the data are accessed simultaneously by different applications and pipelines with their very heterogeneous requirements. HPC analyses based on Deep Learning such as cell segmentation or brain mapping rely on fast random access and parallel I/O to stream image patches efficiently to GPUs. Remote visualization and annotation on the other hand requires exposure of the data through an HTTP service on a VM, with access to higher capacity storage to serve different data at the same time. These demands can be covered by multi-tier HPC storage, which provides dedicated partitions. The High Performance Storage Tier offers low latency and high bandwidth for analysis, while the Extended Capacity Storage Tier is capacity-optimized with a lower latency, meeting the needs for visualization. Exposing the data on different tiers requires controlled staging and unstaging. We organize the image data folders via DataLad datasets, which allows well defined staging across these partitions for different applications, ensures that all data is tracked and versioned from distributed storage throughout the workflow, and enables provenance tracking. To reduce the number of files in one DataLad repository, each section folder has been designed as a subdataset of a superdataset that contains all section folders. The current approach to managing data has two deficiencies. Firstly, the TIFF format is not optimized for HPC usage due to the lack of parallel I/O support, resulting in data duplication due to conversion to HDF5. Secondly, the current data organization is not compatible with upcoming community standards, complicating collaborative efforts. Therefore, standardization of the file format and folder structure is a major objective for the near future. The widely accepted community standard for organizing neuroscience data is the Brain Imaging Data Structure (BIDS). Its extension for microscopy proposes splitting the data into subjects and samples, while using either (OME-)TIFF or OME-ZARR as a file format. Particularly, the NGFF file format OME-ZARR appears to be the suitable choice for the workflow described, as it is more performant on HPC and cloud compatible as opposed to TIFF. However, restructuring the current data layout is a complex task. Adopting the BIDS standard results in large amounts of inodes and files because (1) multiple folders and sidecar files are created and (2) OME-ZARR files are comprised of many small files. DataLad annex undergoes expansion with the increase in the number of files leading to high inode usage and reduced performance. An effective solution to this problem may involve the optimization of the size of DataLad subdatasets. However, the key consideration is that GPFS file systems enforce a limit on the number of inodes, which cannot be surpassed. This raises the following questions: How can usage of inodes be minimized while adhering to BIDS and utilizing DataLad? Should performant file formats with minimal inode usage, such as ZARR v3 or HDF5, be incorporated into the BIDS standard? What is a good balance for DataLad subdataset sizes? Discussions with the community may provide valuable perspectives for advancing this issue. [1] Amunts K, Lippert T. Brain research challenges supercomputing. Science 374, 1054-1055 (2021). DOI:10.1126/science.abl8519

DataLad-Registry: A Registry of DataLad Datasets with Metadata

10:55 (00:20) Event Hall

Isaac To

Talk on datalad-registry - a registry of datalad datasets with metadata extracted.

Questions and panel discussion

11:15 (00:45) Event Hall

Questions and panel discussion

Lunch (outside venue)

12:00 (01:30) Foyer

Reproducible and replicable data science in the presence of patient confidentiality concerns by utilizing git-annex and the Data Science Orchestrator

13:50 (00:40) Event Hall

Markus Katharina Brechtel, Philipp Kaluza

Health-related data for patients is among the most sensitive data when it comes to data privacy concerns. Data science projects in the medical domain must thus pass a very high bar before allowing data researchers access to potentially personally identifiable data, or pseudonymized patient data that carries an inherent risk of depseudonymization. In the project "Data Science Orchestrator", we propose an organizational framework for ethically chaperoning and risk-managing such projects while they are under way, and a software stack that helps in this task. At the same time this software stack will provide an audit trail across the project that is verifyable even by external scientists without access to the raw data, while keeping the option for future reproducibility studies and replicability studies open. This is achieved by utilizing git-annex and datalad in a novel way to provide partial data blinding. Because collecting study-relevant data is often a time- and labor-intensive undertaking in the medical domain, many projects are undertaken by associations that span multiple hospitals, administrative domains, and often even multiple states. Therefore the "Data Science Orchestrator" project also implements distributed data science computations, which allow to honor these existing administrative boundaries by means of a federated access model, all while keeping the most sensitive data in-house and exclusively in a tightly controlled computation environment. This work was sponsored by Deutsche Zentren für Gesundheitsforschung (DZG) and BMBF.

TBA

14:30 (00:20) Event Hall

TBA

TBA

14:50 (00:20) Event Hall

TBA

Questions and panel discussion

15:10 (00:20) Event Hall

Questions and panel discussion

Coffee

15:30 (00:30) Foyer

Coffee time!

Unconference

16:00 (01:00) Event Hall

Unconference slot

Dinner and social (outside venue)

17:00 (01:00) Foyer

Dinner and social time (outside venue)


2024-04-06

Coffee

08:30 (00:30) Seminar room 1, 3rd floor

Coffee time!

Kick off / Pitches

09:00 (00:30) Seminar room 1, 3rd floor

Hacking

09:30 (02:30) Seminar room 1, 3rd floor

Lunch (outside venue)

12:00 (01:30) Seminar room 1, 3rd floor

Hacking

13:30 (01:00) Seminar room 1, 3rd floor

Coffee

14:30 (00:15) Seminar room 1, 3rd floor

Coffee time!

Hacking

14:45 (01:45) Seminar room 1, 3rd floor

Wrap-up

16:30 (00:30) Seminar room 1, 3rd floor

Dinner and social (outside venue)

17:00 (02:00) Seminar room 1, 3rd floor

Dinner and social time (outside venue)