At a glance
- Conference
- Hackathon
Event schedule
You can also access the schedule information seen below as Pentabarf XML.
For viewing the schedule on mobile, we recommend the Giggity app (F-Droid, Google Play) – enter the link to the XML file inside the app.
2024-04-04
Arrival and registration
08:30 (00:30) Foyer
Welcome and overview
09:00 (00:30) Event Hall
Abstract
Welcome from the organizers
"What's in the DataLad sandwich" AKA DataLad "ecosystem"
09:30 (00:20) Event Hall
Yaroslav Halchenko
Abstract
At the heart of many innovative tools lies a simple spark of necessity. For DataLad, that spark was a father's quest in 2013 for an effortless way to access free children's cartoons and movies. What started to scratch a personal itch, has evolved into a grant funded DataLad platform addressing a broad range of data logistics challenges. Utilizing the strengths of git and git-annex, DataLad has not only expanded its capabilities but has also contributed to the enhancement of git-annex features, tailor-made to suit its needs. Through the innovative use of git external protocols and git-annex external special remotes, DataLad offers a seamless experience to users, fetching data with remarkable flexibility. To push the boundaries further, DataLad introduced an "extensions mechanism," enabling the platform to adapt and extend beyond its core functionalities. This modular architecture, while offering unparalleled flexibility, hints at a potential for complexity and fragility. In this presentation, I will take you on a journey through the foundational elements that give DataLad its unique extensibility—spanning git, git-annex, and beyond—with few practical examples that bring these concepts to life. Despite the inherent challenges of a modular system, our dedicated "dev-ops" components, which I will demonstrate, ensure a stable and efficient ecosystem. By developing, testing, and distributing these components, we've crafted not just a tool, but a robust platform ready to tackle the data logistics needs of today and tomorrow.
"git annex is complete, right?"
09:50 (00:20) Event Hall
Joey Hess
Abstract
My father has asked me this question before over the years. So has an experienced developer recently. Seeing the same question from two such different perspectives got me asking it of myself. While a new data storage system can always be added to git-annex, or a new command be added to improve some use case, both of those can also be accomplished without needing changes to git-annex, by external remotes and more targeted frontends such as DataLad. So what then is the potential surface area of problem space that git-annex may expand to cover? Do diminishing returns and complexity make such expansions a good idea? I will explore this by considering recent developments in git-annex, and the impact of lesser-used features.
DataLad beyond Git, connecting to the rest of the world
10:10 (00:20) Event Hall
Michael Hanke
Abstract
DataLad has been built on Git and git-annex as foundational pillars. However, the vast majority of data infrastructures are not Git-aware. Git-annex can work with a much broader array of services, but the need to "keep the Git repo somewhere" imposes undesirable technical and procedural complexity on users. In this talk I illustrate existing means to take Git-based DataLad datasets to places that Git cannot reach on its own. Moreover, I introduce ongoing work that aims to enable DataLad users to consume non-DataLad resources as native DataLad datasets, and non-DataLad users to consume DataLad resources without DataLad, git-annex, or even Git.
Questions and panel discussion
10:30 (00:30) Event Hall
Abstract
Questions and panel discussion
Coffee
11:00 (00:30) Foyer
OpenNeuro and DataLad
11:30 (00:20) Event Hall
Nell Hardcastle
Abstract
A history of OpenNeuro's adoption of DataLad and the evolution of DataLad and git-annex support on the platform. In 2017 OpenNeuro was preparing to launch with the original data backend implemented as block storage without git-annex. The decision was made to move OpenNeuro to DataLad and a quick prototype for this backend service was created and eventually brought to production for the public release of OpenNeuro. Since 2017 the platform has evolved to support many of the unique advantages of DataLad datasets. This talk discusses the architecture of OpenNeuro, some of the challenges encountered using git-annex as the center of our application’s data model in cloud environments, solutions developed, and future work to improve upon OpenNeuro’s archival and distribution of DataLad datasets.
Balancing Efficiency and Standardization for a Microscopic Image Repository on an HPC System
11:50 (00:20) Event Hall
Julia Thönnißen
Abstract
Understanding the human brain is one of the greatest challenges of modern science. In order to study its complex structural and functional organization, data from different modalities and resolutions must be linked together. This requires scalable and reproducible workflows ranging from the extraction of multimodal data from different repositories to AI-driven analysis and visualization [1]. One fundamental challenge therein is to store and organize big image datasets in appropriate repositories. Here we address the case of building a repository of high-resolution microscopy scans for whole human brain sections in the order of multiple Petabytes [1]. Since data duplication is prohibitive for such volumes, images need to be stored in a way that follows community standards, supports provenance tracking, and meets performance requirements of high-throughput ingestion, highly parallel processing on HPC systems, as well as ad-hoc random access for interactive visualization. To digitize an entire human brain, high-throughput scanners need to capture over 7000 histological brain sections. During this process, a scanner acquires a z-stack, which consists of 30 TIFF images per tissue section, each representing a different focus level. The images are automatically transferred from the scanner to a gateway server, where they are pre-organised into subfolders per brain section for detailed automated quality control (QC). Once a z-stack passes QC, it is transferred to the parallel file system (GPFS) on the supercomputer via NFS-mount. For one human brain, this results in 7000 folders with about 2 PByte of image data in about 20K files in total. From there, the data are accessed simultaneously by different applications and pipelines with their very heterogeneous requirements. HPC analyses based on Deep Learning such as cell segmentation or brain mapping rely on fast random access and parallel I/O to stream image patches efficiently to GPUs. Remote visualization and annotation on the other hand requires exposure of the data through an HTTP service on a VM, with access to higher capacity storage to serve different data at the same time. These demands can be covered by multi-tier HPC storage, which provides dedicated partitions. The High Performance Storage Tier offers low latency and high bandwidth for analysis, while the Extended Capacity Storage Tier is capacity-optimized with a lower latency, meeting the needs for visualization. Exposing the data on different tiers requires controlled staging and unstaging. We organize the image data folders via DataLad datasets, which allows well defined staging across these partitions for different applications, ensures that all data is tracked and versioned from distributed storage throughout the workflow, and enables provenance tracking. To reduce the number of files in one DataLad repository, each section folder has been designed as a subdataset of a superdataset that contains all section folders. The current approach to managing data has two deficiencies. Firstly, the TIFF format is not optimized for HPC usage due to the lack of parallel I/O support, resulting in data duplication due to conversion to HDF5. Secondly, the current data organization is not compatible with upcoming community standards, complicating collaborative efforts. Therefore, standardization of the file format and folder structure is a major objective for the near future. The widely accepted community standard for organizing neuroscience data is the Brain Imaging Data Structure (BIDS). Its extension for microscopy proposes splitting the data into subjects and samples, while using either (OME-)TIFF or OME-ZARR as a file format. Particularly, the NGFF file format OME-ZARR appears to be the suitable choice for the workflow described, as it is more performant on HPC and cloud compatible as opposed to TIFF. However, restructuring the current data layout is a complex task. Adopting the BIDS standard results in large amounts of inodes and files because (1) multiple folders and sidecar files are created and (2) OME-ZARR files are comprised of many small files. DataLad annex undergoes expansion with the increase in the number of files leading to high inode usage and reduced performance. An effective solution to this problem may involve the optimization of the size of DataLad subdatasets. However, the key consideration is that GPFS file systems enforce a limit on the number of inodes, which cannot be surpassed. This raises the following questions: How can usage of inodes be minimized while adhering to BIDS and utilizing DataLad? Should performant file formats with minimal inode usage, such as ZARR v3 or HDF5, be incorporated into the BIDS standard? What is a good balance for DataLad subdataset sizes? Discussions with the community may provide valuable perspectives for advancing this issue. [1] Amunts K, Lippert T. Brain research challenges supercomputing. Science 374, 1054-1055 (2021). DOI:10.1126/science.abl8519
Questions and panel discussion
12:10 (00:20) Event Hall
Abstract
Questions and panel discussion
Lunch (self-organized, outside venue)
12:30 (01:30) Foyer
NeuroBagel and NiPoppy for a neuro-federation
14:00 (00:20) Event Hall
JB Poline
Abstract
We present an ecosystem consisting of NeuroBagel, a distributed and scalable approach based on semantic web technologies for harmonizing and sharing phenotypic and neuroimaging variables with a DataLad backend, and NiPoppy, a specification for MRI processings to integrate derived data and curation information. We used NeuroBagel tools to harmonize the OpenNeuro MRI data as well as several Parkinson datasets (Quebec Parkinson Network, Parkinson Progression Marker Initiative, etc) and will demonstrate how new neuroimaging cohorts can be defined from several distributed open or close datasets. We will show how NiPoppy, extending BIDS, could help with the standardization of the management and monitoring of neuroimaging data processing. We hope that the proposed distributed ecosystem will foster easier and more scalable neuroimaging datasharing and contribute to more diverse and large samples in machine learning applications.
Onedata as a Platform: Distributed Repository for Virtualizing Data and Long-term Preservation
14:20 (00:20) Event Hall
Łukasz Dutka, Łukasz Opioła
Abstract
With the proliferation of digital data, reliable storage, easy accessibility, and long-term preservation have become paramount. Onedata, a novel platform, emerges as a solution for these challenges by enabling a distributed repository framework for virtualizing data. This presentation delves into how Onedata facilitates seamless data management and ensures long-term preservation. By virtualizing data, Onedata abstracts the underlying storage infrastructures enabling a unified view and easy sharing among different stakeholders. Furthermore, its distributed repository nature significantly enhances data durability and availability. The in-built mechanisms for metadata management and data replication ensure that the information remains intact and accessible over extended periods. Through a detailed exploration of its architecture and functionalities, this presentation will highlight how Onedata can be a robust platform for modern data management and long-term preservation needs, catering to academia, industry, and beyond. The insights provided will foster a better understanding of leveraging distributed repository platforms in navigating the complex landscape of digital data preservation.
Questions and panel discussion
14:40 (00:20) Event Hall
Abstract
Questions and panel discussion
Coffee
15:00 (00:30) Foyer
Workflow provenance–based scheduling
15:30 (00:20) Event Hall
Pedro Martinez
Abstract
Scientific computing workflows have become increasingly complex, often comprising of numerous interdependent tasks executed on distributed computing resources. Provenance data, or the history of computational processes, provide a vital link between data reproducibility and task scheduling. Workflows with recorded data provenance can seamlessly integrate with separate workflow management systems, eliminating the need for inter-system communication. In this talk, we introduce a novel tool to perform provenance-based workflow scheduling. Our approach leverages an abstract graph builder tool designed to create abstract graphs representing the high-level structure of workflows. These abstract graphs emphasize dependencies and data flows, facilitating a better understanding of the computational process. Concurrently, we extract concrete graphs from workflow provenance data recorded with DataLad that reflect the actual execution history. The core of our approach lies in comparing the abstract graph to concrete graphs produced by separate runs of the workflow for a set of input parameters. By computing the difference we can pinpoint tasks that remain unexecuted or require re-execution due to errors or changes in input data and automatically schedule these tasks. We will outline future directions for this research, including potential extensions to support system agnostic scheduling, and scalability considerations.
Optimisation in Network Engineering: Challenges and Solutions in Research Data Management
15:50 (00:10) Event Hall
Julius Breuer
Abstract
In the complex realm of network engineering design, optimisation methods have been instrumental, using a range of components across different systems and scenarios. However, this complexity presents a dual challenge: first, managing, tracking and combining thousands of optimisation calculations, including the specifics of component data, system classifications, scenarios considered, and settings applied. Second, integrating diverse data from multiple sources that do not all reside in one place. Third, the possibility of collaboration (in this case with students, potentially with more people). Such challenges emphasise the need for rigorous research data management. Questions such as "which component data was used in which system?" or the provenance of component data come to the fore. To answer these questions, DataLad is used to store disparate data, models, settings and results in an effective and distributed manner. DataLad's provenance reduces the redundancy of storage and the effort required for publication, while increasing confidence in the results. This is done in the context of a research project, but the same questions arise for the industrial application of what has been researched.
fMRI Pipelines on HPC with DataLad and ReproMan
16:00 (00:10) Event Hall
Joe Wexler
Abstract
In this lightning talk, I will share my experience using DataLad, git-annex and ReproMan to run software pipelines on hundreds of fMRI datasets on an HPC cluster. Potential topics may include: (a) The use of ReproMan to avoid the difficulties of using datalad containers-run in parallel on an HPC. (b) How to use DataLad on a scratch filesystem that periodically purges files to save space. (c) A simple algorithm I implemented in ReproMan to prevent excess runtime due to outliers in parallel jobs. (d) The pros and cons of the YODA-BIDS layout for neuroimaging data. I hope my talk will prompt discussion with those hoping to learn more from my experience as well as those who have found alternative solutions to similar challenges.
Reproducibility vs. computational efficiency on HPC systems
16:10 (00:10) Event Hall
Felix Hoffstaedter
Abstract
HPC systems have particular hard- and software configurations that introduce specific challenges for the implementation of reproducible data processing workflows. The DataLad based 'FAIRly big workflow' allows for a separation of the compute environment from the processing pipeline enabling automatic reproducibility over systems. Yet, the sheer size of RAM and CPUs on HPC systems will allow for different ways to optimize compute jobs in contrast to compute clusters and certainly the average workstation/laptop. In this talk, I discuss general differences between HCP and more standard compute environments regarding necessary choices for the setup of processing pipelines to be reproducible. Among the main factors are the availability of RAM, local storage, inodes and wall clock time.
Questions and panel discussion
16:20 (00:40) Event Hall
Abstract
Questions and panel discussion
End of day
17:00 (00:30) Foyer
2024-04-05
Arrival and registration
08:30 (00:30) Foyer
Welcome and overview
09:00 (00:15) Event Hall
Abstract
Welcome and overview: day 2.
Neuroscientific data management using DataLad
09:15 (00:20) Event Hall
Julian Kosciessa
Abstract
Robust data management from raw data to result publication is necessary to make scientific research more widely reusable. This remains a challenge, particularly in projects that involve a variety of subcomponents and large data. In this talk, I provide a user perspective on using DataLad procedures for structuring, managing, and sharing complex cognitive neuroscience projects. By showcasing example multimodal neuroimaging projects that include e.g., electroencephalography (EEG), functional magnetic resonance imaging (fMRI), and behavioral data, I will highlight workflows that are uniquely enabled by the distributed nature of DataLad. Based on my experiences, I will also indicate remaining roadblocks I perceive to widespread adoption.
Staying in Control of your Scientific Data with Git Annex
09:35 (00:20) Event Hall
Yann Büchau
Abstract
Scientific experiments can produce a lot of data, often very different in kind and scattered across devices and even remote locations. Keeping all of this in check is not a simple task and failure to do so can easily cause data loss due to accidental deletion or hardware failure (think cheap SD cards in measurement devices at remote locations). Git Annex can help with synchronisation, catalogisation, versioning and archival of data as well as collaboration.
Questions and panel discussion
09:55 (00:20) Event Hall
Abstract
Questions and panel discussion
Coffee and key signing
10:15 (00:30) Foyer
Fusion of Multiple Open Source Applications for Data Management Workflows in Psychology and Neuroscience
10:45 (00:20) Event Hall
Julia-Katharina Pfarr
Abstract
Finding a compromise between researchers’ needs, their skills in data management, data access restrictions, and limited funding for RDM is a complex but highly relevant and timely challenge. At the University of Marburg this challenge is taken on by the team of the “Data Hub”. The team consists of several people with different responsibilities, backgrounds, and affiliations such as project management staff, scientific staff, data stewards, data scientists, technical administrative staff, located in Marburg and Gießen. The Data Hub is funded by The Adaptive Mind (TAM) and supported by the information infrastructure project (NOWA) of the SFB135, which are consortia in the fields of psychology and neuroscience, with over 50 involved PIs, based in several locations in the federal state of Hesse, Germany. Although the research data in the two consortia are restricted to the fields of psychology and neuroscience, a major challenge is the need to harmonize heterogeneous data. The data encompass research data from different modalities such as behavior, eye tracking, EEG and neuroimaging as well as code for experiments and analysis in various programming languages. Therefore, the data management workflow needs to be applicable to heterogeneous in- and output data, different project sizes, and numbers of researchers involved. Furthermore, tools need to be able to integrate those heterogeneous data by utilizing a harmonizing standard in the field (here: BIDS). To increase the reproducibility of research findings, an integration of version control and provenance tracking (here: DataLad) should be available. For this, the team must have an understanding at which point to include the researchers: How much background knowledge about software do they have and how much do they really need? Which functions of the software are necessary and which ones can be skipped because they’ll never apply to the researchers’ work? Do they need a lot of hands-on practice or is the concept enough? In our presentation, we will first introduce the Data Hub of the University of Marburg and its technical architecture. We will then present the data management tools utilized in the Data Hub, i.e., DataLad, GIN, GitLab, JupyterHub, and BIDS. We will specifically focus on how these tools are interconnected, i.e., the research data management workflow of the Data Hub. Then, we will outline the challenges for both the researchers as well as the Data Stewards regarding training, support and maintenance of the services. This talk is not live-streamed.
Git annex recipes
11:05 (00:20) Event Hall
Timothy Sanders
Abstract
This talk will be a survey of various recipes I have come up with for git-annex. This includes 1) Discussion of git-annex as a format and it's implications 2) usage of git-annex for collaborative mirroring of non-scientific datasets 3) Using git-annex for system administrative purposes, including integration with Gentoo portage 4) Techniques for handling large numbers of keys by reconstructing subset repos and 5) Leveraging BTRFS for transferring data outside of git-annex. See notes here: https://gist.github.com/unqueued/8db6361b66224a84edf9d0d0bbe58439
DataLad-Registry: Bringing Benefits of Centrality to DataLad
11:25 (00:20) Event Hall
Isaac To, Yaroslav Halchenko
Abstract
DataLad-Registry is a service that maintains up-to-date information on over ten thousand datasets, with the collection expanding as more datasets are added. This talk will explore how DataLad-Registry automatically registers datasets from the internet, extracts metadata from them, and keeps these datasets and their corresponding metadata up-to-date. We'll showcase the datasets and metadata types currently available within DataLad-Registry and demonstrate the service's search capability. Additionally, we'll provide an overview of the API and reveal the underlying service components of DataLad-Registry. The presentation will conclude with a discussion on ongoing and future developments, inviting audience input to shape the future of DataLad-Registry.
Questions and panel discussion
11:45 (00:30) Event Hall
Abstract
Questions and panel discussion
Lunch (self-organized, outside venue)
12:15 (01:30) Foyer
Reproducible and replicable data science in the presence of patient confidentiality concerns by utilizing git-annex and the Data Science Orchestrator
13:45 (00:40) Event Hall
Markus Katharina Brechtel, Philipp Kaluza
Abstract
Health-related data for patients is among the most sensitive data when it comes to data privacy concerns. Data science projects in the medical domain must thus pass a very high bar before allowing data researchers access to potentially personally identifiable data, or pseudonymized patient data that carries an inherent risk of depseudonymization. In the project "Data Science Orchestrator", we propose an organizational framework for ethically chaperoning and risk-managing such projects while they are under way, and a software stack that helps in this task. At the same time this software stack will provide an audit trail across the project that is verifyable even by external scientists without access to the raw data, while keeping the option for future reproducibility studies and replicability studies open. This is achieved by utilizing git-annex and datalad in a novel way to provide partial data blinding. Because collecting study-relevant data is often a time- and labor-intensive undertaking in the medical domain, many projects are undertaken by associations that span multiple hospitals, administrative domains, and often even multiple states. Therefore the "Data Science Orchestrator" project also implements distributed data science computations, which allow to honor these existing administrative boundaries by means of a federated access model, all while keeping the most sensitive data in-house and exclusively in a tightly controlled computation environment. This work was sponsored by Deutsche Zentren für Gesundheitsforschung (DZG) and BMBF.
A Tour of Magit
14:25 (00:20) Event Hall
Kyle Meyer
Abstract
Magit is an Emacs interface to Git. Through it, you can drive Git operations, even advanced ones, by typing short key sequences. This talk will show Magit in action. I will give a general overview and then highlight features for preparing and refining a series of commits.
Questions and panel discussion
14:45 (00:20) Event Hall
Abstract
Questions and panel discussion
Coffee
15:05 (00:30) Foyer
Distributed Metadata and Data with Dataverse
15:35 (00:40) Event Hall
Philip Durbin, Jan Range, Oliver Bertuch
Abstract
Dataverse is open source research data repository software that has supported distributed metadata for a long time and is increasingly supporting distributed data. Learn about the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and file "stores" within Dataverse that can be hosted locally, on S3, Globus, or remote locations. We plan to demonstrate a proof of concept for a distributed storage configuration in Jülich DATA and the use of Dataverse APIs to manage data with Python.
Unconference
16:15 (00:45) Event Hall
Abstract
Unconference slot
End of the conference
17:00 (00:30) Foyer
2024-04-06
Coffee
08:30 (00:30) Seminar room 1, 3rd floor
Kick off / Pitches
09:00 (00:30) Seminar room 1, 3rd floor
Hacking
09:30 (02:30) Seminar room 1, 3rd floor
Lunch (self-organized, outside venue)
12:00 (01:30) Seminar room 1, 3rd floor
Hacking
13:30 (01:00) Seminar room 1, 3rd floor
Coffee
14:30 (00:15) Seminar room 1, 3rd floor
Hacking
14:45 (01:45) Seminar room 1, 3rd floor
Wrap-up
16:30 (00:30) Seminar room 1, 3rd floor
Dinner and social (self-organized, outside venue)
17:00 (02:00) Seminar room 1, 3rd floor