- /
- Talks/
- Reproducible and replicable data science in the presence of patient confidentiality concerns by utilizing git-annex and the Data Science Orchestrator/
Reproducible and replicable data science in the presence of patient confidentiality concerns by utilizing git-annex and the Data Science Orchestrator
Health-related data for patients is among the most sensitive data when it comes to data privacy concerns. Data science projects in the medical domain must thus pass a very high bar before allowing data researchers access to potentially personally identifiable data, or pseudonymized patient data that carries an inherent risk of depseudonymization. In the project “Data Science Orchestrator”, we propose an organizational framework for ethically chaperoning and risk-managing such projects while they are under way, and a software stack that helps in this task. At the same time this software stack will provide an audit trail across the project that is verifyable even by external scientists without access to the raw data, while keeping the option for future reproducibility studies and replicability studies open. This is achieved by utilizing git-annex and datalad in a novel way to provide partial data blinding. Because collecting study-relevant data is often a time- and labor-intensive undertaking in the medical domain, many projects are undertaken by associations that span multiple hospitals, administrative domains, and often even multiple states. Therefore the “Data Science Orchestrator” project also implements distributed data science computations, which allow to honor these existing administrative boundaries by means of a federated access model, all while keeping the most sensitive data in-house and exclusively in a tightly controlled computation environment. This work was sponsored by Deutsche Zentren für Gesundheitsforschung (DZG) and BMBF.
Watch this video on YouTube.