Recommendations on Data Versioning

Philipp · September 14, 2024, 8:43am

I’d like to initiate a discussion about a recently published paper containing recommendations on data versioning:

Klump, J., Pampel, H., Rothfritz, L., & Strecker, D. (2024). Recommendations on Data Versioning. Berlin, Germany: Humboldt-Universität zu Berlin. https://doi.org/10.5281/zenodo.13743876

In my view, these recommendations are very useful to help us turn general principles into actionable recommendations. Here are some comments on the paper:

Page 10: “Recommendations”

Comment: For data managed without a Git or other versioning system, I think there should be a recommendation that all changes should result in new minor or major releases. If not, how can you ensure that changes are documented in a transparent and consistent manner? Just having the data stored somewhere including a document where you describe the changes is not enough to ensure reproducibility in a persistent way.

Page 15: “Using Git for revision management automatically records provenance.”

Comment: I think this needs to be modified. Git records “technical” changes at file level, but without the people doing the changes providing more information, e.g, about methodology, origin of sources etc., the Git-recorded information does not represent sufficient provenance.

Page 15: “There are no known examples where manifestations of data have been separately identified with PIDs.”

Comment: Not sure I understand what this sentence means, but in DataverseNO, the same expression (e.g., tabular data) may be archived as multiple manifestations (e.g., as an Excel file and as a tab-separated plain text file), each of them identified with its own file-level DOI.

Page 15: “Reference to the item supports reproducibility and gives credit to the infrastructure providing access to this resource.”

Comment: In what way does reference to the item give credit to the infrastructure and not to the creator?

Page 17: “[C]are must be taken to communicate clearly to the users which identifier is used for the object described by the metadata and which identifier refers to the metadata record.”

Comment: I’m not sure whether I understand this completely. Let’s consider the case of a dataset published in DataverseNO, a Dataverse-based repository. Each dataset has a metadata record describing the dataset, including a dataset-level DOI. This information is displayed on the dataset landing page and can be retrieved in a machine-friendly way. If the dataset contains files, which is the most common case, each of these files have some recorded metadata, including a file-level DOI. This information is displayed on the file landing page and can be retrieved in a machine-friendly way. In this use case, 1) does the dataset-level DOI correspond to what in the Recommendations is described as the identifier referring to the metadata record; 2) does the file-level DOI correspond to what in the Recommendations is described as the identifier used for the object described by the metadata?

Best,
Philipp

Philipp Conzett
UiT The Arctic University of Norway

adswa · September 17, 2024, 12:22pm

Adding to this discussion, us developers in the realm of distributed data versioning software were a bit disheartened to see only DVC mentioned for Git-like versioning of data files. It is a great open source software tool, but arguably with quite a tight focus on the domain of machine learning pipelines. I hope people don’t mind if I point out other tools (disclaimer/COI: I’m an RSE working on some of them).

An especially noteworthy Git extension would be git-annex (git-annex - Wikipedia; git-annex.branchable.com), an actively developed free and open source software tool originally released already in 2010. Its a completely domain agnostic tool for distributed data versioning. At the Research Centre Jülich (my employer) alone, the amount of data version controlled with this software is in the range of Petabytes. I myself am part of the software development team of the DataLad (datalad.org) ecosystem, academic and domain agnostic open source software for data versioning, which extends git and git-annex with, among other things, the ability to version computational environments alongside code and data. There are even Git hosting services that support data versioned with these tools, such as gogs (e.g., gin.gnode.org) or the gitea-fork forgejo (forgejo-aneksajo, matrss/forgejo-aneksajo: A https://forgejo.org/ fork with https://git-annex.branchable.com/ support. Derived from https://github.com/neuropoly/gitea/. See here for more information: https://codeberg.org/matrss/forgejo-aneksajo/wiki/ - Codeberg.org). Beyond those two tools I have personally contributed to, there are many more tools for (scientific) data versioning - we even hosted a conference on the topic of distributed data versioning in April (https://distribits.live), and I’m certain there are more out there than what was presented at the conference.

Maybe the authors could consider extending the list of version control systems for data in the next version of the recommendations.

0000-0001-5911-6022 · September 18, 2024, 12:20am

Thank you for pointing out the additional tools available for data versioning. Even though I have worked on the topic for almost ten years and some of us have been involved in the Helmholtz Open Science activities, we haven’t encountered them. This shows how siloed the field is still. I will add them to the use case collection of the RDA DataVersioning IG (Updated Data Versioning Use Cases 1.2 Draft for Comments - Google Docs). The “Recommendations” will be discussed at the next RDA Plenary.