I’d like to initiate a discussion about a recently published paper containing recommendations on data versioning:
Klump, J., Pampel, H., Rothfritz, L., & Strecker, D. (2024). Recommendations on Data Versioning. Berlin, Germany: Humboldt-Universität zu Berlin. https://doi.org/10.5281/zenodo.13743876
In my view, these recommendations are very useful to help us turn general principles into actionable recommendations. Here are some comments on the paper:
Page 10: “Recommendations”
Comment: For data managed without a Git or other versioning system, I think there should be a recommendation that all changes should result in new minor or major releases. If not, how can you ensure that changes are documented in a transparent and consistent manner? Just having the data stored somewhere including a document where you describe the changes is not enough to ensure reproducibility in a persistent way.
Page 15: “Using Git for revision management automatically records provenance.”
Comment: I think this needs to be modified. Git records “technical” changes at file level, but without the people doing the changes providing more information, e.g, about methodology, origin of sources etc., the Git-recorded information does not represent sufficient provenance.
Page 15: “There are no known examples where manifestations of data have been separately identified with PIDs.”
Comment: Not sure I understand what this sentence means, but in DataverseNO, the same expression (e.g., tabular data) may be archived as multiple manifestations (e.g., as an Excel file and as a tab-separated plain text file), each of them identified with its own file-level DOI.
Page 15: “Reference to the item supports reproducibility and gives credit to the infrastructure providing access to this resource.”
Comment: In what way does reference to the item give credit to the infrastructure and not to the creator?
Page 17: “[C]are must be taken to communicate clearly to the users which identifier is used for the object described by the metadata and which identifier refers to the metadata record.”
Comment: I’m not sure whether I understand this completely. Let’s consider the case of a dataset published in DataverseNO, a Dataverse-based repository. Each dataset has a metadata record describing the dataset, including a dataset-level DOI. This information is displayed on the dataset landing page and can be retrieved in a machine-friendly way. If the dataset contains files, which is the most common case, each of these files have some recorded metadata, including a file-level DOI. This information is displayed on the file landing page and can be retrieved in a machine-friendly way. In this use case, 1) does the dataset-level DOI correspond to what in the Recommendations is described as the identifier referring to the metadata record; 2) does the file-level DOI correspond to what in the Recommendations is described as the identifier used for the object described by the metadata?
Best,
Philipp
Philipp Conzett
UiT The Arctic University of Norway