Do metadata records need separate PIDs?

0000-0001-5911-6022 · June 1, 2022, 3:29am

This post resulted from an online discussion among Keith Russell, Kirsten Elger, Natasha Simons, Lesley Wyborn, and Jens Klump.

In an effort to accelerate broad community convergence on FAIR implementation options, the GO FAIR community has launched the development of machine-actionable FAIR Implementation Profiles (FIP) (Schultes et al., 2020). In their questionnaire, the very first question raises a point that needs clarification.

The FIP questionnaire starts with the question “What globally unique, resolvable identifiers do you use for metadata records?”. This question is related to FAIR Principle F1 (Findable) (Wilkinson et al., 2016).

Asking for separate PIDs as identifiers for metadata records, as well as for the data object, raises the question: what is the use case? Why would I want to identify the metadata record? Is the metadata record a research artefact that I need to identify unambiguously into the future? What is the metadata record identifying? Are we identifying the data at the “expression” or “item” level (Klump et al. 2021)?

The reason this question was added was that data set PIDs were frequently resolving to landing pages, but there was no consistent way for a machine to find the link to the actual data set. To ensure both the landing page and the data set could be found the request was added to have separate DOIs: one resolving to the landing page and one resolving to the data set.

The question of where a PID should resolve to is as old as the PID community itself, and early on it was agreed that a PID should always resolve to a landing page. DataCite states in its notes for DOI best practice:

DOIs should resolve to a landing page, not directly to the content

It is important that both humans and machines have context for the item that the DOI is resolving to. DOIs should therefore resolve to a landing page containing metadata about the item, rather than to a PDF, for example. The landing page should contain a full bibliographic citation, so that a human can tell they have arrived at the correct item, and so that a machine can retrieve additional information about the item that might not be easily retrievable from the item itself.

A more thorough discussion on how to use PIDs in scholarly data repositories was published by Fenner et al. (2019).

However, it has been recognised that there should be a machine-actionable pathway from the resolved PID to the content object too to ensure that machines can find their way from the PID to the data set without requiring human intervention.

As an example the DataCite Metadata Working Group proposes to introduce a new metadata element that would point from the metadata record to the content object. For a more scalable approach, this element could also be embedded in the DOI Handle object itself (Weigel et al. 2019).

It would be important for fully implementing FAIR if other Persistent Identifier systems would follow a similar approach to DataCite. It is also important that this information is presented on the landing page in a consistent machine readable fashion so that machines can easily parse this and find their way to the data set.

Fenner, M., Crosas, M., Grethe, J. S., Kennedy, D., Hermjakob, H., Rocca-Serra, P., et al. (2019). A data citation roadmap for scholarly data repositories. Scientific Data, 6(1), 1–9. A data citation roadmap for scholarly data repositories | Scientific Data

Klump, J., Wyborn, L. A. I., Wu, M., Martin, J., Downs, R. R., & Asmi, A. (2021). Versioning Data Is About More than Revisions: A Conceptual Framework and Proposed Principles. Data Science Journal, 20(1), 12 p. https://doi.org/10.5334/dsj-2021-012

Schultes, E., Magagna, B., Hettne, K. M., Pergl, R., Suchánek, M., & Kuhn, T. (2020). Reusable FAIR Implementation Profiles as Accelerators of FAIR Convergence. In G. Grossmann & S. Ram (Eds.), Advances in Conceptual Modelling (Vol. 12584, pp. 138–147). Cham, Switzerland: Springer International Publishing. Reusable FAIR Implementation Profiles as Accelerators of FAIR Convergence | SpringerLink

Weigel, T., Schwardmann, U., Klump, J., Bendoukha, S., & Quick, R. (2019). Making data and workflows findable for machines. Data Intelligence, 2(1–2), 40–46. Making Data and Workflows Findable for Machines | Data Intelligence | MIT Press

Wilkinson, M. D., Dumontier, M., Packer, A. L., Gray, A. J. G., Mons, A., Gonzalez-Beltran, A., et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3, 160018. The FAIR Guiding Principles for scientific data management and stewardship | Scientific Data

matt.buys · June 2, 2022, 8:01am

It may also be of interest to know that the DataCite metadata WG is considering implementing content URLs as part of the schema (see https://portal.productboard.com/71qotggkmbccdwzokuudjcsb/c/70-metadata-schema-support-for-content-urls?utm_medium=social&utm_source=portal_share)

sharif.islam · June 7, 2022, 11:47am

Hi Jens (@0000-0001-5911-6022),

Thanks for summarising the issue regarding separate PIDs and machine-actionable pathways. You are probably already aware that the FAIR Implementation Profile is now a new working group in the FAIR Digital Object Forum (https://fairdo.org/wg/fdo-fipp/). We are also aligning these conversations there.

A FAIR Digital Object (FDO) vantage point provides us with a level of abstraction to deal with data vs metadata issues – put simply everything is an object with identifier and attributes. We are also discussing granular understanding of the term machine actionability – breaking it down further into readability and interpretability (of course acknowledging overlaps and lack of conceptual distinctions for these terms). This, I believe, gives us a framework to think beyond “datasets” (we need PIDs for software, workflows, machine learning feature sets etc.) and focus on particular domain specific use cases. For example, within DiSSCo we are trying to establish bi-directional link between samples and genomic sequences. The sequence in the Genbank has internal id (accession id: example) and in some cases, mime-type “x-fasta” is used but we are far from a PID graph/DOI level interactions between specimens in the museum and the DNA record stored elsewhere. For a machine-actionable pathway here, often the users are used to a Jupyter notebook type environment instead of a landing page. So we also need to have FAIR implementation for notebooks and the various workflow artefacts.

I think this is a great initiative from DataCite. Also, the elements embedded in the global handle PID record will improve data discoverability a lot:

dwinston · June 24, 2022, 2:51am

A metadata record is a set of assertions about a data object. There may be multiple such assertion-sets over time.

They may have different authors, they may be revisions stewarded by the same service provider, etc.

It may be unlikely that so-called “intrinsic” metadata wrt the data object will vary from record to record, but there may be corrections of errors.
But so-called “user-defined”/provenance metadata may vary considerably, and sets analysis context. All records point to the same data object.

How can we communicate what we knew about a data object at a given time? We do this with git repositories by referring to commit hashes. The value of PIDs for metadata records is akin to the value of revision control for code, akin to the value of the archive.org Wayback Machine for preserving the context of past references to webpage content.