Comments by Alex Hardisty (Cardiff University) on EOSC PID Policy draft

As background, I presently lead digital object architecture work for DiSSCo RI (http://dissco.eu/), including dealing with our requirements for a global “Natural Sciences Identifier” (NSId) PID Scheme. I’m also closely involved in finalising generic guidelines and specific technical implementation guidelines/requirements to be met by FAIR Digital Objects i.e., the “FAIR Digital Object Framework” (FDOF).

In general, this draft EOSC PID policy document is very good, being helpful and clear on the expectations for PID management and infrastructure.

The final version of the document should refer also to the emerging FDO Framework, which is being finalized at present. Complying with FDOF makes FDOs machine-actionable.
(The FDOF current version can be found here: http://bit.ly/fdof102)

Specific remarks

3.4.1.2 on Kernel Information: The implication as written is that contextual metadata exists as a separate object from that which the PID resolves to and that Kernel Information should contain a pointer to this object. This is not always the case and it should certainly not be a recommendation that KI contains this. It is optional, depending on domain. In many cases, the main object can contain its own metadata attributes. A second comment is that the recommendation for KI is that KI records contains a pointer to the type definition and not the type name or type definition itself. Otherwise, the assumptions are that a) a consuming system must know where to find the type definition and/or b) when the type definition changes multiple KI records must be updated. I suggest rephrasing as: “In general, the Kernel Information should at least contain attributes that point to where the bit sequence of the referent can be found and pointer to a type definition. Optionally, it may contain pointer(s) to further contextual objects including metadata referenced through their own PIDs.”.

3.4.1.3: Resolution must indeed still return the kernel information in the resolution response. It would helpful to insert a note here that says: “Note that in such cases, it would be good practice that the kernel information should provide an alternative ‘tombstone’ location for the object that states that the object is no longer available, perhaps also giving a standard reason.” Or refer to clause 5.7, where this is also dealt with.

4.1: Section 2.3 of the policy states that objects can be identified by multiple different PID types simultaneously, and this is correct. The implication is that multiple PID infrastructures can exist, for example for different research domains. It is unlikely there will be a single EOSC PID infrastructure (even one based on the Handle system) that is appropriate for all research domains. Even though it is acknowledged in the subsequent subsections of 4 and in 5.2, it should be made clear in 4.1 explicitly e.g., “A PID infrastructure, of which there can be multiple within the EOSC has a number of defined roles …”.

4.6: The phrase “… processes PIDs and their associated metadata …” seems to be misleading. To avoid confusion, it would be better to say “… processes PIDs and their associated kernel information …”, leaving the term ‘metadata’ to be understood as contextual data applying to the object the PID identifies, and which is likely located elsewhere than at the PID issuing and resolution services.

4.6: What is meant by a PID metadata service? In fact, do these service types have definitions anywhere?

4.6: There is potentially a PID assignment service missing. Issuing of a PID and its assignment (attachment) to a specific object are two different operations e.g., it’s possible to obtain (reserve) a PID long before the associated object comes into existence.

4.7/4.8: It’s worth pointing out explicitly that both the PID Manager and PID Owner roles can also be performed by machines, as well as by individuals or organisations.

5.2: A common API specification is an excellent suggestion. This should be given immediate and high-priority attention to prevent (as far as possible) that several different APIs emerge.

5.3: In this sentence, the word ‘in’ should be ‘into’. What is meant by ‘encryption of PIDs’? Surely no PID itself needs to be encrypted but its kernel information may need to be? Or are you saying that there are some applications where even the existence of a PID needs to be disguised/hidden?

5.6: The support of versioning is a complex area where requirements vary greatly from one application to the next, and the first sentence in this clause is misleading. It is objects that can be versioned, not PIDs.

There is a conflict between any requirement of a PID to contain an indication of object version and the requirement expressed in 3.3.1.2 that PID strings should not contain semantics. Regimes for versioning must specifically consider the needs of referencing/citation (i.e., to able to retrieve the object as it was on the day it was referred to/cited) but also the possibility to retrieve the latest version of the object at any time. Zenodo addresses this problem by maintaining both a top-level PID that points to all versions of a deposition and separate PIDs identifying each version. This works well in an archive or dataset publication context where an object might occasionally be replaced by a corrected or updated version. However, it doesn’t work for dynamic (mutable) objects where the object content always represents the latest state of the data/item, and that may have changed since the object was referred to or cited. In this case, as with time-series datasets, timestamps must be used to reconstruct or access an object as it was at a specific time. The need to take and persistently identify a snapshot of an object in order to refer to it or cite it should be avoided, as this represents an additional barrier to widespread PID scheme adoption and use.

Suggest to re-write the clause as follows: “There should be clear guidelines for users assigning PIDs on how versioning of objects is supported. Applications and repositories must have policies on how to manage versioning in case the FAIR Digital Object or entity changes.”

6.2: Add to the last sentence to say: “All digital representations should be FAIR Digital Objects that follow the generic guidelines and meet the requirements of the FAIR Digital Object Framework.”

Appendix 2, Glossary: The definition of Digital Object is incorrect. It should read: “A Digital Object has a bit sequence that can be stored in multiple repositories and is associated with a Persistent Identifier (PID) and a type definition”. There is no notion of quality in the definition of a DO.

Appendix 2, Glossary: Add a definition of the FAIR Digital Object Framework: “A Framework of generic guidelines and requirements to be met by FAIR Digital Objects that extends the digital object concept with semantic relationships.”

Alex Hardisty
15th January 2020

School of Computer Science and Informatics
Cardiff University

email: hardistyar@cardiff.ac.uk
END.

Thank you @Hardisty, for your excellent comments and suggestions!

We will consider for the next version.

Anders (from EOSC PID task force)