Persistent Identifier (PID) Definition

carly.robinson · April 7, 2021, 2:11pm

I’m interested to learn how others/organizations define the term persistent identifier (PID). Here’s the current definition we use:

A PID is a digital identifier that is globally unique, persistent, machine resolvable, has an associated metadata schema, identifies an entity (e.g., individual researcher, publication, award, digital research output, organization) in perpetuity, and is frequently used to disambiguate between entities.

0000-0002-5119-2271 · April 7, 2021, 3:56pm

This part - " has an associated metadata schema" is a bit sensitive, @carly.robinson , for me personally. Handles, for instance, are just pointers, they don’t come with a metadata schema, still PIDs though…

Also, " in perpetuity" in kinda a big statement, PIDs always break and need fixing and redirecting…

Eugene

sheila.rabun · April 7, 2021, 4:16pm

I agree that the metadata schema requirement may be up for debate. ARKs for example don’t necessarily need to have metadata attached.

alicemeadows · April 8, 2021, 5:19pm

Thanks for raising this @carly.robinson! As mentioned in the call yesterday, I (rather lazily, admittedly!) usually use the Wikipedia definition - or a version of it: " A persistent identifier (PID) is a long-lasting reference to a digital object. Typically, such an identifier is not only persistent but actionable." It would be great to agree a PID Forum community definition that we could all use! I’ve just invited feedback on Twitter too…

carly.robinson · April 8, 2021, 5:32pm

Thank you for inviting feedback on Twitter! Agreed, it would be great if there was a community definition.

carly.robinson · April 8, 2021, 5:42pm

Thanks so much for sharing! I really appreciate the comment from you and @sheila.rabun about the associated metadata schema being sensitive in the definition. It would be great to hear more about use cases where an associated metadata schema isn’t needed - e.g. using Handles or ARKs vs. another PID.

For our use cases, the associated metadata is key for creating connections between PIDs. For example, we assign DOIs to datasets. In the dataset DOI metadata we want to include the ORCID iDs for the data creator, the ROR ID for the data creator’s affiliation, the ROR ID for the funding organization, an Award DOI for the funding, associated journal article DOIs in related identifiers, etc. Without the associated metadata schema, we wouldn’t be able to create those connections.

0000-0002-5119-2271 · April 8, 2021, 7:08pm

One example would be licensed data where copies are available in multiple places and we don’t want the metadata being exposed broadly not to confuse users. In our Abacus Dataverse - https://abacus.library.ubc.ca/, we share more than 40,000 data files with handles only. For research data and research objects, we do mint DOIs, in fact, more than 260,000 of them in the last few years.

E.

Hardisty · April 14, 2021, 3:51pm

The definition that DiSSCo (Distributed System of Scientific Collections) uses is: “a persistent identifier is a string (functioning as a symbol/name) that identifies a digital object. The identifier can be persistently and reliably resolved to digitally actionable meaningful information about the identified digital object.”

We don’t say that it is globally unique, but it otherwise reliable resolution won’t occur. Also, we don’t mention metadata because that is a characteristic of the object, not a characteristic of its identifier. As Eugene @0000-0002-5119-2271 said, Handles (which are the PIDs DiSSCo will use) are just pointers but only once they have a PID Record associated with them. Before that, they’re just names, like my name “Alex”. It doesn’t tell you anything about me or how/where to find me.

Metadata is often needed of course, to tell you something about the identified thing and to make connections to other things, as has been mentioned already.

It’s also not true to say that PIDs exist in perpetuity. They exist only for as long as they are needed, which can be a very long time (>100 years in the case of DiSSCo). How long, is a policy decision related to the purpose for which they are being used. There are use cases, for example where workflows can create huge numbers of PIDs for intermediate results during multiple parameters sweeps and data sweep executions that don’t need to be retained beyond the workflow runs.

However, while PIDs do exist they must persistently identify the correct thing and persistently resolve reliably and stably. It is the identifier that is persistent, not necessarily the thing to which it resolves, although in many use cases that is also the case.

Jonathan_DOI · April 15, 2021, 4:15pm

Hi @carly.robinson, the DOI Foundation needed a broad definition in anticipation of a wide variety of use cases. This is what we came up with: A DOI is an identifier of an entity on digital networks. It provides a system for persistent and actionable identification and interoperable exchange of managed information on digital networks.

carly.robinson · April 16, 2021, 1:59pm

Thanks @Jonathan_DOI! How do you define/think about persistence? Related to @Hardisty great comment - we were tying persistence to identifying an entity in perpetuity, but maybe that isn’t the correct way to think about it.

Jonathan_DOI · April 23, 2021, 3:44pm

@carly.robinson We think about persistence a lot! Have you seen Andrew Treloar’s '5 persistencies?" (see here). It’s a really helpful breakdown of the different ways in which a PID system is (or should be) persistent. We have used it to audit the DOI System.

The first thing to think about is the persistence of the identifier as a thing. We believe that PIDs should never be deleted although there may be edge cases for which it is justified (e.g. GDPR). However for the vast majority of DOIs we think there are broadly speaking two situations:
Either the DOI should never have been minted in the first place, in which case we would resolve it to a tombstone page that says that
or
It was minted in error and a correct DOI exists, then we use aliasing to point to the correct one

The persistence of the binding between identifier and object is obviously very important. We believe that a PID must resolve to something otherwise what is the point? We have processes to check that DOIs do resolve to where they should - although this is much harder to do in practice than you might think. In the unfortunate event that a (valid) PID no longer resolves, we think there should be some sort of “tombstone” page that displays useful information (e.g. the metadata).

The persistence of the object itself is tricky for a PID provider. In an ideal world, we would only work with people who can guarantee the persistence of the objects they manage! In practice, all we can realistically do is to set policy and share best practices, and try to mitigate with metadata.

Finally, there is the persistence of the PID service / infrastructure. We distinguish between disaster recovery and business continuity, Dealing with disaster recovery is relatively straightforward: think back-ups and archiving services such as CLOCKSS or PORTICO. It’s best practice for any information service provider.

Business continuity is harder. What should happen in the event a PID service provider is unable to continue offering services for a reason such as funding being withdrawn or bankruptcy? This is much harder problem than you might think - there is a lot of embedded know-how in organisations that is very hard to archive or escrow. Obviously the best way is to make sure it never happens or that there is some kind of fall back, or in the worst case, people close to the failed entity have the right to restart afresh.

So we believe the way to deal with this is to consider it as a business risk and to mitigate this risk. So for instance, our not-for-profit status protects us from a predatory takeover, we also have explicit terms in our agreements in the event of bankruptcy to protect the service and allow a smooth transition, we maintain cash reserves so we can meet our financial liabilities, and so on.

I hope this is helpful
It’s an ongoing discussion point for us and I’m more than happy to share out thinking with you and others who are interested.

carly.robinson · April 26, 2021, 12:45pm

Thank you, @Jonathan_DOI! It’s incredibly helpful to hear more about how you think of persistence. I had not seen Andrew Treloar’s 5 persistencies. It is a wonderful resource and something I’d like to share within our communities for discussion.

It’s also helpful to hear about the DOI edge cases. As an organization working with Crossref and DataCite to assign DOIs, we try very hard to keeps records/objects for which we’ve assigned DOIs available. But as you say, there are cases where an object should never have received a DOI. When a duplicate DOI has been assigned in error, we have been use the related identifier metadata fields to create a persistent link between the DOIs. Knowing which DOI to consider the “correct” DOI is another interesting question.