Developing a US National PID Strategy Report

TAC_NISO · March 19, 2024, 1:29pm

We are thrilled to announce the publication of Developing a US PID National Strategy

Utilizing the framework created by the Research Data Alliance, this report was created in collaboration with members of the Higher Education Leadership Initiative for Open Scholarship (HELIOS Open) and the Community Effort on Research Output Tracking workstreams organized by the Open Research Funders Group (ORFG).

This report outlines the benefits of PIDs, their associated metadata, and the systems that connect them in advancing open scholarship goals in the United States. It provides information on the research and policy landscape associated with PIDs, discusses the value of PID infrastructure, and offers recommendations for effective utilization of PIDs in connecting and tracking research outputs. Ideally, this guidance will be widely adopted by organizations throughout the research ecosystem in the US and potentially adapted globally in other national contexts around the world, as part of a growing movement to deploy national persistent identifier strategies.

To read the full report: Developing a US National PID Strategy

TAC_NISO · March 19, 2024, 7:07pm

We welcome your thoughts and comments on this report. As this report describes this effort will ideally advance through a consensus standards process within NISO, the project and its recommendations will be vetted again by the community. Additional feedback is encouraged either here or via the RDA Interest Group on National PID Strategies discussion board.

erik-desci · March 21, 2024, 1:50pm

Hello!

Thanks for sharing this. Two quick questions:

I see where section 3.6.1 emphasizes the importance of “centralized” persistent identifier systems. Recent concerns have been raised about the scalability of centralized systems in the naturally distributed landscape of academia. Any particular reason for the emphasis on centralization given recent issues? See here, here, here, and here
Would it be possible to include sections around monitoring emerging PID systems? I believe the Dutch National PID policy has included some thoughts on the topic which may be useful. A number of interesting identifiers are emerging now which I worry the US may miss out on (dARKs, ISCC Codes, and dPIDs). It might also be good to expand the supervisory group for emerging PIDs past incumbents such as Crossref and Datacite to ensure fair and equitable representation.

Thank you,
Erik Van Winkle

TAC_NISO · March 21, 2024, 2:06pm

Erik,

Thank you for your comment.
A few thoughts:
A) This is a preliminary draft to advance the discussions and shouldn’t be considered “final”, apart from being what members of the group considered to be best thinking at this stage. It will go on, we expect, to additional refinement as new voices and thoughts, such as yours are added to the conversation. This will be the next phase of the project, as we move toward greater formalization and consensus.
B) It is important to distinguish between “centralized” identifier systems, distributed systems you describe, and more ad hoc identifiers like ascension numbers. While each has there place potentially, this report is more making a case against systems that lack
C) Distributed systems certainly could provide value and I’m supportive of the experiments and proofs of concept (I’m part of the ISCC project, soon to be published by ISO). However, they are too early in their development to be recommended as best practice in the community. For example, we didn’t stress adoption of RAiD for exactly the same reason.
D) More fundamentally, I have a couple issues with distributed approaches based on the intrinsic identification approach. For example, they rely on search and aggregation rather than registry. This could create issues with persistence, linking, and preservation. DOI-based systems aren’t without their faults, mainly because of the responsibility to maintain metadata. Distributed systems solve this by not addressing this issue of persistent resolution to the referent. You may have a persistently-bound identifier to a piece of content, but not be able to locate it or anything about it. Also, distributed systems don’t necessarily provide the linking capacity that is the real value of the graph. It’s not that these couldn’t be overcome, but they give me pause.

You advice about monitoring new and emergent PID systems is a good one, thank you.

This is certainly something that will rouse a lot of debate and should foster further conversation, which we welcome. How it gets formulated in further drafts is certainly something we all should consider.

erik-desci · March 21, 2024, 2:39pm

This all makes perfect sense and thanks for the response. Continued discussions around these points and more can be invaluable for the advancement of the scholarly record. Thanks for being receptive to the feedback.

Are the discussions around this PID strategy open to members of the public? I would love the ability to contribute if possible. Small changes in wording (like the centralized point mentioned above) can have large impacts.

Congratulations on ISCC by the way! Huge fan of the project.

Another one I forgot to mention in my above post is Software Heritage IDs. DAG based identifiers have the potential to modernize the space.

castedo · March 21, 2024, 3:09pm

We can interpret “distributed and intrinsic identification approaches” to mean many things. Some of those approaches can be agnostic as to whether some people choose to rely on a registry. Some intrinsic identification approaches can advocate a choice on using registries rather than repudiating all registries. This is an intrinsic identifier approach I am taking.

I am developing DSI (Document Succession Identifier) which is an intrinsic identifier and a YAPID (Yet Another PID ). I’ve intentionally made the specification agnostic as to what prefix is used with a DSI.

The prefix to a DSI can be a domain name like perm.pub or it could be a DOI prefix. In this approach a registry might or might not be relied upon, depending on the prefix, and depending on the DSI. The important point is that one is not required to use a registry.

erik-desci · March 21, 2024, 4:30pm

@TAC_NISO We have an active working group at the DeSci Foundation brining together the developers of emerging PID systems. It’s an open group, you’re welcome to join. dPID Working Group

TAC_NISO · March 21, 2024, 4:50pm

The next formal stage of the development of this work is expected to take place as a NISO Working Group to develop the recommendations into a National Standard. If approved by the NISO membership and launched, the project working group will be somewhat open, with caveats. Non-members of NISO are welcome to participate in a working group, with a few caveats. Non-members can participate so long as the organization they represent (i.e., their employer, etc.) isn’t already engaged in another NISO project (i.e., non-NISO-members can appoint one person to one project at a time), and that similar interests and perspectives are not adequately represented by member organizations (i.e., NISO members get preferential appointments over non-NISO-members). We also try to limit the size of groups, simply so that they are able to function effectively (ie., really large groups of >40 people have trouble even agreeing a time to meet, for example). Every project though has public comment periods for those not directly involved can raise issues and concerns, to which the committee is required to consider and respond.

If the project is approved and launched, we will of course issue a public call for participation. I’m hoping this will happen before summer.

And yes, I’ve had some conversations with Software Heritage about their IDs and trying to advance a standards project related to that. Software is noted in the report as a gap in the community.

Todd

danielskatz · March 21, 2024, 10:28pm

I’m a bit sad to see that the key members of various RDA (and other) groups that have looked at PIDs for software over many years don’t seem to have been involved in this process, leading to a recommendation for software PIDs that I think is both somewhat naive and impractical.

Examples of such work include the RDA Software Source Code Identification WG and its output, along with the work of the RDA/ReSA FAIR4RS and FORCE11 Software Citation groups.

In brief, we’ve learned that using DOIs for software (as this report recommends) is ok for some use cases, but is almost certainly not the best answer for a bunch of other use cases, where discipline-specific PIDs (eg RRIDs, ASCL IDs, …) or Software Heritage IDs have much better characteristics. This is partially because software is much more dynamic than things like papers, people, etc.

Because of these issues related to software PIDs, and perhaps for others, it would have been nice if this group had created and circulated a draft report for community feedback before publishing and publicizing this one, in my opinion.

TAC_NISO · March 21, 2024, 11:01pm

Dan,
Again, thank you for the feedback. It is greatly appreciated.

Much like the reason noted above that IDs for projects and the RAiD system weren’t stressed is that, it isn’t that there aren’t models for how to do so. The sense was that these systems were too early in their development and deployment to be recommended as systems which can be implemented nationwide. Of course, that isn’t to say these aren’t important and do need ID systems, which is why we stressed them as areas of emergent need for further development and investments.

Again, this should NOT be viewed as the last word on what a final Strategy will be. Your comments are exactly the community feedback that we were looking for. As we move forward, certainly more discussion and input will be valued and incorporated into the next stage of this process.

danielskatz · March 21, 2024, 11:32pm

Todd,

I think my concern here is that Table 1 reads as “you should use DOIs as PIDs for research software”, and doesn’t have any of the nuance in your message above.

And the fact this this has been published as the “Developing a US National PID Strategy” report, not a draft document or similar, makes it seem like this is a consensus that should be followed, rather than the first step. Again, I don’t think the publish first and then get feedback model makes sense in situations like this, where it would be, to me, much better to get feedback and work towards consensus before publishing.

erik-desci · March 22, 2024, 8:11am

@TAC_NISO what is the process by which emerging PIDs like dPIDs (and established PIDs like Software Heritage IDs) move from “too early” into formal recommendations? To @danielskatz’s point, I also read this document as “We should use DOIs for software” and was confused/concerned. Using DOIs for the identification of software would be inconsistent with multi-decade long identification best practices coming from the software development industry. DAG based identification is the foundation of the git primitive itself, having been used billions of times over.

Do you have thoughts on cold start problems coming from this work? Only solutions tested at scale are recommended and only recommended solutions are tested at scale, hindering innovation.

Do you have thoughts on the Lindy effect? By starting with DOIs for software, we entrench a poorly suited solution in culture, process, and tooling alike.

TAC_NISO · March 22, 2024, 10:55am

We were very careful not to call this report “A US National Strategy” or to presume that we could speak for anyone other than the participants. This is why the report was titled “Developing a US National PID Strategy”, not something more declarative. The group recognized that it couldn’t speak for the entire research community, nor for any Federal Agency, institution or funder, and was therefore reticent to claim it could define a completed US Strategy. This report is just the first phase of a larger, longer process. That process is described on the first page of the report and what the plans are to move it beyond just the group that drafted it. This forum, the RDA-US group and a subsequent NISO working group are envisaged as the places where feedback will be gathered.

The next phase of the initiative will be to launch a broader consensus working group, organized by NISO under our ANSI procedures. As noted, this group will involve a public call for members. Anyone may respond to that call and participation will not be limited to NISO members, with the caveats described above. That group will operate under formal procedures for transparency and for gathering community input and feedback. Additionally, the group will be required to respond publicly to comments that are received during the process and members of the community will have the opportunity to object to what is included before publication.

To @erik-desci 's point, how the community advances and adopts new identifiers and new systems as technology develops is a great example of productive feedback and should probably be incorporated into a subsequent policy document.

danielskatz · March 22, 2024, 1:05pm

Todd - Thanks for your response. I understand what you are saying, but I don’t think document is written consistently. The part you mention and the title indeed suggest future work, but Table 1 is more imperative, and the later section about software PIDs (3.7.2) says the landscape is fluid (which is correct), but doesn’t really tie that to a path toward a future version of this report (or a national PID system). I suppose Table 1 and its definitiveness are really my main point of disagreement with this.

twdragon · March 22, 2024, 5:45pm

@TAC_NISO I read the document and I appreciate very much an attempt to harmonize with EOCS. Also, I need to say that I should support @erik-desci in his reply to paragraph 3.6.1:

I see where section 3.6.1 emphasizes the importance of “centralized” persistent identifier systems. Recent concerns have been raised about the scalability of centralized systems in the naturally distributed landscape of academia. Any particular reason for the emphasis on centralization given recent issues?

The remark I want to place is based on paragraph 3.1.2.2 which says:

Importantly, the scalability of PIDs ensures that, as data volumes increase,
the benefits persist without a corresponding rise in overhead costs.

Here scalability should be indicated as the key point together with the resolver-side and client-side reproducibility of the PID itself, I think. In the context of the prevalence of rotting links, the overhead costs of the systems utilizing non-reproducible PIDs are becoming unpredictable as much as the amount of minted PIDs grows. My opinion here is that interoperability and reproducibility of the PIDs itself should become a part of the whole FAIR concept for PIDs, at least in the open science context. This opinion is based on the outcomes of the ExPaNDS project where we were working on openly shared intermediate datasets from the Photon and Neutron community, where rotting links became a problem even for principal investigators.

The next remark I want to place here is about paragraph 3.4:

Therefore, it is important that individuals and organizations actively contribute to and support the core PID infrastructures that underpin this ecosystem.

In the context of this paragraph, I am for a clearer distinction between the roles of the individual User and the Adopter in Table 2, cause the User may easily become an Adopter by nature in the context of federated or decentralised PID infrastructure implementation, but the definition of Adopter role is written keeping in mind that he is likely acting as a society, group or organization.

In the section 3.5 it is written:

Inadequate metadata support: Lacking the capacity to store comprehensive
metadata, impeding efforts to provide context and information about the
associated data. PIDs come equipped to integrate with metadata standards from
multiple institutions, allowing for the inclusion of crucial details that enhance
the understanding and utility of the research data.

I think, there must be a recommendation to implement the possibility of standardisation and one-click adoption of the predefined metadata schemas (via DTD, Common Grammar, JSON Field Table, etc.), because it is essential here not to link the given general purpose PID implementation hardly to the single schema.

In section 3.6 I am completely for the replacement of the term “Centralized” with the term “Registry-based” because the current term gives the connotation of the simplest complete database-like centralisation whilst even the DOI and Handle systems are implemented over the federated approach, and the internal metadata persistence handling is not built over the simple replication.

In conclusion. I want to repeat my sentence about interoperability and reproducibility of the PID itself, in the context of section 3.8. The interoperability (at least, temporarily dropping the question about reproducibility) of the given PIDs, APIs and usage methodologies should be included there, because it would become a quantitative criterion of success. As an example of this, the presence and coverage of a trustless API could be proposed.

bandrow · March 22, 2024, 7:09pm

Todd,

As was stated by my colleagues, the document touted as National PID policy misses the primary use cases for RRIDs, where they are widely adopted (key biological resources critical for rigor and reproducibility see NIH documents, they are adopted at >1000 journals, see relatively recent review Bandrowski 2023). I don’t think that adoption at thousands of journals and some of the largest scientific resource providers (Thermo Fisher, BD, JAX, Addgene…) is something that is consistent with “lacking adoption”. It seems more consistent with a broad industry and academic adoption of a PID that positively impacts the reliability of data produced and rigor of academic manuscripts.

The use case in the document that mentions RRIDs, is a relatively recent extension of our work. We leverage the broad adoption of RRIDs to help core directors get authors to also cite cores, a function that assists them in filling out reports to funders. This was undertaken with the ABRF society’s Core Marketplace. In addition to registering all ABRF active cores, we have moved forward together with the ABRF community, and we are making headway into registering cores from outside of the US, and outside of the previous remit of the ABRF. Since we began this work we have registered over 1500 cores, and gathered citations to about 500 of them (examples here here). While I admit that this is not implemented evenly across all fields, we seem to be getting directors of shared facilities at various universities coming to register all of their cores, which, once complete, will give us a broad adoption.

The document states that RRID lack of adoption is “complicating the replication of experiments and the validation of findings”, but the RRID implementation at cores seems to be the only PID that is held to this standard.

Do all endorsed PIDs have full and broad adoption across all fields of study? Is it fair that some PIDs are written about as negatively impacting the validation of findings while others get a pass? The document’s lack of nuance and understanding of PIDs is troubling.

castedo · March 23, 2024, 3:47pm

@TAC_NISO, I can suggest that perhaps the term “Discussion Document” be used, or any similar term, like “Request For Comments”, that suggests feedback and discussion. And make it more prominent, for instance, in the abstract. At the moment, the “for input and feedback” clarification is buried in the third paragraph.

castedo · March 23, 2024, 4:59pm

One of these system is SWHIDs. I find it completely unconvincing the claim that SWHIDs are “too early in their development and deployment”.

@TAC_NISO If your working group has this sense, you have a problem with how you are forming a working group to make recommendations about PIDs for software. If your working group can not make credible recommendations why should the public heed these recommendations, consider it a relevant standard, or care what NISO publishes?

I say this as a contributor to a research software package, widely used in population genetics, that has a SWHID and not a DOI to the underlying software code. We use DOIs to identify papers talking about the software and for credit in academia. SWHIDs identify the software code. I don’t claim all researches should take the approach we took, but it is very much the approach of the real research software code that I worked on. And this has been the case for more than two years.

maxence_azzouz · March 25, 2024, 7:00pm

Dear @TAC_NISO,

I am working at Software Search - zbMATH Open

Our service for mathematical software indexes software.
However, we consider our identifiers relevant for the metadata of software but not their source code. This is precisely why we have adopted SWHID for the software source code.

I do not understand your point on the persistence of the SWHID, which anyone can compute at any time on any computer. It is a package you can install on any Linux system. No one can compute a DOI and ensure the identifier is correct in 30 years.
This software comes with more than 120 000 commits:
GitHub - sagemath/sage: Main repository of SageMath. Now open for Issues and Pull Requests.
Any mathematician publishing a paper about a new release he authored can compute the SWHID himself of the source code he intends to author. The SWHID can be recomputed by any other reviewers of the venue afterward.
Software Heritage (https://www.softwareheritage.org/) archives all the open-source code, and the URI of every piece of software is based on SWHID, ensuring the findability of the source code for the future.

jak · March 27, 2024, 5:52pm

I agree that this effort should be retitled to make it clear that the recommendations of the group are only a beginning. The current wording (eg, “report”) connotes something final and summative and it feels like the landscape survey misses important PID schemes.

Besides RRIDs and SWHIDs, there are hundreds of schemes used daily by scientists and researchers – schemes supported and recognized by meta-resolvers such as identifiers.org, n2t.net, and bioregistry.io. In a sense, PIDs are just permalinks that come with a scheme name, and there are extremely heavily used permalinks that don’t get much airtime in PID discussions; for example, Wikidata and other mediawiki-based permalinks. The world of PIDs is much larger than those based on the DOI or sponsored by Crossref (DOI, ORCID, ROR).

One very mature and heavily adopted PID scheme that deserves more attention in this NISO activity is the ARK (Archival Resource Key). Over the past 23 years, more than 1200 ARK (arks.org) organizations have created over 8.2 billion ARK identifiers. They include the Smithsonian, the Louvre, the Frick Collection, the Internet Archive, and UNESCO, as well as 10 national libraries, 145 universities, 184 archives, and 75 journals. Adoption of ARKs among smaller institutions and in the global South is accelerating because of their flexibility and low cost. Non-paywalled identifiers such as ARKs have important implications for (inter)national strategies promoting access and equity.