A global Natural Sciences Identifier (NSId) scheme for specimens and collections

Digital technologies transform the ways in which natural science collections are managed and used. Digitization initiatives around the world lead inevitably to digital representations in cyberspace for physical specimens in collections that become increasingly semantically meaningful (and thus machine-actionable) as well as increasingly acting as the mutable space/place for curation of all data (first- and third-party) derived from and relating to that physical specimen. The notion of universal and stable persistent identifiers (PID) for these ‘digital specimens’ is central to museums’ ambitions for widening access, and to proposed notions of Extended Specimens (Webster et al., 2017, Lendemer et al. 2019) and Next Generation Collections (Schindel and Cook 2018). PIDs act as a digital doorway that allows us to do more than just find specimens. A wide variety of novel first and third-party services become possible, including for example: harmonizing the arrangement of loans and visits, finding specimens related to one another (think: ‘frequently bought together’, ‘customers who viewed this also viewed these’), linking to third-party information, and providing support to Access and Benefit Sharing. Such services in natural sciences can be compared to those enabled by Digital Object Identifiers (DOI) and offered by Crossref, such as Cited-by, Metadata API, and Event Data; or, as in the example of Entertainment Identifiers (EIDR) the more than 25 applications identifying and tracking content production and distribution in the film and TV entertainment supply chain – from studio to theatre, TV or mobile device.

To avoid fragmentation along national and/or regional lines, the global natural science collections community urgently needs action towards a common global scheme for persistent unambiguous and actionable identification of digital specimens and collections. We propose a ‘Natural Sciences Identifier’ (NSId) scheme based on the Handle system in a joint international governance arrangement under the Alliance for Biodiversity Knowledge (https://www.allianceforbio.org/).

Such a mechanism must work for persistently identifying digital objects on timescales typical of the natural sciences collections i.e., from decades to centuries. This mechanism must be independent of and resistant to specific implementation technologies to achieve that. It is reflective for a moment to compare with International Standard Book Numbers (ISBN), which have been in use since the mid-1960’s to identify each edition and variation of books published. The NSId in such a form represents everything a digital/extended specimen stands for, rendering each one unambiguously findable, accessible, and reusable for future science, commercial, policy and societal purposes, and establishing a trusted ‘brand’ over the very long term.

Reliable identifiers derive from robust services supporting persistence (minting, resolution) and machine actionability (semantics) under formal governance arrangements, which Alliance for Biodiversity Knowledge stakeholders are well placed to provide. Reliable identifiers enhance the value of collections and specimens. Identifiers that uniquely and meaningfully identify specimens and collections enhance the quality and accuracy of work. They confer authority, raising overall trust throughout the value chain (figure 1) founded in the worldwide collections of physical specimens and the digital assets arising out of digitization initiatives.

Figure 1: Value chain founded in natural science collections

At every point in the chain reliable identifiers can unambiguously identify, refer to, use, trace and track natural science objects that have their digital representations, the third-party data derived from them and the transactions involving them stored and manipulated in computer and information systems. Multiple value-adding service opportunities that can respond well to the emergence of a new Natural Sciences Identifier (NSId) scheme, unconfounded by existing schemes with their quite different object characteristics arise at all points through the chain.

A jointly governed, global, Handle-based system layered over existing institutional identification practices that allows a global PID to be created at the earliest moment is the first step towards these ambitions. Additionally, it would help to overcome limitations linked to the functionality of URIs (cf. RFC 3650). Our requirements and use cases, however, are different to those of the DOI-based and other identifier schemes. Alongside guaranteeing the association (link) between the physical specimen and its digital representation, these include (figure 2): persistence for the very long term, governance by stakeholders themselves, a brand that inspires trust and authority, and scalability (circa 30billion identifiers) with pertinent tailored services.

Figure 2: Governance, trust, scalability and persistence enabled by the scheme reinforce each other

The non-profit DONA Foundation (www.dona.net), based in Geneva, Switzerland is the neutral forum for global governance of the entire rapid-resolution, globally distributed system run by multiple groups that the public can use for resolving identifiers (Handles), of which DOIs are a significant proportion and under which we propose Natural Science Identifiers. Such a scheme would, we believe have the backing of DONA.

The time is right to establish a worldwide joint declaration of intent, and to proceed with plans to initiate a global persistent identification scheme for natural science digital object types such as digital specimens and collections.

Alex Hardisty, Dimitris Koureas, Wouter Addink
Distributed System of Scientific Collections (DiSSCo)
16th December 2019.

The Distributed System of Scientific Collections (DiSSCo) is a new world-class Research Infrastructure (RI) for natural science collections. The DiSSCo RI works for the digital unification of all European natural science assets under common curation and access policies and practices. These aim to make the data easily Findable, more Accessible, Interoperable and Reusable (FAIR).

References and further reading:

L. Lannom, D. Koureas & A.R. Hardisty. FAIR data and services in biodiversity science and geoscience. doi: 10.1162/dint_a_00034.

Lendemer et al., 2019 . doi: 10.1093/biosci/biz140;

Schindel and Cook, 2018. doi: 10.1371/journal.pbio.2006125;

Webster et al., 2017. isbn: 978-1-4987-2915-4;

1 Like

Have you considered the IGSN (http://www.igsn.org/) for this purpose? This group has expanded beyond geo samples to include natural science identifiers.

IGSN has and is being considered and DiSSCo.eu is in dialogue with IGSN stakeholders. However, there are some constraints today that prevent IGSNs in their present form being suitable for identification of museum specimens. Namely, the requirement that physical sample and digital representation identifier are one and the same nine-character suffix, and DiSSCo desire for a recognisably distinct and branded 2-digit top-level prefix that clearly indicates that what is being identified is a natural sciences specimen.

An IGSN can be of any length, it is not limited to nine characters (see https://igsn.github.io/syntax/). However, since human operators are involved in many IGSN applications, the advice given by the IGSN Implementation Organization is to keep identifiers short to make it easy to fit IGSN identifiers onto labels or into tables, etc. Nobody wants to copy a 32-digit UUID by hand, and there are other potential sources of transcription errors.

In the current applications of IGSN, the identifier points to a landing page (plus other digital resources) that represents the physical object on the web and it can point to other related resources identified by PIDs or URIs. Does DiSCCo assume that an object can have more than one digital representation? Or does its digital representation change over time? What is the reason behind making this distinction between the identity of the physical object and the identity of its digital representation?

By branded 2-digit top-level prefix I assume that you mean a top-level prefix in the Handle.net namespace. Is that correct? IGSN has a concept of top-level namespaces within its own Handle.net namespace to allow institutional branding and to accommodate already existing identifier systems that are locally unique but not yet globally unique (see https://igsn.github.io/namespaces/).

Based in the reply from @0000-0001-5911-6022 is this still a barrier please @Hardisty? The nine-character suffix is seemingly not a problem, but can you elaborate on any concerns relating to physical and digital representation?