Creating resolvable PIDs without registering them in a registry, through CIDs and IPFS

keesvanbochove · October 8, 2020, 11:57am

Dear all,

For a while now I’ve been pondering the possibility of creating persistent, unique and resolvable identifiers, without having to register these identifiers anywhere in a specific registry. This whole idea came up when trying to find a way to register PIDs for European healthcare databases in the IMI EHDEN project in which I’m involved, and I realized how hard it is currently to do that in a future-proof way (hopefully EOSC can address this going forward).

But… is this even possible? Yes, it is theoretically possible, if we agree on a standardized and reproducible way of encoding data and using its hash as the identifier. But more than that, the technology already exists today. Using the IPLD standard and CIDs as identifiers, with the backing of the growing IPFS (Interplanetary Filesystem) which is an ambitious project to decentralize the internet. If you have no idea what I’m talking about (that happens, often to myself ), perhaps the clearest explanation is this small article with code examples: https://docs.ipld.io/tutorial.html#addressing.

I just don’t know a good place to bring this topic up and experiment with it. The open science / FAIR digital objects community where PIDs live and the blockchain world of CIDs have almost zero intersection, it’s just that I happen to be interested in both. So I’m trying this forum which is about PIDs, has anyone thought about this or an interest to experiment with it?

Greetings,

Kees

Luc · October 8, 2020, 2:00pm

Hi Kees,

Interesting topic. Can you elaborate on the resolvability of CIDs? I see no mention of URIs and protocols, so I’m a bit confused. If there’s no implied protocol and host, I don’t understand how a CID is more than a hash. (For example, https://proto.school/anatomy-of-a-cid/01 mentions that QmcRD4wkPPi6dig81r5sLj9Zm1gDCL4zgpEj9CfuRrGbzF is a CID, while actually linking to https://ipfs.io/ipfs/QmcRD4wkPPi6dig81r5sLj9Zm1gDCL4zgpEj9CfuRrGbzF.)

https://docs.ipld.io/tutorial.html#addressing mentions that “We can now ask any random device on the internet do you have this CID?” I feel like you can already do that with any HTTP HEAD/GET request and ask random devices if they have a given path, then inspect the response code. I understand the benefit of knowing, with CIDs, whether the server lies, but I feel like I’m missing the bigger picture.

Also, how do you handle metadata without a registry? You cannot based the CID on data that is subject to change (hosting organizations, addresses, etc.). So how do you exchange metadata about a given resource, and how do you link different versions (hence different CIDs) of a given resource?

I also wanted to point out that CIDs’ dependency on cryptographic hashes reminded me of the SWHIDs used by Software Heritage, which use a URI scheme registered with IANA, in case you don’t already know that project.

keesvanbochove · October 8, 2020, 4:19pm

Hi Luc,

Thanks for your reply! Your questions help a lot in brainstorming this. I guess the other part of the puzzle is IPFS itself. The ipfs.io/ipfs bridge is just a way to bridge today’s centralized internet stack to IPFS. Think about it as a permanent peer-to-peer network where as long as you are connected with some node that has the data (which could be a researcher over in the next university that also downloaded the same dataset) you can always retrieve it.
The original video from Juan Benet probably says it best (https://www.youtube.com/watch?v=HUVmypx9HGI) - the newer ones are very flashy, but you can also look at hands-on overview like this: https://www.freecodecamp.org/news/ipfs-101-understand-by-doing-it-9f5622c4d4ed/.

Of course, the big question is, will IPFS actually get enough traction, because its user community is growing but still small compared to the current web. On the other hand, you really only need a handful of data producers and users, the network auto-scales… and as a bonus it really would work nicely on Mars too.

Your other point about metadata handling is a good one. I think it’s not difficult to add metadata because IPLD supports links between blocks. You can see that in how qri.io uses IPFS (e.g. https://ipfs.io/ipfs/QmTKhugTGYXe9ozosHSdeSDCfjr3fi5DDCzmeqM9czfExE). Like data standards, metadata standards will evolve over time, but if you would use something like JSON-LD, then you would have a decentralized linked web of data. If someone already defined an entity that you want to reuse, you can just copy the JSON-LD and as a bonus you would be storing an additional copy of that block.

However, you wouldn’t be able to version the metadata, because this whole idea hinges on the immutability of the content. As soon as you introduce versioning, you will have to have some sort of registry. This could still be a decentralized one, such as IPNS (https://docs.ipfs.io/concepts/ipns/, see https://dweb-primer.ipfs.io/publishing-changes) or the ENS (https://ens.domains/), but that’s not the idea I have in the title.

Thanks for the link to SWHIDs, I see a number of parallels there with IPFS and IPLD indeed!

Greetings,

Kees

rgiessmann · October 13, 2020, 7:50am

Hi @keesvanbochove, exciting to read that you looked into resolvable PIDs with IPFS and Co.!

I wondered about that issue, too, but found the following solutions, depending on what kind of level you need:

w3id.org – is managed via GitHub Pull Requests, and has thus delay between asking to register something and availability of the identifier. But: works out of the box.
trying to sneak into public infrastructure which mints handles (handle.net): here it took me roughly 3 months to get answers via email, and I needed to try with several public handle services, but ultimately the most responsive (and still for free!) was https://epic.grnet.gr/ – the pidconsortium.net is still in the making (so will take around 4 months).

In your post you also referred to “community uptake / traction” – although I love the idea of IPFS, I believe handles (and thus http) are the way to go right now (and even handles are already quite advanced, I often feel). But again, thanks for sharing, I would say this is a good place indeed.

See you around!

keesvanbochove · April 21, 2021, 11:43am

OK, so an update to this topic – it turns out, that https://ceramic.network/ is pioneering an interesting solution to the earlier raised problem of metadata. The workflow is that a W3C DID is used to establish a Genesis record which contains some basic metadata (including the CID of the first version of the document), which is then used as a permanent identifier for the document and which can optionally also be anchored to a blockchain. Subsequent updates reference this identifier and can add new versions of the document.

I also just discovered ISCC Foundation which investigates this topic more in depth (see ISCC Concepts).

RayPlante · April 21, 2021, 11:47am

I know this is a bit of an old thread, but it’s an interesting question. I faced this question myself, and I ended up settling on ARK IDs. (Remember them?)

I was building a repository and wanted to choose a form for our internal identifiers. In this application, many things would have a DOI, but not all. Some items would be publicly accessible, but not all. I wanted an identifier that has most of the features of a PID (global uniqueness, resolution, etc.) but not have the dependency on an external resolver.

ARK Identifiers has all of these characteristics (ARK). It has a concept of a global resolver, but it also supports a concept where the resolver is explicitly provided. Another advantage for me was that my organization already had an assigned NAAN (naming authority number, which establishes our globally unique namespace).

So essentially what I did was to mint ARK IDs that can only be resolved via our own resolver. The resolver+ARK-ID URL becomes a “long-term URL” assigned to an object. When a DOI is assigned, it can resolve to our corresponding resolve to this long-term URL.

hope this helps,
Ray

twdragon · August 18, 2023, 3:46pm

I am elaborating on the topic since the beginning of 2021 working for ExPaNDS project. I see the following main peculiarities both 2 types of PIDs used in IPFS/IPLD ecosystems have, that can not be overcome by known partially-centralized solutions:

CIDs are resolvable from anywhere if only the IPFS endpoint is available, and in any format acceptable by the implementation, with free conversion between the formats within the spec.
CID replaces cryptographic data signing, as IPFS block objects linked with the given CID carry their own metadata including the tree of links, so any CID of even partially stored data is verifiable and reproducible on-site.
IPLD/IPNS ecosystem separates concepts of data identifiers and endpoint identifiers (PeerIDs) deriving endpoint identifiers from user private keys. Thus PeerIDs are self-signed identifiers that would be minted only by the private key’s owner.
IPNS entry keys are resolvable to CIDs.

From my point of view, the enumerated peculiarities allow the creation of decentralized PID infrastructures in an easy way, even without a blockchain, referring mostly to PeerIDs, covering easily the tasks described, for example, here. The only task here that should be left to authority, is trusted verification that the given PeerID is owned by the known publisher. This task does not require a blockchain as it can be solved using an infrastructure similar to Root CA used for WWW TLS verification. Holding and enforcement of this authority is a procedural question.

In this presentation some additional concepts and proposals are described. There is also a Zenodo community including, among other stuff, the information about decentralization of the PIDs elaborated in the ExPaNDS project.

Especially, the described peculiarities of IPFS could be useful, in my opinion, for open-source software and dataset provenance. In a few days, I will publish here a topic with the link to the helper instruments supporting direct IPNS publishing of Git repositories under the IPNS entry key with immutable intermediate stages (there is an existing one, but it does not support IPNS directly).