Decentralised PIDs - Discussion and Proposals

twdragon · October 26, 2023, 2:15pm

Dear PID developers!
This topic is created to encourage decentralised PID developers to propose, discuss and develop their projects within the growing community. Please feel free to link the projects and topics, discuss, propose and promote your own experiments!

Related projects

https://dpid.org

Related topics

twdragon · October 26, 2023, 2:17pm

Here is the Google document containing the initial draft for the proposal of the decentralised PID infrastructure with compliance with FAIR and possibilities to implement reverse compatibility and cross-system lookup. Please feel free to comment, discuss and propose improvements. In the near future the document will be converted to Markdown and uploaded to Github.

[DRAFT] General Specification for Decentralized FAIR-compliant Persistent Identifier Infrastructure - Google Docs

erik-desci · October 29, 2023, 1:27pm

Hi Andrei,

Thanks for making this thread. This is Erik from DeSci Labs (we help the DeSci Foundation develop the dPID protocol). We’ve been developing dPIDs for 2-3 years now. It’s great to see discussion popping up around the topic.

Here are a few resources we’ve found through our explorations with dPID: Github, Website

Content Addressable Storage Networks

IPFS as the base network for content addressable storage
Bacalhau as a compute orchestration layer on the decentralized web
Ceramic for data streaming

Blockchain related

Squaring Zooko’s Triangle: As luck would have it, Aaron Schwartz was tangentially interested in decentralized PIDs. I came across this old blog post of his where he talks about how blockchain functions as a solution to squaring zooko’s triangle.
Hyperstructures is one of the best high level explanations I’ve seen of some of the autonomous protocols coming out of the decentralized web right now.

And a few other resources

A presentation on content addressable PIDs that should hopefully provide more context on the dPID system for anyone who isn’t a developer.
An example research object built on scalable PID infrastructure. The hardest part of making PIDs is explaining to people why they should care about PIDs. This research object is granularly citable, has 5 version histories, uses annotations as components in a research object to facilitate reproducibility and is completely portable across platforms. I use it every day to explain why PIDs matter!

APfeil · November 8, 2023, 9:38am

Thank you, Andrey, for opening this topic. I think it is worth mentioning (for other readers) that you started quite a lot of meetups at the International Data Week to find others interested in this topic, which was really great. (And sorry for my late response here).

I think decentralized PIDs open a lot of interesting questions and opportunities. For the questions, the most obvious to me is how we define persistence. I’d love to have discussions about if and how it can improve what persistence means to us. Most PID systems define persistence using a governance/trust/licensing model, achieving it with quite some effort into infrastructure maintenance, redundancy, and federation. For distributed systems, this will be different. People will need to understand where PIDs will “live” physically (think of pinning in IPFS) in order to understand the meaning of persistency in these concepts. And this new approach to persistency has to be communicated properly (for each concept/implementation), otherwise no one will understand the concept. For the opportunities: I am not a PID developer, but a research software developer. Some of our Software makes heavy use of PIDs (Handles in our case) in the context of FAIR Digital Objects. There, PIDs are ubiquitous. Therefore, I am mainly interested in how automation, access, and different usage modalities could benefit.

twdragon · November 18, 2023, 9:18pm

Dear all,
Here we can find the recording of the webinar we had together with @erik-desci and DeSci Labs. If you have any ideas, points to note or opinions to share, please feel free to write a post here or open a linked topic.

sharif.islam · November 22, 2023, 3:16pm

Thank you for the specification and the opportunity to comment. I believe these are exciting technical developments in PID infrastructure that warrant widespread discussion and testing. Overall, the decentralised approach offers several benefits that deserve wider adoption, but there are a few nuances that need to be clarified. We also need to engage our user community and funding organisations to better understand the problems we are trying to solve. PID resolution, zero-trust infrastructure, cryptogprahic verification are related but also can have different types of implementation based on use cases.

I have added a few comments to the specification document. Here, I would like to summarise a few high-level points that can facilitate productive discussion, rather than framing the discussion as centralised versus decentralised. Most decentralised technologies still rely on centralised governance models.

In essence, we need to think about types of decentralisation and hybrid implementations, taking into account the context of other components in the infrastructure. PID is just one aspect.

Often, centralised service providers adopt distributed and redundant approach within their infrastructure which is also a crucial aspect of decentralised technology. So it is important to keep this in mind.

A few things stand out to me:

Keeping data as close as possible to the compute and analytics platform, while also allowing for local domain ownership. This can change where and how the PID infrastructure interacts with the data and compute part.
Can we think of a hybrid solution that takes advantage of centralised governance and data aggregation and the independence of decentralised model?
Leveraging the existing trust models of centralised organisations. Trust needs to accommodate complex social and local contexts; cryptographic verification is just one aspect.

twdragon · November 23, 2023, 10:29am

Dear @sharif.islam, thank you for the comments and also for your discussion points you wrote on the Google doc (I will consider them a bit later). Below please let me write some opinions I have on the points you highlighted here, because these questions are floating in the ether, and just a bit before we discussed with #DeSci Labs and @erik-desci very similar topics, so the interested people are seeing the points with ease.

Keeping data as close as possible to the compute and analytics platform, while also allowing for local domain ownership. This can change where and how the PID infrastructure interacts with the data and compute part.

I think, if any research institution could share the same open sourced software to get its own resolver node with integrated domain provenance layer (or at least, trusted digital signature authority), this can accelerate edge computing drastically, so it is a good point to improve the spec and to establish a whitepaper for externals too.

Can we think of a hybrid solution that takes advantage of centralised governance and data aggregation and the independence of decentralised model?

It is what already almost all involved people are keeping in mind, I guess, because, as I wrote above, at least the trusted centralised signature authority may be used as an integrated provenance layer, but it should be on a higher level than the protocols, software, and minting policies. So we can say: yes and it will be done.

Leveraging the existing trust models of centralised organisations. Trust needs to accommodate complex social and local contexts; cryptographic verification is just one aspect.

The reply is already given above: yes, and we need to facilitate this as one of the main convenience points also to attract the people from the traditional PID world to collaborate.

erik-desci · November 30, 2023, 3:15pm

Hi all,

For starters, the working group is going strong! This level of enthusiasm is exciting to see! Check out the draft prospectus if you’re curios about the group: [Decentralised PID Working Group - Prospectus - Google Docs]. It has a link to the common notes included.

@APfeil I’m reiterating the answer from the thread on this forum so others who stumble across it can join in on the conversation. This is documented in the dPID Working Group Common Notes

Erik S brought a framework for segmenting persistence to my attention after one of his presentations at the national academies. It has been useful in structuring/clarifying my thoughts around the topic. I can’t find the source, if anyone knows of it please let me know. The framework splits persistence into 6 distinct points:

Persistence of the payload as a thing

Persistence of the mechanism to handle the payload’s non-persistence

Persistence of the identifier as a thing

Persistence of the binding between the identifier and the payload

Persistence of the service to resolve from the identifier to the payload

Persistence of the service to allow for updating of the binding between identifier and payload

See below for my take on persistence in the dPID system and the places where social contracts may need to be taken into account by designated resources (preferably federated or decentralized). Comments and thoughts are definitely appreciated.

Persistence 2, 3, and 4 are architected to be fully decentralized. Persistence 1 (payload referring to metadata mainly) could benefit from dedicated centralized or federated resources. Persistence 5 and 6 could benefit from federated or decentralized resources. Federated/Decentralized resolver infrastructure and an insurance policy on Manifest/DAG storage is important. This could be built into PID minting stacks relatively easily and hosted locally by institutions. Manifest files are small, this task should have a relatively light footprint. Given the rise of decentralized compute orchestration layers and decentralized content delivery networks, this may throw an interesting spin on the role of a Federated PID infrastructure.

1. Persistence of the payload as a thing (as best possible): (Meta)data includes both data and metadata. Ensuring the persistence of all data is an impossible task, but we should work to make metadata permanent. Designated IPFS nodes could be useful for Manifest/DAG storage. DeSci Labs is already committed to this functionality.
2. Persistence of the mechanism to handle the payload’s non-persistence: Content drift is eliminated in dPID, and link rot is mitigated for pretty much all cases except file deletion. From a technical standpoint, this is already handled by content-addressable storage networks. Would appreciate thoughts on this, but I don’t really see how Persistence 2 would benefit from a central authority. Once data is completely off a P2P network, there’s not much a central authority can do.
3. Persistence of the identifier as a thing: PID stored on DLT, this is as close as we can come to permanent on the internet. Central authority is likely not helpful here. Helping to secure the chain is always appreciated but blockchains can’t differentiate or discriminate.
4. Persistence of the binding (link) between the identifier and the payload: The CID is an abstracted hash of underlying content, no central authority would be helpful there. IPFS uses a distributed hash table to link PIDs to the CIDs. It’s always helpful to have more resources managing this hash table, but they would effectively be contributors in a decentralized network in this capacity, not centralized authorities. IPFS can’t differentiate or discriminate.
5. Persistence of the service to resolve from the identifier to the payload: Open source, deployable application to provide resolution of identifiers to their payloads over HTTP. I could see the utility of dedicated providers, however, a small group of central authority seems restricted given that IPFS is an open P2P network. Best to federate/decentralize as possible on this one through a dPID stack.
6. Persistence of the service to update/CRUD over the binding between identifier and payload: Handled by the protocol, likely better off on an org by org level. As long as the manifest file is stored this should be fine. Different communities will want to handle updating differently and that’s ok. Best to federate/decentralise as possible on this one through a dPID stack.

The group was in general agreement with this sentiment, additional discussion is on the google doc if you’re curious. @twdragon should be sharing the recording to that call shortly.

@sharif.islam Those are excellent points and questions.

We also need to engage our user community and funding organisations to better understand the problems we are trying to solve.

Totally agreed. In addition to Universities and funding agencies, we should be brining in other prominent FAIR organizations to these discussions (FDO, RDA, GFF, CODATA, etc). Any thoughts on how we can bring in these 3 groups?

Here, I would like to summarise a few high-level points that can facilitate productive discussion, rather than framing the discussion as centralised versus decentralised. Most decentralised technologies still rely on centralised governance models.

In essence, we need to think about types of decentralisation and hybrid implementations, taking into account the context of other components in the infrastructure. PID is just one aspect.

I couldn’t agree more. Governance for the PID system will need to be considered in depth, and you’re right about centralized governance. The DeSci Foundation is standing up a governance committee right now. It would be good to have the governance committee tightly coupled with a technical steering committee. I know that is on the horizon for the Foundation. The experience of members on this forum thread would be invaluable there.

I’m sure Philipp would appreciate discussion on this topic, reach out to him if you’d like to discuss.

Keeping data as close as possible to the compute and analytics platform, while also allowing for local domain ownership. This can change where and how the PID infrastructure interacts with the data and compute part.

I could see a case for repositories to play a more important role in the governance and operations of a PID system which more closely integrates to data through decentralized technologies like Bacalahu.

I’ve broached this topic in the past and have found quite a bit of pushback. Nobody disagrees with the concept, but it seems to be too early for this discussion. Cyber concerns, architecture, privacy, cost, etc all get brought up regularly (and are valid concerns). We should keep this edge compute functionality in mind, but focus on communicating more basic concepts like persistent folder structures.

Can we think of a hybrid solution that takes advantage of centralised governance and data aggregation and the independence of decentralised model?

This is a very good question. While I dont have an immediate answer, I would love to explore your thoughts. Do you have any specific answers in mind?

Leveraging the existing trust models of centralised organisations. Trust needs to accommodate complex social and local contexts; cryptographic verification is just one aspect.

One of the main points of discussion thus far in the dPID working group has been how we can build on these existing trust models. One concern that has been expressed is that current trust models tend to conflate the existence of a PID with the ability to trust information (Seeing PIDs as a “stamp of science”). While there is definitely a correlation between the two, it’s not a causation and can be a dangerous assumption. Do you have any thoughts on the idea of using the scalability of IPFS based PIDs to build a more explicit verification and social trust layer post minting? We talked about doing that through the idea of annotations and attestations it in this video

twdragon · November 30, 2023, 8:29pm

Dear participants,
I am really excited to see so much interest and enthusiasm for this topic! Thank you all a lot! Here is a video recording of our initial discussion

On my YouTube channel, you can also find a playlist where I am collecting all the corresponding discussion recordings and topic describing videos.

Totally agreed. In addition to Universities and funding agencies, we should be bringing in other prominent FAIR organizations to these discussions (FDO, RDA, GFF, CODATA, etc). Any thoughts on how we can bring in these 3 groups?

I think that it is a good starting point to publish the prospectus with some supplementary text on the RDA blog platform. Some involved people read these articles, so we can make some general dissemination in this way.

I’ve broached this topic in the past and have found quite a bit of pushback. Nobody disagrees with the concept, but it seems to be too early for this discussion. Cyber concerns, architecture, privacy, cost, etc all get brought up regularly (and are valid concerns). We should keep this edge compute functionality in mind, but focus on communicating more basic concepts like persistent folder structures.

I see from the practice, that cost/outcome balance concerns are becoming now the most important factor for the research institution that should take care of its data by itself. I am still thinking about how to successfully explain the advantages of the decentralised system to the responsible persons. In this aspect, I am relying on your experience, guys!