Content-based addressing and PIDs for version-controlled data/software

twdragon · August 23, 2023, 12:50pm

Dear colleagues,
Could you please share your knowledge and best practices you know about minting and using the PIDs for:

Content-addressed datasets and their elements (for example, for the collections with a possibility to search by image, or for any other non-verbal type of the searching keys);
Version-controlled datasets and software.

Most users solve the problems they encounter over the related data curation tasks using VCS and DOI authority, like the pair Git + Zenodo, but I could also collect information about the practices that are not considered widespread and known to society, especially about hash-based PIDs.

Frauke · September 5, 2023, 10:02am

Hi,
in the AV portal at the TIB we use hash-based DOIs to cite videos at a certain timestamp. Here is an example:
DOI for the video: https://doi.org/10.5446/62930
DOI for a certain timestamp of that video: https://doi.org/10.5446/62930#t=00:56
DOI for a certain section of that video: https://doi.org/10.5446/62930#t=02:21,02:35

The resolving of the hash takes place in the AV portal itself, not at DOI level.

twdragon · September 6, 2023, 8:57am

@Frauke could you please tell a bit more about the hashing functions you use, and the resolver pipeline implemented on your side?

Frauke · September 8, 2023, 12:13pm

Hi,

in the AV-Portal, the video player component of our JavaScript application takes over the task of resolving the timestamps in the DOI (which points to the actual portal URL, starting with https://av.tib.eu/media/…). There are some special parsing methods implemented in order to make the player compatible with the common timestamp resolving standards (so-called media fragments) that should be supported by modern HTML5 players. The hash can contain the starting playback timestamp (second-wise), but you can also use the from-to format (start of the segment → end of the segment). Addtionally, you can make use of the built-in citation function of the portal, which lets you cite the currently playing segment of the video in the aforementioned from-to format.

You may find more helpful information on media fragments here:

https://www.w3.org/TR/media-frags/

twdragon · September 8, 2023, 12:59pm

@Frauke There is nothing now on the link you shared(

As I understood, the hash is used only as URI affix where URI represents DOI resolved by the centralized DOI gateway. Would you tell also about how the affix is formulated and hashed (maybe sharing the internal spec if it would be open)? I would be really appreciated studying your use case for implementation also of the openly reproducible PIDs and affix model based on them, like I described here:

mvermeyen · September 12, 2023, 11:19am

A bit late to the discussion, but for VCS systems, you could also take a look at how Software Heritage handles this:

https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html

Kind regards,
Maarten

twdragon · September 12, 2023, 1:24pm

@mvermeyen thank you! It is a good concept, especially because it uses Merkle DAG concept. I implemented Git bridge for IPFS that is also content-addressed network using Merkle DAG. There should be a way to integrate Merkle DAG structures with Git and content-addressed networking directly.

I am also working on the prefix-based concept allowing decentralized resolving of DOIs and all prefix-based identifiers. Do you know something about the similar projects already exist?

Frauke · September 13, 2023, 6:51am

@twdragon https://av.tib.eu/media/ is not really a link just the start of a URL for explanatory purposes

The hash suffix in this case is the common standard Media Fragment Identifier (see links in that post). The DOI system ignores everything coming after a hash by default and affixes it to the URL. So for example this https://dx.doi.org/10.14454/3w3z-sa82#pdf transforms to the following URL: https://schema.datacite.org/meta/kernel-4.4/#pdf - just in this case the hash does not do anything. But for example it could also be used to jump to a certain caption on a HTML page.

So to recap, the TIB AV Portal uses the Media Fragment Identifier standard and combines it with the DOI system’s default settings.

I hope this helps to clear it up a bit.

twdragon · September 13, 2023, 9:20am

@Frauke Thank you! This clarifies things