Granularity of datasets

mhfenner · July 10, 2020, 10:40am

Datasets can be stored and described by metadata in a variety of ways, for example one big dataset with a single DOI and associated metadata, or a list of dataset files, that are grouped together in a collection, but each with separate persistent identifier and metadata. What is needed to provide more clarity?

Philipp · July 10, 2020, 11:13am

Thanks for following up on this. I addressed this issue a couple of years ago; see this Dataverse GitHub issue.

Currently, dataset-level DOI records are conflated with dataset-file-level DOI records in DataCite. This situation is quite unsatisfactory as it, i.a., results in a proliferation of file metadata records listed in DataCite Search result lists and ORCID records search result lists.

As a first step to mitigate this issue, I’d like to suggest to introduce a new type of Resource Type General to be used for DOI records about files in a dataset. We might call this type Dataset file, Dataset part or Part of Dataset. This type will then be available for filtering records in DataCite Search and Fabrica and other (DataCite) webpages and services.

mhfenner · July 10, 2020, 11:28am

Thanks Philipp. One challenge I see is that data repositories are implementing this in different ways. For example:

one dataset (with one DOI) that holds everything
one dataset with multiple data files, where only the dataset has a DOI and DataCite metadata
one dataset with multiple data files, each with a DOI and metadata
multiple datasets (each with multiple data files), aggregated in one or more collections

My specific question is then whether the first item is a dataset or dataset file?

Maybe an alternative way to distinguish (and filter out) dataset files is looking at isPartOf/HasPart in the metadata.

mhfenner · July 10, 2020, 11:46am

It helps to identify the use cases of what we want to solve. I start:

In DataCite Search, only get the datasets and not all associated files, to reduce noise
in an ORCID record, only include the dataset, and not associated files, to focus on the overall record
aggregate all citations, views and downloads by dataset, even if they are associated with individual dataset files

Philipp · July 10, 2020, 11:54am

As far as I can see from the metadata records in DataCite Fabrica, filtering based on isPartOf in the Related Identifiers > Relation Type field may work; see e.g. this screenshot of a DataCite file-level DOI record from a Dataverse repository:

mhfenner · July 10, 2020, 12:14pm

And should we do this automatically in DataCite Search? I am fine with that, but only after we have implemented solid navigation from the dataset to the dataset files, similar how this is implemented in Dataverse.

For ORCID claiming via auto-update we for a long time already exclude all DataCite DOIs that have a isIdenticalTo, isPartOf or isVersionOf relationship.

mhfenner · July 12, 2020, 6:18am

@Philipp In schema.org (and in DCAT where this comes from), we have the concepts of distribution and dataDownload that I think align nicely with what you are proposing.

0000-0001-6975-6816 · October 2, 2020, 2:14am

HI @mhfenner and @Philipp, I’m glad that you have raised these questions. There’s an RDA BoF that we have started, currently at task force. We’d love to have you be a part of this group. These are exactly the types of concerns and solutions we are looking into. We have a BoF session at the virtual plenary in November: https://www.rd-alliance.org/data-granularity
Katy McNeill and I are chairing this group.

mhfenner · October 2, 2020, 4:06am

Perfect timing! I will try to attend the BoF. My interest is twofold: a) is there a need to adjust the DataCite Metadata Schema, and b) how to better display these relations in DataCite Search and now DataCite Commons.