DataCite blog post: Making the most out of available Metadata

mhfenner · July 10, 2020, 10:45am

DataCite yesterday posted a blog post about how to improve the usefulness of DataCite metadata by improving the DataCite services: https://doi.org/10.5438/1dgk-1m22

The blog post focusses on three metadata properties (language, rights, subject area) and two DataCite services (Fabrica DOI registration service, DataCite Search). Please put your questions and comments here.

tedhabermann · July 10, 2020, 7:19pm

I made this comment on the original blog post and repeat it here for completeness:
Thanks for the update on three DataCite metadata elements that help improve findability and reuse and the inclusion of these in the DataCite metadata registration interface. It is worth noting that many elements that can facilitate findability are rather rare in DataCite metadata. My analysis last year showed that ~60% of DataCite collections include subject (keywords) in their metadata and a similar analysis shows that ~40% of the collections include rights information. There are clearly opportunities for improvement here. Your recommendations for standard vocabularies definitely start us out on the right track. Quantitative measures applied across the DataCite corpus can help us identify and celebrate good examples and measure progress.

mhfenner · July 12, 2020, 6:26am

Crossref Participation Reports show the completeness of important metadata summarized by publisher. I would be interested in how much they have helped with the metadata completeness of the metadata shown, which includes for example license URLs.

It is a complementary, but slightly different approach. What we are trying to accomplish with the DOI form and search facets is not just that there is metadata completeness for language, rights and subject area (to focus on what the blog post is about), but encourage using standard vocabularies with these fields.

tedhabermann · July 12, 2020, 4:25pm

100% correct that standard vocabularies are a good thing. It would be interesting to know where we are starting from. What vocabularies are people already using and how many people are using multiple vocabularies… Two interesting questions…

mhfenner · July 13, 2020, 8:48am

The challenge is of course that there is rarely a single standard. language is a case in point. Is the standard ISO 639-1, ISO 639-2 or BCP 47? The schema documentation says to use ISO 639-1 or BCP 47, and we have many records using ISO 639-2, as we until now didn’t have a check for the standard used.

tedhabermann · July 13, 2020, 1:05pm

Yes, there are many vocabularies… Fortunately we can include the source in the metadata, I think that is as good as it gets.

mhfenner · July 13, 2020, 3:33pm

I think we should enable users to for example search for all “dissertations written in Spanish”. In case of the language property it would help to a) allow multiple language entries (for multiple languages and/or multiple vocabularies used), and b) to add a new property languageScheme to indicate the vocabulary used.

Gin · July 18, 2020, 11:01am

A very good question! Anecdotally the participation reports have helped move the dial quite significantly, we have many examples where members just didn’t know they could (or their service provider could) add things like license and full-text links or abstracts or even references. We do 1:! consultations with members with lower participation and definitely see a few months later a significant jump in the richness of the records. We don’t keep historical coverage metadata (yet) so it’s hard to prove unfortunately.