Experiences on data citation with editors and publishers

marco.marsella · July 9, 2019, 9:02am

We assign DOIs to Plant Genetic Resources for Food and Agriculture (PGRFA). Our stakeholders are genebanks, breeders, conservationists, researchers and policy makers worldwide. Many of them publish scientific results about those PGRFA in papers that get assigned a DOI.

With the release of Event Data by Crossref and DataCite, we have a fantastic way of locating publications and datasets citing our DOIs.

Given how most current publishing systems are designed, the citation of our DOIs can be passed to Crossref and DataCite, to be later fed into Event Data, only if listed among other bibliographic references. Unfortunately, we are experiencing significant resistance from editors and publishers because they are not used to such odd-looking references.

In some cases, e.g. Cambridge University Press, we got the work done after a reasonable amount of pushing (see https://doi.org/10.1017/s0960258518000211) but others were not so amenable.

Crossref and DataCite are doing a great work talking to editors and publishers and we push from the author’s side, but a lot still needs to be done.

What are your experiences? Are you being more successful? If yes, what are your arguments?

dnoesgaard · July 10, 2019, 3:57pm

Dear Marco,

At GBIF we assign Datacite DOIs to datasets of biodiversity data (mainly species occurrences). Such datasets are used in publications at a rate of about 2 papers per days. I’ve recently done some statistics on how publishers are doing in our eyes—i.e how many papers cite datasets the right way (using a DOI) and how many paper cite in a more generic style. You can see my findings here:

Happy to chat more about this!

mhfenner · July 10, 2019, 4:06pm

Thanks @dnoesgaard. And great to see that you also use Discourse at GBIF.

What I find interesting in the post by @marco.marsella is the focus on editors, as there are two pieces needed to have more data citations in publisher metadata:

a technical problem: how to properly ask authors for identifiers for data and how to put them into the metadata sent to Crossref
a social problem: convincing editors and reviewers to make sure the underlying data are included in submitted manuscripts and cited properly, via journal policies, training of editorial boards, etc.

We need to solve both aspects.

marco.marsella · July 10, 2019, 4:50pm

Thanks to both @dnoesgaard and @mhfenner for taking the time to reply to my post. It is interesting how GBIF data match our own experience, with the same publishers being more responsive than others.

I also agree with Martin that we need to tackle both issues. There is quite some confusion between editors and publishers as we had editors claim as impossible things that the publisher accepted without a blink. For instance, an editor told us that you cannot have publications (or, more in general, DOIs) in the list of bibliographic references that are not explicitly cited in the main text. The publisher said there would be no problem doing that (as expected, I would say).

For @dnoesgaard: how do you track usage? Event Data?

Thank you!

dnoesgaard · July 10, 2019, 7:04pm

@mhfenner, I think the social problem is the bigger one of the two.

@marco.marsella, we would use Event Data but there just isn’t enough proper citations. So while we wait for a better data citation culture, we rely on a fairly manual process (described here).

mhfenner · July 11, 2019, 4:49am

@dnoesgaard GBIF is one of the major contributors to Event Data, something like 90% of the links between datasets in the system come from you, and also a significant number of references to publications. As you say, the number of citations to your content coming from publishers via Crossref is much smaller, and that is typical for data citations in Event Data in general:

This slide is from a presentation I gave last month at a workshop for societies organized by the American Geophysical Union (AGU). Something we discussed there is to go back to the publisher(s) with the data citations provided in DataCite metadata by the data repository, discovered via the manual process @dnoesgaard describes. Publishers that are very supportive to data citation (such as AGU) can then go back to their journal articles and check why the specific data citation was not provided to Crossref. It might be that the dataset is not mentioned at all, is mentioned without the DOI, or is mentioned with the DOI, but the DOI is not included in the metadata sent to Crossref.

rzepa · July 13, 2019, 8:11am

Can I illustrate this with a specific example? Article DOI 10.1002/open.201900099 has a metadata record https://data.datacite.org/application/vnd.datacite.datacite+xml/10.1002/open.201900099

Cross checking with the article itself reveals eg Ref 25 is missing from the metadata record. Ref 25 is a DataCite registered data citation. In this article it is being treated merely as a URL rather than a PID (which is common publisher practise).

I am in the process of querying this exclusion with the publisher but it would be good that a more concerted lobby to get references like this included in the metadata “ecosystem” be persued!

mhfenner · July 14, 2019, 10:55am

The reference you mention looks like this in the Crossref metadata:

<citation key="e_1_2_6_21_1">
  <unstructured_citation>H. S. Rzepa Experimental evidence for “hidden intermediates”? Epoxidation of ethene by peracid. Blog post 2013 doi: 2010.14469/hpc/14807</unstructured_citation>
</citation>

Compare to for example the next reference:

<citation key="e_1_2_6_22_1">
  <journal_title>J. Chem. Theory Comput.</journal_title>
  <author>Knizia G.</author>
  <first_page>4834</first_page>
  <volume>9</volume>
  <cYear>2013</cYear>
  <doi provider="crossref">10.1021/ct400687b</doi>
</citation>

DataCite is currently not processing unsctructured citations when converting from Crossref metadata. In cases like this that would be easy and I will look into how we could include them.

The tricker and more common case is unstructured citations for content with DataCite DOIs that doesn’t include the DOI. This is common for content with Crossref DOIs, and Dominika at Crosssref as done a lot of great work on this topic, see for example this blog post from last week.

rzepa · July 15, 2019, 6:38am

Thanks Martin for that really helpful insight into how CrossRef categorises citations. It appears that by “structured”, they refer to the printed legacy type, with a page number, volume etc. Traditionally, that “structure” does not include any PID. I remember starting to add PIDs to my “structured” article citations a few years back, only to have them all removed at the proof setting stage because they were not the “correct” structure.

I guess it also does not help that the reference you illustrate above (actually ref 9) is to a narrative, whereas ref 25 (which is not shown above) is to formal data. That difference is probably not formally declared in the associated metadata for the ref 9 at least (ref 25 however is indeed declared using resourceTypeGeneral=“Collection”).

mhfenner · July 16, 2019, 10:39am

The Crossref metadata for that DOI is at http://api.crossref.org/works/10.1002/open.201900099 (JSON) or http://api.crossref.org/works/10.1002/open.201900099.xml (XML). The reference 25 in the PDF looks like this:

[25] J. E. M. N. Klein, G.Knizia, H. S. Rzepa, Imperial College Research Data Repository 2019, DOI: 10.14469/hpc/13603.

In the Crossref XML submitted by the publisher it looks like this:

<citation key="e_1_2_6_45_1">
  <journal_title>Imperial College Research Data Repository</journal_title>
  <author>Klein J. E. M. N.</author>
  <cYear>2019</cYear>
</citation>

Which means that important information was lost in the Crossref XML. This needs further investigation, but one possible reason is that the DOI https://doi.org/10.14469/hpc/13603 is looked up in the Crossref system (as almost all DOIs in reference lists are from Crossref) for validation, and then not found. Or the DOI didn’t exist yet when the reference list is validated, as the dataset is published in parallel.

marco.marsella · January 28, 2021, 1:36pm

@mhfenner @rzepa and @dnoesgaard thank you again for the great discussion! I would like to open a new dimension related to dataset repositories. I do not mean to be critical, I am just reporting facts.

Harvard Dataverse is a popular repository and is also being actively used by many of our users (plant genetic resources communities, mostly genebanks). Unfortunately, I found out that Dataverse does not support data citation, i.e. they offer no way of referencing the DOIs of the plant material the dataset refers to. I took this up with Dataverse Support and they told me that while Dataverse was recently updated to retrieve information from Event Data, contributing information to Event Data was not on their roadmap!

Another example is e!DAL from the Leibnitz Genebank in Germany (IPK). They are also a DataCite member and we collaborate on several plant genetic resources-related projects. Again, e!DAL did not support data citation (and did not even allow for DOI metadata to be updated after registration…). At least they agreed to commit resources to resolving the issue and the work is under way.

I have been able to look at Zenodo and found that it does support data citation. Maybe you can contribute other examples?

While I understand that publisher systems may suffer from their dated design, I am surprised that repositories that, at least in my ignorance, I believe have a more recent history, are affected by this issue.

We need to band together and get all these guys listen to us!