Exploiting the DataCite schema and Elasticsearch for complex queries

As primarily a user, my main objective is to exploit the DataCite search features to perform rich search filtering of datasets. When the DataCite schema 4.0 (4.1) was announced, we decided to try to populate as much metadata against that schema as was possible using simple workflows. The next step was to construct some searches against that metadata that we considered the community would find useful, in order to get feedback, perchance improvements and to help test the DataCite schema and its indexing in a real world environment.

Here I offer the 20 examples thus far constructed to this forum for such feedback. These can be inspected at DOI: https://doi.org/10.14469/hpc/5920

One of the immediate applications was to be able to estimate how much FAIR data (i.e. data with rich registered metadata) of a particular type might be associated with a particular journal (e.g. entries 14-19).

Another was to evaluate the current indexing of Schema 4.0/4.1 metadata using the Elasticsearch engine (entry 6).

I would also welcome examples of any interesting searches that members of this forum might have encountered or developed themselves, to be added possibly to the existing list of 20.

In the longer term, it would be desirable for users to be able to construct such searches using a user-friendly GIU. The current syntax can be quite fragile, with a single incorrect character causing failure. My question to this forum is how/whether such GUIs should evolve and/or whether e.g. subject domains such as chemistry (the one I work in) should in fact develop their own?

3 Likes

Thanks! It’s useful to see what people are searching for.

I agree that it would be useful to be able to construct these kinds of complex searches through a GUI, and this is something that’s already on our radar. Please feel free to comment on the related card on our public roadmap, chat about it here.

Thanks for sharing these, Henry. I particularly like 14-19 with the journal searches. This would require us to include the related publication information in the metadata, and this is something we don’t typically have at the time the DOI is generated. So, we would have to sort out a workflow for 1) identifying which objects have related articles, 2) finding out when they are published, and 3) updating the records with this information. Particularly since we are currently generating DOIs one at a time in Fabrica, this could be quite labor intensive to do. Love to hear anyone’s thoughts on approaches to this. I’d love to be able to have better metadata in these records.

1 Like

Yes, adding related publication information is currently a manual process (for the depositor) which can cause severe problems such as if the depositor has left the host institution by the time the related publication appears and hence lost their credentials to update their metadata record.

I understand that eg ChemRxiv has some sort of workflow to try to automatically detected associated publications appearing after a deposition, but that must be expensive to write and maintain.

1 Like

Can I ask another question? I alluded in my original post that much of our metadata is generated using automated workflows. It struck me that there is no method I know to record at least some details of our workflow (if not the entire script) and to include how the metadata record was generated in the metadata itself. The alternative is of course to record that it was generated by a human answering questions.

One step further might be to assign a PID to any specific workflow used and to cite that (instead of extra metadata or perhaps as well as).

To illuminate this, I will describe just two of our automated workflows

  1. When the depositor logs into the repository, they can only proceed if a workflow using OAUTH to validate an ORCID has already been run. Thereafter ORCID is always injected.

  2. We generate molecule identifiers, which we call InChI (International chemical identifiers) using a script which invokes a program called OpenBabel. It would be good of course if OpenBabel itself had a software PID, but certainly the bare bones of this information perhaps should be included in the registered metadata.

  3. Both the above scripts are under the control of the overall institutional administrator and cannot be subverted by a user. So the provenance and trust of running these scripts is I think also worth recording. In contrast of course, human entered metadata relies entirely on the veracity of that human!

So can anyone suggest how to add the provenance of the metadata record to the metadata itself?

1 Like

@rzepa adding provenance to the metadata is something the Metadata WG would discuss, and I can forward the suggestion. Changing the metadata schema is a long process, so please expect this to take some time to be discussed by the working group, and possibly that the working group decides not to include provenance in the schema.

What you can do in the meantime is linking to more metadata via relatedIdentifier of type hasMetadata.

@amyhodge DataCite is working on automatically pulling in metadata for related publications if you provide the DOI in relatedIdentifier metadata. This is only a few weeks away via API, and a little bit longer via web interface.

@mhfenner - what exactly do you mean by this?

When you include relatedIdentifiers in DOI metadata, we automatically extract those relations and put them into the Event Data service. We have done work recently to optionally include DOI metadata in the Event Data API response. The work is not completed, but an example with GBIF data and a Zootaxa publication would be https://api.datacite.org/events/14e83dec-55fa-4715-a4c2-b1679a3d4ef3?include=dois (you need to add an application/json; version=2 accept header to use version 2 of the Event Data API). This returns

 {
            "id": "10.11646/zootaxa.4457.4.5",
            "type": "dois",
            "attributes": {
                "doi": "10.11646/zootaxa.4457.4.5",
                "identifiers": [
                    {
                        "identifier": "https://doi.org/10.11646/zootaxa.4457.4.5",
                        "identifierType": "DOI"
                    }
                ],
                "creators": [
                    {
                        "name": "DUAN, YANI",
                        "nameType": "Personal",
                        "givenName": "YANI",
                        "familyName": "DUAN"
                    },
                    {
                        "name": "DIETRICH, CHRISTOPHER H.",
                        "nameType": "Personal",
                        "givenName": "CHRISTOPHER H.",
                        "familyName": "DIETRICH"
                    }
                ],
                "titles": [
                    {
                        "title": "A new species of Polyamia DeLong (Hemiptera: Cicadellidae: Deltocephalinae: Deltocephalini) representing the first record of the genus from South America"
                    }
                ],
                "publisher": "Magnolia Press",
                "container": {
                    "type": "Journal",
                    "issue": "4",
                    "title": "Zootaxa",
                    "volume": "4457",
                    "firstPage": "557",
                    "identifier": "1175-5334",
                    "identifierType": "ISSN"
                },
                "publicationYear": 2018,
                "subjects": null,
                "contributors": null,
                "dates": [
                    {
                        "date": "2018-08-10",
                        "dateType": "Issued"
                    },
                    {
                        "date": "2018-11-08T21:17:07Z",
                        "dateType": "Updated"
                    }
                ],
                "language": null,
                "types": {
                    "ris": "JOUR",
                    "bibtex": "article",
                    "citeproc": "article-journal",
                    "schemaOrg": "ScholarlyArticle",
                    "resourceType": "JournalArticle",
                    "resourceTypeGeneral": "Text"
                },
                "relatedIdentifiers": [
                    {
                        "relationType": "IsPartOf",
                        "relatedIdentifier": "1175-5334",
                        "resourceTypeGeneral": "Collection",
                        "relatedIdentifierType": "ISSN"
                    }
                ],
                "sizes": null,
                "formats": null,
                "version": null,
                "rightsList": null,
                "descriptions": [
                    {
                        "description": "Polyamia (Polyamia) choromorica sp. n., representing the first record of the genus Polyamia DeLong from South America, is described and illustrated. Previously described species of Polyamia DeLong appear to be restricted to North America. Color illustrations of Polyamia (Copolyamia) caperata (Ball), Polyamia (Copolyamia) similaris DeLong & Davidson and Polyamia (Polyamia) weedi Van Duzee are also provided for comparison. A species checklist and distribution summary for the genus is provided. Notes on other South American species of Deltocephalini with supernumerary forewing crossveins are also provided.",
                        "descriptionType": "Abstract"
                    }
                ],
                "geoLocations": null,
                "fundingReferences": null,
                "url": "https://biotaxa.org/Zootaxa/article/view/zootaxa.4457.4.5",
                "contentUrl": null,
                "metadataVersion": 1,
                "schemaVersion": "http://datacite.org/schema/kernel-4",
                "source": "levriero",
                "isActive": true,
                "state": "findable",
                "reason": null,
                "created": "2019-06-12T18:56:57.000Z",
                "registered": null,
                "published": "2018",
                "updated": "2019-07-06T04:59:15.000Z"
            }

as part of the API response. And includes all information needed to generate a formatted citation. Which we will also do for you via https://api.datacite.org/dois/text/x-bibliography/10.11646/zootaxa.4457.4.5?style=apa

Thanks @mhfenner, this is very interesting and quite powerful!

Thanks @dnoesgaard. We support this also in our GraphQL API at https://api.datacite.org/graphql. For example requesting all datasets referencing/referenced by the Zootaxa paper:

{
  publication(id: "https://doi.org/10.11646/zootaxa.4457.4.5") {
    id
    creators {
      name
    }
    titles {
      title
    }
    publisher
    publicationYear
    datasets(first: 25) {
      totalCount
      nodes {
        id
        creators {
          name
        }
        titles {
          title
        }
        publisher
        publicationYear
      }
    }
  }
}

This query returns 193 datasers, with the metadata requested for the first 25.

There is a bit more work needed before we cover all Crossref DOIs cross-referenced to DataCite DOIs.

I am charting the metadata evolution and cross-references of eg article DOI: https://doi.org/10.1002/open.201900099 and associated data DOI: https://doi.org/10.14469/hpc/3603

Thus far, https://api.datacite.org/dois/text/x-bibliography/10.1002/open.201900099?style=apa has not yet propagated (expected presumably any day now). However I have already checked that the metadata record does not include ref 25 (to https://doi.org/10.14469/hpc/3603) because the publisher has treated it as a URL and not a PID.

From the other direction, https://data.datacite.org/application/vnd.datacite.datacite+xml/10.14469/hpc/3603 has a relatedIdentifier which references https://doi.org/10.1002/open.201900099 although the bibliographic version
https://api.datacite.org/dois/text/x-bibliography/10.14469/hpc/3603?style=apa
does not seem to include reference to https://doi.org/10.1002/open.201900099

Is there a reason that 10.1002/open.201900099 is included in the full DataCite metadata record at https://data.datacite.org/application/vnd.datacite.datacite+xml/10.14469/hpc/3603 but is not included in the bibliographic version at https://api.datacite.org/dois/text/x-bibliography/10.14469/hpc/3603?style=apa ? Perhaps the latter will soon catch up?

I guess we are not far away from fully symmetric cross citation between data and article!

@rzepa Citation styles do typically not include information about other content, with the exception of parent containers such as book or journal. So don’t expect https://api.datacite.org/dois/text/x-bibliography/10.14469/hpc/3603?style=apa to include https://doi.org/10.1002/open.201900099.

As for Crossref metadata via api.datacite.org, that is a process that will take more time to include all Crossref DOIs referenced by or referencing DataCite DOIs, as it is a workflow we only started in June.

It seems to me that the case for extending citation styles to include (FAIR?) data as content must surely be a strong one, given the general and increasing awareness of FAIR as an important part of the research processes. Do you happen know where the discussions about “citation styles” take place, and whether or not inclusion of data as content is on the cards?

This is where the hard part lies. We don’t have this at the time we create the DOI, because the related publication has not yet been published, and we sometimes do not yet know that it exists. So we have to figure out which ones have publications, keep tabs on when they are published, and then update the relatedIdentifiers information. If the publication cited the DOI, then maybe there is hope for identifying what needs to be updated, but sometimes the authors don’t do this, or don’t do this in a way that is picked up easily by software looking for these cross-references. Still much to be done!

@amyhodge we have in the past with Crossref how the coordinated registration of DOIs for publication and dataset can be improved. On the DataCite side this means for example to register the DOI for the dataset early on, but set the state to registered so that the DOI is not included in DataCite Search, and metadata are not exposed prematurely.

Can I ask a question about media types used for content negotiation? Although these can be queried using Elastic ( https://search.datacite.org/works?query=media.media_type:*) these declarations are not explicit in the XML manifest for the metadata.

Did I see somewhere a project to explicitly incorporate them into the XML schema?

Media types are not part of the DataCite metadata, and there are currently no plans to do so.

We will soon announce that we will deprecate custom media types in DOI content negotiation, as DOI content negotiation is really about getting DOI metadata in different metadata formats, rather than the content. And custom metadata formats create a lot of overhead and confusion.

Registration and retrieval of media via API will of course continue to be supported, and requests with unknown media types will be forwarded to the registered landing page, where content negotation can again happen.

The timeline for these changes is end support via data.datacite.org on January 1st, and end support via doi.org on October 1st. There will be a blog post with more info next week.

So to be clear, will a search of the type

https://search.datacite.org/works?query=media.media_type:chemical/x-mnpub*

stop working or still continue?

Your query calls the REST API and not content negotiation, so will continue to be supported.

The REST API query might look different in the future. The /works endpoint is for an older version of the API (we use /dois in the current version of the API), and the query syntax might look different. But that is independent from the changes to content negotiation, and no changes on that end planned in 2019.