For high-volume data like Earth System Model data, usually data is referenced on collections consisting of many datasets (e.g. datasets of a model run) but statistics are based on dataset download information. The COUNTER Code of Practice for Research Data implicitly assumes that DOIs are registered for datasets. How could high-volume data provide data usage (or rather data viewing and data download) information according to the COUNTER standard?
In Earth System Modeling, many variables are required to analyze the simulated climate for a future projection. Therefore usually all variables created in the same model run or over an ensemble of model runs receive a single DOI to be referenced in scholarly articles and included in the reference list. Moreover, several models in different configurations contribute their data to such an international research project, e.g. CMIP (Coupled Model Intercomparison Project) or CORDEX (Coordinated Regional Climate Downscaling Experiment). Thus a common use case for data usage statistics requested by the project management is: Which climate variables were downloaded the most? That information is used for the data request in the next phase of the project. Another use case are our annual reports on data downloads, where we count dataset downloads and volume of downloaded data. Thus, download information on the dataset or the sub-DOI-granularity is essential.
Our use case in COUNTER terminology:
- Component = File (in binary community formats)
- Dataset = Timeseries of a variable consisting of >=1 files. This granularity is suitable for statistics, versioning and tracing/provenance information.
- Collection = ~100 to ~10000 Datasets belonging to a numeric model run or an ensemble of model runs. This data is created with one source (numeric model) and for one research question and therefore suitable for data citation (DOI).
How to use the COUNTER standard on our data?
- Aggregate all information on the DOI granularity:
That possibility is easy to implement but all information on the dataset level is lost, which on the other hand is essential for many statistical applications. Therefore, the usefulness of this download information is questionable.
- Provide the COUNTER information for datasets:
In this case, rich information on data viewing and data download is provided. However, what is the ‘dataset-id’? We can provide urls leading to information on the specific dataset but would like to keep also the DOI reference, to which the dataset belongs, as relation of type ‘IsPartOf’.
- Provide COUNTER information on DOI and dataset levels:
If we provide reports for DOI data collections, the DOI could be specified as part of the general information of the report. That information would be an extension because mixing report information with reported content should be avoided. The body of the report would be as in 2. but no reference of the DOI granularity within an individual dataset report would be needed.
Is anyone dealing with the same problem or has a view on this?