Deduplicating names in the ROR data

stevecanham · April 23, 2021, 2:36pm

Hi

As 2 of n in posts about our project at ECRIN investigating ROR data…
We decided that to try and match the organisations in the ROR data with organisations in our own database, it was necessary to aggregate the various names under which organisations are listed in ROR, to form an ‘all_names’ table. This required aggregating the names present in the main organisation records with those listed as aliases, acronyms, and labels. But doing that gave rise to several hundred duplicate records, which need to be removed before the aggregate table can be used.
N.B. ‘Duplicate’ organisation names refer to the same name being applied to the same organisation (clearly duplicate names arise naturally across different organisations). There are two sets of duplicates to be considered, those within and those between the 4 source tables.

a) Within the labels table, 32 organisations have the same label applied 2, 3 or in one case 5 times to the same organisation, with a different language code. In most cases the additional language listed is clearly wrong. The attached document lists the delete statements required to remove the excess records.
b) Within the aliases table, 6 records have the same alias listed twice. Record sequence numbers are used to delete the version with the highest number.
c) Within the acronyms table, one organisation has trhe same acronym listed twice. Again sequence numbers can be used to delete the version with the highest number.
d) 65 aliases have the same name as that given for the main organisation record and need to be removed.
e) 359 aliases also appear in the labels records for the same organisation and can be removed.
f) 97 labels have the same name as that given for the main organisation record and need to be removed. As the labels record has the additional language code information, this is first preserved, with the languiage data copied across from the labels table. The duplicated labels can then be removed.
g) Finally 19 of the acronyms are also present as the main name, as an alias or as a label. These too need to be removed.

Once these operations have been completed the names and ids of all names can be aggregated, from all 4 tables, to creat an ‘all_names’ table with 192,205 records.
If the SQL statements used to perform the deletions above are of any use to anyone, they can be found at ror_data/Name deduplication 21040423.txt at master · ecrin-github/ror_data. This is a plain txt file and includes just the Postgres SQL rather than C#.
Cheers
Steve