Using OpenRefine to assess open research at your organization

The goal of this post is to share the process I used to collect open research publications data, using primarily OpenRefine, to understand the impact of our Open Access (OA) Publication Policy at the Michael J. Fox Foundation (MJFF). Inspired by the Year of Open Science, it is my hope that the practical recipes shared here can help others assess the impact of open research at their organizations. 

Tools used

For this work, I used three free tools which you can either use online or download to your computer:

  • Google Sheets - A spreadsheet program included as part of the free, web-based Google Docs Editors suite offered by Google. Wikipedia

  • OpenRefine - An open-source desktop application for data cleanup and transformation to other formats, an activity commonly known as data wrangling. Wikipedia

  • Publish or Perish - A software program that retrieves and analyzes academic citations. 

Databases searched

Databases searched were selected based on indexing strength (in biomedical research) and the number of results they returned. Wikipedia has a list of academic databases and search engines, but I focused on a core list that included Google Scholar, Dimensions, Europe PMC, Scopus, and Web of Science, highlighted in this recent study. I searched all available fields in these databases for the string - Michael J “Fox Foundation”.

I used Publish or Perish to automate the process of exporting records from Google Scholar queries. Due to Google Scholar rate limitations, only the first 1000 records can be exported, but at least you can filter your queries, for instance, by year. To export records and download data from Dimensions, registration is required; the free version limits users to exporting 500 records at a time. Filtering is also available, by year, and/or OA status. Scopus and Web of Science are available via subscriptions (e.g., libraries).

Reconciling database results via DOIs 

With the exported records from each database, I used the digital object identifier (DOI) fields to find unique matches and reconcile the DOIs into one merged list. First, I uploaded the database-exported data, in either Excel or CSV format, to separate tabs in the same Google Spreadsheet. I then used the VLOOKUP function and recipe featured in Image 1 to find matches and unique DOIs. For details on the VLOOKUP recipe used, see Cross Checking/Finding Duplicate DOIs below. 

Image 1: Example Google Sheet demonstrating the VLOOKUP DOI matching approach. 

Retrieving Crossref metadata from the reconciled list of DOIs

I turned to Crossref, with favorable metadata license information, to retrieve metadata for the reconciled DOIs. Crossref is an official DOI registration agency for publisher metadata (including journal article metadata) in addition to other sources such as preprints, grants, and standards. I used OpenRefine, a free, open-source tool used for wrangling data, to import the reconciled DOIs and fetch associated metadata via the Crossref application programming interface (API) in JSON (Javascript Object Notation), an open standard file format and data interchange format. To learn more about OpenRefine and how to use it, see Library Carpentry: OpenRefine. I then parsed the imported JSON into separate fields using the recipes found in the  Crossref GREL Expressions section below. What is particularly helpful about OpenRefine is that you can facet by these fields (e.g., Select a column, Facet -> Text Facet) and sort by counts. In some cases, like the Subject field, you may need to Edit Cells -> split multi-valued cells by a delimiter, e.g. semicolon, before using the Text Facet option. 

Image 2: Screenshot of OpenRefine where Crossref fields have been retrieved from the reconciled DOIs. Download the example OpenRefine project and Import (the) Project, not Create (the) Project, in your OpenRefine application to see the recipes via the Undo / Redo tab. 

Using Unpaywall to find open access information connected to DOIs

Another helpful service that can be used to enhance Crossref data is Unpaywall, where content is harvested from over 50,000 journals and OA repositories, but it is handy for OA status and links to OA versions of articles. Once again I used OpenRefine and DOIs to fetch and parse Unpaywall data. See Fetch Unpaywall Metadata from DOIs and Unpaywall GREL Expressions below.

Image 3: Screenshot of OpenRefine where Unpaywall JSON and fields have been retrieved from the reconciled DOIs. Download the example OpenRefine project and Import (the) Project, not Create (the) Project, in your OpenRefine application to see the Unpaywall recipes used via the Undo / Redo tab. 

Data/software citations from DataCite

I queried DataCite, an international DOI registration agency, similar to Crossref, but geared towards data citation and associated metadata, to see if there were relationships between the list of reconciled DOIs and any data/software citations. The query returned the number of relationships between the DOI queried and additional DOIs and their URLs. See the Fetch DataCite Relationships from DOIs section for the OpenRefine recipes. To determine the sources of the DOIs, see the section on Leveraging the Crossref/DataCite APIs for Prefix Information

Note: Download the example OpenRefine project and Import (the) Project, not Create (the) Project, in your OpenRefine application to see the DataCite recipes used via the Undo / Redo tab

Challenges with data/software citations

The sharing of data and software by authors across publishers and journals is not standard and rates of sharing remain low. Data/software sharing comes in various forms (slide 12), and unfortunately, it is still common to see statements such as “data available upon reasonable request”. Persistent identifiers such as DOIs and Research Resource Identifiers (RRIDs), are extremely valuable to sharing data/software in a Findable, Accessible, Interoperable, and Reusable (FAIR) manner. This means we can unlock the potential of the underlying metadata associated with their DOIs to facilitate the discovery and reuse of data/software. To unlock this potential, with a DOI, authors can generate a citation from a repository (e.g., Zenodo) where they have preserved their data/software, cite it in their availability statement (slide 18), and include the citation in their references. Publishers can tag these citations and Crossref can make these paper-data/software relationships available via Event Data. It doesn’t always work this way though, and so we have a number of initiatives working towards mining this information from research publications (e.g., Open Global Data Citation Corpus, Software Mentions, policy compliance, and tracking). The Additional Ideas section below also has some ideas/recipes if you’d like to attempt mining data/software DOIs/RRIDs from research publications yourself. 

Conclusion

I hope the recipes and ideas below are helpful. When I started this process, not all this information was readily available and so I am sharing it with others that might be trying to do similar work. If you would like to visualize and share the bibliographic information you’ve collected in OpenRefine, I used the Lens.org Collections feature to create a dashboard (after registering) from the DOIs acknowledging MJFF. 

Throughout this process, it was not always apparent if/how I could contribute information back to the systems mentioned above to improve the discovery of MJFF-funded research. For instance, I learned that you can email support@unpaywall.org if you want to update OA information in Unpaywall. I also learned that updating MJFF grant information in Crossref and tying it to research publications can improve the discovery of MJFF research in the systems that rely on their data. In the spirit of the Year of Open Science, let’s work together to make it easier for the community to contribute information back to these systems. 

In closing, persistent identifiers (PIDs), particularly DOIs, were invaluable to matching and reconciling open data from the different systems I used and I especially want to thank Unpaywall, DataCite, and Crossref. I also want to thank Michelle Durborow (Michael J. Fox Foundation), Chelle Gentemann (NASA), Kelly Stathis (DataCite), Elisha Wood-Charlson (KBASE/Lawrence Berkeley National Lab), Sonya Dumanis (Aligning Science Across Parkinson's), Juliane Schneider (Pacific Northwest National Laboratory), Christine Kirkpatrick (San Diego Supercomputer Center), Kimberly Mann Bruch (San Diego Supercomputer Center), Rachael Lammey (Crossref), Kristen Ratan (ICOR), Martin Magdinier (OpenRefine), Yuhan Rao (NCICS), and Tom Morris (OpenRefine) for their help and feedback. 

Recipes

Cross Checking/Finding Duplicate DOIs 
For cross-checking DOIs from database exports, use VLOOKUP in Google Sheets. Example:

=VLOOKUP($A2,DB!$A$2:$A$100,1,FALSE)
$A2 - Cell with DOI where you are trying to match against a range of DOIs
DB! - The name of the database worksheet/tab that you are pulling the range of DOIs from
$A$2:$A$100 - Range of cells in the database worksheet/tab that contain DOIs to check against
1 - Index, returns the DOI if it is a match
FALSE - Looks for an exact match
$ - For absolute cell referencing 

Fetch Crossref Metadata from DOIs 
In OpenRefine, choose the column with DOIs. Use Edit Column > Add column by fetching URLs. Add the GREL expression below:

"https://api.crossref.org/works/"+value

Make sure the DOI ‘value’ is in the following format:

10.7554/eLife.79771

Remember to name the new column that will be created, for instance, CrossrefJSON, and consider a 3000 or 4000 throttle delay (to speed up the process). Also, consider faceting your search depending on how many records/DOIs there are to fetch.

You can also append your email address to the API call, for example:
"https://api.crossref.org/works/"+value+”&mailto=name@xyz.com”

Crossref GREL expressions
The following are OpenRefine GREL expressions to extract data from the CrossrefJSON into separate columns using Edit Column -> Add column. 

Publisher
value.parseJson().message.publisher

Title
value.parseJson().message["title"][0]

Journal Title
value.parseJson().message["container-title"][0]

Abstract
value.parseJson().message.abstract

Published Date
value.parseJson().message.published["date-parts"][0]
Year
value.parseJson().message.published["date-parts"][0][0]
Month
value.parseJson().message.published["date-parts"][0][1]
Day
value.parseJson().message.published["date-parts"][0][2]

ISSN (Electronic)
value.parseJson().message["issn-type"][1].value

Type
value.parseJson().message.type

Author
forEach(value.parseJson().message.author,v,v.family + ', ' + v.given).join(';')

Affiliation
forEach(value.parseJson().message.author,v,v.affiliation[0].name).join(';')

Referenced By
value.parseJson().message["is-referenced-by-count"]

Subject
forEach(value.parseJson().message.subject,v,v).join(';')

Has Preprint
value.parseJson().message.relation["has-preprint"]

BibTeX/Citation
From DOI, Edit Column > Add column by fetching URLs
"https://api.crossref.org/works/"+value+"/transform/application/x-bibtex" 
"https://api.crossref.org/works/"+value+"/transform/text/x-bibliography" 

Author Count
From Author, Edit Column > Add column based on this column…
value.split(";").length() 

Fetch Unpaywall metadata from DOIs
Similar to the Fetch Crossref Metadata from DOIs section, you can retrieve OA status and other information from Unpaywall using OpenRefine and DOIs.

Edit Column > Add column by fetching URLs
"https://api.unpaywall.org/v2/"+value+"?email=email@xyz.com"

Unpaywall GREL expressions
The following are OpenRefine GREL expressions to extract data from the UnpaywallJSON into separate columns using Edit Column -> Add column. 

OA Status
value.parseJson().oa_status

License
value.parseJson().best_oa_location.license

Repository Copy
value.parseJson().has_repository_copy

URL
value.parseJson().best_oa_location.url

PDF
value.parseJson().best_oa_location.url_for_pdf

URL Landing Page
value.parseJson().best_oa_location.url_for_landing_page

Fetch DataCite relationships from DOIs
Similar to the Fetch Crossref Metadata from DOIs section, you can retrieve dataset to scholarly article relationships from DataCite via OpenRefine using DOIs.

Edit Column > Add column by fetching URLs
"https://api.datacite.org/events?citation-type=Dataset-ScholarlyArticle&source-id=crossref&subj-id=https://doi.org/"+value

Alternatively, to potentially capture more relationships, the following are broader searches:
“https://api.datacite.org/events?citation-type=Dataset-ScholarlyArticle&doi="+value
“https://api.datacite.org/events?doi="+value

Total (Relationships)
value.parseJson().meta.total

Object (Data/Software) Relationship URLs
forEach(value.parseJson().data,v,v.attributes["obj-id"]).join(';')

Visualize relationships via DataCite Commons:
E.g., https://commons.datacite.org/doi.org/10.5880/icgem.2015.1

Leveraging the Crossref/DataCite APIs for prefix information
The following API calls can be used to retrieve prefix information for DOIs, or in other words, the repository names for DOIs. 

Look Up Crossref Prefix Information
https://doi.crossref.org/getPrefixPublisher/?prefix=10.7554 (Life)

Retrieve DataCite Client (Repository) ID 
Example: https://api.datacite.org/prefixes/10.5281 (Zenodo)

Look Up DataCite Client ID
Example: https://api.datacite.org/clients/cern.zenodo 

Additional ideas
The following are additional data sources/approaches that can be explored.

Directory of Open Access Journals (DOAJ):
Example: https://doaj.org/api/search/journals/2073-4409 

OpenRefine Reconciliation Services
https://reconciliation-api.github.io/testbench/#/ 
Examples: Wikidata, ROR

OpenAire Research Graph
Examples: https://graph.openaire.eu/develop/api.html 

Scholix (Event Data)
https://www.eventdata.crossref.org/guide/app-scholix/ 
Example: https://api.eventdata.crossref.org/v1/events/scholix?subj-id=10.7554/eLife.79771 

Extracting DOIs (and RRIDs) from HTML
Edit Column -> Add column based on this column (Python / Jython)
import re 
matches = re.findall(r"\b10\.\d{4,9}/[-.;()/:\w]+", value) 
return ";".join(matches)

A regular expression for RRIDs: 
RRID\:\S*

RRID API
Example: https://scicrunch.org/resolver/RRID:AB_2314866.json 

Extract Country from Affiliation
Examples: https://forum.openrefine.org/t/geolocation-country-from-affiliation-or-address-string/273/1 

Mining PDFs
Thread: https://forum.openrefine.org/t/mining-information-from-pdf-urls-using-regex/278/1 

GROBID - GeneRation Of BIbliographic Data
https://grobid.readthedocs.io/en/latest/Introduction/

Previous
Previous

ORCID & GO FAIR US: Collaborators Working to Realize a FAIR Data Ecosystem

Next
Next

NSF Awards $1.26 Million to SDSC to Extend FAIR Research Community Activities