On Friday, I posted an article on self-service scientific data repositories. I worked for the next couple of days on redoing a portion of our flagstone patio that was all off level and needed to be torn up, dug out, repacked, and laid all over again plus a number of other projects in the area. I’m sore as hell today, but it is looking nice.

Flagstone patio and porch from the driveway

Flagstone patio from the porch

And it gave me the opportunity to think all day (at least the parts I wasn’t having to concentrate on fitting the flagstone puzzle pieces together); part of which I spent pondering on the dynamics of releasing scientific data and the tension between what the data librarians want that universe to look like and what publishing scientists are willing to stomach. My wife and I also took a lovely walk in the neighborhood yesterday morning; opting to stay at home, work on house projects, and relax at our home in the mountains instead of going further into the mountains with everyone else this Independence Day weekend.

My wife recently retired from a career as a neonatal nurse practitioner. When I shared my thoughts on our walk, she offered a nice correlation from her field. As part of  their pursuit of evidence based medicine, the group of NNPs and neonatologists she worked with got together every few months in a “journal club” to review the latest literature and determine if any new evidence warranted a change in their medical practice. Medical librarians from one of the hospitals in the group would participate by assembling all of the papers on a given topic along with relevant data and making them available to the NNPs and neonatologists who would divvy them up and prepare summaries for group discussion. The librarians would also participate in the discussions, helping to connect the dots with the preparatory work they were doing and improve their ability to pull together and organize research for the next topic.

One of the other corollaries she brought up in our discussion was the Cochrane Library, a synthesis of medical research evidence assembled into an analyzable database system. Groups like the neonatology bunch here in Denver and countless others in many fields contribute and draw from this synthesis every day to improve healthcare and standard of practice. This came up when we were discussing the central motivation that I believe drives most of our data management professionals in earth system science – a desire to enable more reproducibility and reusability of the data being collected and produced. I am a wholehearted supporter of and participant in this motivation, but my concern is that we’re asking some of our scientists, at the point of their initial publication and data release, to shoulder too much of the burden in eventual data integration and synthesis.

One other interesting thought from our discussion was my wife’s reaction to the concept of metadata. While data and metadata that describe them are a regular part of my daily life over the last few years, it’s a topic I’ve rarely discussed outside my work circle. In trying to explain the issue of repositories that demand complete standard metadata encoded in just the right way vs. those that demand nothing more than a working data file and basic description, I had to lay out the basic premise of what we mean by metadata. In discussing the important parts of metadata like the description of attributes contained in the data and methods for collection and processing, she basically said, “well, that’s just part of the data, right?” – aligning well with the first principle I posed in my previous post.

So, speaking of the different types of repositories, I mentioned Figshare as one of those that has been operating for a while. Amongst the group of earth science data people that I hang out with from both government and academic institutions, I’ve sometimes heard Figshare get a pretty bad rap as an uncurated pile of stuff that is poorly documented, not well managed, and of dubious utility. One colleague from Columbia University, who I have the greatest respect for, often promotes the idea that scientific data repositories should really only be domain-specific, operated by professionals in particular scientific domains who have the ability to understand a particular type of data and get it organized in a way that is most useful and understandable for that community. Resources like Figshare and Dryad are presumably not as useful because they are simple online file shares. However, in browsing around a bit yesterday, I saw mostly only positive reviews by the community of research scientists using Figshare.

So, how much are the various data repositories being used? How often are the hosted datasets cited? What is being cited from the repositories, many of  which host everything from datasets to author copies of manuscripts, presentations, and posters)? How are the datasets being cited from the repositories being used? How do the number of hosted datasets (shown by registered DOIs) compare between the different repositories? How do the citation records between self-service or carefully curated repositories compare?

To dig into these questions a little bit, I used the DataCite API to grab registered DOI metadata for some of the interesting repositories and dumped them to a database for some quick analysis. I used the following DOI prefixes of interest. The prefix is a decent enough way to tease out specific facilities or organization, though they may not represent the full holdings of a given organization.

  • Figshare – 10.6084
  • Dryad – 10.5061
  • USGS – 10.5066

Pangaea (10.1594) was also interesting to me, but I didn’t want to take the time to deal with their large volume (700K+ registered DOIs). That DOI prefix actually represents a number of different facilities who all share the same DOI registration agent.

Repository Resource Type Counts

Counts of resource types from different repositories of interest (Figshare, Dryad, and USGS) from the DataCite registry as of July 5, 2015
repositoryresourceTypeCOUNT
Dryad185
DryadDataFile25406
DryadDataPackage6546
DryadDataset3
Figshare106
FigshareCode334
FigshareDataset18604
FigshareFigure2194
FigshareFileset2162
FigshareGenome Variation44
FigshareMedia356
FigsharePaper4277
FigsharePoster723
FigsharePresentation691
FigshareThesis169
USGS767
USGSCollection15
USGSDataset93
USGSEvent5
USGSImage1
USGSModel2
USGSOther41
USGSService16
USGSWorkflow1

The majority of the items from the three repositories are datasets of some kind with some variation in how they are classified in the resourceType attribute. There is an interesting artifact in the USGS registrations where the blank resourceTypes should actually be datasets.

The DataCite statistics interface also provides a way to examine the DOIs that have been resolved, meaning that someone clicked on a DOI link (presumably in a reference somewhere) and it was dereferenced through the DOI resolver system to access the referenced item. This is a part of the stats package from DataCite that is only available to registered users through the API, so I didn’t bother tracking down how to gain access yet. The stats interface provides the top 10 DOIs from a given prefix that have been resolved. I browsed through these to see what metadata were available behind those top 10 and made a couple of interesting observations.

  • The top 2 dataset DOIs from USGS being resolved are for datasets from my own program area. This is probably just a reflection on the fact that our local program is a bit ahead of the curve on registering DOIs and including them in metadata. None of the USGS DOIs that are getting resolved have information in the DataCite registry showing where they were referenced.
  • Of the four repositories examined (including Pangaea), only the Dryad top 10 show DOIs to journal articles where the datasets were referenced. Those are shown in the table below.

Top 10 DOIs referencing Dryad items

The top 10 journal DOIs referencing Dryad datasets or other items (May 2015)
Dryad DOIReferenced By
10.5061/DRYAD.3RV62doi: 10.1126/SCIENCE.AAA4984
10.5061/DRYAD.234doi: 10.1111/J.1461-0248.2009.01285.X
10.5061/DRYAD.HQ4V0doi: 10.1126/SCIENCE.AAA8902
10.5061/DRYAD.FP060doi: 10.1038/NATURE14423
10.5061/DRYAD.S38N5doi: 10.1371/JOURNAL.PCBI.1004128
10.5061/DRYAD.RQ43Rdoi: 10.1371/JOURNAL.PONE.0122092
10.5061/DRYAD.6M653doi: 10.1073/PNAS.1423853112
10.5061/DRYAD.KP905doi: 10.1111/MEC.13243
10.5061/DRYAD.9MB54doi: 10.7554/ELIFE.06664
10.5061/DRYAD.HF56Vdoi: 10.1128/AEM.00631-15
10.5061/DRYAD.3M680doi: 10.1111/MEC.13238
10.5061/DRYAD.4TM8Jdoi: 10.1186/S12862-015-0359-4

This table of Dryad DOIs is getting closer to where I’d like to be in answering the questions I posed above. It would take further analysis and reading to understand the specific connection between the articles and referenced datasets. It is also interesting to look at the number of DOI resolutions across the different data centers. Where are all of these coming from? Do they represent real research or other consumer interest in the datasets or other items? This would require additional information from the data facilities themselves to examine web statistics coming from the DOI resolution to see where users are going from the landing pages. I do, however, draw a couple of useful conclusions from this initial look at a handful of repositories:

  • Self-service data repositories like Figshare and Dryad are definitely being used both from the standpoint of data producers and data consumers.
  • References to self-service data repositories include some articles from high impact journals like Science and Nature.

What I’m essentially after in all this is a thoughtful examination of whether or not we should be considering something more simple, straightforward and mostly self-service for repositing and managing some of the data assets being produced by scientists in my particular organization and set of domains. A number of years ago, I helped to envision and engineer a data and information platform called ScienceBase for the USGS. ScienceBase does a lot of different things, but among them is a digital repository function for many of our USGS Science Centers, some of our partner agencies, and a couple of NGOs with which we are affiliated. The underlying information model of ScienceBase can be as simple as Fighshare, somewhere in between like Dryad or Pangaea, or more rigorous in terms of formal metadata requirements. There are quite a number of datasets hosted that are at the less well documented end of the spectrum, but with a raft of new data management and release policies that are now out in a testing phase, our organization is working toward making the official data releases through ScienceBase (those that get a DOI and are officially blessed for release) much more on the rigorous side of things. We’re hearing rumblings from scientists and managers that the things we’re asking for will be creating additional barriers rather than enablers, and I’m wondering if we might be taking things too far and asking scientists to do work that rightly belongs in the hands of data librarians.

The Australian National Data Service, which is probably one of the organizations that is farthest along in thinking about this stuff, has quite a bit of information on the role of data librarians and what they term information specialists. They mention a few things that are worth repeating here:

Data librarians are professional library staff engaged in managing research data, using research data as a resource, or supporting researchers in these activities.

 

What do information specialists do?

The role may include supporting researchers or institutional initiatives in the following areas:

  • Data management
    • data management planning
    • issues such as copyright, intellectual property, licensing of data, embargoes, ethics and re-use, privacy
    • storing and managing data during the research project (curation)
    • depositing data in archives at the end of the project, determining retention and disposal
    • open access and publishing of data
    • research organisation policies affecting data
  • Metadata management
    • creating and maintaining metadata
    • developing and applying metadata standards
  • Using data (data as a resource)
    • finding or obtaining data for re-use
    • citing data
    • data analysis tools and support services
    • data literacy

 

I highlighted in orange the roles that I believe our organization is asking research scientists and publishing authors to take on for themselves. There is certainly a recognition that help resources need to be made available, and our organization has done a good job in providing online resources and tools that may be of some assistance. There is also a working notion that these roles may be most efficiently accomplished with an actual Information Specialist or domain-specific data manager who is a part of research labs or science centers. However, the ultimate responsibility for complying with policies and getting these roles fulfilled comes down to local managers and scientists throughout the organization, and we’re consistently hearing anecdotes from these folks that the policies are imposing an additional burden.

So, my open questions are these:

  1. Are we asking our research scientists to develop core competencies that are taking them too far beyond their chosen field of study?
  2. Is understanding and complying with robust metadata standards really necessary for data to be released and made useful in the scientific community?
  3. Should we be thinking about providing a minimalist, self-service platform for data release in addition to accommodating more robust documentation and curation where resources are available?
  4. Should we be putting more energy and resources into developing professional information specialists and/or data librarians in the organization who can take on data management roles?