I got a lovely invite in my inbox recently to the new Mendeley Data system, and it prompted me* to work through some concepts about scientific data release that have been bugging me as my own organization works to implement a number of new policies. Mendeley is a citation management tool and online resource that I used for a while, and it was picked up and made part of the Elsevier family in 2013. Back when it first came online, the Mendeley folks had some really nice features like rapid ways of building up bibliographies and sharing them with colleagues. Over time, I started pointing my students and other folks more toward Zotero as a solution for that sort of thing because it’s easier to use and has better export options that we take advantage of for aggregating citation information together into other systems.

This new offering is part of an overall movement in the last number of years by the journal publishers and all the rest of us in the scientific data enterprise to figure out how to get more of our data online and openly accessible. There have been various efforts in the community to get together on approaches like the Coalition for Publishing Data in the Earth and Space Sciences that I’ve been involved with. From the U.S. government standpoint, it’s all part of an overall open government movement that began long ago but has most recently involved the Digital Government Strategy, the OSTP memo on Expanding Public Access to the Results of Federally Funded Research, and the policy on Making Open and Machine Readable the New Default for Government Information (see also – https://www.whitehouse.gov/open). I’ve been fairly involved in working to figure out how we’re going to do all of this in the overall earth system science part of federal government.

Across the global scientific community, including government institutions, academia, and private foundations, we have an interesting divide between the robust data technology haves and the have-nots. We have large, solidly funded data infrastructures that do a great job managing and providing their data assets, and then we have what is now in the common vernacular as the long tail of research data (from the standard distribution curve). The big data facilities, often centered around major scientific instruments like satellite missions, are at a very high level of thinking where they are heading standards bodies to do things like establish methods for Audit and certification of trustworthy digital repositories. Slightly less rigorous than the ISO standard but more used is the Data Seal of Approval effort, of which the USGS is a participating member through our EROS Data Center. The efforts focused on standards and certification for digital repositories have all kinds of detail about service robustness and long term preservation. At the other end of the spectrum are efforts like this latest offering from Mendeley, Figshare, Dryad, and Pangaea plus myriad other small repositories in specific institutions. Some of these efforts focused toward long tail data are participants in things like the DSA, and I know from personal conversations that many of them are thinking about some of the same principles promoted by the standards bodies.

Looking over the Mendeley Data deal this morning, I was struck by a few things. It’s basically one simple step above sharing your data from Dropbox. They make it super simple to upload data (as many 2GB or less files as you want) that all get a Digital Object Identifier (DOI) assigned when you “publish” them. There are no metadata requirements other than a title, date, abstract, and author (basically what it takes to mint a DOI) plus whatever is inherent in the upload. The other services I mentioned (Figshare, Dryad, and Pangaea) all have a little more to them, with Pangaea probably the most robust in its metadata requirements and data services. For data management geeks and archivists who spend their time sweating over the completeness of metadata and the dynamics of long term preservation and usability, the lack of any real metadata requirement with the Mendeley offering is going to be complete heresy. But for the average grad student, postdoc, and many career researchers, offerings like this are pretty darn attractive. For many scientists, the meaning and significance of the data they collect or produce are found in the interpretations and conclusions that are drawn from them and described in scientific papers that they author. There are more than a few practicing scientists that really only want a super simple ability to put the data somewhere online where some tech group is going to take care of the basics of making them available, a citation reference complete with a DOI, and then an ability to put that citation in a paper and count on it working.

The Mendeley service, as an arm of Elsevier, is responding to that market demand in a typical and understandable self-serving way. Almost all of the major publishers and journal editors are now requiring data associated with publications to be available online and cited/linked from the publication. Many are starting to make the same demand for scientific software. By providing their own file sharing service, Elsevier can tell submitting authors, “If you don’t have some other option for hosting your data, join Mendeley, load your data there, and cite them in your manuscript.” From their point of view, the journal article is still the primary asset, and the data are still basically supplementary files. Perhaps with “Mendeley Data” they are just a little more official looking than a simple file link listing from the journal article web page, but they still are inextricably linked the article if anyone else it to be able to use them for something.

And the reality is that many publishing scientists wholeheartedly agree with that notion and are perfectly happy with that kind of solution. One of the critiques I’ve heard over and over as the concept and practice of data citation has evolved is that publishing authors are worried about their data being cited more than their interpretive papers, the fear being that this will result in a lower impact citation record. My personal feeling is that this dynamic needs to change in how we evaluate science and rate scientists, with a thorough evolution of data impact metrics including altmetrics. I have a bias here in that my own scientific program spends its time building a national-scale synthesis of mostly other people’s data in order to conduct nationally consistent assessments of biodiversity status and trends. I’m also a bit of a technophile, and I really believe that information technology is enabling an acceleration in scientific discovery by increasing the efficiency whereby we climb on the shoulders of giants.

Within the community of scientists and data professionals who spend their time thinking deeply about these issues, I’ve mostly seen a very negative response to this notion and a struggle to make sure that ALL data are fully “first class citizens” and are documented and managed with full rigor. After beating our head against the wall for enough years on this issue, perhaps we should accept that too much more of this is simply nailing jello to a wall. Perhaps we can accept that we simply can’t treat all data the same; that some scientific data will be documented mostly within the context of an interpretive paper that might take a little more reading in order to be understood and potentially used for some new purpose.

In our community of government-based data lifecycle management thinkers for the earth sciences, the gold standard for metadata has become the ISO191x family of specifications with a few holdouts for the venerable FGDC Content Standard for Digital Geospatial Metadata. The notion from many has been that if we can just get every scientist who produces data to learn and employ the standards, produce compliant XML encodings, and release their data with full metadata, we will have achieved scientific data nirvana. We can harvest all these sources up together into catalogs like Data.gov or get Google to pay more attention to formal metadata and make it super simple to find, get, and use all scientific data. We try to create carrots and sticks to make this work – carrots being tools that we think will make it easier to produce compliant metadata and sticks being policies or even attempted legislation to mandate that everyone has to do it. In my opinion, neither one of those has worked very well. The tools are still too cumbersome; mostly because our motivation is compliance with the standard and not an enjoyable and value-adding experience for the customer. We’ve got the policies now from the highest level in federal government circles, but a stick without a carrot or a stick beating from behind with a massive barrier blocking forward progress is simply abuse.

By and large, across most disciplines (in my world, at least), scientists are motivated to get our data online and accessible. Increasingly, as we encode our thinking into algorithms and scientific software, we are interested in making that part of the scientific record as well. Journal publishers and editors are now requiring that we make both data and software available as part of publications, and offerings like Mendeley Data will probably be followed by “Mendeley Software” sometime down the road. We already have lots of vehicles for getting our data and software out and citable, including DOIs for GitHub repositories. Many of these tools are available and being used by our government scientists, somewhat under the radar, because they are easy to use, we hear about them from colleagues in other institutions, and no one has yet put the hammer down on us to force us to do something else. USGS and other government earth science agencies do have an incredible legacy and a legislated responsibility for long-term data management and accessibility, and the reality is that these organizations should provide mechanisms for releasing and managing data and software through government-managed assets. But if our thinking doesn’t catch up with the reality of how scientists think about the data and software they produce and use and our tools and methods don’t line up with the readily available public offerings, we’re going to continue to be more of an annoying hinderance to scientific progress and not an enabler and accelerator.

So, enough philosophizing about all this. I pose here a few principles for data documentation and release that might help codify what I believe to be the reality of data handling.

  1. Focus on metadata that are actually useful in understanding and working with data, and put those attributes into the data themselves in a structured and software-usable way. Think about what we need to know about how the data were collected or generated in order to use them in analyses, and record those details in the data in some way depending on the type or structure of the data. When it comes time to release data or advertise them through some catalog, the attributes in the data can be extracted and encoded to something like the ISO19115 standard using software as opposed to filling out a form.
  2. When it’s possible for formal structured metadata to be part of the scientific workflow, such as through metadata produced by instruments or embedded in other data analysis tools, that’s a great situation and we should take advantage of it to get formal metadata as part of the data package or data stream.
  3. When formal structured metadata would be a burden and major learning curve for publishing scientists but they are perfectly willing to post their files with whatever documentation they have generated (including hopefully more of no. 1), provide an easy way to get this done following the patterns of Mendeley Data, Figshare, and other really simple tools. The peer review process that will include the data as well as the interpretive manuscript will help sort out any issues in being able to understand and use the data. If the reviewers can’t figure out how to use the data to check on the conclusions in the paper, they will likely say something in their review comments, prompting the authors to make improvements prior to publication and setting the pattern for future data release.
  4. If we really believe that making data and software more prominent as scientific assets in their own right in addition to interpretive papers, then we need to quantify, study, assess, and report on this dynamic. How are data and software being cited, what’s being done with them, and why? How do we measure this in terms of scientific impact, both within the larger scientific community and within our own unique subcultures?

* Though it may be obvious from reading this blog and elsewhere online that I work for the USGS, the thoughts and opinions expressed are my own and are not reflective of anything official from the USGS or the U.S. Government.