In my area of earth science, we are working to evolve our thinking about how we publish the results of our research programs with much of the current focus on data and software as part of the publication process. Our larger community has been involved in interesting debates and discussions on this over the last number of years with a seminal paper, Is Data Publication the Right Metaphor?, from Mark Parsons and Peter Fox in 2013. In the paper and the community discussion leading up to it, Parsons and Fox argue that data are a different kind of thing than the interpretive discussions that go into a scientific paper and that perhaps we should be thinking about more simply releasing our data with the appropriate documentation and in formats people can use. They raise some interesting questions about what peer review for data actually means, how feasible it might be for really large or complex datasets, and what’s really required in order to make usable scientific data available. The USGS has generally taken this concept to heart with our new policy on the matter titled, Review and Approval of Scientific Data for Release, and we’re currently struggling through the dynamics of how review and approval really work across the wide breadth of data that we produce and interpret.
With the release of that policy, we are now working through the dynamics of actual implementation. I’ve written previously about self-service data repositories and some questions about the data librarian job, and this post is part of my process of trying to make sense of what we are doing and offer some perspective in a way that I can share externally with colleagues from other institutions. What I’m pondering on here is whether it’s really an either/or situation or if it’s an either/and between the publish and release metaphors. As we’re working through implementing new data release policy, we are also struggling with the legacy of a specialized report series, the Data Series report, that you can read about here and see a current listing of here. The Data Series is similar in many ways to (and predates by several years) special journals like Nature: Scientific Data where we essentially have a paper describing data and a pointer to some online location where data can be accessed. USGS has also released data as supplementary files with others of its “Series Reports” and journal articles just like every other science organization. Even before our official policy on data release opened the door for all datasets to be released direct to the web as their own entities, some types of data that were large and complex enough were also released via an online application or through one of the larger repositories like the EROS Data Center’s Long Term Archive.
One repository in USGS that is working through the dynamics of data release is ScienceBase, which has set itself up as a potential organization-wide resource. The team that is working this issue is seeing some interesting data packages come along like this one on Airborne Geophysical Surveys over the 2011 Mineral, Virginia, Earthquake Area (http://doi.org/10.5066/F78K773V). The author put together what is essentially a data paper (Background Information and Analysis Results) that, from their perspective, represented the important metadata for the data product. From the perspective of the data repository, the ScienceBase team worked with the author to refine the XML-encoded metadata, packaging details, and other trappings of the overall dataset. It is packaged together as a set of related ScienceBase items that are accessible through a web page.
By and large, this is all fine. The author is happy to have her data online and accessible, and the ScienceBase team was happy to work through an example like this and provide a useful service. (Note: You can browse a public JIRA with discussions on the data release issues the ScienceBase Team is tackling.) There are a couple of issues with this situation and others that are similar that have me thinking about the whole data publication vs. data release dynamic.
1) We didn’t improve the technology
The end result of the geophysical survey data release and a few others that have been done are really not substantively different from the many Data Series reports that are already online such as http://doi.org/10.3133/ds898. From a technical standpoint, one of the major problems with what we’ve been doing for years with the smaller datasets that can be released as USGS pubs is the fact that they are essentially just piles of “supplementary” files sitting online somewhere and not cataloged or discoverable as data apart from their “report.” They sometimes have metadata, but those are often like this case where they sit in a downloads directory and cannot be cataloged or advertised in some other way for easy discovery and access through software. We still have not solved the data discovery problem in our earth science community (and many other domains) as evidenced by the many talks on the subject at AGU and elsewhere. Part of this is due to cases like this where actual data content is deep in file systems and generally opaque to either purpose-build data catalogs or search engines.
We want to get to a point where someone can not only quickly discover multiple datasets of relevance across disparate repositories but can also write code to crawl through the mounds of long-tail data we have online in heterogenous collections from across multiple earth science organizations to discover and quickly begin using data that may be pertinent to some new synthesis or line of inquiry. ScienceBase provides the technical capability to do just that, but we haven’t taken full advantage of that ability in this or any of the other cases that have come online so far.
There are 14 total items in the Airborne Geophysical Surveys package. There are a couple of ways to explore them with code, but the most robust information is returned in the native ScienceBase JSON through something like the following URL:
Unfortunately, whether you are a coder or not, it would take significant human logic to parse through this JSON and make any sense of what’s in the data package and what you can do with it. Another technical method returns what should be the most complete metadata for the individual items in the package encoded in FGDC XML:
This isn’t bad in that each item in the package is sufficiently described, but there is even less ability for anyone to write software that can navigate between the different parts of the data package in the same way that I can get by viewing it online as a human. We’ve provided what may be slightly better than the simple online folder listing of files typical of data reports from the past, but the technical situation is not substantively improved.
2) Authors are still thinking in terms of data publication
The author in this case provided what still amounts to a data report. It is not as robust or nicely formatted as the typical data series report, but it’s not substantively different and it serves the same purpose – provide human readable background and summary information on the dataset for the purpose of understanding how it can be used and “advertising” the product. More often than not, when authors are asked for metadata, what they provide is a text document describing their data collection methods and the substance of the dataset. Data geeks have been trying to get authors to learn how to provide standards compliant metadata, nicely encoded as machine-parseable XML for years (decades, in fact) with limited success. We keep hoping that if we can just provide the next great metadata tool or get scientists into the right training so they are indoctrinated with metadata goodness, we’ll solve this problem.
But is it really a problem? Shouldn’t we want the parts of the data documentation package that are intended for human consumption to be nicely packaged, formatted, and readable by humans? If many authors are going to think about presenting their data online and having other researchers access it and want to put the energy into writing a data paper, why shouldn’t we make that a central part of the package?
Another perspective in this area is coming from GeoSoft (I still can’t figure out the etiology in naming between GeoSoft and OntoSoft), one of the NSF EarthCube projects. This project started out focusing on the whole area of software documentation, but in doing so, they’ve ended up dealing with the larger spectrum of scientific publishing where it involves data and software. In particular, Yolanda Gil and others from the team are working on the Geoscience Paper of the Future, where the aim is to set a pattern for scientific papers that are more like “multimedia packages” (my phrase) that include data, software, and provenance (scientific workflows). In an effort I was involved in a few years ago, the Core Science Systems strategy for USGS, we tried to aim the USGS at the same concept – going so far in an early draft that didn’t make the review cut as suggesting that all scientific papers in future should be fully encoded as software. There’s likely a balance in here where there are “traditional” journal articles that present theoretical science and thinking that is meant to be read and absorbed in total and other publications that are more more data driven science packages containing many individual artifacts and microcitations of individual findings, data subsets, and algorithms within larger works.
Are there actually 3 distinct jobs in data release?
So, how do we best facilitate releasing data as an institution. As a science organization, and particularly as a government science agency with a mandate for long term data preservation and open access, we are more than individual PIs releasing the data they generate and analyze through research projects. We are an institution that has the ability and responsibility to invest in and support a larger, more robust data system as a key organizational asset. I posit that there are three different jobs that need to be examined.
Without authors or data producers of different kinds, there is no content and no deep knowledge about the scientific subject. We need a system of data release and publishing that allows authors to do what they do best, focus on the substance of their work and communicate it in understandable ways. We need to not force authors to learn more than they need to about other jobs. Authors do need to own the final package and presentation of their intellectual capital, and this has sometimes led us to try and force the next two jobs onto authors. I believe we are at an interesting crossroads as scientific institutions where we need to decide if we want to have data release or publication start and stop at the point of mostly author-owned, self-service repositories or layer on one or both of the next two roles as institutional capabilities.
Data Librarians or Curators
Some of my colleagues with whom I was arguing this area in a meeting last week struggled with the term, data librarian, mostly because of the tension that seems to be going on between traditional library functions and what those functions mean in a mostly digital world. I personally agree with many of the functions that the Australian National Data Service lists for data librarians and information specialists, but I’d like to take things up a notch on the technological acumen for these folks to work more deeply on advanced data services and the combination of some types of discrete data to higher order aggregated or integrated products. I believe we need to hire, train, and nurture a professional workforce of data librarians working collaboratively within some macro-domain level of the various sciences to maximize discoverability, accessibility, and preservation of scientific data and information. These folks need the capability and build a practice around making sure that the “we didn’t improve the technology” statements I made above no longer apply.
There may be some data that are perfectly sufficient as data packages made up of the data content and documentation that can be encoded in standard ways and then presented as catalog records through various discovery engines. But there are also other cases for what are more data papers or enhanced packages that seem to be fairly prevalent in my own organization and perhaps others. These cases may start with a mostly prose and pictures document describing the background and circumstance of a dataset, summarizing its contents, and perhaps linking to interpretive content in other publications. Publishing specialists in the USGS have traditionally treated these cases as standard publications, focusing their attention on the jobs they do best – editing textual content for clarity, refining graphical elements (figures, charts, etc.), and building an attractive presentation. While these can be value adding capabilities, the time and expense involved and the fact that the end products still lack more cutting edge features for how the data are available have often pushed authors toward finding alternate pathways.
I would like to see publishing specialists involved in the process, but I would love to see them amp up their technological game as well. I think about the following questions as things publishing specialists can bring to the table to ask and answer:
- How can more advanced data visualization capabilities (e.g., http://d3js.org/) be made a part of the technology stack so that visualizations in data papers are driven from live data, provide visualizations that are relevant and legitimate for the data in question, serve as an example of what others may be able to do with the data, and provide a more attractive and interactive feature with the data than static pictures?
- With data assets that data librarians or authors themselves make available through more advanced online services or APIs (beyond the download), how are these dynamics and the possibilities for access clearly articulated in both encoded metadata and how is the information is presented in a paper so that users know what possibilities exist and what tools may be used with the data?
- How can summary tables, subsets, or specific highlights within a dataset that an author wants to call out and explain be pulled directly from the data in a live way (e.g., via a subsetting service) so that users can easily connect the summarization to the full dataset?
- How are the connections between data tools (commercial or open source software tools), applications (scientific models), interpretive publications, and other assets both encoded in the metadata so that software algorithms can exploit the connections and clearly articulated to users in the human readable part of the package?
- What can we do to measure and study the usability of different visualization and summarization methods across various types of data and usage patterns to learn what works and what doesn’t work with users?
Software and Provenance
At some point, I’ll revisit this subject with regard to the full suite of what the GeoSoft project I mentioned is dealing with – data, software, and provenance. I believe similar issues exist to be explored with the value-adding capabilities that data librarians and publishing specialists can bring to the table for software and the encoding and presentation of scientific workflows.