I don’t think I’ve written anything about technology here. I’ve generally posted technical stuff related to my work on an internal space and done my philosophizing out here. However, I can’t share the internal blog with anyone else, so I figure to start using this space for that stuff as well.

One of the things that I kicked off a few years ago and shepherded through the first few rounds of architecture is ScienceBase – a management, distribution, and integration platform for scientific data used in the USGS and several other agencies and institutions. I don’t manage the program anymore, but I do use it in my current job and I poke at things occasionally to stir the pot a little on where the platform is going.

ScienceBase has a fairly robust API at its core with a number of different interface methods and return formats. Most of the API interaction uses JSON, and I had a colleague the other day with a use case where she needed to pull ScienceBase Item information into a spreadsheet. ScienceBase offers a canned CSV output for any search results that is okay, but occasionally someone needs to customize the output. I had a student a while ago who wrote some Google Script to pull ScienceBase query results into a Google Spreadsheet as part of a project, but he never posted anything anywhere convenient and lasting and when I went looking for it, it was nowhere to be found. Fortunately, it looks like a few folks have figured out some even more slick ways of doing this since that time, and I rounded one of them up. I’m posting the concept here for safekeeping.

Here’s the ScienceBase collection in question – a set of data assets for the Desert Landscape Conservation Cooperative. How those guys are using ScienceBase in some pretty cool ways is another story for another time.

https://www.sciencebase.gov/catalog/item/54fa3835e4b02419550da3a8

Here’s what the search interface looks like in a RESTful URL:

https://www.sciencebase.gov/catalog/items?parentId=54fa3835e4b02419550da3a8

If you add some parameters to that, you can get a CSV output of all records in the collection:

https://www.sciencebase.gov/catalog/items?parentId=54fa3835e4b02419550da3a8&max=200&format=csv

That’s not a bad output, but it doesn’t include everything you might want or all that you want from the information about the datasets in the collection. Google Spreasheets already have some built in functions for importData that can grab that CSV output and put it in a table and importXML that do what we want to do here with XML. OpenOffice spreadsheets have some similar functions, and there might even be a way to pull it off with some whacky method in Excel. But Google Spreadsheets offer a nice way to demonstrate the concept and get the work done online and collaboratively.

I found a great script from someone named Trevor Lohrbeer (@FastFedora) on GitHub:

https://github.com/fastfedora/google-docs/tree/master/scripts/ImportJSON

He also wrote about it in a blog post here where you can get more details. Once that’s loaded in as an available script for a spreadsheet, you can call it up from a cell to build out a table with some straightforward syntax that looks like this:

=ImportJSON(“https://www.sciencebase.gov/catalog/items?parentId=54fa3835e4b02419550da3a8&format=json&fields=title,body,webLinks,contacts&max=200”, “/items/id,/items/title,/items/body,/items/webLinks,/items/contacts”, “noTruncate”)

You’ll see the following in the REST URL to ScienceBase:

  • format=json (ask for the result in JSON, also invoked with content negotiation)
  • fields=<list of fields I want> (see Reference)
  • max=<max records to return> (ScienceBase will give you as many as 1000 records at a time)

The ImportJSON script flattens everything out in the JSON response and sets it up to configure certain fields in the output via XPath-like addresses (e.g., /items/title). And there are a few config parameters you can read about in the script.

The tricky thing with this is how it handles JSON arrays, which don’t lend themselves very well to flattening out into a spreadsheet (why we have more complex text formats like JSON and XML in the first place). @FastFedora did a pretty cool job with going ahead and writing out cases like webLinks and contacts in my example into multiple spreadsheet rows where some of the information is repeated and the multi-value attributes are thrown into the associated columns. If you want to go check out this example in action, I put it into an open spreadsheet here. You might notice, that it’s smart enough to go ahead and return all the attributes for individual /items/webLinks with a “Weblinks” prefixed column header, which is pretty cool. And actually in this particular JSON, I can simply tell ImportJSON to give me all of /items, and it will return everything that’s in that array all flattened out in a reasonably decipherable way.

Thanks to @FastFedora for coming up with a pretty kick ass script to do a simple job that should save quite a bit of time for folks wanting to build nice flat tables from ScienceBase items.