Wednesday, February 9, 2011

Visualization of GIS Records in the CI-BER Testbed

The goal of CI-BER is to develop scalable interfaces for exploring and understanding archival collections containing billions of records. In the archival sense, "records" do not signify rows in a database or a spreadsheet, easily broken down into machine-understandable structured chunks of data; rather, records follow the traditional meaning of the word. CI-BER records, archival records, are heterogeneous entities containing text, images, signatures, data in current or even obsolete binary formats, compressed digital archives, and others. Along with what one might think of as the data itself is the structure of its collection: folders, subfolders, file dates and names, versioning information, and whatever (often sparse if present at all) metadata about ownership, provenance, or content the data came with. All this must be preserved in a genuine archive, and thus building interfaces that make any sense of these collections is a complicated problem.

Billions is a fair estimate, too. Every year, government agencies store exponentially more archival data in electronic formats, and transparency efforts make archiving them more important and necessary than before. The goal of the CI-BER project is to make tools and update storage mechanisms pre-emptively to handle the influx as it continues to grow.

In browsing the records we added to the CI-BER test collection, we noticed that a significant number of geographic data sets are liberally scattered throughout a number of sub-collections. To explore and understand these collections we put together an index and an explorer interface for viewing geographic metadata of CI-BER records, with a treemap interface inspired largely by the work of TACC.

For geographic records, our interface is meant to pick up where TACC leaves off. When the CI-BER geographic indexing process discovers a geographic file type, it goes through a series of open-source geographic programs looking for one that can decode the file. Once a decoding program is found, the indexer asks the file a series of questions, finding out the geographic boundaries, the assumed map projection (there are roughly speaking, thousands of likely projections), the type of data stored at each point of the file, and what kind of features: lines, rasters, points, polygons - that data is mapped to. The answers to these questions, along with some standard file metadata: location, size, owner, creation date, among others, are stored in the index to be used in the explorer.

The goal of the explorer and future follow-on geographic interfaces is to provide a level of collection understanding while at the same time providing a way to cross-correlate different sub-collections or even entire collections. Geography, like some say about music, is a kind of universal language: once you have data linked firmly to geography, that hard physical reference provides a way to relate that data to a person or town, and to relate it to other data coordinated within the same geographical region. CI-BER collections come from different government agencies, many of which don't interact or interact on a very loose level. By collecting and indexing their collections in terms of geography, cross-agency collaboration and understanding can be fostered.

The figure shown is the first iteration of the geographic explorer. On the right hand side, we see a "treemap" visualization of the collection, showing all the entries in a particular sub-collection by the number of sub-entries they contain, or file size if the entry is a single file. Grey entries are geographic dead-ends. These cells contain no geographic data. They are included in the explorer to give a user a rough idea of how sparsely geographic data is housed within the collection. Highlighted in yellow are collections containing only "features": individual points, outlines, or paths on which data is stored (typical data stored on features include sensor data, road names, elevations, census data, and municipality information). Highlighted in blue are collections containing "coverages", image data describing a two-dimensional surface or three-dimensional volume of real space. Typical data stored in coverages include weather predictions and observations, climate or hydrography models, satellite images, and digital elevation models. Finally, highlighted in red are collections that contain both.

When a user hovers over one of the highlighted cells, the left side containing the map dynamically displays rectangular outlines which are the geographic boundaries of all the contained-subcollections or the file. Additionally, a hover-over information box appears containing the additional metadata about the file, including projection, numerical boundaries, number of features/rasters contained, and POSIX file metadata. Clicking on a cell descends a level into the collection, refining the user's understanding of a single sub-collection as it breaks out into further sub-collections or individual files.

This is just the first tool developed by the CI-BER working group for visualization. From what we learned in this rapid iteration, follow-on tools will be written that make it easier to understand huge collections in terms of their geography.


  1. Cool stuff. You should check out TextGrounder -

  2. Thanks for sharing information about this work, and the code on github! I was wondering what the significant number of geographic data files were present in the archival record set. Were you looking at the 14 million unique files and 27.5TB of data that Richard blogged about earlier?