Wednesday, December 12, 2012

Big Data Meets Collaboration by Difference: HASTAC Goes CIBER

Big data, big collaboration. That’s what happenened to the CI-BER project when it took on new partners this fall. CI-BER launched in 2011 as a cooperative research agreement between NARA (National Archives and Records Administration), NSF (National Science Foundation), and the University of North Carolina at Chapel Hill: Build a master copy of billions of federal electronic records and visualize that data in different ways.

Jeff Heard of RENCI, one of the project collaborators) summarized the CI-BER testbed as:

  • 75 million records and growing
  • 70 terabytes of data
  • Records spread across 150 government agencies
  • Thousands of file types
  • Dozens of data classes
  • Hundreds of ad-hoc human organizational structures
  • All replicated and compiled into a central IRODS repository for next-gen data-grid access
  • Geographic subset of > 1.2m records

As of fall 2012, CI-BER is expanding to include new partners from Duke University, UNC-Asheville, and the City of Asheville, creating a collaborative team that represents computer science, political science, the humanities, engineering, information and library science, three universities, the town of Asheville, and community leaders with a pressing need for big data. Over the next 9 months, we will be reporting on the multiple facets of this collaboration, not only sharing research results from our experiments and developments, but documenting the practice of collaborating across a complex mix of disciplines, organizations, and institutions.

Why this project? Public government records are growing exponentially each year, and CI-BER’s goal is to create new tools to store, view, and use that digital data. To give some idea of what volume of data is heading our way, consider that George W. Bush transferred an estimated 77 terabytes (a terabyte equals 12 zeros) of information to the National Archives upon leaving office, 35 times what the Clinton administration generated. In 2011, President Obama ordered federal agencies to make wider use of digital-based recordkeeping systems, and that promises to exponentially grow the size of national archives by an order of magnitude and complexity that gives CI-BER and other big data initiatives a sense of urgency.

Terabytes of information no longer elicit much awe for the average user, not when Google is processing 24 petabytes per day (a petabyte equals 15 zeros) and you can buy 2 terabytes of storage for less than a hundred dollars. However, despite our comfort generating and consuming massive data, it is no minor thing to mine billions of files with different file formats and folder systems, each one a small dot of information in a giant matrix of heterogeneous data that needs to be accessed, viewed, remixed, and eventually shared and made available for public use.

CIBER’s first task is to load the collections and make the data web-accessible so users can eventually “check out” a record and view or manipulate the data, and even remix or mash it up if they wish. By “users” we mean not only archivists, but the public. One of our CI-BER goals is to make it possible for users to crowdsource geospatial metadata without changing the underlying record.

Richard Marciano, co-founder of UNC’s Digital Innovation Lab and CI-BER’s PI, has experience making large archived collections web-accessible and searchable. One of his current research projects is T-Races (Testbed for the Redlining Archives of California’s Exclusionary Spaces), which looks at the impact that maps and reports from the 1930s had on mortgage lending policies and how that data influenced the resilience of neighborhoods through the 1960s. The end result is an innovative system that allows users to interact with and analyze historical data that was virtually inaccessible, allowing similar cities and neighborhoods around the country to mine and share their own public history.

That expertise will carry over to our own CI-BER place-based work. In this grant, Dr. Marciano will be joined by two co-PIs at Duke University, including Robert Calderbank, Dean of the Natural Sciences and Professor of Electrical and Computer Engineering, and Cathy N. Davidson, John Hope Franklin Humanities Institute Professor of Interdisciplinary Studies and co-founder of Humanities, Arts, Sciences, and Technology Advanced Collaboratory (HASTAC or "haystack"). 

Other collaborators on the team include Sheryl Grant and Mandy Dailey of HASTAC, Jeff Heard, Erik Scott, and John McGee at RENCI, Chien-Yi Hou at University of North Carolina-Chapel Hill's School of Information and LIbrary Science, Priscilla Ndiaye, of the Asheville Southside Community Advisory, Dwight Mullen of UNC-Asheville, Mark Conrad of NARA, and Sheau-Yen Chen of the University of California, San Diego. We expect that this marks the beginning of a long-term data connection across UNC, Duke, and beyond.

For our initial study, we chose to focus on the heterogeneous datasets and multi-source historical and digital collections of the City of Asheville, North Carolina. This allows us to validate core concepts of the research so they can be subsequently scaled nationally. We have assembled historical and born-digital collections for a particular neighborhood in Asheville spanning 100 years of urban development, including heterogeneous records of 1930s New Deal policies, 1960s Urban Renewal policies, and 1990s planning documents. Historically, thousands of cities followed the same development patterns nationwide, and the case-study will naturally lead to scaling at a very large national level.

By focusing on this particular place, we plan to demonstrate the potential for automation and integration of temporal and spatial datasets involving census, economic, historic, planning, insurance, scientific, and financial content, with the goal of making a national impact on future research.  Dr. Marciano has already conducted surveys and a pilot of the types of data sources involved from scanned imagery, maps, digitized datasets, National Archives federal records, and beyond, many of which are already part of the CI-BER testbed. Our plan is to demonstrate workflows for potential automation and integration at scale, and to show how citizen-scientist crowdsourcing projects can be deployed.

Interested in following along? Visit HASTAC.org (under the CIBER tag) to learn more.

1 comment:

  1. Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big Data trainings

    ReplyDelete