Tuesday, March 19, 2013

A Citizen-Led Crowdsourcing Roadmap for the CI-BER “Big Data” Project

A Citizen-Led Crowdsourcing Roadmap for the CI-BER Big Data Project

March 2013

Priscilla Ndiaye (Asheville Southside Community Advisory Board)
Dwight Mullen (UNC Asheville - Political Science)
Richard Marciano (UNC Chapel Hill / SALT Lab)
Cathy Davidson (Duke / HASTAC)
Robert Calderbank (Duke / iiD)
Sheryl Grant (Duke / HASTAC)
Mandy Dailey (Duke / HASTAC)
Kristan Shawgo (Duke / HASTAC)
Jeff Heard (UNC Chapel Hill / RENCI)

“[A]t a time when the web is simultaneously transforming the way in which people collaborate and communicate, and merging the spaces which the academic and non-academic communities inhabit, it has never been more important to consider the role which public communities - connected or otherwise - have come to play.”  (Dunn & Hedges, 2012.)
This statement is especially true for the CI-BER project, a collaborative “big data” project based on the integration of heterogeneous datasets and multi-source historical and digital collections, including a place-based case study of the Southside neighborhood in Asheville, North Carolina. The CI-BER team is in the process of assembling historical and born-digital collections that span decades of urban development and demographic data and include heterogeneous records of 1930s New Deal policies, 1940 Census data, 1960/70s Urban Renewal policies, and contemporary planning data. Members of the CI-BER project are automating and integrating temporal and spatial datasets such as census, economic, historic, planning, insurance, scientific, and financial content. The next task is to co-create a crowdsourcing process to solicit feedback from community members that will accelerate the data identification process.  Crowdsourcing is becoming an increasingly popular technique in dealing with “big data” processing and management.
What sets CI-BER apart from a straightforward digitization or visualization project is its deployment of crowdsourcing, in which the community is an essential part of the design and implementation process. Crowdsourcing is “the process of leveraging public participation in or contributions to projects and activities,” and can be carried out in different ways depending on the community involved, the content to be crowdsourced, and the technology available. This document outlines a four-phase crowdsourcing process designed with and for the Southside Asheville community.
In the sections below, we outline the 1) crowdsourcing framework, 2) community history, 3) remapping process, and propose a 4) citizen-led crowdsourcing process in four proposed phases that describe possible relationships and workflow. The CI-BER project is a highly iterative and collaborative project, and feedback is both welcome and necessary.
I. Crowdsourcing Framework
Our overall frame for how we approach crowdsourcing takes as inspiration the policies of the Obama administration and the nation’s recordkeeper, the National Archives and Records Administration (NARA):
“...Our commitment to openness means more than simply informing the American people about how decisions are made. It means recognizing that the Government does not have all the answers, and that public officials need to draw on what citizens know. And that’s why, as of today, I’m directing members of my administration to find new ways of tapping the knowledge and experience of ordinary Americans. . . .’” (David Ferriero, quoting President Obama, 2009).
Drawing on what citizens know and want to know is at the heart of how CI-BER approaches crowdsourcing. The involvement of Asheville’s Southside community in co-creating and designing the process is of paramount importance to its success. In their Crowdsourcing Scoping Study, Dunn and Hedges (2012) refer to three types of crowdsourcing approaches: contributory, collaborative, and co-created. CI-BER proposes a process of co-creation, in which the community is actively involved in most or all steps of crowdsourcing. We refer to this as citizen-led crowdsourcing, and take initiative from the community based on their energetic and vital participation in an earlier North Carolina Humanities Council (NCHC) project called Twilight of a Neighborhood: Asheville’s East End, 1970.
CI-BER’s use of the term “citizen-led sourcing” is inspired by the Obama administration’s Open Government Initiative, which encourages public participation and collaboration. It is a derivative of the term citizen sourcing which has been defined as the “government adoption of crowdsourcing techniques for the purposes of (1) enlisting citizens in the design and execution of government services and to (2) tapping into the citizenry’s collective intelligence.” Vivek Kundra, Chief Information Officer of the United States from March 2009 to August 2011 under President Obama, described citizen sourcing as a way of driving “innovation by tapping into the ingenuity of the American people to solve those problems that are too big for government to solve on its own.”  Citizen sourcing is derived from the term crowdsourcing and emphasizes the type of civic engagement typically enabled through Web 2.0 participatory technologies, over a more impersonal crowd-based distributed problem-solving and production model.
In the International Journal of Public Participation article, Citizensourcing: Applying the Concept of Open Innovation to the Public Sector, the authors present “a structural overview of how external collaboration and innovation between citizens and public administrations can offer new ways of citizen integration and participation, enhancing public value creation and even the political decision-making process.”  
The Archivist of the United States, David Ferriero, has also introduced the concept of “citizen archivists” in 2010. He made a parallel with citizen scientists and spoke of increasing public engagement in the archives given the National Archives and Records Administration’s over-abundance of paper records and need to digitize and transcribe them.  Finally, more recently, at the August 2011 Society of American Archivists (SAA) annual meeting in Chicago, Robert Townsend, Deputy Director of the American Historical Association, chaired a session examining the notion of “participatory archives.”  The panelists Kate Theimer, Elizabeth Yakel, and Alexandra Eveleigh provided a definition and examples of participatory archives, discussing the latest research on the impact of user participation. Kate Theimer offered the following definition:
Participatory Archive:  An organization, site or collection in which people other than archives professionals contribute “knowledge or resources, resulting in increased understanding about archival materials, usually in an online environment.”
This definition is very useful as it relates to the notion of “citizen-led sourcing” we propose. Our citizen-led focus puts civically engaged community members at the forefront and indicates that the focus is on the community engaging the archive with control resting on their shoulders.
Fig. 1: From crowdsourcing to citizen-led sourcing (dates indicate when concepts were first introduced)
II. The Community
In 2008, the North Carolina Humanities Council sponsored Twilight of a Neighborhood, which was organized around the photographs of Andrea Clark who had documented the community’s life at the eve of urban renewal. This project, and the earlier October 2007 transfer of a nearly intact and complete collection of urban renewal documents from the City to the University of North Carolina Asheville Library, helped energize “an emerging movement of concerned Asheville citizens who believe that their culture and history will shape how they live in the present and define the future.”
The CI-BER crowdsourcing project builds on this vital community interest and energy, and works with displaced citizens and African American community leaders who participated in the North Carolina Humanities Council project. The purpose is to put technology at the service of the community to document, analyze, and represent its lost history, and help reclaim it for education benefits, awareness, civic action, and potentially economic development purposes. 
In October 2007, the Asheville City Council approved the transfer of the records of the Housing Authority of the City of Asheville (HACA) to the D.H. Ramsey Library Special Collections & University Archives at the University of North Carolina Asheville (UNCA). This collection of nearly 130 linear feet and some 129 cartons, documents a number of “significant redevelopment projects undertaken from the early 1960s to the mid-1980s.”
CI-BER focuses on the Southside project (formerly known as East Riverside), which at over 400 acres was the largest urban renewal area in the southeastern United States. In 1966, the Southside community represented about 4,000 people, some 7% of the population, living in nearly 1,300 households, and 98% African American, housing “more than half of the Negro families in the City of Asheville.” “The scale of devastation here was unmatched with more than 1,100 homes lost.” The urban renewal African American experience in Asheville is one of painful displacement and fragmentation.
Urban renewal as a federal government program was a 24-year (1949-1973) initiative started under the Housing Act of 1949, and modified under the Housing Act of 1954. It used the 1930’s Home Owners’ Loan Corporation (HOLC) redlining terminology of “blight” and “slums” to launch an ambitious redevelopment and eminent domain process that led to the bulldozing of some 2,500 neighborhoods in 993 American cities. It is estimated that one million people were dispossessed in the process. “Black America cannot be understood without a full and complete accounting of the social, economic, cultural, political, and emotional losses that followed the bulldozing of 1,600 neighborhoods,” wrote Mindi Fullilove, who added that “the obliteration of a neighborhood destroys the matrix that holds people together on particular paths and in specific relationships.”
Community project consultant Priscilla Ndiaye is a Southside native, a community leader, and the Chair of the Southside Community Advisory Board, which was housed in the W.C. Center in the former Livingston Street School building, one of the very few original neighborhood buildings standing. She and community members, Nikoletta and Alonzo Robinson, led the collection indexing and mapping effort in March 2013.  University of North Carolina-Asheville project director Dr. Dwight Mullen is a professor of Political Science, and as the lead partner at UNCA, directed four students in the digitization of initial portions of the collection from January to March 2013.  The digitization team also included Donnell Sloan, Noor Al-Sibai, Anna McGuire, and Jesse Rice-Evans. This work was enabled through generous support of UNCA’s University Librarian, Leah Dunn, and Special Collections staff, Colin Reeve and Laura Gardner.
III. Remapping the Community
In essence, the CI-BER crowdsourcing approach harnesses the power of engaged citizens who can help capture key elements (owner, renter, parcel number, street address) per scanned documents (initial appraisal sheet, and house photo), and rapidly remap the entire Southside neighborhood. CI-BER’s technical team will use this information with fragments of maps found in the collection to create a digital spatial canvas of the entire neighborhood in 1965 on the eve of urban renewal, where each parcel is clickable and linked to its key elements.
The goal is to recreate community and bring to life the entire collection in the very first iteration. Priscilla Ndiaye demonstrated the vitality of this process when she pulled various property acquisition folders from Southside boxes, and came across the house she was born in, the house she grew up in, and houses of friends, neighbors, and community members. Being able to reproduce this process online and identify and tag collection items is one of the goals of our citizen-led sourcing of the archive.  The community carries the living history of these records. Through iterative and incremental passes over the files, the CI-BER team will gradually make sense of the collection, digitize strategic content, transcribe it through citizen-led crowdsourcing, visualize and map the content, enhance the collection, develop a working content model, and add functionality to the software user interface being built.
IV. Citizen-led Crowdsourcing
The CI-BER project uses an incremental agile development-like approach. During our first pass, we are producing an initial neighborhood map that serves as a starting canvas in space and time. As new iteration loops take place throughout March and April of 2013, neighborhood residents will be able to follow the progress, provide feedback, and upload additional content to ensure transparency and openness: “ensuring the community is in the loop.” CI-BER’s citizen-led focus puts civically engaged community members at the forefront so that control rests on their shoulders.
We propose a workflow in four phases, based on best practices from theCommunity History Digitization: How-To Manual and Exercises.
Phase 1: Student- and citizen-led crowdsourced digitization, indexing, and mapping (completed)
January-March 2013
  • Community member and student-driven digitization of an initial subset of the collection.
  • Community-led indexing and mapping of the collection.
Phase 2: Modeling and motivating community participation in the crowdsourcing design process
March-May 2013
  • Develop programming that can be used to build community buy-in around the project in its developmental stages.
  • We are working with Jeff Heard from RENCI and modifying the “Big Board” emergency mapping software to accommodate crowdsourcing capabilities.
Phase 3: Deploying the crowdsourcing model
April-June 2013
  • Develop a framework for the crowdsourcing logistics.
  • Evaluate other community-based historical digitization projects
  • Develop a wishlist of features to help guide the design of the crowdsourcing interface.
Phase 4: Presentation of crowdsourcing process
Summer 2013
  • Rollout of project through a series of public events.
V. References
Dunn, S., & Hedges, M. (2012). Crowd-sourcing scoping study: Engaging the crowd with humanities research. Centre for e-Research, Department of Digital Humanities King’s College London. Arts & Humanities Research Council. A project of the AHRC Connected Communities Theme. Retrieved from http://humanitiescrowds.org/2012/12/21/final-report/

Ferriero, D. (2013). Volunteers help NARA do its job, support professional archivists. Archival Outlook. January/February 2013, p. 16. Society of American Archivists.

Fullilove, M. (2004). Root Shock: How Tearing Up City Neighborhoods Hurts America, and What We Can Do About It. New York: One World/Ballantine Books.

Judson, S. (2010). Twilight of a neighborhood: Asheville’s East End, 1970. Crossroads: A Publication of the North Carolina Humanities Council. Summer/Fall 2010. Retrieved fromhttp://nchumanities.org/sites/default/files/documents/Crossroads%20Summer%202010%20for%20web.pdf

Hilgers, D., & Ihl, C. (2010). Citizensourcing: Applying the concept of open innovation to the public sector, The International Journal of Public Participation, 4:1. Retrieved fromhttp://www.iap2.org/associations/4748/files/Journal_10January_Vol4_No1_6...

Monday, February 11, 2013

Data Control: Then and Now

Alistair Croll was not referring to the historic uprooting of Asheville’s Southside neighborhood when he said, “Data doesn’t invade people’s lives. Lack of control over how it’s used does.”

But he could have been.

We treat “big data” as a 21st century phenomenon that Google or Facebook brought upon us, as though data are only now being mined for ends we have yet to imagine. For the residents of Asheville’s Southside, though, there is no need for imagination.  They have already experienced that lack of control.

Twilight of a Neighborhood was a public humanities project that focused on Asheville’s African American neighborhoods, which included East End, Burton Street, Stump Town, and Southside.  The Southside was one of an estimated 1,600 African American neighborhoods that were torn apart over three decades of continuous “urban renewal” between the 1950s and 1970s.

The result of that renewal, says Dr. Mindy Fullilove, an urban scholar and psychiatrist at Columbia University, is root shock, “ the traumatic stress reaction to the loss of some or all of one’s emotional ecosystem.” Robert Hardy of the Southside Neighborhood Association spoke of the impact of root shock on his community, “The resulting 'fiasco' which we are now living is perpetual poverty for the descendants and gentrification of their land.”

Priscillia Ndiaye, born and raised in the Southside neighborhood and a Chair of the Southside Community Advisory Board, (and a collaborator on the CI-BER project) reflects on what happened to her neighborhood, “Multiple perspectives, lack of knowledge, much confusion, and discouraged and bitter individuals are all entwined as spiders in a web: any way you touch it, it trembles.”

“At over four hundred acres, the urban renewal project here was the largest in the southeastern United States. The scale of the devastation here was unmatched.” Over a thousand homes were bulldozed, as well as churches, gas stations, grocery stores, funeral homes, businesses, schools, doctor offices, and a hospital.

How were these neighborhoods targeted? One answer lies in an earlier type of “big data.” As far back as the 1930s, the U.S. Census Survey has been collecting data from citizens who had little knowledge or control over how it was used. Alistair Croll is correct when he says that “Big data is our generation’s civil rights issue and we don’t know it,” except that for many African Americans, the link between big data and civil rights issues is nothing new.

Richard Marciano and Chien-Yi Hou extended an approach taken from an earlier project called T-RACES (mapping redlining in California cities) and applied it to cities in North Carolina including the city of Asheville. Advisors on the Asheville redlining project included Priscilla Ndiaye, chair of Asheville's Southside Advisory Committee, and Dwight Mullen, UNCA Political Science professor.  Richard writes that, “Urban renewal as a federal government program was a 24-year (1949-1973) initiative started under the Housing Act of 1949, and was modified under the Housing Act of 1954. It used the 1930’s Home Owners’ Loan Corporation (HOLC) redlining terminology of “blight” and “slums” to launch an ambitious redevelopment and eminent domain process that led to the bulldozing of 2,500 neighborhoods in 993 American cities.”

Referring to the image above, Richard continues, “There is a picture of the 1937 Asheville redlining map and on the right is a snapshot of an interactive web mapping application that allows exploration of redlining but also superimposes the four major urban renewal neighborhoods of Asheville impacted in the 60s, 70s, and beyond, including the Southside neighborhood.”

“What is remarkable about these preliminary findings (we believe this to be one of the first interactive juxtapositions of these sets of historical policies) is the fact that the urban renewal footprint is almost a perfect match with the earlier 1937 redlining disinvestment one. The legacy of redlining, urban renewal, and the social philosophy that authorized it as an economic and policy instrument is still evident in the range of problems that continue to impact many urban neighborhoods.”

This is where the CI-BER project gets involved. We have chosen to set our initial study in the context of the heterogeneous datasets and multi-source historical and digital collections of the City of Asheville in North Carolina.

This allows us to start small, to validate core concepts of the research, and subsequently scale nationally. We are assembling historical and born-digital collections that span 100 years of urban development and include heterogeneous records of 1930s New Deal policies, of 1960s Urban Renewal policies, and 1990s planning documents.

Nationwide, thousands of cities followed the same development patterns, and the case study naturally leads to scaling at a very large national level. By focusing on this particular place, we will demonstrate the potential for automation and integration of temporal and spatial datasets involving census, economic, historic, planning, insurance, scientific, and financial content, with an eye on scalability and the goal of making a national impact on future research.

Richard has already conducted surveys and a pilot of the types of data sources involved from scanned imagery, maps, digitized datasets, National Archives federal records, and beyond, many of which are already part of the CI-BER testbed. Workflows demonstrating the potential for automation and integration at scale will be researched and citizen-scientist crowdsourcing processes deployed.

We have already secured the support of a number of entities in Asheville, including citizen groups, non-profits, city organizations, and universities. In upcoming posts, we will look at other facts of the CI-BER collaboration.

Please join us on our Collaborative Data group at HASTAC.org to network with others working in this research area, and to receive updates on other collaborative data projects.

Flickr image courtesy of LaurenManning

Friday, February 1, 2013

Socializing Big Data: Collaborative Opportunities in Computer Science, the Social Sciences, and the Humanities

Last month at the Franklin Humanities Institute at Duke University, Richard Marciano talked about Socializing Big Data: Collaborative Opportunities in Computer Science, the Social Sciences, and the Humanities. Richard is a professor in the School of Information and Library Science at the University of North Carolina at Chapel Hill, Director of the Sustainable Archives and Leveraging Technologies (SALT) lab, and co-director of the Digital Innovation Lab (DIL).
(A copy of Richard's presentation is available here on Slideshare)
The following highlights the points in Richard's presentation that, for many of us, represents the best of what academia offers -- complex, collaborative, and innovative research that builds, applies, investigates, and distills knowledge across a diverse social landscape.
"Socializing big data" represents one of the most challenging and intriguing sociotechnical questions of the 21st century: We have the data. We have a lot of it. Now what?

Richard started his talk describing the issues that keep him up at night: "It is not just the messiness of all this data, but the notion that big data can create big collaborations, which invites key questions: How can people get along and bring diverse points to the table? Big collaborations also lead to bigger ideas, so how can we guide research directions and develop innovative approaches that benefit from that kind of diversity?"

To illustrate big data and big collaborations, Richard highlighted the “Records in the Cloud” project funded by University of British Columbia iSchool in collaboration with the University of Washington iSchool, and the MidSweden University's Info Tech and Media program. The purpose of the project is to delegate to cloud providers the responsibility for security, accessibility, disposition and preservation. To quote Richard, “This is the nature of a lot of these projects -- which is to say, it is cross discipine in nature, and digs deep and broad to make sure there are viewpoints and representation that go far beyond the technological aspects.”

Going beyond the technological aspects of big data matters now more than ever. 
The White House announced that Big Data is a Big Deal in March of 2012, a headline with teeth, evidently, since they backed it up with $200 million in funding across six federal departments and agencies. We need to be smart about what we do with this opportunity. Writer Ed Dumbill of Forbes magazine shared his own take in an article titled Big Data, Big Hype, Big Deal, an intelligent forecast of big data’s potential for “[S]ensing, algorithmic discovery and gaining deeper insight through data. Essentially, the emergence of a global digital nervous system.”

The question, then, is what does it mean to gain deeper insight through data? As Allistair Croll wrote, Big data is our generations civil rights issue and we don’t know it. Generating deeper insights requires having diverse viewpoints at the table. As an example, Richard referred to the iRODS Primer: Integrated Rule-Oriented Data System Synthesis that he co-authored. His question to us seems technical, but it is fundamentally ethical:
  • How can you specify rules of engagement and rules of authenticity? How can you instrument content management systems with new forms of automation?
  • How could you instruct these systems with rules of ethics, or rules of social behavior?
  • How can you customize the behavior of the system so it will be more user-friendly and smart and adapt as a single system that functions for everyone?
No one is doing this yet, says Richard. What would a system like this mean? “Access and linkage to your big data collections would be governed by principles, and then the collection would try to enforce those things. It’s a return to the old days of artificial intellience when we had expert systems, not just collections of content. A set of rules, triggers, policies would customize the entire behavior. Dealing with big data brings us back to this kind of space, this kind of thinking.”

When we start imagining intelligent systems that are governed by principles and ethics, Richard commented, “You really need different viewpoints at the table. You can’t afford these projects that solely deal with the role of cyberinstructure and that punt on these other topics.”

That brings us to another question raised during the talk: How does the advent of big data change the way we do social science and determine what role social scientists will play? There are so many research issues that haven’t even been framed or formed yet. Understanding how to colloborate is also equally important as how we deal with big data. We are early stages of learning what it means to collaborate across big data projects.

On the CI-BER project that Richard and HASTAC are working, the research involves not just big teams of scholarly collaborators, but big teams of neighborhood groups, public libraries, the chamber of commerce, county organizations, regional planning councils and other stakeholders. “These projects are larger than life. When you bring them together, it supercedes the capability of any one individual to do them, so you have to really rethink the entire research process. We cannot simply automate, we cannot just rely on technology.”

Yes, to be technological, we are talking about “researching the cyberinfrastructre implications of supporting large scale content-based indexing of highly heterogeneous digital collections potentially embodying non-unifrom or sparse metadata architectures.” But we are also talking about the nuts and bolts of how people work together.

When asked by a member of the audience how he manages big collaboration, Richard responded, “Collaboration is essentially a translation and semantics issue. For some, it might make sense to hire a technical broker. Or it might make sense to bring on humanists and philosphers who can help bring people together, to help them position people and ideas. My experience is that this is very humbling and it takes a lot of time. It can take a year, a year and a half before relationships of trust are developed, and people begin to understand other people’s language.”
*** *** ***
Richard leads development of "big data" projects funded by Mellon, NSF, NARA, NHPRC, IMLS, DHS, NIEHS, and UNC. Recent 2012 grants include a JISC Digging into Data award with UC Berkeley and the U. of Liverpool, called "Integrating Data Mining and Data Management Technologies for Scholarly Inquiry," a Mellon / UNC award called "Carolina Digital Humanities Initiative," which involves the translating of big data challenges into curricular opportunities, and an NSF award for CI-BER, a collaborative big heterogenous data integration project between Duke, University of North Carolina-Chapel Hill, and HASTAC. 
He holds a B.S. in Avionics and Electrical Engineering, and an M.S. and Ph.D. in Computer Science, and has worked as a postdoc in Computational Geography. He conducted interdisciplinary research at the San Diego Supercomputer at UC San Diego, working with teams of scholars in sciences, social sciences, and humanities.
Join HASTAC's Collaborative Data group to follow Richard's work with CI-BER, HASTAC's EAGER project, or want to share your own data collaborations.

(Image of a starling murmuration is courtesy of http://www.flickr.com/photos/rocketjohn/6350566523/.)

Wednesday, December 12, 2012

Big Data Meets Collaboration by Difference: HASTAC Goes CIBER

Big data, big collaboration. That’s what happenened to the CI-BER project when it took on new partners this fall. CI-BER launched in 2011 as a cooperative research agreement between NARA (National Archives and Records Administration), NSF (National Science Foundation), and the University of North Carolina at Chapel Hill: Build a master copy of billions of federal electronic records and visualize that data in different ways.

Jeff Heard of RENCI, one of the project collaborators) summarized the CI-BER testbed as:

  • 75 million records and growing
  • 70 terabytes of data
  • Records spread across 150 government agencies
  • Thousands of file types
  • Dozens of data classes
  • Hundreds of ad-hoc human organizational structures
  • All replicated and compiled into a central IRODS repository for next-gen data-grid access
  • Geographic subset of > 1.2m records

As of fall 2012, CI-BER is expanding to include new partners from Duke University, UNC-Asheville, and the City of Asheville, creating a collaborative team that represents computer science, political science, the humanities, engineering, information and library science, three universities, the town of Asheville, and community leaders with a pressing need for big data. Over the next 9 months, we will be reporting on the multiple facets of this collaboration, not only sharing research results from our experiments and developments, but documenting the practice of collaborating across a complex mix of disciplines, organizations, and institutions.

Why this project? Public government records are growing exponentially each year, and CI-BER’s goal is to create new tools to store, view, and use that digital data. To give some idea of what volume of data is heading our way, consider that George W. Bush transferred an estimated 77 terabytes (a terabyte equals 12 zeros) of information to the National Archives upon leaving office, 35 times what the Clinton administration generated. In 2011, President Obama ordered federal agencies to make wider use of digital-based recordkeeping systems, and that promises to exponentially grow the size of national archives by an order of magnitude and complexity that gives CI-BER and other big data initiatives a sense of urgency.

Terabytes of information no longer elicit much awe for the average user, not when Google is processing 24 petabytes per day (a petabyte equals 15 zeros) and you can buy 2 terabytes of storage for less than a hundred dollars. However, despite our comfort generating and consuming massive data, it is no minor thing to mine billions of files with different file formats and folder systems, each one a small dot of information in a giant matrix of heterogeneous data that needs to be accessed, viewed, remixed, and eventually shared and made available for public use.

CIBER’s first task is to load the collections and make the data web-accessible so users can eventually “check out” a record and view or manipulate the data, and even remix or mash it up if they wish. By “users” we mean not only archivists, but the public. One of our CI-BER goals is to make it possible for users to crowdsource geospatial metadata without changing the underlying record.

Richard Marciano, co-founder of UNC’s Digital Innovation Lab and CI-BER’s PI, has experience making large archived collections web-accessible and searchable. One of his current research projects is T-Races (Testbed for the Redlining Archives of California’s Exclusionary Spaces), which looks at the impact that maps and reports from the 1930s had on mortgage lending policies and how that data influenced the resilience of neighborhoods through the 1960s. The end result is an innovative system that allows users to interact with and analyze historical data that was virtually inaccessible, allowing similar cities and neighborhoods around the country to mine and share their own public history.

That expertise will carry over to our own CI-BER place-based work. In this grant, Dr. Marciano will be joined by two co-PIs at Duke University, including Robert Calderbank, Dean of the Natural Sciences and Professor of Electrical and Computer Engineering, and Cathy N. Davidson, John Hope Franklin Humanities Institute Professor of Interdisciplinary Studies and co-founder of Humanities, Arts, Sciences, and Technology Advanced Collaboratory (HASTAC or "haystack"). 

Other collaborators on the team include Sheryl Grant and Mandy Dailey of HASTAC, Jeff Heard, Erik Scott, and John McGee at RENCI, Chien-Yi Hou at University of North Carolina-Chapel Hill's School of Information and LIbrary Science, Priscilla Ndiaye, of the Asheville Southside Community Advisory, Dwight Mullen of UNC-Asheville, Mark Conrad of NARA, and Sheau-Yen Chen of the University of California, San Diego. We expect that this marks the beginning of a long-term data connection across UNC, Duke, and beyond.

For our initial study, we chose to focus on the heterogeneous datasets and multi-source historical and digital collections of the City of Asheville, North Carolina. This allows us to validate core concepts of the research so they can be subsequently scaled nationally. We have assembled historical and born-digital collections for a particular neighborhood in Asheville spanning 100 years of urban development, including heterogeneous records of 1930s New Deal policies, 1960s Urban Renewal policies, and 1990s planning documents. Historically, thousands of cities followed the same development patterns nationwide, and the case-study will naturally lead to scaling at a very large national level.

By focusing on this particular place, we plan to demonstrate the potential for automation and integration of temporal and spatial datasets involving census, economic, historic, planning, insurance, scientific, and financial content, with the goal of making a national impact on future research.  Dr. Marciano has already conducted surveys and a pilot of the types of data sources involved from scanned imagery, maps, digitized datasets, National Archives federal records, and beyond, many of which are already part of the CI-BER testbed. Our plan is to demonstrate workflows for potential automation and integration at scale, and to show how citizen-scientist crowdsourcing projects can be deployed.

Interested in following along? Visit HASTAC.org (under the CIBER tag) to learn more.

Monday, November 14, 2011

CI-BER at LDAV 2011

The CI-BER project took a poster of its latest work to the Large Data Analysis and Visualization symposium, co-located with VisWeek 2011 in Providence, RI from September 23-25th. The conference was great, and I wanted to quickly share the highlights of the poster we presented.

We focused in the poster on communicating the unique problems that are faced when indexing and visualizing archives. Namely, the problem of dealing with data with highly variable structure and "quality." This has posed a number of challenges over the last year as older files found ways of corrupting the indexer process, corrupting the index, and confusing the indexer to no end. This was eventually overcome by separating the indexer core, that is the queue of files and the scanning process, from the part that actually delved into files to collect metadata. These are isolated into separate UNIX processes to keep a core dump or other major failure on one file from bringing down the entire indexing process or making it grind to a halt. This also permits us to distribute indexing across multiple machines, making for better overall performance. The indexer architecture is shown below:

This indexer allowed us to create prototypes that give a highly interactive view of the over 60TB collection as far as it relates to geography. The indexer core will be going Open Source on GitHub by December 2011, and I will post the announcement of its availability here.

In addition to the previously blogged prototype on treemap visualization, we created a visualization which allows a user to geographically search the collection. The user draws bounding boxes with a swipe of the finger on a tablet device (such as the iPad) and that searches the index for metadata records that can be reviewed as results in the side-pane. The user can then drill down on the side pane to retrieve the actual metadata record itself.

These tools will also be going Open Source about the same time, and will be on GitHub with the indexer.

LDAV Poster on Slideshare -- Click *HERE*

Wednesday, February 9, 2011

Visualization of GIS Records in the CI-BER Testbed

The goal of CI-BER is to develop scalable interfaces for exploring and understanding archival collections containing billions of records. In the archival sense, "records" do not signify rows in a database or a spreadsheet, easily broken down into machine-understandable structured chunks of data; rather, records follow the traditional meaning of the word. CI-BER records, archival records, are heterogeneous entities containing text, images, signatures, data in current or even obsolete binary formats, compressed digital archives, and others. Along with what one might think of as the data itself is the structure of its collection: folders, subfolders, file dates and names, versioning information, and whatever (often sparse if present at all) metadata about ownership, provenance, or content the data came with. All this must be preserved in a genuine archive, and thus building interfaces that make any sense of these collections is a complicated problem.

Billions is a fair estimate, too. Every year, government agencies store exponentially more archival data in electronic formats, and transparency efforts make archiving them more important and necessary than before. The goal of the CI-BER project is to make tools and update storage mechanisms pre-emptively to handle the influx as it continues to grow.

In browsing the records we added to the CI-BER test collection, we noticed that a significant number of geographic data sets are liberally scattered throughout a number of sub-collections. To explore and understand these collections we put together an index and an explorer interface for viewing geographic metadata of CI-BER records, with a treemap interface inspired largely by the work of TACC.

For geographic records, our interface is meant to pick up where TACC leaves off. When the CI-BER geographic indexing process discovers a geographic file type, it goes through a series of open-source geographic programs looking for one that can decode the file. Once a decoding program is found, the indexer asks the file a series of questions, finding out the geographic boundaries, the assumed map projection (there are roughly speaking, thousands of likely projections), the type of data stored at each point of the file, and what kind of features: lines, rasters, points, polygons - that data is mapped to. The answers to these questions, along with some standard file metadata: location, size, owner, creation date, among others, are stored in the index to be used in the explorer.

The goal of the explorer and future follow-on geographic interfaces is to provide a level of collection understanding while at the same time providing a way to cross-correlate different sub-collections or even entire collections. Geography, like some say about music, is a kind of universal language: once you have data linked firmly to geography, that hard physical reference provides a way to relate that data to a person or town, and to relate it to other data coordinated within the same geographical region. CI-BER collections come from different government agencies, many of which don't interact or interact on a very loose level. By collecting and indexing their collections in terms of geography, cross-agency collaboration and understanding can be fostered.

The figure shown is the first iteration of the geographic explorer. On the right hand side, we see a "treemap" visualization of the collection, showing all the entries in a particular sub-collection by the number of sub-entries they contain, or file size if the entry is a single file. Grey entries are geographic dead-ends. These cells contain no geographic data. They are included in the explorer to give a user a rough idea of how sparsely geographic data is housed within the collection. Highlighted in yellow are collections containing only "features": individual points, outlines, or paths on which data is stored (typical data stored on features include sensor data, road names, elevations, census data, and municipality information). Highlighted in blue are collections containing "coverages", image data describing a two-dimensional surface or three-dimensional volume of real space. Typical data stored in coverages include weather predictions and observations, climate or hydrography models, satellite images, and digital elevation models. Finally, highlighted in red are collections that contain both.

When a user hovers over one of the highlighted cells, the left side containing the map dynamically displays rectangular outlines which are the geographic boundaries of all the contained-subcollections or the file. Additionally, a hover-over information box appears containing the additional metadata about the file, including projection, numerical boundaries, number of features/rasters contained, and POSIX file metadata. Clicking on a cell descends a level into the collection, refining the user's understanding of a single sub-collection as it breaks out into further sub-collections or individual files.

This is just the first tool developed by the CI-BER working group for visualization. From what we learned in this rapid iteration, follow-on tools will be written that make it easier to understand huge collections in terms of their geography.

Complete Master Copy of NARA Research Holdings Now Available at UNC!

The CI-BER project has successfully built a consolidated master copy of all of the research holdings of the National Archives at UNC Chapel Hill. Files were merged into a master collection from a variety of networked locations.

The initial consolidated CI-BER testbed currently holds over 14 million unique files and 27.5TB of data. We expect the testbed to grow significantly over the next few months.

One of the goals is for the CI-BER testbed to enable empirical studies at scale that contribute to the understanding of how to apply new cyberinfrastructure approaches. We hope this will lead to an understanding of how NARA can characterize, analyze, document and manage its vast holdings of records.

The CI-BER team has already started to explore these holdings using data-intensive approaches and visual analytics. Results showing innovative interfaces for collection navigation for geospatial collections will be blogged about next.