CI-BER: CyberInfrastructure for Billions of Electronic Records: February 2013

Alistair Croll was not referring to the historic uprooting of Asheville’s Southside neighborhood when he said, “Data doesn’t invade people’s lives. Lack of control over how it’s used does.”

But he could have been.

We treat “big data” as a 21st century phenomenon that Google or Facebook brought upon us, as though data are only now being mined for ends we have yet to imagine. For the residents of Asheville’s Southside, though, there is no need for imagination. They have already experienced that lack of control.

Twilight of a Neighborhood was a public humanities project that focused on Asheville’s African American neighborhoods, which included East End, Burton Street, Stump Town, and Southside. The Southside was one of an estimated 1,600 African American neighborhoods that were torn apart over three decades of continuous “urban renewal” between the 1950s and 1970s.

The result of that renewal, says Dr. Mindy Fullilove, an urban scholar and psychiatrist at Columbia University, is root shock, “ the traumatic stress reaction to the loss of some or all of one’s emotional ecosystem.” Robert Hardy of the Southside Neighborhood Association spoke of the impact of root shock on his community, “The resulting 'fiasco' which we are now living is perpetual poverty for the descendants and gentrification of their land.”

Priscillia Ndiaye, born and raised in the Southside neighborhood and a Chair of the Southside Community Advisory Board, (and a collaborator on the CI-BER project) reflects on what happened to her neighborhood, “Multiple perspectives, lack of knowledge, much confusion, and discouraged and bitter individuals are all entwined as spiders in a web: any way you touch it, it trembles.”

“At over four hundred acres, the urban renewal project here was the largest in the southeastern United States. The scale of the devastation here was unmatched.” Over a thousand homes were bulldozed, as well as churches, gas stations, grocery stores, funeral homes, businesses, schools, doctor offices, and a hospital.

How were these neighborhoods targeted? One answer lies in an earlier type of “big data.” As far back as the 1930s, the U.S. Census Survey has been collecting data from citizens who had little knowledge or control over how it was used. Alistair Croll is correct when he says that “Big data is our generation’s civil rights issue and we don’t know it,” except that for many African Americans, the link between big data and civil rights issues is nothing new.

Richard Marciano and Chien-Yi Hou extended an approach taken from an earlier project called T-RACES (mapping redlining in California cities) and applied it to cities in North Carolina including the city of Asheville. Advisors on the Asheville redlining project included Priscilla Ndiaye, chair of Asheville's Southside Advisory Committee, and Dwight Mullen, UNCA Political Science professor. Richard writes that, “Urban renewal as a federal government program was a 24-year (1949-1973) initiative started under the Housing Act of 1949, and was modified under the Housing Act of 1954. It used the 1930’s Home Owners’ Loan Corporation (HOLC) redlining terminology of “blight” and “slums” to launch an ambitious redevelopment and eminent domain process that led to the bulldozing of 2,500 neighborhoods in 993 American cities.”

Referring to the image above, Richard continues, “There is a picture of the 1937 Asheville redlining map and on the right is a snapshot of an interactive web mapping application that allows exploration of redlining but also superimposes the four major urban renewal neighborhoods of Asheville impacted in the 60s, 70s, and beyond, including the Southside neighborhood.”

“What is remarkable about these preliminary findings (we believe this to be one of the first interactive juxtapositions of these sets of historical policies) is the fact that the urban renewal footprint is almost a perfect match with the earlier 1937 redlining disinvestment one. The legacy of redlining, urban renewal, and the social philosophy that authorized it as an economic and policy instrument is still evident in the range of problems that continue to impact many urban neighborhoods.”

This is where the CI-BER project gets involved. We have chosen to set our initial study in the context of the heterogeneous datasets and multi-source historical and digital collections of the City of Asheville in North Carolina.

This allows us to start small, to validate core concepts of the research, and subsequently scale nationally. We are assembling historical and born-digital collections that span 100 years of urban development and include heterogeneous records of 1930s New Deal policies, of 1960s Urban Renewal policies, and 1990s planning documents.

Nationwide, thousands of cities followed the same development patterns, and the case study naturally leads to scaling at a very large national level. By focusing on this particular place, we will demonstrate the potential for automation and integration of temporal and spatial datasets involving census, economic, historic, planning, insurance, scientific, and financial content, with an eye on scalability and the goal of making a national impact on future research.

Richard has already conducted surveys and a pilot of the types of data sources involved from scanned imagery, maps, digitized datasets, National Archives federal records, and beyond, many of which are already part of the CI-BER testbed. Workflows demonstrating the potential for automation and integration at scale will be researched and citizen-scientist crowdsourcing processes deployed.

We have already secured the support of a number of entities in Asheville, including citizen groups, non-profits, city organizations, and universities. In upcoming posts, we will look at other facts of the CI-BER collaboration.

Please join us on our Collaborative Data group at HASTAC.org to network with others working in this research area, and to receive updates on other collaborative data projects.

Flickr image courtesy of LaurenManning

Last month at the Franklin Humanities Institute at Duke University, Richard Marciano talked about Socializing Big Data: Collaborative Opportunities in Computer Science, the Social Sciences, and the Humanities. Richard is a professor in the School of Information and Library Science at the University of North Carolina at Chapel Hill, Director of the Sustainable Archives and Leveraging Technologies (SALT) lab, and co-director of the Digital Innovation Lab (DIL).

(A copy of Richard's presentation is available here on Slideshare)

The following highlights the points in Richard's presentation that, for many of us, represents the best of what academia offers -- complex, collaborative, and innovative research that builds, applies, investigates, and distills knowledge across a diverse social landscape.

"Socializing big data" represents one of the most challenging and intriguing sociotechnical questions of the 21st century: We have the data. We have a lot of it. Now what?

Richard started his talk describing the issues that keep him up at night: "It is not just the messiness of all this data, but the notion that big data can create big collaborations, which invites key questions: How can people get along and bring diverse points to the table? Big collaborations also lead to bigger ideas, so how can we guide research directions and develop innovative approaches that benefit from that kind of diversity?"

To illustrate big data and big collaborations, Richard highlighted the “Records in the Cloud” project funded by University of British Columbia iSchool in collaboration with the University of Washington iSchool, and the MidSweden University's Info Tech and Media program. The purpose of the project is to delegate to cloud providers the responsibility for security, accessibility, disposition and preservation. To quote Richard, “This is the nature of a lot of these projects -- which is to say, it is cross discipine in nature, and digs deep and broad to make sure there are viewpoints and representation that go far beyond the technological aspects.”

Going beyond the technological aspects of big data matters now more than ever.

The White House announced that Big Data is a Big Deal in March of 2012, a headline with teeth, evidently, since they backed it up with $200 million in funding across six federal departments and agencies. We need to be smart about what we do with this opportunity. Writer Ed Dumbill of Forbes magazine shared his own take in an article titled Big Data, Big Hype, Big Deal, an intelligent forecast of big data’s potential for “[S]ensing, algorithmic discovery and gaining deeper insight through data. Essentially, the emergence of a global digital nervous system.”

The question, then, is what does it mean to gain deeper insight through data? As Allistair Croll wrote, Big data is our generations civil rights issue and we don’t know it. Generating deeper insights requires having diverse viewpoints at the table. As an example, Richard referred to the iRODS Primer: Integrated Rule-Oriented Data System Synthesis that he co-authored. His question to us seems technical, but it is fundamentally ethical:

How can you specify rules of engagement and rules of authenticity? How can you instrument content management systems with new forms of automation?
How could you instruct these systems with rules of ethics, or rules of social behavior?
How can you customize the behavior of the system so it will be more user-friendly and smart and adapt as a single system that functions for everyone?

No one is doing this yet, says Richard. What would a system like this mean? “Access and linkage to your big data collections would be governed by principles, and then the collection would try to enforce those things. It’s a return to the old days of artificial intellience when we had expert systems, not just collections of content. A set of rules, triggers, policies would customize the entire behavior. Dealing with big data brings us back to this kind of space, this kind of thinking.”

When we start imagining intelligent systems that are governed by principles and ethics, Richard commented, “You really need different viewpoints at the table. You can’t afford these projects that solely deal with the role of cyberinstructure and that punt on these other topics.”

That brings us to another question raised during the talk: How does the advent of big data change the way we do social science and determine what role social scientists will play? There are so many research issues that haven’t even been framed or formed yet. Understanding how to colloborate is also equally important as how we deal with big data. We are early stages of learning what it means to collaborate across big data projects.

On the CI-BER project that Richard and HASTAC are working, the research involves not just big teams of scholarly collaborators, but big teams of neighborhood groups, public libraries, the chamber of commerce, county organizations, regional planning councils and other stakeholders. “These projects are larger than life. When you bring them together, it supercedes the capability of any one individual to do them, so you have to really rethink the entire research process. We cannot simply automate, we cannot just rely on technology.”

Yes, to be technological, we are talking about “researching the cyberinfrastructre implications of supporting large scale content-based indexing of highly heterogeneous digital collections potentially embodying non-unifrom or sparse metadata architectures.” But we are also talking about the nuts and bolts of how people work together.

When asked by a member of the audience how he manages big collaboration, Richard responded, “Collaboration is essentially a translation and semantics issue. For some, it might make sense to hire a technical broker. Or it might make sense to bring on humanists and philosphers who can help bring people together, to help them position people and ideas. My experience is that this is very humbling and it takes a lot of time. It can take a year, a year and a half before relationships of trust are developed, and people begin to understand other people’s language.”

*** *** ***

Richard leads development of "big data" projects funded by Mellon, NSF, NARA, NHPRC, IMLS, DHS, NIEHS, and UNC. Recent 2012 grants include a JISC Digging into Data award with UC Berkeley and the U. of Liverpool, called "Integrating Data Mining and Data Management Technologies for Scholarly Inquiry," a Mellon / UNC award called "Carolina Digital Humanities Initiative," which involves the translating of big data challenges into curricular opportunities, and an NSF award for CI-BER, a collaborative big heterogenous data integration project between Duke, University of North Carolina-Chapel Hill, and HASTAC.

He holds a B.S. in Avionics and Electrical Engineering, and an M.S. and Ph.D. in Computer Science, and has worked as a postdoc in Computational Geography. He conducted interdisciplinary research at the San Diego Supercomputer at UC San Diego, working with teams of scholars in sciences, social sciences, and humanities.

Join HASTAC's Collaborative Data group to follow Richard's work with CI-BER, HASTAC's EAGER project, or want to share your own data collaborations.

(Image of a starling murmuration is courtesy of http://www.flickr.com/photos/rocketjohn/6350566523/.)

CI-BER: CyberInfrastructure for Billions of Electronic Records

Monday, February 11, 2013

Data Control: Then and Now

Friday, February 1, 2013

Socializing Big Data: Collaborative Opportunities in Computer Science, the Social Sciences, and the Humanities

Background

Total Pageviews