Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse

Collaborators

External Funding

Project Website

Students

Synopsis

As a key component of the nation’s knowledge infrastructure, libraries must continuously reinvent themselves with the emergence and the establishment of new discovery paradigms. The recent wave of data intensive science has motivated many high-profile library big data services, notably the ambitious plan to archive all tweets at the Library of Congress, the heterogeneous and geographically replicated archival storage known as the Digital Preservation Network (DPN), the data mining facility at the HathiTrust Research Center (HTRC), and the metadata hubs developed at the Digital Public Library of America (DPLA) and the SHARE initiative. Many more are being developed or under planning. The scope of this project will be limited to the technical infrastructure of such services and its implications for staff training, two important components of the National Digital Platform.

With an emphasis on big data sharing and reuse, this research project aims to develop an evidence-based, broadly adaptable cyberinfrastructure (CI) strategy to operate digital library services. 

Patterns for Library Big Data Services

Key Findings

How big is too big in terms of big data? 100MB, 100GB, 100TB, or 100PB? It really depends on how fast we can process and use the data to answer questions. Performance is therefore the key to solve big data problems. Despite the rapid progress in hardware, software, and systems, an often overlooked infrastructural component of big data analytics is data format. The most performant choice is not always aligned with the disciplinary or domain norm, which often lags behind the big data best practices. Libraries have a role to play in assisting researchers to make the right choice so as to eliminate potential use and reuse barriers. 

Take as an example the ISO 28500: WARC File Format, which still is the standard archival format for most web archives. Our experiments showed that if we convert the web archive data from WARC to Parquet or Avro, running Latent Dirichlet Allocation (LDA), a frequently used topic modeling procedure, on the same data can reach 5 to 10 times speed up. Even higher, sometimes up to two orders of magnitude of speedup can be reached on some other routine data analysis operations. Such speedups make a radical difference for researchers. Many previously impenetrable datasets suddenly open up for exploration and interrogation, if they are converted to or originally archived in the more performant formats. 

Our findings are also corroborated by many other research,  though in different disciplines and contexts. Commonly used data formats like JSON, XML, HDF5, and FITS all become increasingly obstructive with the data size grows, therefore exact a heavy performance and reuse penalty on researchers and end users.

Topic Modeling Speedups Using Different Data Formats

Related Publications