Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse
Tyler Walters, Dean, Virginia Tech Libraries
Edward Fox, Professor, Department of Computer Science, Virginia Tech
Pablo Tarazaga, Professor & Associate Department Head of Research and Strategic Initiatives, Department of Mechanical Engineering, Texas A&M University
Jiangping Chen, Professor and Chair, Department of Information Science, University of North Texas
Zhiwu Xie (PI), Tyler Walters (Co-PI), Edward Fox (Co-PI), Pablo Taragaza (Co-PI), Jiangping Chen (Co-PI). Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse. $308,175. Institute of Museum and Library Services LG-71-16-0037-16.
Abhinav Kumar, MS Computer Science, Virginia Tech, Aug 2017 - May 2019. Now at Amazon. I am Abhinav's thesis committee member (chaired by Ed Fox).
Xinyue Wang, Computer Science PhD student, Virginia Tech, Aug 2017 - present. I am Xinyue's dissertation committee co-chair (with Ed Fox).
Xiaoyu Zhang, Information Science PhD student, University of North Texas. Advised by Prof. Jiangping Chen.
As a key component of the nation’s knowledge infrastructure, libraries must continuously reinvent themselves with the emergence and the establishment of new discovery paradigms. The recent wave of data intensive science has motivated many high-profile library big data services, notably the ambitious plan to archive all tweets at the Library of Congress, the heterogeneous and geographically replicated archival storage known as the Digital Preservation Network (DPN), the data mining facility at the HathiTrust Research Center (HTRC), and the metadata hubs developed at the Digital Public Library of America (DPLA) and the SHARE initiative. Many more are being developed or under planning. The scope of this project will be limited to the technical infrastructure of such services and its implications for staff training, two important components of the National Digital Platform.
With an emphasis on big data sharing and reuse, this research project aims to develop an evidence-based, broadly adaptable cyberinfrastructure (CI) strategy to operate digital library services.
Patterns for Library Big Data Services
How big is too big in terms of big data? 100MB, 100GB, 100TB, or 100PB? It really depends on how fast we can process and use the data to answer questions. Performance is therefore the key to solve big data problems. Despite the rapid progress in hardware, software, and systems, an often overlooked infrastructural component of big data analytics is data format. The most performant choice is not always aligned with the disciplinary or domain norm, which often lags behind the big data best practices. Libraries have a role to play in assisting researchers to make the right choice so as to eliminate potential use and reuse barriers.
Take as an example the ISO 28500: WARC File Format, which still is the standard archival format for most web archives. Our experiments showed that if we convert the web archive data from WARC to Parquet or Avro, running Latent Dirichlet Allocation (LDA), a frequently used topic modeling procedure, on the same data can reach 5 to 10 times speed up. Even higher, sometimes up to two orders of magnitude of speedup can be reached on some other routine data analysis operations. Such speedups make a radical difference for researchers. Many previously impenetrable datasets suddenly open up for exploration and interrogation, if they are converted to or originally archived in the more performant formats.
Our findings are also corroborated by many other research, though in different disciplines and contexts. Commonly used data formats like JSON, XML, HDF5, and FITS all become increasingly obstructive with the data size grows, therefore exact a heavy performance and reuse penalty on researchers and end users.
Topic Modeling Speedups Using Different Data Formats
Zhiwu Xie, Yinlin Chen, Julie Speer, Tyler Walters, Pablo Tarazaga, and Mary Kasarda. Towards use and reuse driven big data management. In Proceedings of the 15th ACM/IEEECS Joint Conference on Digital Libraries. Knoxville, TN. ACM, 2015. https://doi.org/10.1145/2756406.2756924
Zhiwu Xie, Edward A Fox, Tyler Walters, Pablo Tarazaga, and Jiangping Chen. Developing Library Cyberinfrastructure (LCI) Strategy for Big Data Sharing and Reuse, D-Lib Magazine, September/October, 2016. http://www.dlib.org/dlib/september16/09inbrief.html
Zhiwu Xie, Advancing Library Cyberinfrastructure for Big Data Sharing and Reuse, invited talk at National Federation of Advanced Information Services (NFAIS) 2017 Annual Conference, Alexandria, VA, 2017 (Feb. 27) Greg Tananbaum, Zhiwu Xie, and Anita de Waard, Rolling in the Deep Analytics: Big Data Comes to Scholarly Communication, Charleston Conference, Charleston, SC, 2016
Yinlin Chen, Zhiwu Xie, and Edward Fox. A Library to Manage Web Archive Files in Cloud Storage, Bulletin of IEEE Technical Committee on Digital Libraries, 13(1), 2017.
Tyler Walters, Big Data: Making Research Accessible, Discoverable, and Reusable, 2017 Texas Library Association Annual Conference, San Antonio, TX, 2017
Zhiwu Xie, and Edward A Fox. Advancing library cyberinfrastructure for big data sharing and reuse. Information Services & Use, 37(3), 319–323, 2017. https://doi.org/10.3233/ISU-170853
Xinyue Wang and Zhiwu Xie. Web Archive Analysis Using Hive and Spark SQL. In Proceedings of the 19th ACM/IEEE-CS on Joint Conference on Digital Libraries, Champaign, IL. ACM, 2019. https://doi.org/10.1109/JCDL.2019.00101
Xinyue Wang and Zhiwu Xie. The Case For Alternative Web Archival Formats To Expedite The Data-To-Insight Cycle. Proceedings of the 2020 ACM/IEEE-CS on Joint Conference on Digital Libraries, Virtual Event, China. ACM, 2020. https://doi.org/10.1145/3383583.3398542
Natasha Vipond, Abhinav Kumar, Joseph James, Frederick Paige, Rodrigo Sarlo, and Zhiwu Xie. Real-time processing and visualization for smart infrastructure data. Automation in Construction, 2023. https://doi.org/10.1016/j.autcon.2023.104998