Librarian-in-the-Loop Deep Learning To Curate Very Large Biomedical Image Datasets
Collaborators
C Shan Xu, Harvey and Kate Cushing Professor of Cellular & Molecular Physiology, Yale School of Medicine
Song Pang, Director, FIB-SEM Collaboration Core, Yale School of Medicine
Yinlin Chen, Assistant Director & Assistant Professor, Center for Digital Research & Scholarship, Virginia Tech Libraries
Angelique Bordey, Rothberg Professor of Neurosurgery & Co Vice Chair of Research, Neurosurgery, Yale School of Medicine
- External Funding
Zhiwu Xie (PI), Yinlin Chen (Co-PI, Assistant Director, Center for Digital Research & Scholarship, Virginia Tech Libraries), Song Pang (Director, FIB-SEM Collaboration Core, Yale University School of Medicine). Curating Very Large Biomedical Image Datasets For Librarian-In-The-Loop Deep Learning. $149,216 federal funding.
Students
NSF DS-PATH Summer Fellows (Jun - Aug 2024):
Sarah Ramirez, Computer Science MS student, University of California, Riverside
Gian Carlo Robles, Associate Degree in Data Science student, Moreno Valley College
Mubariz Mohammed, MS student in Computational Data Science, University of California, Riverside
Ali Becerra Jr., Computer Science BS student, California State University, San Bernardino
Charles O’Hagin, Data Science BS student, University of California, Riverside
Linhan Wang, Computer Science PhD student, Virginia Tech, Jun 2024 - present
Steve Neustadter-Schneider, Postgraduate Research Associate, Yale University School of Medicine. Aug 2023 - July 2024
Chongyu He, Computer Science MS student, Virginia Tech, Jan 2022 - July 2024. I am Chongyu's thesis committee co-chair (with Prof. Ed Fox at VT Computer Science/Digital Library Research Lab).
Lennon Headlee, Aerospace Engineering Undergraduate Student, Virginia Tech, Aug - Dec 2022
Sid Pothineni, Computer Science Undergraduate student, Virginia Tech, Aug - Dec 2022
Shruti Dongare, Computer Science PhD student, Virginia Tech, May- Aug 2022
Sareh Ahmadi, Computer Science PhD student, Virginia Tech, May- Aug 2022
Synopsis
From double helix to pillars of creation, science is often driven by innovative instrumentation and imaging. Recent breakthroughs in Focused Ion Beam Scanning Electron Microscopy (FIB-SEM) technology have made it possible to reliably acquire nanometer scale 3D imaging from sizable volumetric biomedical samples, each resulting in tens to hundreds of terabytes of raw image data. After generating landmark datasets for neuroscience and cell biology research, scientists at Yale are now bringing enhanced FIB-SEM to enable discoveries in translational and clinical research. Similar to many other data intensive science challenges, the bottleneck has now shifted from data collection and storage to data curation for the primary purpose of extracting insights and knowledge from data, albeit with more stringent requirements on efficiency, timeliness, replicability, and reusability.
Existing data curation models and frameworks are insufficient to address these challenges. In addition, the very large data volume has rendered comprehensive close reading and manual image annotation impractical. For example, it has been estimated that FIB-SEM images taken from a single cell may take up to 60 person-years to annotate manually. To make sense of these images, researchers increasingly resort to machine learning methods. Supervised deep learning has been applied to FIB-SEM images but its performance can be unreliable. Training a model for automatic image segmentation may take months on a GPU cluster and still result in overfitting. Thankfully, a recent study suggests that interventions from experienced and insightful domain experts and data curators may drastically speed up the training, although the performance gain originating from such human interventions has not been carefully benchmarked. Very large FIB-SEM datasets therefore present an archetypal test case on how to best orchestrate scientists, data curators, cyberinfrastructure, software, and deep learning algorithms to achieve best data-to-insight performance.
This project will draw insights from our prior IMLS funded project curating very large research datasets. Our past experience has shown that 1) data curators/librarians should be deployed in the big data pipeline as early as possible, even at the stage of physically acquiring data. Knowledge in data acquisition often affords pertinent opportunities to optimize the data pipeline. 2) Data curation should be driven primarily by data use and reuse, which closely aligns librarians/data curators with domain scientists. Long-term preservation activities are better performed as a side effect of data use and reuse. 3) The efficiency, cost, and performance of extracting insights from data are often the critical success factors for data curation and are closely associated with both the data format and the cyberinfrastructure options and choices. Experimenting and benchmarking are often the more effective way to achieve balanced results, therefore this prototyping project.
Preliminary Results (Updated July 2024)
Our initial work has been primarily focused on reproducing the results from the Open Organelle project, as described in this Nature paper.
Since Oct 2022, we have started to work on FIB-SEM images taken from Bordey Lab's mouse brain samples. Our current focus is on identifying nuclear pores. Zhiwu lead the labelling efforts to produce ground truth data, and work with Yinlin on the training and prediction efforts.