Librarian-in-the-Loop Deep Learning To Curate Very Large Biomedical Image Datasets




From double helix to pillars of creation, science is often driven by innovative instrumentation and imaging. Recent breakthroughs in Focused Ion Beam Scanning Electron Microscopy (FIB-SEM) technology have made it possible to reliably acquire nanometer scale 3D imaging from sizable volumetric biomedical samples, each resulting in tens to hundreds of terabytes of raw image data. After generating landmark datasets for neuroscience and cell biology research, scientists at Yale are now bringing enhanced FIB-SEM to enable discoveries in translational and clinical research. Similar to many other data intensive science challenges, the bottleneck has now shifted from data collection and storage to data curation for the primary purpose of extracting insights and knowledge from data, albeit with more stringent requirements on efficiency, timeliness, replicability, and reusability. 

Existing data curation models and frameworks are insufficient to address these challenges. In addition, the very large data volume has rendered comprehensive close reading and manual image annotation impractical. For example, it has been estimated that FIB-SEM images taken from a single cell may take up to 60 person-years to annotate manually. To make sense of these images, researchers increasingly resort to machine learning methods. Supervised deep learning has been applied to FIB-SEM images but its performance can be unreliable. Training a model for automatic image segmentation may take months on a GPU cluster and still result in overfitting.  Thankfully, a recent study suggests that interventions from experienced and insightful domain experts and data curators may drastically speed up the training, although the performance gain originating from such human interventions has not been carefully benchmarked. Very large FIB-SEM datasets therefore present an archetypal test case on how to best orchestrate scientists, data curators, cyberinfrastructure, software, and deep learning algorithms to achieve best data-to-insight performance. 

This project will draw insights from our prior IMLS funded project curating very large research datasets. Our past experience has shown that 1) data curators/librarians should be deployed in the big data pipeline as early as possible, even at the stage of physically acquiring data. Knowledge in data acquisition often affords pertinent opportunities to optimize the data pipeline. 2) Data curation should be driven primarily by data use and reuse, which closely aligns librarians/data curators with domain scientists. Long-term preservation activities are better performed as a side effect of data use and reuse. 3) The efficiency, cost, and performance of extracting insights from data are often the critical success factors for data curation and are closely associated with both the data format and the cyberinfrastructure options and choices. Experimenting and benchmarking are often the more effective way to achieve balanced results, therefore this prototyping project.

Preliminary Results (Updated Aug 2023)

Our initial work has been primarily focused on reproducing the results from the Open Organelle project, as described in this Nature paper

Since Oct 2022, we have started to work on FIB-SEM images taken from Bordey Lab's mouse brain samples. Our current focus is on identifying nuclear pores. Zhiwu lead the labelling efforts to produce ground truth data, and work with Yinlin on the training and prediction efforts.

Initial Labeling Scheme

Modified Labeling Scheme

Initial Prediction

Projections of Cell1_2_Crop1_new.avi

Predicted Nuclear Pores in 3D