The National Institutes of Health (NIH) is creating infrastructure capabilities and new programs to promote data discovery, use and sharing in alignment with its upcoming data management policy.
“Our goals are to catalyze data science. capabilities across all the 27 institutes and centers at NIH,” Susan Gregurick, associate director for data science and director of the Office of Data Science Strategy, said at GovCIO Media & Research’s Infrastructure: Health IT virtual event. "We do that by working with a very large number of colleagues across NIH — almost over 200 NIH staff work with us on various teams — to help us implement different data science strategy capabilities.”
Data in the biomedical space is historically very heterogeneous, which has presented challenges in data integration and big data analytics. NIH has also faced obstacles with data policies and data governance that are structured for a “one dataset at a time approach.”
Serving as the foundation of NIH’s data efforts is the agency’s upcoming policy on data management, which is set to go into effect in January 2023. The policy will strive to make data as widely and freely available as possible while safeguarding privacy and protecting confidential and proprietary data.
In addition, the agency is preparing data repositories and developing guidance to help researchers create data management plans that ultimately drive the concept of "data curation at scale.”
“I hope this just gives you a little bit of idea of some of the things that are happening in the biomedical data landscape, computing architecture, to ways in which we're thinking of data as an infrastructure, to finding data and resources across different programs, and then of course, helping researchers create data management and sharing plans and really enable greater data sharing across all our programs," Gregurick said.
Gregurick noted the agency's STRIDES Initiative that provides cloud computing resources to NIH investigators and also training and workforce development. Through the program, the agency has experimented with several different infrastructure approaches, including hub-and-spoke, distributed, centralized and federated, to build out its health care delivery and research ecosystem.
“A few promising activities include the National COVID Cohort Collaborative (N3C),” Gregurick said. “It's really an amazing activity in terms of what they're doing with the data. So as a highlight, they developed a way to extract and harmonize clinically related data at a scale that's really quite unprecedented for NIH.”
The collaborative worked across more than 72 sites to extract participant data derived from electronic health records (EHR), patient chart data, medical histories, diagnostics, demographics, immunization records and imaging data pathology data to improve access and use of COVID‑19 clinical data, which could then inform pandemic-related research questions.
“All this data is harmonized ... so that researchers have access to a very large and extensive amount of data sets that can be used at scale,” Gregurick said.
Under NIH’s Rapid Acceleration of Diagnostics (RADx) Program, a hub-and-spoke approach enabled the agency to speed up development, validation and commercialization of point-of-care home-based testing. Gregurick said the program is spearheading a community-driven, integrated approach for creating data models and common data elements that can be harmonized.
As the agency continues to explore different federated approaches, like its Cloud Platform Interoperability Effort, the goal is to provide researchers with capabilities to find, access and use distributed datasets across NIH’s supportive platforms and institutes.
“This is a fairly complicated and complex program, but the idea is that the flexibility of creating an integrated and interoperable and somewhat federated approach allows new programs to be added and new capabilities to be developed on demand, so we can start out small and build the program in a really flexible manner,” Gregurick said.
NIH aims to grow its STRIDES Initiative to explore the use of cloud environments through commercial cloud providers. These partnerships will enable access to improved datasets and advanced computational infrastructure, tools and services.
“The overall goal is to provide a modernizing integrated biomedical data ecosystem,” Gregurick said. “That sounds easy, but it's actually quite challenging because of the diversity of science across NIH, and the diversity of needs and capabilities, an overall one-size-fits-all strategy is very challenging.”
With the promise of emerging technologies like hypercomputing, Gregurick said the agency will eye “data as a product” and “data as an infrastructure” to enable NIH to provide diagnostic functionality and tools and maintain interoperability.
“We have a significant amount of data and computational capabilities in our STRIDES partnerships. We want to leverage these, but we also want to include ways in which we can provide equality for all our research communities,” Gregurick said.