The National Institutes of Health is preparing for a new data management policy in 2023 that will require researchers to plan for how they're going to manage and share their data. Ahead of this and faced with growing pressure to keep up with the increasing amounts of generated data, NIH leadership is working to broaden a key initiative that has made massive amounts of data available to biomedical researchers over the past few years.
Since 2018, the Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative has harnessed cloud environments to help the agency revolutionize how and who can access data.
“It enables us to connect our data platforms so we’re creating more than just a lot of data silos,” NIH Officer of Data Science Strategy Director Susan Gregurick told GovCIO Media & Research. The STRIDES Initiative is one of NIH's efforts to implement its overall data science strategic plan, a new version of which is expected in mid-2023. “No matter where you are or who you are in doing research, you can connect across our data platforms to find the data and do the analysis that you want. That’s the goal.”
Over the past year, NIH has acquired over 200 petabytes of data, or about two-fifths of Spotify's catalog. By 2025, NIH expects the total amount of genomics data alone to surpass amounts from other large data generators for astronomy and YouTube, for example, as cited in NIH's published strategic plan.
Moving forward with the cloud, NIH wants to put that data into the hands of all researchers who need it.
“One of our goals going forward now that we have STRIDES is to really broaden it and think about how all researchers in biomedical science can benefit from the cloud, but it’s going to take time and effort. Just having it available doesn’t mean it's going to happen, we have to do some work," Gregurick said.
When researchers are thinking about sharing data, they have options. They can decide to share data on the cloud and enable quick and easy access to it, or they can also share data in repositories, such as a community repository or general repository that will be "cloud enabled," Gregurick added.
For instance, the agency's new Generalist Repository Ecosystem Initiative (GREI) improves access to NIH-funded data through seven large general repositories: Dryad, Dataverse, Figshare, Mendeley Data, Open Science Framework, Vivli and Zenodo.
In fact, the cloud has played a key role in data collection during the COVID-19 pandemic, and it’s also helping the agency better prepare for future public health crises.
The National Center for Advancing Translational Sciences (NCATS) has been gathering from multiple centers data only related to COVID-19 and putting it in one place on a single platform in the cloud.
“They are now working on how to identify people who have long COVID and make predictions on who likely will have a bad outcome from COVID depending on underlying heath issues” Gregurick said. “That resource is helping us now with the pandemic and will help us address future health emergencies.”
Gregurick and the team aim to create a cloud lab for researchers who have limited resources to explore the cloud or who have no experience in cloud computing.
“There will be training modules and peer-reviewed data sets. They can play around with analytic processes,” Gregurick said. “They can also bring their own tools and data into the cloud lab to try things out and see what they want to do. This in itself will bring more researchers to use cloud computing.”
NIH's upcoming data management policy will go into effect Jan. 25, 2023. Gregurick believes it will lead to a widespread culture change as it will require researchers to plan for how they’re going to manage and share their data.
The idea is that "data sharing should be the default and that we should maximize it to the best of our ability," Gregurick said. “We anticipate researchers will be sharing the data and the results from their scientific endeavors, and we want the data that makes your research reproduceable even if that data is not published.”
“Hopefully we will enable STRIDES to improve our data integration and big data analytics,” Gregurick added. “I think researchers will be more interested in the cloud as they think about data sharing through the policy.”