Good practices for FAIR data management (4) - an interview with Pascal de Boer (UG DCC & GBB) about setting up a FAIR data repository
Date: | 26 May 2023 |
Author: | Leon ter Schure |
Part of open science is that researchers make their data FAIR: Findable, Accessible, Interoperable and Reusable. But how to do this? In this series, we ask researchers to tell us more about their data management choices.
Appropriate infrastructure is crucial for practicing open science. In this edition we ask Pascal de Boer to tell us more about the FAIR and open access data repository he co-created for large-scale electron microscopy (EM) images.
Pascal works as data steward at the UG Digital Competence Centre (UG DCC) and the Groningen Biomolecular Sciences and Biotechnology Institute (GBB, Faculty of Science and Engineering). He was formerly a postdoc at the Department of Biomedical Science of Cells and Systems of the UMCG.
What is this repository about?
The repository contains large-scale electron microscopy (EM) images of pancreatic organ donor tissue. In the past, EM produced images that excluded the whole context from which they were obtained. With newer techniques such as large-scale EM, we can record complete tissue section overviews of up to square millimeter(s) while retaining high resolution. This makes it possible to study the samples in great detail, by scrolling and panning through these tissue maps similar to using Google Maps. We called this technique nanotomy for nano-anatomy. These vast amounts of data are recorded automatically, while in classical imaging an operator had to sit behind the microscope and record every detailed image manually. Nanotomy maps can be made available relatively easily through repositories which allows other researchers to reuse the data.
Why did your research team decide to set up this repository?
With my previous research team that was led by dr. Ben Giepmans we had two main topics of focus, namely advanced microscopy development and the study of type 1 diabetes. Type 1 diabetes results from a loss of insulin-producing cells located in the islets of Langerhans in the pancreas caused by an autoimmune reaction. As a consequence, patients are not able to control their blood sugar levels, which ultimately leads to consistent high levels of glucose in the blood. These patients are dependent on insulin treatment for the rest of their lives and are prone to many diabetes-related complications. From the network of Pancreatic Organ Donors (nPOD) we received pancreatic donor material for nanotomy processing in order to better understand the pathogenesis. EM data is intrinsically information-rich and can be analyzed for multiple research purposes. We set up a repository to make the data openly available for other diabetes researchers worldwide, initially on a server of our own. Once we made our repository open access available in combination with a research article, it included 64 nanotomy maps and the repository is currently still growing. At the time this was the largest repository of biomedical EM of human subjects available, and perhaps it still is. Thanks to this repository it is no longer necessary that research groups worldwide perform EM on their own tissue received from nPOD or other donor material. This saves valuable resources, because EM is laborious and costly. Moreover, all donor subjects are tagged with a unique identifier which potentially allows for combining EM findings from several labs that analyze pancreas tissue from the same donor(s) but with different techniques (genetics, proteomics etc). My previous lab is now included as the EM node for the nPOD program, which means all the donor tissue intended for EM is processed, recorded, and shared from there.
The team received an innovation award for making this data repository open access in a FAIR way. What steps did you take to make this repository FAIR and why is this important?
We realized that the repository we were hosting on our own server was not as findable and reusable as we envisioned. For example, the website was linked to the publication but lacked persistent identifiers, which help to increase findability. Furthermore, proper metadata tagging was absent and it mainly contained proprietary file formats, which potentially limits future automatic analyses of the data. For these reasons the team aimed to make the data more FAIR by using existing open access infrastructure, such as the bio-image archive, and image data resource (IDR). All published images were converted into a standardized open microscopy file format, OME-tiff converter, and furthermore enriched with all the necessary metadata using the bioinformatics tool Molgenis in collaboration with the group of Morris Swertz of the UMCG. Morris Swertz has a long track record of data FAIRification in the field of genetics and his input therefore a valuable contribution for this project.
EM is often quite laborious and costly. It does, however, provide valuable insights in cellular and molecular aberrations in tissues under disease conditions. Furthermore, as mentioned before EM is very information dense as it reveals many different structures of cells and tissue at the nanometer to micrometer level. Each of these different structures can be analyzed for different research questions. Lastly, the material comes from human donors which is intrinsically very valuable, including for the donors and their relatives. This open repository creates maximum value out of the pancreas material and data.
(How) is data from this repository actually being reused by others?
There are several research groups worldwide that examine the type 1 diabetes nanotomy repository for independent research topics. Moreover, a couple of these research groups already included results from reusing the repository in their own publications. These included a couple of articles about intracellular processes that could contribute to the development of type 1 diabetes, such as autophagy and neoantigen formation.
You are now working as a data steward at the UG Digital Competence Centre & Groningen Biomolecular Sciences and Biotechnology Institute. What kind of support do you provide?
I help researchers with questions about the whole research data life cycle. This includes advice on writing a Research Data Management Plan (RDMP). This is required both by the institute when researchers start a project and by funding agencies once a grant has been awarded. The use of open repositories already has quite a history in the life sciences such as the NCBI for genetics and the Protein Data Bank for protein structures. As a consequence, infrastructure is in place for these expertises. I of course do help researchers that work in fields where such repositories are not available or common.
The most common questions that I receive are about archiving data locally for internal reuse. Most research groups are using local solutions such as a Network Attached Storage (NAS) or several USB devices for long-term data storage. Finding back data from years ago from a researcher who already left can be a big burden for group leaders. My advice is to make use of the UG Research Data Management System (RDMS), which has been developed for long-term storage and data archiving with the possibility to add metadata that increases the findability. I also closely collaborate with the developers of the RDMS to further improve the system.
My advice for researchers regarding data management is to start with the end in mind. It is often quite difficult to adjust your data management practice after a couple of years or when you reach the end of your project. You want to avoid finding yourself in a big data management mess. For this reason, research data management planning is crucial. If you have troubles or do not know how to start you can always consult a local data steward if applicable or contact the UG Digital Competence Center (DCC) for help.
Interesting Links:
Repositories mentioned are the original nanotomy repository, the Bio-image Archive, and the Image Data Resource.
Used software: OME-tiff converter and Molgenis
The UG Digital Competence Centre (UG DCC) supports researchers in managing their research data throughout the entire research (data) life cycle, from grant proposal to FAIR data archiving.
Learn more about the UG’s Open Science Programme
About the author
Leon ter Schure is Lead of the UG Digital Competence Centre (DCC).