Good practices for FAIR data management - an interview with Kasper Meijer on ‘The seafloor from a trait perspective. A comprehensive life history dataset of soft sediment macrozoobenthos’
Date: | 04 March 2024 |
Author: | Alba Soares Capellas |
Part of open science is that researchers make their data FAIR: Findable, Accessible, Interoperable and Reusable. But how to do this? In this series, we ask researchers to tell us more about their data management choices.
In this edition, we highlight the data publication ‘The seafloor from a trait perspective. A comprehensive life history dataset of soft sediment macrozoobenthos’ that was published in Springer Nature’s Scientific Data.
We asked co-author Kasper Meijer, PhD Candidate at the Groningen Institute for Evolutionary Life Sciences (GELIFES) of the Faculty of Science and Engineering, a few questions.
Your publication in the open-access journal Scientific Data has received significant attention with over 1,000 accesses. Can you discuss the impact you envisioned for the dataset as a result of the publication, and how have you seen it being utilized by the scientific community and beyond?
We initially created the database for our own use. We are interested in the general functioning of seafloor communities and how these are impacted by human influences, specifically in the Dutch Wadden Sea. Generally, the most sensitive communities are made up of species with a long lifespan and slow reproduction. In addition, those species that live on top of the seafloor or are buried only very shallowly are the most sensitive to seafloor disturbance. Looking at distribution patterns of such life history characteristics (functional traits) instead of species distributions can give more insight into the composition of sensitive seafloor communities and their response to seafloor disturbance.
In 2019, we conducted a large sampling campaign within the project Waddenmozaïek to identify macrozoobenthic communities (invertebrates, such as shellfish and marine worms, that are larger than 1mm) in the Dutch subtidal Wadden Sea (permanently submerged areas). So we had a huge dataset on which species occur where, but there was no complete database on the life history traits of all these species. A colleague (Joao Bosco Gusmao, co-author) had started a database containing a subset of the species and life history traits we were interested in. So, we combined efforts to construct a more complete database with even more species and traits.
An incredible amount of time went into compiling this dataset, and we realised how much easier it would have been if such a dataset had been publicly available from the start. Especially given the many uses of a biological trait dataset for all kinds of analysis, such as identifying and monitoring changes under anthropogenic pressures. Such data being readily available is especially important when quick decisions need to be made to guide management. The dataset has been online since November 2023 and is already being utilised by several researchers within our own group, and we hope this will only grow as time goes by. It has already proved useful in several of our own upcoming publications, and it is considered to be included in general monitoring of the Wadden Sea.
In the publication, challenges related to missing data for certain traits were mentioned. Could you elaborate on some of the challenges you encountered during the data collection process and how you addressed them?
A lot of information that we have included in the dataset comes from old studies that focus on a specific species or species group. It collates a lot of fundamental research on the ecology and life history of specific species. These kinds of studies were often conducted on the most prevalent species. In addition, this kind of fundamental research is rarely done anymore. This makes it very hard to find data on certain species that are rarer or more recently discovered, especially for the more difficult-to-measure types of life history traits we are interested in, like lifespan and reproductive potential.
One of the major challenges from a practical perspective is that it sometimes took up to an hour to finally find the one publication from which such data could be inferred, if it could be found at all. To do this for over 235 species and 16 different life history traits is a major undertaking. Then, if possible, the best next step would be to infer information from a closely related species or, more broadly, on a higher level of identification, like the family level. The downside to this is that you lose resolution within your dataset, but it is sometimes better than not having any information at all.
There are other methods to impute missing data using statistical techniques. However, these come with their own disadvantages. Some of them require additional extensive knowledge of phylogeny, advanced statistical techniques like structural equation modeling, or making assumptions on the generality of certain traits within the same or related taxonomic group. Therefore, we have chosen to specify when data is missing or when we used information from a closely related species to fill out the dataset so users can make their own choices in how they want to deal with missing information.
What did you do to make this dataset FAIR (Findable, Accessible, Interoperable, Reusable)? Were there any challenges you encountered during this process?
We wanted to make a dataset that was easily usable by other users and tried to follow FAIR principles as closely as possible. We were therefore looking for a way of publishing a dataset that would increase findability and would allow us to thoroughly explain its contents, uses and pitfalls. We quickly concluded that it would be best to publish the dataset next to a peer-reviewed Data Descriptor in the open access journal Scientific Data. This would help with the findability of the dataset, as well as the accessibility, as it is a purely open-access journal. We particularly liked how this journal did not set any text limitations to the Methods section, which enabled us to describe the dataset and its contents in as much detail as possible.
To ensure interoperability, we structured the dataset in a scientifically widely used format for biological trait analysis, and we could highlight a few studies that describe a common workflow with these types of data in the Data usage section of the data descriptor. This also facilitates the reusability of the dataset as the data descriptor thoroughly explains the collection and the structure of the dataset. Since the dataset is built on species life history traits, which generally don’t change, this makes the dataset also usable for a wider range of studies not only focused on the Dutch Wadden Sea. By building an app around the database to include updates as new research comes to light, we can ensure its long-term relevance and usability while keeping older versions of the dataset available to trace back or re-do analyses when needed.
I think the biggest challenge was the format of the data descriptor itself. It is not your general scientific article, so it was a fun challenge to write something you’re not familiar with. It really makes you think more about how you structure and describe your dataset to make it more accessible to other users.
In what ways has using DataverseNL as the repository of choice contributed to the accessibility and discoverability of your dataset?
We chose the DataverseNL repository as we generally always use this repository for data archival when publishing articles. The repository gives you a unique Digital Object Identifier (DOI) for the dataset, which can be used to cite the dataset directly and increase findability when used by others.
The dataset is presented as dynamic and periodically updated. How do you manage updates to the dataset, and what measures are in place to ensure its long-term applicability and relevance?
We have designed and created a Shiny app that hosts the dataset and can be updated with new versions whenever necessary. This Shiny app is hosted on a university web server and contains the metadata relevant to the dataset. This Shiny app allows for dynamic searching of the dataset, and the dataset can be downloaded as a CSV or Excel file from this app. This app allows users to select and download the latest version of the dataset or a previous version for reproducibility of analysis.
The dataset currently includes 235 species, but many more can be included. In addition, more traits can be included in the dataset if necessary for different types of analysis. This allows the Shiny app to be a central hub for distributing a standardised dataset. Other researchers can add their taxa to the dataset by sending us the data and references, which will then be included in the next update. We are also finding more taxa in our own sampling campaigns, which we are also adding to the dataset. This way, the dataset is continuously evolving.
We have already been contacted by several other researchers who had additions to the dataset or knew of literature containing information that we had not found yet. In doing so, we ensure that the dataset is not only applicable to our own study areas and species but also on a larger geographical scale and that new information can be included in the dataset as new research comes to light.
Useful linksThe UG’s Digital Competence Centre supports UG researchers throughout the entire research (data) life cycle, from grant proposal to FAIR data archiving. An overview of the support available at the UG for staff members who want to engage in open science practices. These practices (and the support) include open access, FAIR data, open education, public engagement and more. |
CitationMeijer, K.J., Gusmao, J.B., Bruil, L. et al. The seafloor from a trait perspective. A comprehensive life history dataset of soft sediment macrozoobenthos. Sci Data 10, 808 (2023). https://doi.org/10.1038/s41597-023-02728-5 |
About the author
Communications Officer at the UG Digital Competence Centre (UG DCC)