BES Guide Data Management 2019
BES Guide Data Management 2019
BES Guide Data Management 2019
DATA
MANAGEMENT
Contents
Preface .................................................................................................................................................... 01
Introduction ......................................................................................................................................... 03
Planning data management ......................................................................................................... 11
Creating data ....................................................................................................................................... 17
Processing data .................................................................................................................................. 21
Documenting data ............................................................................................................................. 27
Preserving data .................................................................................................................................. 29
Sharing data ......................................................................................................................................... 33
Reusing data ........................................................................................................................................ 35
Sources and Further Reading ....................................................................................................... 37
Acknowledgements .......................................................................................................................... 37
This work is licensed under a Creative Commons Attribution 4.0 International License. To view a copy of this licence,
except where noted on certain images. visit http://creativecommons.org/licenses/by/4.0/
British Ecological Society
britishecologicalsociety.org
[email protected]
2
Preface
1
2
Introduction
3
Introduction
Create
Reuse Process
Plan
Share Document
(& use)
Preserve
It is important to note that research data can and should be used more than
once. Once data satisfy the needs of initial collection, open availability and data
standardization ensure further uses of data in science and other contexts. Data
citation mechanisms enable acknowledgement of the primary contributors of the
data. At large scales of analyses, no project or individual, no matter how well-funded,
is able to generate all the necessary data anew, and this guide offers advice on data
sharing and reuse. For many, data sharing is an uncomfortable first experience, as it
is exposes raw, unprocessed evidence, which can compromise academic priority or
even career development. The solution comes from deciding on the right moment to
share the data: publish your paper first, but remember to share the data with or after
the publication of a chapter, book, paper or dissertation.
Why should I manage data?
Data management concerns how you plan for all stages of the data lifecycle and
implement this plan throughout the research project. Done effectively it will ensure
that the data lifecycle is kept in motion. It will also keep the research process efficient
and ensure that your data meet all the expectations set by you, funders, research
institutions and legislation (e.g. copyright, data protection).
Ask yourself, ‘Would a colleague be able to take over my project tomorrow if I
disappeared, or make sense of the data without talking to me?’ Or even ‘Will I be able
to find and reuse my own data or recreate this analysis in 10 or 20 years’ time?’ If you
can answer with yes, then you are managing your data well.
Potential benefits of good data management include:
•ensuring data are accurate, complete, authentic and reliable
•increasing research efficiency
•saving time and money in the long run – ‘undoing’ mistakes is frustrating
•meeting funder requirements
•minimizing the risk of data loss
•preventing duplication by others
•facilitating data sharing
•ensuring data discovery and reuse
Why should I share my data?
It is increasingly common for funders and publishers to mandate data sharing
wherever possible. In addition, some funding bodies require data management
and sharing plans as part of grant applications. Sharing data can be daunting, but
5
6
Introduction
data are valuable resources and their usefulness could extend beyond the original
purpose for which they were created. Benefits of sharing data include:
•increasing the impact and visibility of research
•encouraging collaborations and partnerships with other researchers
•maximizing transparency and accountability
•encouraging the improvement and validation of research methods
•reducing costs of duplicating data collection
•advancing science by letting others use data in innovative ways
There are, however, reasons for not sharing data. These include:
•if the datasets contain sensitive information about endangered or
threatened species
•if the data contain personal information – sharing them may be a breach of
the Data Protection Act in the UK, or equivalent legislation in other countries
•if parts of the data are owned by others – you may not have the rights to
share them
During the planning stages of your project you will determine which of your data can
and should be shared. Journal data archiving policies recognize these reasons for not
sharing. The BES policy, for example, states:
Exceptions, including longer embargoes or an exemption from the requirement, may
be granted at the discretion of the editor, especially for sensitive information such as
confidential social data or the location of endangered species.
Data sharing is one manifestation of a cultural shift towards open science. Other
terms that will become more prevalent as this movement grows include:
Virtual research environments: a relatively new phenomenon, currently used by only
a few universities for collaborative research within the institution. They allow data
sharing among colleagues by providing a private virtual workspace for members
of a research group to share files, manage workflows, track version control, access
resources and communicate with each other.
Open notebooks: lab notebooks that are made publicly available online, including all
the raw data and any other materials that may be generated in the research project
– even ‘failed’ experiments. They are a transparent approach to research, allowing
others to access and feedback on your project in real time, without limitations on
7
Introduction
reuse. Open notebooks are not widely used but they are gaining momentum as part
of the movement towards open approaches to research practices and publishing.
Open data: public data that anyone can use and that are licensed in a way that allows
for unrestricted reuse. Advocates of open data are often interested in new computing
techniques that unlock the potential of information held in datasets. The term open
data came into the mainstream in 2009 when governments, including those of the
UK, USA and New Zealand, announced initiatives to open up access to their public
information.
Big data: a term used to describe extremely large, complex datasets that are difficult
to process using traditional methods and require extra computer power. ‘Big data’
as a concept is more subjective than open data because of this dependence on
computers to process them – what seems big today may not seem so big tomorrow
when computing technologies are more advanced.
As more and more researchers share their research and work collaboratively, the
possibilities of combining open data and big data increase, and the results of this
combination have the potential to be very powerful1. In fields such as ecology, open
and big data could contribute to answering questions on climate change, enable
large-scale modeling, and help shape environmental policy.
1
http://www.theguardian.com/public-leaders-network/2014/apr/15/big-data-open-data-transform-government
accessed 10 October 2014.
8
9
10
Planning data management
“I think that having a data management plan is absolutely crucial and planning
for archiving and making your data accessible (after a suitable time period if
you like) is really important for research in the future. There are now some great
resources out there that help you to organize and plan your data collection and
make it more usable for yourself as well as others who may wish to use it in the
future. Make a plan. Stick to the plan from day one and adapt it as your project
evolves and your needs change.”
- Yvonne Buckley, Trinity College Dublin, Ireland
Consider your budget. Data management will have its costs and this should be
included within the larger budget of your whole research project. You can use the
data lifecycle to help price each activity needed throughout data management in
terms of people’s time or extra resources required. Costing tools may be available
from universities and other online data management resources.
Talk to your supervisor, colleagues and collaborators. Discuss any challenges they
have already faced in their experience – learn from their mistakes.
Key things to consider when planning:
Time. Writing a data management plan may well take a significant amount of time.
It is not as simple as filling out a template or working through a checklist of things
to include. Planning for data management should be thorough before the research
project starts to ensure that data management is embedded throughout the
research process.
Design according to your needs. Data management should be planned and
implemented with the purpose of the research project in mind. Knowing how your
data will be used in the end will help in planning the best ways to manage data at
each stage of the lifecycle (e.g. knowing how the data will be analysed will affect
how the data will be collected and organized). Consider using international data
standards to ensure that your data will be usable by the most people once the
project has finished.
“As a student I was always told to plan my analysis well in advance and collect
and organize the data toward this. I (and most of my early career colleagues)
ignored this advice. Only after spending hours reorganising different datasets
did I learn my lesson.”
- Kulbhushansingh Suryawanshi, Nature Conservation Foundation, India
what terms they are available. During planning it is therefore important to clearly
assign roles and responsibilities instead of merely presuming them. Others who
may be involved in data management besides you and your collaborators include
external people involved in collecting data, university IT staff who provide storage
and backup services, external data centres or data archives.
Review. Plan how data management will be reviewed throughout the project
and adapted if necessary; this will help to integrate data management into the
research process and ensure that the best data management practices are being
implemented. Reviewing will also help to catch any issues early on, before they turn
into bigger problems.
The data management checklist (p14) from the UK Data Archive will help prompt you
on the things you need to think about when planning your data management, and
enable you to keep on top of your data management once the project has started.
13
Data Management Checklist 2
• Are you using standardized and consistent procedures to collect, process,
check, validate and verify data?
• Are your structured data self-explanatory in terms of variable names, codes
and abbreviations?
• Which descriptions and contextual documentation can explain what your
data mean, how they were collected and the methods used to create them?
• How will you label and organize data, records and files?
• Will you apply consistency in how data are catalogued, transcribed and
organized, e.g. standard templates or input forms?
• Which international data standard is best suited to your data? Will you store
your data in that standard or standardize the data exports?
• Which data formats will you use? Do formats and software enable sharing
and long-term validity of data, such as non-proprietary software and
software based on open standards?
• When converting data across formats, do you check that no data or internal
metadata have been lost or changed?
• Are your digital and non-digital data, and any copies, held in a safe and
secure location?
• Do you need secure storage for personal or sensitive data?
• If data are collected with mobile devices, how will you transfer and store
the data?
• If data are held in various places, how will you keep track of versions?
• Are your files backed up sufficiently and regularly and are backups stored
safely?
• Do you know what the master version of your data file is?
• Do your data contain confidential or sensitive information? If so, have you
discussed data sharing with the respondents from whom you collected
the data?
• Are you gaining (written) consent from respondents to share data beyond
your research?
• Do you need to anonymize data, e.g. to remove identifying information or
personal data, during research or in preparation for sharing?
• Have you established who owns the copyright of your data?
• Who has access to which data during and after research? Are various access
regulations needed?
• Who is responsible for which part of data management?
• Do you need extra resources to manage data, such as people, time or hardware?
UK Data Archive ‘Managing and Sharing Data’, May 2011, p 35, CC BY-SA 3.0, Copyright 2011 University of Essex
2
14
© Jeremy Holloway
15
16
Creating data
“There are very few researchers who have collected 10-year datasets, yet the
results that emerge from such data are revealing and impossible to predict
from short-term data. Think of associated data that could help you if you had a
longer time sequence, then begin to collect those data. ”
- Andy Dyer, University of South Carolina Aiken, USA
17
Creating data
“Digitize and organize your data immediately after field collection so it is fresh
in your mind and you do not forget about aspects that only the field collector is
aware of.” - Roberto Salguero-Gomez, University of Queensland, Australia
“The project I work on supports a variety of data collection for both long-term
data collection, as well as a diverse range of PhD projects. I collect behavioural
data through multiple methods, data from laboratory samples and a range
of environmental data. The main challenge of such a large-scale research
project is due to the fact that there are numerous volunteers and PhD students
working together to collect the same data. This means there is a constant
need for effective communication and clear ways to record not only the data
themselves, but the fact that they have been collected.”
- Cassandra Raby, University of Liverpool, UK
“If using empirical data collected by someone else, discuss the format of the
output and generate a template of the recording spreadsheet for the data
prior to recording, including the accompanying explanatory notes and data
variable keys. This is particularly important if you are involved in a multi-site
experiment – a common recording template provided to all collaborators will
make collating the data easier.”
- Caroline Brophy, National University of Ireland Maynooth, Ireland
“The region where I live and work, Chilean Patagonia, is remote, pristine,
isolated and has an often harsh climate. To collect data I usually have to drive
on unpaved roads, then hike and climb to the treeline with a cooler box full of
ice packs which I use to conserve the tissue samples I collect. As there are no
universities in the area, I don’t usually have students or assistants to help, and
often do this collection alone. Back in the lab there is limited access to basic
materials needed to perform chemical analysis, and many of the procedures
have never been performed in that lab before, so I must install their protocol
for the first time. All of these logistical limitations in the field and the lab mean I
must be tough, efficient and independent.”
- Frida Piper, Centro de Investigación en Ecosistemas de la Patagonia, Chile
18
Creating data
Data may be collected directly in a digital form using devices that feed results
straight into a computer or they may be collected as hand-written notes. Either way,
there will be some level of processing involved to end up with a digital raw dataset.
Key things to consider during data digitization include:
•designing a database structure to organize data and data files
•using a consistent format for each data file – e.g. one row represents a complete
record and the columns represent all the parameters that make up that record
(this is known as spreadsheet format)
•atomizing data – make sure that only one piece of data is in each entry
•using plain text characters (e.g. ASCII, Unicode) to ensure data are readable by
a maximum number of software programmes
•using code – coding assigns a numerical value to variables and allows for
statistical analysis of data. Keep coding simple
•describing the contents of your data files in a ‘readme.txt’ file, or other metadata
standard, including a definition of each parameter, the units used and codes for
missing values
•use international data standards – your data are more likely to merge with other
data at some point than not. Using international standards helps this process
•keeping raw data raw
“Despite growing recognition and activity with ‘open data’, simply posting
your data online and making available as-is is not sufficient. The FAIR Data
Principles (doi.org/10.1038/sdata.2016.18) offer a framework for ensuring that
your data are findable, accessible, interoperable and reusable. The principles
also highlight the importance of using accepted biodiversity data standards to
ensure that the data you share has real impact for your colleagues. Data in large
projects and biodiversity portals come from multiple sources; it is important to
consider this when preparing to share and archive your data. In practice, this
includes storing your collection dates in the ISO 8601:2004(E) format (YYYY-MM-
DD), indicating the coordinate system used for geographic data, and paying
attention to metadata. When you are ‘done’ with your data, take a look at it
again from the point of view of an uninformed user: is your data structure self-
explanatory? Can the data story be understood from metadata alone without
an email from you? If the answer to both questions is yes, you are good to go.”
- Dmitry Schigel, Global Biodiversity Information Facility
19
© Kevin Weng.
The CC-BY licence does not apply to this image.
20
Processing data
21
Processing data
Version control. Once the masterfile has been finalized, keeping track of ensuing
versions of this file can be challenging, especially if working with collaborators in
different locations. A version control strategy will help locate required versions, and
clarify how versions differ from one another.
Version control best practice includes:
•deciding how many and which versions to keep
•using a systematic file naming convention, using filenames that include the
version number and status of the file, e.g. v1_draft, v2_internal, v3_final
•record what changes took place to create the version in a separate file,
e.g. a version table
•mapping versions if they are stored in different locations
•synchronizing versions across different locations
•ensuring any cross-referenced information between files is also subject
to version control
22
Processing data
23
Processing data
“Working with multiple forms of data is challenging but R can transform your
management and analysis of data. My workflows rely on the integration of the
R programming language and scripting in the UNIX environment. Scripts and
version control of data manipulation and analysis are important and make
collaboration transparent and relatively trouble free.”
- Andrew Beckerman, University of Sheffield, UK
“All modification made to the raw data should be scripted and retained as a
record. Use tools like R for your data manipulation because by using R scripts
you can keep a record of any changes and manipulations that you do. If you
make those changes in Excel you won’t remember what you have done 1, 5, 10
years later.”
- Yvonne Buckley, Trinity College Dublin, Ireland
http://www.r-project.org/
3
24
Processing data
25
26
Documenting data
Data level
•names, labels and descriptions for variables
•detailed explanation of codes used
•definitions of acronyms or specialist terminology
•reasons for missing values
•derived data created from the raw file, including the code or algorithm used to
create them
If a software package such as R is used for processing data, much of the data level
documentation will be created and embedded during analysis.
Metadata help others discover data through searching and browsing online and
enable machine-to-machine interoperability of data, which is necessary for data
reuse. Metadata are created by using either a data centre’s deposit form, a metadata
editor, or a metadata creator tool, which can be searched for online. Metadata follow
a standard structure and come in three forms:
•descriptive – fields such as title, author, abstract and keywords
•administrative – rights and permissions and data on formatting
•structural – explanations of e.g. tables within the data
“Curate your master data file while you create it and document it with metadata
at the same time. Creating metadata afterwards is way more painful and you
risk misinterpreting your own data, your memory is not usually as good as you
think!”
- Ignasi Bartomeus, Swedish University of Agricultural Sciences, Sweden
28
Preserving data
“Lots of people solve data storage by buying space on a cloud service such
as Dropbox, but they might actually be breaching their contract with their
university if they do this, particularly if they store any kind of sensitive material
e.g. student-related, patient-related or any kind of social data. People should
always consult their institution in terms of implementing not only data storage
but all aspects of data management.”
- Rob Freckleton, University of Sheffield, UK
31
Preserving data
“I recently had a request for the raw data from a 25-year-old experiment and
was able to find the floppy disk but then spent a day trying to figure out how
to translate from an old software format into something readable. The request
was for data that didn’t make it into the published paper but that we had in fact
collected. It felt good to be able to offer it up for use in someone else’s work.”
- Charles Canham, Cary Institute of Ecosystem Studies, USA
32
Sharing data
33
Sharing data
Longitudinal datasets that span many years are important in ecology and
evolution. Journals mandating that authors archive their data only guarantees
the preservation of the data used in a particular paper, but researchers should
be aware of the value of archiving and sharing large datasets to drive discovery.
Sharing large datasets has not been common practice in fields such as ecology.
However, as trust in ethical guidelines, journal and funder requirements, and
community expectations with regards to accessing and correctly citing others’
data grow, progress can be made towards a more open access data future.
Data papers. A data paper is a peer-reviewed paper published in a scholarly
journal describing a particular dataset or a group of datasets. The primary
purpose of a data paper is to present the metadata and describe the data and
the circumstances of their collection, rather than to report hypotheses and
conclusions. By publishing a data paper, you will receive credit through indexing
and citation of the published paper, in the same way as with a research paper.
Publishing biodiversity data through the Global Biodiversity Information
Facility (GBIF). The Global Biodiversity Information Facility (www.gbif.org)
is an open-data research infrastructure, funded by governments, aimed at
providing anyone, anywhere with access to data about all types of life on Earth.
Coordinated through a Secretariat in Copenhagen, Denmark, GBIF enables data-
holding institutions around the world to share information about where and
when species have been recorded. This knowledge derives from many sources,
including museum specimens dating back decades or centuries, current research
data and monitoring programmes, as well as volunteer recording networks and
citizen science initiatives.
By encouraging the use of common data standards and open-source publishing
tools, GBIF enables data from thousands of different collections and projects to
be integrated, discovered and used to support research and policy. Data published
through GBIF can be freely accessed at the global level through GBIF.org and
associated web services, as well as through national and thematic portals making
use of the shared infrastructure.
Through its network of national, regional and thematic nodes, (see www.gbif.org/
the-gbif-network), GBIF also acts as a collaborative community, sharing skills and
best practices to encourage the widest possible participation.
34
Sharing data
Reusing data
35
Reusing data
data management practices, researchers can ensure that high-quality data are
preserved for the research community and will play a role in advancing science
for future generations.
Data openness
•increases the efficiency of research
•promotes scholarly rigor and quality of research
•enables tracking of data use and data citation through DOIs
•expands the spectrum of academic products through data papers
•enables researchers to ask new research questions
•enhances collaboration and community-building
•increases the economic and social impact of research
•supports international conventions and requirements from funding agencies
Citing data. Data accessed from data portals is often free and open, but is not
free of obligations. Read and respect the data use agreements of the data portals
you use in your research, and follow the citation guidelines of the data portal and
instructions to authors in your journal. Whenever possible, use a DOI to refer to
the unprocessed or downloaded data, and to the processed and archived version.
Good citation practices ensure scientific transparency and reproducibility by
guiding other researchers to the original sources of information. They also reward
data-publishing institutions and individuals by reinforcing the value of sharing
open data and demonstrating its impact to their stakeholders and funders.
Datasets published through GBIF and other portals are authored electronic data
publications and, as such, should be treated as first-class research outputs and
correctly cited.
36
Sources and Further Reading
Acknowledgements
This booklet would not have been possible without contributions from: Peter Alpert, Andrea Baier, Liz Baker, Ignasi
Bartomeus, Andrew Beckerman, Caroline Brophy, Yvonne Buckley, Rosalie Burdon, Charles Canham, Tim Coulson, Kyle
Copas, Kyle Demes, Stéphane Dray, Andry Dyer, Rob Freckleton, David Gibson, Erika Newton, Bob O’Hara, Catherine Hill,
Tim Hirsch, Will Pearse, Nathalie Pettorelli, Frida Piper, Cassandra Raby, Andrew Rodrigues, Roberto Salguero-Gomez,
Dmitry Schigel, Kulbhushansingh Suryawanshi, Phil Warren and Ken Wilson.
Image credits
p2: Norwegian University of Life Sciences p15: © Jeremy Holloway
/Snow Leopard Foundation Pakistan p16: Image provided by Koen and Walpole using Circuitscape
p3: Danielle Green p18 Tomáš Václavík
p6: Markku Larjavaara p20: © Kevin Weng.
p8: David Bird The CC-BY licence does not apply to this image.
p9: Kara-Anne Ward p23: Oliver Hyman
p10: Ute Bradter p26: Hannah Grist
p13: Benjamin Blonder p30: Adam Seward
PUBLICATIONS Proud to
partner with