Where to archive data

Choosing the right place to archive your data is important, as this will determine to what extent they are Findable, Accessible, Interoperable and Re-usable (FAIR). In most cases, a digital data repository will be suitable, but different services may be appropriate depending on the type and nature of the data. High-volume and non-digital data may require other solutions.

Digital data

As a rule, digital data should be archived to a suitable public data repository. A data repository will ensure your dataset is FAIR by undertaking the following functions:

It actively preserves data for long-term viability, e.g. replicating and validating data files, migrating to preservation formats;
It publishes machine-readable metadata to enable online discovery;
It will assigns a persistent unique identifier (e.g. DOI) to the dataset to make it citable;
In most cases it quality-controls the dataset and may enhance metadata, e.g. by applying standard vocabularies (not all repositories do this);
It enables online access to data, so that they can be used by other people;
It applies a licence notice to the dataset, to make terms of use and attribution requirements clear.

We provide guidance on choosing a suitable data repository.

Source code

Software code that supports published results (e.g. model code used to generate output data, or code written for purposes of statistical analysis) should be archived to a public data repository, so that it is preserved in the specific version relevant to the reported results, and can be cited by DOI. Small scripts specific to a dataset can be archived in the data repository alongside the data. Code that may exist as an output in its own right, e.g. model code, may be better archived as a standalone item. GitHub provides an easy-to-use function for archiving code files to the Zenodo digital repository. Code files can also be deposited in the University's Research Data Archive, or any other general-purpose repository.

Where it is desirable to release source code so that others can download and run it, and contribute to its ongoing development, it should also be made available as a public code repository. A code repository will provide version control, code review, bug tracking, documentation, user support and other features. The University provides a GitLab code repository service; other popular platforms are GitHub and Bitbucket. Note that sharing code related to published results is not sufficient, as code repository platforms do not undertake long-term preservation or issue DOIs, and links to code repositories are not version-specific. Code supporting published results should always be deposited in a data repository and cited by DOI from the related publication.

Restricted data

Some data may not be suitable for public access, for example because they contain confidential information that cannot be easily removed (such as biometric data or video/image data), or because redacting data to remove sensitive or confidential information would significantly diminish their value. This does not mean the data cannot be archived and made available to others.

Some repositories can manage sensitive data under a controlled access procedure. This may require a prospective data user to make an application to consult a specific dataset, which can be rejected or approved by the data owner or a nominated data steward. The requestor may also need to fulfil certain conditions to be granted access to the data, such as signing a confidentiality agreement. Access to personal data will also be subject to consent from the data subject, so this would need to be considered at the planning and recruitment stage of the research. See the University's guide to Data Protection and Research for more information.

Repositories that provide controlled-access procedures include the UK Data Service ReShare repository, which has a 'safeguarded data' option, and the European Genome-phenome Archive.

The University's Research Data Archive also provides a restricted dataset option. Restricted datasets can be securely preserved on University infrastructure and made accessible to authorised researchers affiliated to a research organisation, subject to approval by a Data Access Committee (including the PI of the original study or a nominated representative), and under the terms of a Data Access Agreement between the University and the recipient organisation.

High-volume data

Some research can generate large volumes of data, at the 100s gigabytes (GB) or terabytes (TB) scale, such as computational modeling and various kinds of experimental imaging. If you need to archive these data, there are practical and cost limitations that may constrain your options. Many data repositories cannot effectively handle datasets of this size, although this is not always the case - for example:

NERC's CEDA Archive routinely manages climate and weather datasets at the TB scale.
The European Bioinformatics Institute provides repositories for genetic, imaging and general biological study data, which can accept large volumes of data at no charge.
Some research facilities that support the generation of high-volume data, such as the ISIS Neutron and Muon Source, provide an archive facility for raw data collected on their instruments. In this case you would not need to archive the data yourself as this will be done as part of facility operational procedures.

The University's Research Data Archive can only accept data deposits up to 20 GB in size; but two general-purpose data repositories with larger capacities can be recommended:

The free-to-use data sharing service Zenodo accepts deposits of up to 50 GB with a maximum of 100 files, and will accept a one-off deposit of up to 200 GB.
Figshare Plus can be used to share datasets up to several TB in scale for a one-off charge. (The standard Figshare service is free to use for deposits up to 20 GB.)

Bear in mind that you may not necessarily need to archive or maintain all of the raw data collected or generated in project. There is more information about this on the Data selection web page.

Archiving high-volume data outside a repository

If there is no suitable data repository for your data, we recommend you consider the following solutions, in the order presented. If combined with creation of a metadata record in the Research Data Archive describing the dataset and the means by which it can be accessed, this can enable compliance with the University's data sharing requirements.

The DTS Offline Data Archive provides a cost-effective, long-term storage solution for the archiving of digital data in a secure environment. This service is designed to archive research data that needs to be preserved for extended periods but does not require immediate, active access and is suitable for NFS (Linux) or SMB (Windows) datastores.
University cloud storage options offer free high-volume storage. OneDrive accounts provides staff users with 5 TB of storage as standard; Teams sites provide up to 25 TB storage. These services are not designed as long-term storage solutions, and are not optimal for storage and use of high volumes of data. Data stored in OneDrive would be accessible only as long as the account-holder is a member of the University, so data should be backed up to another location where continued access by others is required.
External hard drives provide inexpensive storage solutions, but you should consider backing up the data in at least one separate location. The hard drive should be stored securely on site and accessible by at least two people. Data would need to be migrated to new media periodically, e.g. every five years.

If data are stored by these means, you are advised to observe the following principles:

Ensure the data are accessible to/retrievable by at least two people, and that there is a handover policy, so that if a data owner/steward leaves the University, responsibility is transferred and the data continue to be retrievable. It is advisable to have a designated steward for archived data within a research group or department, who maintains a register of archived datasets, their locations, and responsible owners.
Basic measures should be taken to ensure the integrity and usability of the data. Data files should be write-protected, so that once archived they cannot be further modified. If possible, checksums should be generated for all data files. There should be some documentation of the data, including a file listing or manifest, so that they can be navigated and understood.
If the data support published results, a metadata record should be created in the University's Research Data Archive describing the data and the means by which they can be accessed. This will enable the data to be cited by DOI from related publications, and provide a means by which others can request access to them. If a request to access the data is received, this can be granted by inviting the requester to view the data on site (if this is feasible), or by arranging (at their expense) to send a copy of the relevant data.
When data are deleted, any local register and metadata record in the Research Data Archive must be updated accordingly.

The Research Data Service can advise on and support you in archiving data using the principles outlined above.

Non-digital (offline) data

Non-digital data should be digitised for archiving wherever possible. If for any reason this is not possible or desirable, they should be archived following the principles for high-volume data outlined above. There should be clear documented ownership and local management of the data. If the data are necessary to support published research findings, a record should be published in the University's Research Data Archive describing the data and the means by which they can be accessed, so that they can be cited from the related publication.

Robert Darby, Research Data Manager

researchdata@reading.ac.uk

Tel. 0118 378 6161