Backup

Backups are an important instrument to ensure that data and related files can be restored in case of loss or damage. Among the most common causes of data loss are:

DataLoss700px
  • Hardware failure;
  • Software malfunction;
  • Malware or hacking;
  • Human error (research data accidentally gets deleted or overwritten or is lost in transport);
  • Theft, natural disaster or fire;
  • Degradation of storage media.

Creating a backup strategy in 10 steps

A backup strategy in short

  • Make at least three backup copies of the data on at least two different types of storage media
  • Keep storage devices in separate locations with at least one off-site
  • Regularly check whether they work
  • Ensure you know the process and follow it

In the tabs below, the steps to create a backup strategy are outlined in more detail.

CTRL-C600px

Find out whether your institution has a backup strategy. If so, backups may automatically be taken care of for any files stored on institutional servers. However, it is necessary that you check if the backup strategy in place sufficiently meets your requirements.

The three common options for backups are:

  1. Full backup of the entire system and files;
  2. Differential backups, where everything is recorded that was changed since the last full backup. To restore your data and/or system, you will require the last full backup and the last differential backup;
  3. Incremental backups, where only changes since the last backup are recorded. To restore your data and/or system, the last full backup and the entire series of incremental backups is required.

Differential and incremental backups are also called "intelligent" backups. If only a small percentage of your data changes on a daily basis, it's a waste of time and disk space to run a full backup every day.

It is recommended that you make three backup copies. This will greatly minimise the risk of data loss, even in the case that one of the backups is damaged or lost. However, if storage capacity is an issue and/or if sensitive data is involved, it may be necessary to work with fewer copies.

You should clearly state in your backup strategy how often backups will be made. The frequency of backups will depend on the frequency and amount of changes to your data and documents.

We recommend that you store at least some of the backups in (physically) different places. For example, backing up to two servers standing in the same room or building may cause you to lose both backups in case of a fire. Having an offsite copy of your backup mitigates this risk.

Backups can be made to networked drives, cloud storage, and to local or portable devices (see 'Storage'). What works best for your project depends on the amount of data that needs to be backed up, the required frequency of backups, the level of automation, and the sensitivity of the data.

Estimate which amount of data and documentation you will collect and create in your project. Then determine the corresponding approximate amount of storage capacity needed for backups. If your institution has an IT department, they will be able to help you with this.

Automating backups can help to ensure that backups are created at the correct time and that they are saved to the correct location, reducing the risk of human errors. Both Microsoft and Apple operating systems have software to support automatic backups. Cloud storage solutions often also have backup functionality. However, make sure to check frequently that functional backups were indeed created.

  • Windows 10
    Windows 10 includes two different backup programs:
    • File history
      The File History tool automatically saves multiple versions of a given file, so you can “go back in time” and restore a file before it was changed or deleted. That’s useful for files that change frequently.
    • Windows Backup and Restore
      The Backup and Restore tool creates a single backup of the latest version of your files on a schedule.

Of course, you would still need an off-site backup as well.

It is generally recommended that you do not overwrite one backup with another. However, if you have to back up large amounts of data frequently it may not be feasible to retain all backups for the entire duration of the project.

If sensitive data is involved, make sure that any deleted data are truly gone and cannot be recovered in any way. For suitable procedures, see 'Security'.

Make sure that backups of data containing sensitive information are protected against unauthorised access in the same manner as the original files. For suitable measures, see the chapter on Security.

A disaster recovery plan defines the steps to take if a data loss occurred and thus helps you to restore data as quickly as possible. The plan should also assign responsibilities for data recovery tasks and list persons (or functions) to contact when a data loss occurs.

To ensure that data recovery will run as smoothly as possible in the event of an actual data loss, make sure to regularly test whether restoring lost files from your backups is actually possible.

Never assume that someone will take care of backups and data recovery. Assign responsibilities for making manual backups, for checking those automatic backups actually happened, for testing data recovery, and for restoring any lost data.

Errors can happen when backups are written or copied. We recommend that you frequently check the integrity of your backed up files. This can be done with so-called checksum tools such as MD5summer.

The UKDS compares checksums to digital fingerprints. Available tools create such a fingerprint with the help of an algorithm that computes the fingerprint - a string of numbers - from the bit values (the ones and the zeros) of a file. Monitoring whether the fingerprint of a given file changes allows you to detect if a file was changed in any way intentionally or unintentionally.

Video tutorial on using MD5summer: https://www.youtube.com/watch?v=VcBfkB6N7-k

Case study

The following scenario is used to illustrate the importance of backups and to highlight some of the key points to consider when planning a backup strategy.

After reading through the scenario, take a few minutes to think about what could have been done to prevent data loss. Afterwards, you can open the tab to see our analysis.

Mastercopy600px

A group of researchers collaboratively works on quantitative survey data. They use a shared working space on a networked drive where a master copy of the data and a working copy are stored. Two backups exist, stored separately from the working and master files: one copy on an external hard drive and one in the university’s own Cloud system.

A new researcher enters the project, who is not aware of the way files are named and organised and accidentally works on the master copy of the data. In this process, a number of variables get overwritten when the new team member recodes variable values and forgets to save them into a new variable. Fortunately, two backups exist.

The researchers know that sometimes copies can get changed due to write or transmission errors, so they decide to check with a checksum tool if the two copies are identical. They discover that the checksums for the two files are not identical. This means that either one or both of the files were altered in some way.

  • The master copy is kept as a separate file from working files;
  • Two backup copies on different media and in different locations exist;
  • No frequent integrity checks of the backups were made and no additional protection for the master copy of the data was in place to prevent it from being overwritten.

1. Versioning and file naming rules

Errors such as accidentally overwriting a file can always happen, but they are less likely to occur if clear rules for versioning and file naming are in place and if folders are clearly labeled. Such policies and guidelines help to avoid confusion about what files contain and where they should be saved. See 'Data authenticity, versions and editions'.

2. Restricting access to important files

As mentioned above, human error is one of the most common causes of data loss. Therefore, consider restricting the access to important files, for example with the help of passwords or by using systems with read and access rights management. By giving fewer people access to important files, the risk of data loss caused by human error can be minimised.

3. Creating three backup copies rather than only two

If three copies rather than only two had been created, this would have increased the chances of identifying the unaltered copy: if two out of three copies are identical, this suggests that these are unharmed. This would have saved the project laborious work of trying to identify the correct copy.

4. Checking the integrity of files

Errors can happen when backups are written or copied. These can sometimes make a copy entirely unusable, but sometimes they are small enough to go unnoticed initially but then cause problems further down the line. This could lead to you losing access to the data entirely - for example because a software can suddenly no longer render the files - or it can cause the data to contain errors, thus impacting the results of your research negatively. Learn more about integrity checks in this video about performing a checksum check for your files (UK Data Service, 2016a).