Automation of testing of operating system backup and recovery

The goal of the project is to fill the gap in testing of the source repository of the Relax-and-Recover disaster recovery tool by automating the recovery process and deploying a Continuous Integration (CI) setup that will automatically test all the proposed changes to the source repository.

Details and background

Backups are a very important part of IT infrastructure. In practice, one often needs to preserve the whole setup of important computers, because reinstalling them after a disaster would be time consuming and their downtime can be very disruptive and therefore expensive. Even if all the important data are preserved by a backup tool, recreating the server from the backup to the original state can be difficult, as this task is outside the capabilities of the traditional backup software. This is especially the case when the server had a complex disk setup. To automate this task, one needs a slightly different type of software - a disaster recovery tool.

One very popular tool for disaster recovery on Linux is Relax-and-Recover (ReaR). It creates bootable rescue media which can be used to recover the system from backup, preserving the original configuration faithfully. It can recreate a large variety of configurations and integrate with many backup solutions. Given the task, such tool needs to be very reliable and well tested. One particular area which deserves testing is the actual recovery process. While problems during the creation of the backup and the bootable media are discovered immediately, problems during the recovery process will be normally discovered only in the case of emergency, when the original server has been destroyed, and therefore could be impossible to correct. Such a failure can lead to huge problems to the user who has been expecting to be able to recover the system in the case of disaster and now has only unusable backups instead. The only way to significantly reduce the likelihood of such an undesirable failure is rigorous testing of the recovery process. At the same time, testing of the recovery process is difficult. One needs to create the rescue image and then to boot from them and guide the tool during the recovery process. This is not easy to do in an automated way and therefore it is not being performed in the upstream code repository on GitHub, and the package provided by Red Hat as part of Red Hat Enterprise Linux is tested only in a limited number of scenarios.

The goal of the project is to fill the gap in testing of the source repository of the ReaR tool by automating the recovery process and deploying a Continuous Integration (CI) setup that will automatically test all the proposed changes to the source repository. The project will consist of:

research on automating the backup and recovery process, enhancing the capabilities of the ReaR tool if they are not currently sufficient
survey of the existing CI solutions which could be used to run this automation
integrating the automation with the chosen CI solution and deploying it to test the GitHub repository (or, alternatively, the package provided by Linux distributions, such as Fedora).

Possible extensions of the project include:

CI testing of the package provided by Linux distribution such as Fedora (if not done in the task above)
testing of integration with network backup tools (Bacula, Bareos)
CI testing using a static analysis tool like Shellcheck
writing tests for more scenarions (like more complex storage setups)

Literature

W. Preston: Backup & Recovery. O'Reilly, 2009. http://shop.oreilly.com/product/9780596102463.do