Disaster Recovery Testing: The Importance of Documenting and Testing your Plans

by Jeff Hollis | Aug 4, 2020 | ZAG Standards

It’s not if, it’s when. People who work in IT and people who ride motorcycles both know what this phrase means: protection is meant for dealing with when a disaster hits, however unlikely it may seem, not merely the possibility of danger. For motorcycle riders this means wearing a helmet, gloves, and heavy clothing for crash protection. For companies, preparing Disaster Recovery (DR) documentation, laying out a set of procedures for recovery, and testing to make sure everything works when it is supposed to protects you from a disaster crippling your business.

Documentation/Runbook

In an emergency drill, gathering areas for employees are documented and established outside of a building. Like this example, a major concept driving your disaster documentation is to record where applications and data will go and how they can be accessed in case of an emergency. Therefore before a disaster, a gathering place for IT applications away from the physical primary site should be established at a secondary site. Having a secondary site in a different physical location will limit the damage one disaster can cause to your companies IT systems and data.

Once you have a secondary site recorded, mapping which applications are most important to your company is your next step. Begin by outlining your applications; create a logical diagram of which programs interact with each other and where data is stored. Start small and create this on an application by application basis.

Establish a tier by tier sequential process of which of these applications, servers, and data need to be restored first. Typically, the most critical IT systems will be in the first tier (Tier 1), followed by those deemed less critical (Tier 2), and so forth. This organization will help confirm services are restored on a step-by-step process.

Include a User Acceptance document. This is a list of the programs, processes, and data you will go through as a business to validate when a disaster recovery is complete. Your company will review this list with your IT team and verify services are up when they are restored on the secondary site.

Create a checklist for your recovery process: initiate failover, failover completed, and user acceptance testing. Failover is the process of switching over from a failed site to a backup site; i.e. the primary site failed, so you switch IT systems over to the secondary site.

Document when an outage is deemed long enough to initiate a failover process. Your systems may be running on the secondary site on a long-term basis (greater than 30 days), so you must be ready to switch over completely. Even though this is hard to predict, you must decide when to perform a failover. This process is more complex than flipping a light switch OFF then back ON and should not be taken lightly.

Once plans and diagrams begin to come together, gather them into a Runbook. Keep a digital copy but also distribute hard copies to both the IT team and business managers.

Testing/Procedures

Before testing, make sure your secondary site has replicating servers, data, and warm machines. For a smaller business running on Azure or AWS, this may be as simple as switching over to the cloud while your primary site systems are repaired. Even though this may seem easy, create documentation, testing, and procedures for this DR process.

Your test is a real-time run through of the recovery effort. Isolate a DR environment away from the primary site, typically run through the secondary site, so normal business functions will not be affected by testing. On command, the IT team will initiate failover with the Tier 1 applications and continue restoring services one tier at a time from the secondary site. After the test environment failover is complete, you will log in and validate all systems are up in the test environment as per your User Acceptance document. Record the results of the test and analyze reports for any anomalies or failures which need remediation.

Testing should occur quarterly, and any new procedures discovered written into your DR Runbook. Schedule these tests with both your IT team and employees from other departments within your company.

Procedures/Recovery

What happens when a disaster does occur? Whether an earthquake, hurricane, fire, or flood, your primary site and business systems are down. What happens now?

Follow the instructions and DR plan you have laid out step-by-step. Once failover is initiated per your Runbook standards, you IT team will begin restoring services. Once failover is complete, your business will verify all systems are working according to your User Acceptance document. If your documentation, planning, and testing is sound your business services will be up and running at the secondary site with minimal impact.

When it comes to disaster recovery, there is no “one size fits all” approach. Qualified engineers should look at your environment, pick it apart, and find out what it takes to construct a good disaster recovery process. Do not rely on people to pass systems knowledge along or remember the recovery processes. When employees leave, they take that knowledge with them and sometimes it is lost forever. No matter how small or well-known, any plan worth having is worth documenting A good disaster recovery plan that is documented and tested will give your business the competitive edge to stay open while other businesses may be off-line or down indefinitely.