The Problem with Most Disaster Recovery Plans
Your disaster recovery plan probably sucks. This is a funny thing since someone took the time to write it up and, then it got blessed by a group of auditors in all likelihood and, people felt good about it despite all that the plan probably sucks. There are a few reasons for this: first off it’s not been tested enough, second the plan is out of date and, third it doesn’t account for enough of what could go wrong.
The first problem is that disaster recovery plans are difficult to test. Well, it’s relatively easy to have a tabletop exercise and discuss what could go wrong actually spinning up a service in a new region, account or god forbid another cloud provider, and getting everything to work is a different tale. The best case scenario usually is that database failover is tested but not a restoration of the whole system.
The second major issue that you’ll encounter is that DR plans tend to quickly become outdated. This is the exact same problem that you will encounter with architecture diagrams or any other piece of documentation. However, your disaster recovery plan is something that you’ll be relying on at the most critical time. Since DR plans are only used internally they often don’t have enough eyes on them to catch issues with changes. This means that DR plans rot over time and end up not giving an accurate view of the recovery process. Generally, this means that DR plans start off about 90% right but when the process is written down and then within six months they no longer make sense as new services are released and infrastructure changes.
This brings us to the third major problem. DR Plans usually only consider how the data will get restored. While this is usually the most critical part of the process there is a whole host of configuration and application code that exist as well in working systems. Most disasters allow you to recover in the same place so this isn’t a huge issue but it can turn into a total mess if you are recovering in a new region, account, or data center. Problems can arise because of bad assumptions and misunderstandings of small technical details. For example, when restoring to a new region IAM roles and users are global however they frequently have references to resources that only exist in a single region.
What to do about it
Now that we’ve talked about the problems that exist, we can look at what to do about them. Many of the solutions are simple that doesn’t make them easy to do. The first piece of the puzzle is having a realistic list of scenarios that you build plans to recover from. It’s fun to think about what happens when an asteroid strikes, many smaller-scale disasters will cause a recovery process and are far more likely. All of those are worth thinking about. For example, if data gets partially deleted from a SQL query issues can arise about how to partially recover data, and most traditional recovery processes don’t cover this scenario well. Thinking about those scenarios is important to actually be able to recover quickly. It’s far more likely that you’ll have partial data loss or a configuration issue that requires recovery than to be dealing with region loss.
Once you have a list of scenarios that you are concerned about, the next thing to do is test the recovery process. While many things can go wrong in reality they don’t go wrong in testing. Such as, regional failover being slower in the event of regional failure since everyone will be trying to move to a new region. However, actually trying to spin up your infrastructure and then shadowing traffic to the recovered instances will give you a very good idea of whether this process actually works. Doing this will show you two sets of issues, the first is the recovery data which everyone thinks of with disaster recovery (though partial recovery frequently turns out to be a major challenge), and recovering all of the configuration that exists for a system.
Partial recovery is hard because most tools assume that you are restoring a whole database or at least a whole table. In theory, this is solvable with most logical backup tools since it will give you each row of data as an insert. Of course, at this point, you have the problem of finding the right rows which is doable but makes you do the equivalent of grepping through your backup to successfully complete the partial restore. This can be partially fixed by point-in-time backup tools but depending on how they work with your database they can have many of the same problems. It’s definitely worth putting time into finding or creating tools that allow for partial restorations of data especially if you are working in a multitenant environment.
Another major challenge is configuration. This is partially solved by using infrastructure as code. However, there will still be challenges when restoring to a new region or account. Also many times there is configuration that isn’t captured in infrastructure as code even if it should be. Even when everything is captured in infrastructure as code there is still going to be region-specific configuration. All of this demands that the new failover location is tested beforehand including shadowing traffic to that region to see any failures that might happen. Another common issue is things like encryption keys making the backup not actually restorable in the new region.
Testing and Dependency Management
There is no way to sugarcoat it. Testing DR infrastructure is hard and expensive. There is a reason that most cloud deployments end up being in a single region. Recovery has many of the same challenges as going multi-region or other attempts at providing high availability and deserves similar amounts of engineering effort. The problem isn’t that it takes a long time to recover in and of itself; the bigger problem is when recovery times are unpredictable because the process has to be invented on the fly each time. This means that you need the testing pyramid for infrastructure to ensure that your disaster recovery process works. Individual components should be tested frequently at least once a day (preferably more) and total system tests need to be planned for and happen less often due to the expense and complexity.
The other major issue that you’ll run into with total system tests is that they are expensive: both in terms of compute and more importantly in terms of time. This becomes even more apparent at scale when there is lots of coordination between different groups. The need for coordination makes total system tests even more important since communication breakdowns can significantly slow down the restoration process. A problem in many large technology organizations that is uncovered by testing total system recovery is having circular dependency which makes recovery very difficult. Resolving those circular dependencies is usually a huge undertaking in its own right.
Regularly testing systems also makes it more likely that dependency issues will get caught early in the process of running applications. Catching issues early is incredibly important with dependencies because dependency graphs get more complicated over time without a concentrated effort to simplify them. As the number of systems and integrations you have goes up the likelihood of there being a complicated and hard-to-break chain of dependencies increases. DR testing can be a part of the process of finding hidden dependencies.
Tests of single components are also necessary. There are still way too many incidents where the issue turns out to be that backups aren’t being taken successfully. These tests are more straightforward and should be automated. Many tools can help with this but there still are tons of little surprises which can be caught and need to be accounted for. This includes issues like making sure backups are rotated and that keys for backups still exist in some form. Creating automatic quality checks for these issues makes recovery more likely.
Conclusion
Having a plan is critical for disaster recovery. But for the plan to work it needs to be tested without testing it’s just a piece of paper with some ideas about how recovery should go. Reality has lots of details and that will come out with actual testing. It’s ok if some things don’t work correctly the first time but there needs to be a process of continual improvement with disaster recovery just like anything else in operations.
It’s a journey to get to guarantee success in a disaster recovery scenario. Almost every problem that you encounter will take a significant amount of work to solve. This means that it’s better to take each problem and work through it. The good news here though is improving disaster recovery will improve resiliency throughout the software development lifecycle. By making DR easier it makes handling incidents easier and also forces you to create repeatable processes. This means that there is significant value in this work beyond just the ability to recover after a disaster.