Back in the day, hurricane season was the time of year when local information technology professionals would become extremely concerned with disaster recovery. If a hurricane hit, what could be done to ensure computer systems could still support the needs of the organization? Nowadays, however, with the increasing reliance on computer systems and the availability of such systems over the internet all day every day, protecting against extended outages is a year-round concern.
While the demands for system availability have increased, the process by which a disaster recovery plan is devised has not changed. A good disaster recovery plan takes into account the business needs of the organization first. Toward that end, all of the critical and important systems of the organization need to be defined. This must include representatives of business units; it cannot be a task done by IT folks alone.
Then for each system, two factors need to be defined. The first is the amount of time the organization can tolerate before the system needs to be up and running. The fancy name for this is the recovery time objective (abbreviated as RTO since we love our acronyms). For example, how long can our accounting system be down? How long can we go without access to email?
The second factor is called recovery point objective (RPO). In plain English, RPO is that point in time to which you can effectively restore your system. Typically, this will be the date and time of your last good backup.
One can see how these two factors go together. Just as an example, if our RTO is 24 hours and our RPO is 48 hours, that means that we’ll be OK if the system comes up in one day, with data that is 2 days old.
This example was probably reasonable for most businesses and government agencies just a few years ago. However, in current times these factors are typically significantly shorter, measured in minutes as opposed to hours or even days. It is not unusual to see the demand for systems to come up in 60 minutes, with data that is maybe 90 minutes old. With these types of metrics, it’s not just “disasters” that could cause the recovery plan to kick in, it could be plain old human error.
Of course, the wrench in all of this is cost. The lower the RTO and RPO, the greater the cost. To meet metrics such as those defined above might require “hot spares,” that is, a duplicated system that is always running albeit in a different location. This basically means two of everything. Organizations need to take a hard look at such costs and balance that against the fallout of a prolonged system outage.
John Agsalud is an IT expert with more than 25 years of information technology experience. Reach him at jagsalud@live.com.