Years ago, we received a rather disturbing letter from a mainland-based employee working for one of Hawaii’s most important and prominent companies. The letter claimed, among other things, that this company’s IT operations were in such dire straits that a system failure would bring local commerce to a standstill for weeks, and cripple Hawaii’s economy. We reviewed these claims in detail, and came to the conclusion that the author was, more than anything else, a disgruntled employee. However, he still raised good points about planning for, recovering from and even preventing system failures.
Our author was of the mind that the Recovery Point Objective of the company was out of whack. In plain English, RPO is that point in time to which you can effectively restore your system. Typically, this will be the date and time of your last good backup (or set of backups).
In our estimation, the company’s RPO was 24 hours. The company seemed to understand this parameter, and believed that yesterday’s data would suffice in the event of an outage. While the company really didn’t want to discuss the matter in detail, it was clear that they had a set of procedures in place to deal with losing a day’s worth of data.
Further, it was our opinion at the time that the company was trying to improve its RPO. It was experiencing problems with the improvement efforts, which we believed spurred the employee to author his correspondence to us.
RPO is one of two commonly used parameters used to gauge system recovery effectiveness. The other is RTO, or Recovery Time Objective. Simply put, RTO is the amount of time an organization can be without its system(s).
RTO obviously impacts the methods and technologies used to conduct system backups and subsequent restores. If your RTO is 4 hours, your backup had better be in a format and a medium that can be restored in that time frame. Realistically though, it should be restorable in half that time frame, to allow for any hiccups that might occur during the process.
RTO also dictates the type of infrastructure you need to consider. In many cases, RTO is so low that the organization cannot wait for a traditional backup/restore cycle. In such cases, real-time backups must be employed. This includes such technologies as clustering or system relocation, where applications that experience problems are automatically moved to another system.
The author of the letter also claimed that the RTO of the company was set too high at eight hours. The company, however, claimed that it was aware of the RTO and had operational procedures in place to deal with an outage.
Whether or not the company’s claims can be confirmed, it was clear to us that they were aware of their limitations. The company claimed it was following best practices in confirming their business requirements, identifying their limitations, and developing policies and procedures to deal with such outages. All organizations should ensure that they follow similar processes.
———
John Agsalud is an IT expert with more than 20 years of information technology experience. Reach him at johnagsalud@yahoo.com.