Recovery Strategies
Recovery alternatives are the choices an organization has for restoring critical systems and the data in those systems. Recovery strategies can include the following:
Alternate processing sites
Hardware recovery
Software and data recovery
Backup and restoration
Telecommunications recovery
The goal is to create a recovery strategy that balances the cost of downtime, the criticality of the system, and the likelihood of occurrence. As an example, if you have an RTO of less than 12 hours and the resource you are trying to recover is a mainframe computer, a cold-site facility would never work—because you can’t buy a mainframe, install it, and get the cold site up and running in less than 12 hours. Therefore, although cost is important, so are criticality and the time to recover. The total outage time that the organization can endure is referred to as maximum tolerable downtime (MTD). Table 4-5 shows some MTDs used by many organizations.
Table 4-5 Required Recovery Times
Item |
Required Recovery Time |
Critical |
Minutes to hours |
Urgent |
24 hours |
Important |
72 hours |
Normal |
7 days |
Nonessential |
30 days |
Alternate Processing Sites
For disasters that have the potential to affect the primary facility, plans must be made for a backup process or an alternate site. Some organizations might opt for a redundant processing site. Redundant sites are equipped and configured just like the primary site. They are owned by the organization, and their cost is high. After all, the company must spend a large amount of funds to build and equip a complete, duplicate site. Although the cost might seem high, it must be noted that organizations that choose this option have done so because they have a very short (if any) RPO. A loss of services for even a very short period of time would cost the organization millions. The organization also might be subjected to regulations that require it to maintain redundant processing. Before choosing a location for a redundant site, it must be verified that the site is not subject to the same types of disasters as the primary site. Regular testing is also important to verify that the redundant site still meets the organization’s needs and that it can handle the workload to meet minimum processing requirements.
Alternate Processing Options
Mobile sites are another alternate processing alternative. Mobile sites are usually tractor-trailer rigs that have been converted into data-processing centers. They contain all the necessary equipment and can be transported to a business location quickly. They can be chained together to provide space for data processing and can provide communication capabilities. Used by the military and large insurance agencies, mobile sites are a good choice in areas where no recovery facilities exist.
Another type of recovery alternative is subscription services, such as hot sites, warm sites, and cold sites.
A hot site facility is ready to go. It is fully configured and equipped with the same system as the production network. It can be made operational within just a few hours. A hot site merely needs staff, data files, and procedural documentation. Hot sites are a high-cost recovery option, but they can be justified when a short recovery time is required. Because a hot site is typically a subscription-based service, a range of fees is associated with it, including a monthly cost, subscription fees, testing costs, and usage or activation fees. Contracts for hot sites need to be closely examined; some might charge extremely high activation fees to prevent users from utilizing the facility for anything less than a true disaster.
Regardless of what fees are involved, the hot site needs to be periodically tested. Tests should evaluate processing abilities as well as security. The physical security of a hot site should be at the same level or greater than the physical security at the primary site. Finally, it is important to remember that the hot site is intended for short-term use only. With a subscriber service, other companies might be competing for the same resource. The organization should have a plan to recover primary services quickly or move to a secondary location.
For a slightly less expensive alternative, an organization can choose a warm site. A warm site has data equipment and cables and is partially configured. It could be made operational in anywhere from a few hours to a few days. The assumption with a warm site is that computer equipment and software can be procured in case of a disaster. Although the warm site might have some computer equipment installed, it typically has lower processing power than the equipment at the primary site. The costs associated with a warm site are slightly lower than those of a hot site. The warm site is the most popular subscription alternative.
For organizations that are looking for a cheaper alternative and that have determined that they can tolerate a longer outage, a cold site might be the right choice. A cold site is basically an empty room with only rudimentary electrical, power, and computing capability. It might have a raised floor and some racks, but it is nowhere near ready for use. It might take several weeks to a month to get the site operational. A common misconception with cold sites is that the organization will be able to get the required equipment after a disaster. This might not be true with large disasters. For example, with Hurricanes Katrina, Sandy, and Irma, vendors sold out of equipment and could not meet demand. It is possible that backorders could push out the operation dates of a cold site to much longer than planned. Cold sites offer the least of the three subscription services discussed. Table 4-6 shows some examples of functions and their recovery times.
Table 4-6 Examples of Functions and Recovery Times
Process |
Recovery Time |
Recovery Strategy |
Database |
15 minutes to 1 hour |
Database shadowing at a redundant site |
Applications |
12–24 hours |
Hot site |
Help desk |
24–48 hours |
Hot site |
Purchasing |
24–48 hours |
Hot site |
Payroll |
1–3 days |
Redundant site |
Asset inventory |
5–7 days |
Warm site |
Nonessential services |
30 days |
Cold site |
Emergency services (for example, for companies that need to set up operations quickly in areas that have been hit by disasters, such as insurance companies, governmental agencies, military, and so on) |
Hours to a few days |
Mobile site |
With reciprocal agreements, two organizations pledge assistance to one another in the event of a disaster. These agreements are carried out by sharing space, computer facilities, and technology resources. On paper, this appears to be a cost-effective solution because the primary advantage is its low cost. However, reciprocal agreements have drawbacks and are infrequently used. The parties to such an agreement must trust each other to aid in the event of a disaster. However, the nonvictim might be hesitant to follow through if such a disaster occurs, based on concerns such as the realization that the damaged party might want to remain on location for a long period of time or that the victim company’s presence will degrade the helping company’s network services. Even concerns about the loss of competitive advantage can drive this hesitation. The issue of confidentiality also arises: The damaged organization is placed in a vulnerable position and must entrust the other party with confidential information. Finally, if the parties to the agreement are near each other, there is always the danger that disaster could strike both parties and thereby render the agreement useless. The legal departments of both firms need to look closely at such an agreement. ISACA recommends that organizations considering reciprocal agreements address the following concerns before entering into them:
What amount of time will be available at the host computer site?
Will the host site’s employees be available for help?
What specific facilities and equipment will be available?
How long can emergency operations continue at the host site?
How frequently can tests be scheduled at the host site?
What type of physical security is available at the host site?
What type of logical security is available at the host site?
Is advance notice required for using the site? If so, how much?
Are there any blocks of time or dates when the facility is not available?
When reviewing alternative processing options, subscribers should look closely at any agreements and at the actual facility to make sure it meets the needs of the organization. One common problem is oversubscription. If situations such as Hurricane Harvey occur, there could be more organizations demanding a subscription service than the vendor can supply. The subscription agreement might also dictate when the organization may inhabit the facility. Thus, even though an organization might be in the path of a deadly storm, it might not be able to move into the facility yet because the area has not been declared a disaster area. Procedures and documentation should also be kept at the offsite location, and backups must be available. It’s important to note that backup media should be kept in an area that is not subject to the same type of natural disaster as the primary site. For example, if the primary site is in a hurricane zone, the backup needs to be somewhere less prone to those conditions. If backup media is at another location, agreements should be in place to ensure that the media will be moved to the alternate site so it is available for the recovery process. A final item is that organizations must also have prior financial arrangements to procure needed equipment, software, and supplies during a disaster. This might include emergency credit lines, credit cards, or agreements with hardware and software vendors.
Hardware Recovery
Recovery alternatives are just one of the items that must be considered to cope with a disaster. Hardware recovery is another. Remember that an effective recovery strategy involves more than just corrective measures; it is also about prevention. Hardware failures are some of the most common disruptions that can occur. It is therefore important to examine ways to minimize the likelihood of occurrence and to reduce the effect if it does occur. This process can be enhanced by making well-informed decisions when buying equipment. At purchase time, you should know three important items associated with the reliability:
Mean time between failures (MTBF): The MTBF calculates the expected lifetime of a device that can be repaired. A higher MTBF means the equipment should last longer.
Mean time to failure (MTTF): The MTTF calculates the expected lifetime of a one-time-use item that is typically not repaired.
Mean time to repair (MTTR): The MTTR estimates how long it would take to repair the equipment and get it back into use. For MTTR, lower numbers mean the equipment takes less time to repair and can be returned to service sooner.
For critical equipment, an organization might consider some form of service level management. This is simply an agreement between an IT service provider and a customer. The most common example is a service level agreement (SLA), which is a contract with a hardware vendor that provides a certain level of protection. For a fee, the vendor agrees to repair or replace the equipment within the contracted time.
Fault tolerance can be used at the server level or the drive level. At the server level is clustering, technology that groups several servers together yet allows them to be viewed logically as a single server. Users see the cluster as one unit, although it is actually many. The advantage is that if one server in the cluster fails, the remaining active servers will pick up the load and continue operation.
Redundant Array of Independent Disks
Fault tolerance on the drive level is achieved primarily with redundant array of independent disks (RAID), which is used for hardware fault tolerance and/or performance improvements and is achieved by breaking up the data and writing it to multiple disks. RAID has humble beginnings that date back to the 1980s at the University of California. To applications and other devices, RAID appears as a single drive. Most RAID systems have hot-swappable disks, which means the drives can be removed or added while the computer systems are running. If a RAID system uses parity and is fault tolerant, the parity date is used to rebuild the newly replaced drive. Another RAID technique is striping, which means the data is divided and written over several drives. Although write performance remains almost constant, read performance drastically increases. According to ISACA, these are the most common levels of RAID used today:
RAID 0
RAID 3
RAID 5
RAID level descriptions are as follows:
RAID 0: Striped disk array without fault tolerance: Provides data striping and improves performance but provides no redundancy.
RAID 1: Mirroring and duplexing: Duplicates the information on one disk to another. It provides twice the read transaction rate of single disks and the same write transaction rate as single disks yet effectively cuts disk space in half.
RAID 2: Error-correcting coding: Rarely used because of the extensive computing resources needed. It stripes data at the bit level instead of the block level.
RAID 3: Parallel transfer with parity: Uses byte-level striping with a dedicated disk. Although it provides fault tolerance, it is rarely used.
RAID 4: Shared parity drive: Similar to RAID 3 but provides block-level striping with a parity disk. If a data disk fails, the parity data is used to create a replacement disk. Its primary disadvantage is that the parity disk can create write bottlenecks.
RAID 5: Block interleaved distributed parity: Provides data striping of both data and parity. Level 5 has good performance and fault tolerance. It is a popular implementation of RAID. It requires at least three drives.
RAID 6: Independent data disks with double parity: Provides high fault tolerance with block-level striping and parity data distributed across all disks.
RAID 10: A stripe of mirrors: Known to have very high reliability. It requires a minimum of four drives.
RAID 0+1: A mirror of stripes: Not one of the original RAID levels. RAID 0+1 uses RAID 0 to stripe data and creates a RAID 1 mirror. It provides high data rates.
RAID 15: Creates mirrors (RAID 1) and distributed parity (RAID 5). This is not one of the original RAID levels.
One final drive-level solution worth mentioning is just a bunch of disks (JBOD). JBOD is similar to RAID 0 but offers few of the advantages. What it does offer is the capability to combine two or more disks of various sizes into one large partition. It also has an advantage over RAID 0: In case of drive failure, only the data on the affected drive is lost; the data on surviving drives remains readable. This means that JBOD has no fault tolerance. JBOD does not provide the performance benefits associated with RAID 0.
Software and Data Recovery
Because data processing is essential to most organizations, having the software and data needed to continue this operation is critical to the recovery process. The objectives are to back up critical software and data and be able to restore them quickly. Policy should dictate when backups are performed, where the media is stored, who has access to the media, and what its reuse or rotation policy is. Backup media can include tape reels, tape cartridges, removable hard drives, disks, and cassettes. The organization must determine how often backups should be performed and what type of backup should be performed. These operations will vary depending on the cost of the media, the speed of the restoration needed, and the time allocated for backups. Typically, the following four backup methods are used:
Full backup: All data is backed up. No data files are skipped or bypassed. All items are copied to one tape, set of tapes, or backup medium. If restoration is needed, only one tape or set of tapes is needed. A full backup requires the most time and space on the storage medium but takes the least time to restore.
Differential backup: A full backup is done typically once a week, and a daily differential backup is done only to those files that have changed since the last full backup. If you need to restore, you need the last full backup and the most recent differential backup. This method takes less time per backup but takes longer to restore because both the full and differential backups are needed.
Incremental backup: This method backs up only those files that have been modified since the previous incremental backup. An incremental backup requires additional backup media because the last full backup, the last incremental backup, and any additional incremental backups are required to restore the media.
Continuous backup: Some backup applications perform a continuous backup that keeps a database of backup information. These systems are useful because if a restoration is needed, the application can provide a full restore, a point-in-time restore, or a restore based on a selected list of files.
Although tape and optical systems still have significant market share for backup systems, hardware alternatives and cloud based options are making inroads. One of these technologies is massive array of inactive disks (MAID). MAID offers a hardware storage option for the storage of data and applications. It was designed to reduce the operational costs and improve long-term reliability of disk-based archives and backups. MAID is similar to RAID, except that it provides power management and advanced disk monitoring. The MAID system powers down inactive drives, reduces heat output, reduces electrical consumption, and increases the drive’s life expectancy. This represents real progress over using hard disks to back up data. Storage area networks (SANs) are another alternative. SANs are designed as a subnetwork of high-speed, shared storage devices. Cloud backup is gaining in popularity as it offers several benefits. These value-added functions include geographical redundancy, advanced search, content management and automatic offsite storage.
Backup and Restoration
Where backup media are stored can have a big impact on how quickly data can be restored and brought back online. The media should be stored in more than one physical location to reduce the possibility of loss. A tape librarian should manage these remote sites by maintaining the site, controlling access, rotating media, and protecting this valuable asset. Unauthorized access to the media is a huge risk because it could impact the organization’s ability to provide uninterrupted service. Encryption can help mitigate this risk. Transportation to and from the remote site is also an important concern. Consider the following important items:
Secure transportation to and from the site must be maintained.
Delivery vehicles must be bonded.
Backup media must be handled, loaded, and unloaded in an appropriate way.
Drivers must be trained on the proper procedures to pick up, handle, and deliver backup media.
Access to the backup facility should be 24×7 in case of emergency.
Offsite storage should be contracted with a known firm that has control of the facility and is responsible for its maintenance. Physical and environmental controls should be equal to or better than those of the organization’s facility. A letter of agreement should specify who has access to the media and who is authorized to drop off or pick up media. There should also be an agreement on response time that is to be met in times of disaster. Onsite storage should be maintained to ensure the capability to recover critical files quickly. Backup media should be secured and kept in an environmentally controlled facility that has physical control sufficient to protect such a critical asset. This area should be fireproof, with controlled access so that anyone depositing or removing media is logged. Although most backup media is rather robust, it will not last forever and will fail over time. This means that tape rotation is another important part of backup and restoration.
Backup media must be periodically tested. Backups will be of little use if they malfunction during a disaster. Common media-rotation strategies include the following:
Simple: A simple backup rotation scheme is to use one tape for every day of the week and then repeat the next week. One tape can be for Mondays, one for Tuesdays, and so on. You would add a set of new tapes each month and then archive the monthly sets. After a predetermined number of months, you would put the oldest tapes back into use.
Grandfather-father-son: This rotation method includes four tapes for weekly backups, one tape for monthly backups, and four tapes for daily backups. It is called grandfather-father-son because the scheme establishes a kind of hierarchy. Grandfathers are the one monthly backup, fathers are the four weekly backups, and sons are the four daily backups.
Tower of Hanoi: This tape-rotation scheme is named after a mathematical puzzle. It involves using five sets of tapes, each set labeled A through E. Set A is used every other day; set B is used on the first non-A backup day and is used every fourth day; set C is used on the first non-A or non-B backup day and is used every eighth day; set D is used on the first non-A, non-B, or non-C day and is used every 16th day; and set E alternates with set D.
SANs are an alternative to traditional backup. SANs support disk mirroring, backup and restore, archival and retrieval of archived data, and data migration from one storage device to another. SANs can be implemented locally or can use storage at a redundant facility. Another option is a virtual SAN (VSAN), a SAN that offers isolation among devices that are physically connected to the same SAN fabric. A VSAN is sometimes called fabric virtualization.
Traditionally, SANs used Small Computer System Interface (SCSI) for connectivity, but there are more current options in use today. One is iSCSI, which is a SAN standard used for connecting data storage facilities and allowing remote SCSI devices to communicate. Fiber Channel over Ethernet (FCoE) is another SAN interface standard. FCoE is similar to iSCSI; it can operate at speeds of 10Gbps and rides on top of the Ethernet protocol. While it is fast, it has a disadvantage in that it is nonroutable.
One important issue with SAN and backups is location redundancy. This is the concept that content should be accessible from more than one location. An extra measure of redundancy can be provided by means of a replication service so that data is available even if the main storage backup system fails.
Another important item is security of the backups. This is where secure storage management and replication are important. The idea is that systems must be designed to allow a company to manage and handle all corporate data in a secure manner, with a focus on the confidentiality, integrity, and availability of the information. The replication service allows for the data to be duplicated in real time so that additional fault tolerance is achieved.
When you need to make point-in-time backups, you can use SAN snapshots. SAN snapshot software is typically sold with a SAN solution and offers a way to bypass typical backup operations. The snapshot software has the ability to temporarily stop writing to physical disk and make a point-in-time backup copy.
If budget is an issue, an organization can opt for electronic vaulting, which involves transferring data by electronic means to a backup site, as opposed to physical shipment. With electronic vaulting, an organization contracts with a vaulting provider. The organization typically loads a software agent onto systems to be backed up, and the vaulting service accesses these systems and copies the selected files. Moving large amounts of data can slow WAN service.
Another backup alternative is standby database shadowing. A standby database is an exact duplicate of a database maintained on a remote server. In case of disaster, it is ready to go. Changes are applied from the primary database to the standby database to keep records synchronized.
As an alternative to traditional backup techniques, using cloud services for backup may offer a cost-saving alternative. These services should be carefully evaluated, as there are many concerns when using them. Cloud backups can be deployed in a variety of configurations—for example, as an on-premises private cloud or as an offsite public or private cloud.
Telecommunications Recovery
Telecommunications recovery should play a key role in recovery. After all, the telecommunications network is a critical asset and should be given a high priority for recovery. Although these communications networks can be susceptible to the same threats as data centers, they also face some unique threats. Protection methods include redundant WAN links and bandwidth on demand. Whatever the choice, the organization should verify capacity requirements and acceptable outage times. The following are the primary methods for telecommunications network protection:

Redundancy: This involves exceeding what is required or needed. Redundancy can be added by providing extra capacity, providing multiple routes, using dynamic routing protocols, and using failover devices to allow for continued operations.
Diverse routing: This is the practice of routing traffic through different cable facilities. Organizations can obtain both diverse routing and alternate routing, but the cost is not low. Most of these systems use facilities that are buried, and they usually emerge through the basement and can sometimes share space with other mechanical equipment. This adds risk. Many cities have aging infrastructures, which is another potential point of failure.
Alternate routing: This is the ability to use another transmission line if the regular line is busy or unavailable. This can include using a dial-up connection in place of a dedicated connection, a cell phone instead of a land line, or microwave communication in place of a fiber connection.
Long-haul diversity: This is the practice of having different long-distance communication carriers. This recovery facility option helps ensure that service is maintained; auditors should verify that it is present.
Last-mile protection: This is a good choice for recovery facilities in that it provides a second local loop connection and can add to security even more if an alternate carrier is used.
Voice communication recovery: Many organizations are highly dependent on voice communications. Some of these organizations have started making the switch to VoIP because of the cost savings. Some land lines should be maintained to provide recovery capability.
Verification of Disaster Recovery and Business Continuity Process Tasks
As an auditor, you will be tasked with understanding and evaluating business continuity/disaster recovery strategy. An auditor should review a plan and make sure it is current and up-to-date. The auditor should also examine last year’s test to verify the results and look for any problem areas. The business continuity coordinator is responsible for maintaining previous tests. Upon examination, an auditor should confirm that a test met targeted goals or minimum standards. The auditor should also inspect the offsite storage facility and review its security, policies, and configuration. This should include a detailed inventory that includes checking data files, applications, system software, system documentation, operational documents, consumables, supplies, and a copy of the business continuity plan.
Contracts and alternative processing agreements should also be reviewed. Any offsite processing facilities should be audited, and the owners should have a reference check. All agreements should be made in writing. The offsite facility should meet the same security standards as the primary facility and should have environmental controls such as raised floors, HVAC controls, fire prevention and detection, filtered power, and uninterruptible power supplies (UPSs). A UPS allows a computer to keep running for at least a short time when the primary power source is lost.
If the location is a shared site, the rules that determine who has access and when they have access should be examined. Another area of concern is the business continuity plan itself. An auditor must make sure the plan is written in easy-to-understand language and that users have been trained. This can be confirmed by interviewing employees.
Finally, insurance should be reviewed. An auditor should examine the level and types of insurance the organization has purchased. Insurance can be obtained for each of the following items:
IS equipment
Data centers
Software recovery
Business interruption
Documents, records, and important papers
Errors and omissions
Media transportation
Insurance is not without drawbacks, which include high premiums, delayed claim payouts, denied claims, and problems proving financial loss. Finally, most policies pay for only a percentage of actual loss and do not pay for lost income, increased operating expenses, or consequential loss.
The purpose of disaster recovery is to get a damaged organization restarted so that critical business functions can resume. When a disaster occurs, the process of progressing from the disaster back to normal operations includes the following:
Crisis management
Recovery
Reconstitution
Resumption
An auditor should be concerned with all laws, mandates, and policies that govern the organization in a disaster situation. As an example, federal and state government entities typically use a Continuity of Operations (COOP) site, which is designed to take on operational capabilities when the primary site is not functioning. The length of time the COOP site is active and the criteria used to determine when the COOP site is enabled depend on the business continuity and disaster recovery plans. An example of the Disaster Lifecycle is shown in Figure 4-5.
Figure 4-5 The Disaster Life Cycle
The Disaster Life Cycle
Both governmental and nongovernmental entities typically use a checklist to manage continuity of operations. Table 4-7 shows a sample disaster recovery checklist.
Table 4-7 Disaster Recovery Checklist
Time |
Activity |
When disaster occurs |
Notify disaster recovery manager and recovery coordinator |
Under 2 hours |
Assess damage, notify senior management, and determine immediate course of action |
Under 4 hours |
Contact offsite facility, recover backups, and replace equipment as needed |
Under 8 hours |
Provide management with updated assessment and begin recovery at updated site |
Under 36 hours |
Reestablish full processing at alternative site and determine a timeline for return to the primary facility |
Protection of life is a priority while working to mitigate damage. The areas impacted the most need attention first. Recovery from a disaster entails sending personnel to the recovery site. Individuals responsible for emergency management need to assess damage and perform triage. When employees and materials are at the recovery site, interim functions can resume operations. This might require installing software and hardware. Backup data or copies of configurations might need to be loaded, and systems might require setup.
When operations are moved from the alternative operations site back to the restored site, the efficiency of the new site must be tested. In other words, processes should be sequentially returned from least critical to most critical. In the event that a few glitches need to be worked out in the new facility, you can be confident that your most critical processes are still in full operation at the alternative site. When those processes are complete, normal operations can resume.
