Maintaining Critical Services

Jan 1, 2099

📄 Contents

␡

"Do I Know This Already?" Quiz
Foundation Topics
The Business Continuity Planning (BCP) Process
Recovery Strategies
Chapter Summary
Exam Preparation Tasks
Review All the Key Topics
Define Key Terms
Exercises
Review Questions
Suggested Readings and Resources

⎙ Print

< Back Page 3 of 11 Next >

This chapter is from the book 

Certified Information Systems Auditor (CISA) Cert Guide

Learn More Buy

The Business Continuity Planning (BCP) Process

The BCP process can be described as the process of creating systems of prevention and recovery to deal with potential threats to a company. One of the best sources of information about the BCP process is the Disaster Recovery Institute International (DRII), which you can find online at www.drii.org. The process that DRII defines for BCP is much broader in scope than the ISACA process. DRII breaks down the disaster recovery process into 10 domains:

Project initiation and management
Risk evaluation and control
Business impact analysis
Developing business continuity management strategies
Emergency response and operations
Developing and implementing business continuity plans
Awareness and training programs
Exercising and maintaining business continuity plans
Crisis communications
Coordination with external agencies

The BCP process as defined by ISACA has a much narrower scope and focuses on the following seven steps, each of which is discussed in greater detail in the following sections:

Project management and initiation
Business impact analysis
Development and recovery strategy
Final plan design and implementation
Training and awareness
Implementation and testing
Monitoring and maintenance

Project Management and Initiation

Before the BCP process can begin, management must be on board. Management is ultimately responsible and must be actively involved in the process. Management sets the budget, determines the team leader, and gets the process started. The BCP team leader determines who will be on the BCP team. The team’s responsibilities include the following:

Identifying regulatory and legal requirements
Identifying all possible threats and risks
Estimating the possibilities of these threats and their loss potential and ranking them based on the likelihood of the event occurring
Performing a business impact analysis (BIA)
Outlining which departments, systems, and processes must be up and running first
Developing procedures and steps in resuming business after a disaster
Assigning tasks to individuals that they should perform during a crisis situation
Documenting, communicating with employees, and performing training and drills

One of the first steps the team is tasked with is meeting with senior management. The purpose of this meeting is to define goals and objectives, discuss a project schedule, and discuss the overall goals of the BCP process. This should give everyone present some idea of the scope of the final BCP policy.

It’s important for everyone involved to understand that the BCP is the most important corrective control the organization will have an opportunity to shape. Although the BCP process is primarily corrective, it also has the following elements:

Preventive: Controls to identify critical assets and develop ways to prevent outages
Detective: Controls to alert the organization quickly in case of outages or problems
Corrective: Controls to return to normal operations as quickly as possible

Business Impact Analysis

Chance and uncertainty are part of the world we live in. We cannot predict what tomorrow will bring or whether a disaster will occur—but this doesn’t mean we cannot plan for it. As an example, the city of Galveston, Texas, is in an area prone to hurricanes. Just because the possibility of a hurricane in winter in Galveston is extremely low doesn’t mean that planning can’t take place to reduce the potential negative impact of such an event actually occurring. This is what BIA is about. Its purpose is to think through all possible disasters that could take place, assess the risk, quantify the impact, determine the loss, and develop a plan to deal with the incidents that seem most likely to occur.

As a result, BIA should present a clear picture of what is needed to continue operations if a disaster occurs. The individuals responsible for BIA must look at the organization from many different angles and use information from a variety of inputs. For BIA to be successful, the BIA team must know what the key business processes are. This is something that businesses may already know but don’t recognize it as such. As an example, a computer company that places a priority on selling computers over the service and repair of computers has determined the key activity. It’s the selling of the product. As such, this activity needs to have controls in place to continue in the face of negative events. Questions the team must ask when determining critical processes might include the following:

Does the process support health and safety? Items such as the loss of an air traffic control system at a major airport or the loss of power in a hospital operating room could be devastating to those involved and result in loss of life.
Does the loss of the process have a negative impact on income? For example, a company such as eBay would find the loss of Internet connectivity devastating, whereas a small nonprofit organization might be able to live without connectivity for days.
Does the loss of the process violate legal or statutory requirements? For example, a coal-powered electrical power plant might be using scrubbers to clean the air before emissions are released. Loss of these scrubbers might lead to a violation of federal law and result in huge regulatory fines.
How does the loss of the process affect users? Returning to the example of the coal-powered electrical power plant, it is easy to see how problems with the steam-generation process would shut down power generation and leave many residential and business customers without power. This loss of power in the Alaskan winter or in the Houston summer would have a large impact.

As you might be starting to realize, performing BIA is no easy task. It requires not only knowledge of business processes but also a thorough understanding of the organization. This includes IT resources and individual business units, as well as the interrelationships between these pieces. This task requires the support of senior management and the cooperation of IT personnel, business unit managers, and end users. The general steps of BIA are as follows:

Determine data-gathering techniques.
Gather business impact analysis data.
Identify critical business functions and resources.
Verify completeness of data.
Establish recovery time for operations.
Define recovery alternatives and costs.

BIA typically includes both quantitative and qualitative components:

Quantitative analysis deals with numbers and dollar amounts. It involves attempting to assign a monetary value to the elements of risk assessment and to place dollar amounts on the potential impact, including both loss of income and expenses. Quantitative impacts can include all associated costs, including these:
- Lost productivity
- Delayed or canceled orders
- Cost of repair
- Value of the damaged equipment or lost data
- Cost of rental equipment
- Cost of emergency services
- Cost to replace the equipment or reload data
Qualitative assessment is scenario driven and does not involve assigning dollar values to components of the risk analysis. A qualitative assessment ranks the seriousness of impacts into grades or classes, such as low, medium, and high. These are usually associated with items to which no dollar amount can be easily assigned:
- Low: Minor inconvenience; customers might not notice.
- Medium: Some loss of service; might result in negative press or cause customers to lose some confidence in the organization.
- High: Will result in loss of goodwill between the company and a client or an employee; negative press also reduces the outlook for future products and services.

Although different approaches for calculating loss exist, one of the most popular methods of acquiring data is using a questionnaire. A team may develop a questionnaire for senior management and end users and might hand it out or use it during an interview process. This form might include items such as the recovery point objective (RPO), the recovery time objective (RTO), or even the mean time to recover (MTTR). Figure 4-2 provides an example of a typical BIA questionnaire.

The questionnaire can even be used in a round-table setting. This method of performing information gathering requires the BIA team to bring the required key individuals into a meeting and discuss as a group what impact specific types of disruptions would have on the organization. Auditors play a key role because they might be asked to contribute information such as past transaction volumes or the impact to the business of specific systems becoming unavailable.

Figure 4-2 BIA Questionnaire

Criticality Analysis

How do you classify systems and resources according to their value or order of importance? You determine the estimated loss in the event of a disruption and calculate the likelihood that the disruption will occur. The quantitative method for this process involves three steps:

Estimate potential losses (SLE): This step involves determining the single loss expectancy (SLE), which is calculated as follows:

Single loss expectancy = Asset value × Exposure factor

Items to consider when calculating the SLE include the physical destruction of human-caused events, the loss of data, and threats that might cause a delay or disruption in processing. The exposure factor is the measure or percentage of damage that a realized threat would have on a specific asset.
Conduct a threat analysis (ARO): The purpose of a threat analysis is to determine the likelihood that an unwanted event will happen. The goal is to estimate the annual rate of occurrence (ARO). Simply stated, how many times is this event expected to happen in one year?
Determine annual loss expectancy (ALE): This third and final step of the quantitative assessment seeks to combine the potential loss and rate/year to determine the magnitude of the risk. This is expressed as annual loss expectancy (ALE). ALE is calculated as follows:

Annualized loss expectancy (ALE) =

Single loss expectancy (SLE) × Annualized rate of occurrence (ARO)

For example, suppose that the potential loss due to a hurricane on a business based in Tampa, Florida, is $1 million. An examination of previous weather patterns and historical trends reveals that there has been an average of one hurricane of serious magnitude to hit the city every 10 years, which translates to 1/10, or 0.1% per year. This means the assessed risk that the organization will face a serious disruption is $100,000 (= $1 million × 0.1) per year. That value is the annualized loss expectancy and, on average, is the amount per year that the disruption will cost the organization. Placing dollar amounts on such risks can aid senior management in determining what processes are most important and should be brought online first. Qualitatively, these items might be categorized not by dollar amount but by a risk-ranking scale. According to ISACA, the scale shown in Table 4-3 is used to classify systems according to their importance to the organization.

Table 4-3 System Classification

Classification	Description
Critical	These extremely important functions cannot be performed with duplicate systems or processes. These functions are extremely intolerant to disruptions, and any disruption is very costly.
Vital	Although these functions are important, they can be performed by a backup manual process—but not for a long period of time. These systems can tolerate disruptions for typically five days or less.
Sensitive	Although these tasks are important, they can be performed manually at a reasonable cost. However, this is inconvenient and requires additional resources or staffing.
Noncritical	These services are not critical and can be interrupted. They can be restored later with little or no negative effects.

After addressing all these questions, the BCP team can start to develop recommendations and look at some potential recovery strategies. The BCP team should report these findings to senior management as a prioritized list of key business resources and the order in which restoration should be processed. The report should also offer potential recovery scenarios. Many times it will be the network operations center (NOC) or help desk that fist hears of a problem via end users. It’s important to have processes that tie these reports back to BCP teams so that potential problems can be addressed quickly.

Before presenting the report to senior management, however, the team should distribute it to the various department heads. These individuals were interviewed, and the plan affects them and their departments; therefore, they should be given the opportunity to review it and note any discrepancies. The BIA information must be correct and accurate because all future decisions will be based on those findings.

Development and Recovery Strategy

At this point, the team has completed both the project initiation and BIA. Now it must determine the most cost-effective recovery mechanisms to be implemented based on the critical processes and threats determined during the BIA. An effective recovery strategy should apply preventive, detective, and corrective controls to meet the following objectives:

Remove identified threats.
Reduce the likelihood of identified risks.
Reduce the impact of identified risks.

The recovery strategies should specify the best way to recover systems and processes in case of interruption. Operations can be interrupted in several different ways:

Data interruptions: Caused by the loss of data. Solutions to data interruptions include backup, offsite storage, and remote journaling.
Operational interruptions: Caused by the loss of equipment. Solutions to this type of interruption include hot sites, redundant equipment, and redundant array of independent disks (RAID).
Facility and supply interruptions: Caused by interruptions due to fire, loss of inventory, transportation problems, HVAC problems, and telecommunications. Solutions to this type of interruption include redundant communication and transporting systems.
Business interruptions: Caused by interruptions due to loss of human resources, strikes, critical equipment, supplies, and office space. Solutions to this type of interruption include redundant sites, alternate locations, and temporary staff.

The selection of a recovery strategy is based on several factors, including cost, criticality of the systems or process, and the time required to recover. To determine the best recovery strategy, follow these steps:

Document all costs for each possible alternative.
Obtain cost estimates for any outside services that might be needed.
Develop written agreements with the chosen vendor for such services.
Evaluate what resumption strategies are possible if there is a complete loss of the facility.
Document your findings and report your chosen recovery strategies to management for feedback and approval.

Normally, any IT system that runs a mission-critical application needs a recovery strategy. There are many to choose from; the appropriate choice is based on the impact to the organization of the loss of the system or process. Recovery strategies include the following:

Continuous processing
Standby processing
Standby database shadowing
Remote data journaling
Electronic vaulting
Mobile site
Hot site
Warm site
Cold site
Reciprocal agreements

All of these options are discussed later in the chapter, in the section “Recovery Strategies.” To get a better idea of how each of these options compares to the cost of implementation, take a moment to review Figure 4-3. At this point, it is important to realize that there must be a balance between the level of service needed and the recovery method.

Figure 4-3 Recovery Options and Costs

Final Plan Design and Implementation

In the final plan design and implementation phase, the team prepares and documents a detailed plan for recovering critical business systems. This plan should be based on information gathered during the project initiation, the BIA, and the recovery strategies phase. The plan should be a guide for implementation. The plan should address factors and variables such as these:

Selecting critical functions and priorities for restoration
Determining support systems that critical functions need
Estimating potential disasters and calculating the minimum resources needed to recover from the catastrophe
Determining the procedures for declaring a disaster and under what circumstances this will occur
Identifying individuals responsible for each function in the plan
Choosing recovery strategies and determining what systems and equipment will be needed to accomplish the recovery
Determining who will manage the restoration and testing process
Calculating what type of funding and fiscal management is needed to accomplish these goals

The plan should be written in easy-to-understand language that uses common terminology that everyone will understand. The plan should detail how the organization will interface with external groups such as customers, shareholders, the media, and community, region, and state emergency services groups during a disaster. Important teams should be formed so that training can be performed. The final step of the phase is to combine all this information into the business continuity plan and then interface it with the organization’s other emergency plans.

Training and Awareness

The goal of training and awareness is to make sure all employees know what to do in case of an emergency. Studies have shown that training improves response time and helps employees be better prepared. Employees need to know where to call or how to maintain contact with the organization if a disaster occurs. Therefore, the organization should design and develop training programs to make sure each employee knows what to do and how to do it. Training can include a range of specific programs, such as CPR, fire drills, crisis management, and emergency procedures. Employees assigned to specific tasks should be trained to carry out needed procedures. Cross-training of team members should occur, if possible, so that team members are familiar with a variety of recovery roles and responsibilities. Some people might not be able to lead under the pressure of crisis command; others might not be able to report to work. Table 4-4 describes some of the key groups involved in the BCP process and their responsibilities.

Table 4-4 BCP Process Responsibilities

Person or Department	Responsibility
Senior management	Project initiation, ultimate responsibility, overall approval and support
Middle management or business unit managers	Identification and prioritization of critical systems
BCP committee and team members	Planning, day-to-day management, implementation, and testing of the plan
Functional business units	Plan implementation, incorporation, and testing
IT audit	Business continuity plan review, test results evaluation, offsite storage facilities, alternate processing contracts, and insurance coverage

Implementation and Testing

During the implementation and testing phase, the BCP team ensures that the previously agreed-upon steps are implemented. No demonstrated recovery exists until a plan has been tested. Before examining the ways in which the testing can occur, look at some of the teams that are involved in the process:

Incident response team: Team developed as a central clearinghouse for all incidents.
Emergency response team: The first responders for the organization. They are tasked with evacuating personnel and saving lives.
Emergency management team: Executives and line managers who are financially and legally responsible. They must also handle the media and public relations.
Damage assessment team: The estimators. They must determine the damage and estimate the recovery time.
Salvage team: Those responsible for reconstructing damaged facilities. This includes cleaning up, recovering assets, creating documentation for insurance filings or legal actions, and restoring paper documents and electronic media.
Communications team: Those responsible for installing communications (data, voice, phone, fax, radio) at the recovery site.
Security team: Those who manage the security of the organization during a time of crisis. They must maintain order after a disaster.
Emergency operations team: Individuals who reside at the alternative site and manage systems operations. They are primarily operators and supervisors who are familiar with system operations.
Transportation team: Those responsible for notifying employees that a disaster has occurred. They are also in charge of providing transportation, scheduling, and lodging for those who will be needed at the alternative site.
Coordination team: Those tasked with managing operations at different remote sites and coordinating the recovery efforts.
Finance team: Individuals who provide budgetary control for recovery and accurate accounting of costs.
Administrative support team: Individuals who provide administrative support and also handle payroll functions and accounting.
Supplies team: Individuals who coordinate with key vendors to maintain needed supplies.
Relocation team: Those in charge of managing the process of moving from the alternative site to the restored original location.
Recovery test team: Individuals deployed to test the business continuity plan/disaster recovery plan and determine their effectiveness.

Did you notice that the last team listed is the recovery test team? This team consists of individuals who test the business continuity plan; this should be done at least once a year. Without testing, there is no guarantee that the plan will work. Testing helps bring theoretical plans into reality. To build confidence, the BCP team should start with easier parts of the plan and build to more complex items. The initial tests should focus on items that support core processing and should be scheduled during a time that causes minimal disruption to normal business operations. Tests should be observed by an auditor who can witness the process and record accurate test times. Having an auditor is not the only requirement: Key individuals who would be responsible in a real disaster must play a role in the testing process. Testing methods vary among organizations and range from simple to complex. Regardless of the method or types of testing performed, the idea is to learn from the practice and improve the process each time a problem is discovered. As a CISA exam candidate, you should be aware of the three different types of BCP testing, as defined by the ISACA:

Paper tests
Preparedness tests
Full operation tests

The following sections describe these basic testing methods.

Paper Tests

The most basic method of BCP testing is the paper test. Although it is not considered a replacement for a full interruption or parallel test, it is a good start. A paper test is an exercise that can be performed by sending copies of the plan to different department managers and business unit managers for review. Each of these individuals can review the plan to make sure nothing has been overlooked and that everything that is being asked of them is possible.

A paper test can also be performed by having the members of the team come together and discuss the business continuity plan. This is sometimes known as walk-through testing. The plans are laid out across the table so that attendees have a chance to see how an actual emergency would be handled. By reviewing the plan in this way, some errors or problems should become apparent. With either method—sending the plan around or meeting to review the plan—the next step is usually a preparedness test.

Preparedness Tests

A preparedness test is a simulation in which team members go through an exercise that reenacts an actual outage or disaster. This type of test is typically used to test a portion of the plan. The preparedness test consumes time and money because it is an actual test that measures the team’s response to situations that might someday occur. This type of testing provides a means of incrementally improving the plan.

TIP

During preparedness tests, team leaders might want to use the term exercise because the term test denotes passing or failing, which can add pressure on team members and can be detrimental to the goals of continual improvement. For example, during one disaster recovery test, the backup media was to be returned from the offsite location to the primary site. When the truck arrived with the media, it was discovered that the tapes had not been properly secured, and they were scattered around the bed of the truck. Even though the test could not continue, it was not a failure because it uncovered a weakness in the existing procedure.

Full Operation Tests

The full operation test is as close to an actual service disruption as you can get. The team should have performed paper tests and preparedness tests before attempting this level of interruption. This test is the most detailed, time-consuming, and thorough of all the tests discussed. A full interruption test mimics a real disaster, and all steps are performed to start up backup operations. It involves all the individuals who would be involved in a real emergency, including internal and external organizations. Goals of a full operation test include the following:

Verifying the business continuity plan
Evaluating the level of preparedness of the personnel involved
Measuring the capability of the backup site to operate as planned
Assessing the ability to retrieve vital records and information
Evaluating the functionality of equipment
Measuring overall preparedness for an actual disaster

Monitoring and Maintenance

When the testing process is complete, individuals tend to feel that their job is done. If someone is not made responsible for this process, the best plans in the world can start to become outdated in six months or less. Don’t be surprised to find out that no one really wants to take on the task of documenting procedures and processes. The responsibility of performing periodic tests and maintaining the plan should be assigned to a specific person. While you might normally think of change-management practices being used to determine whether changes made to systems and applications are adequately controlled and documented, these same techniques should be used to address issues that might affect the business continuity plan.

A few additional items must be done to finish the business continuity plan. The primary remaining item is to put controls in place to maintain the current level of business continuity and disaster recovery. This is best accomplished by implementing change-management procedures. If changes to the approved plans are required, you will then have a documented structured way to accomplish this. A centralized command and control structure will ease this burden. Life is not static, and the organization’s business continuity plans shouldn’t be either.

Understanding BCP Metrics

Reviewing the results of the information obtained is the next step of the BIA process. During this step, the BIA team should ask questions such as these:

Are the systems identified critical? All departments like to think of themselves as critical, but that is usually not the case. Some departments can be offline longer than others.
What is the required recovery time for critical resources? If the resource is critical, costs will mount the longer the resource is offline. Depending on the service and the time of interruption, these times will vary.

All this information might seem a little overwhelming; however, it is needed because at the core of the BIA are two critical items:

Recovery point objective (RPO): The RPO defines how current the data must be or how much data an organization can afford to lose. The greater the RPO, the more tolerant the process is to interruption.
Recovery time objective (RTO): The RTO specifies the maximum elapsed time to recover an application at an alternate site. The greater the RTO, the longer the process can take to be restored.

The lower the time requirements are, the higher the cost will be to reduce loss or restore the system as quickly as possible. For example, most banks have a very low RPO because they cannot afford to lose any processed information. Think of the recovery strategy calculations as being designed to meet the required recovery time frames: Maximum tolerable downtime (MTD) = RTO + Work recovery time (WRT). (The WRT is the remainder of the MTD used to restore all business operations.) Figure 4-4 presents an overview of how RPO and RTO are related.

Figure 4-4 RPO and RTO

These items must be considered in addition to RTO and RPO:

Maximum acceptable outage: This value is the time that systems can be offline before causing damage. This value is required in creating RTOs and is also known as maximum tolerable downtime (MTD).
Work recovery time (WRT): The WRT is the time it takes to get critical business functions back up and running once the systems are restored.
Service delivery objective (SDO): This defines the level of service provided by alternate processes while primary processing is offline. This value should be determined by examining the minimum business need.
Maximum tolerable outages: This is the maximum amount of time the organization can provide services at the alternate site. This value can be determined using contractual values.
Core processes: These activities are specifically required for critical processes and produce revenue.
Supporting processes: These activities are required to support the minimum services needed to generate revenue.
Discretionary processes: These include all other processes that are not part of the core or supporting processes and that are not required for any critical processes or functions.