Bus Cont Plan

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 45

5 Incident response plan

Incident response  a process that gets triggered when something unexpected happens in a way that
threatens continuity. 

Disaster recovery  play a role when huge incident appears where the business can’t continue its
operations.

Necessary Prerequisites to Build an Incident Response Program


there must be specific capabilities such as:

1. Access-control processes and restriction of elevated privileges.


2. Data misuse protection in motion, in use, and at rest.
3. Hardware hardening, based on established standards.
4. Vulnerabilities understanding and management.
5. Existence of communication and control network protections (firewalls, etc.)

Incident Response plan

Incident Response Frameworks


Computer Security Incident Handling Guide, include the following:

1. Organizing a Computer Incident Response Capability


2. Handling an Incident
 Identify
 Contain
 Eradicate
 Recover
3. Post-incident
Organizing a Computer Incident Response Capability
Incident response program organizing requires defining incidents. Not everything that is unusual is an
incident. Before defining anomalies as incidents, these occurrences must be analyzed and triaged as
events:

1. Policies and procedures


2. The team
3. Goals, strategy, and Objectives
4. The incident response plan
5. Tactical procedures

Incident Response Definitions


1. An event is an observable occurrence in a system or network.

An example of an event is quarantined e-mail that appears to be suspicious. A security analyst assesses
the e-mail and decides either to release it to the recipient or eradicate it.

2. Adverse Event: Event resulting in negative consequences.

System outages, whether malicious or accidental, fall into the adverse event bucket.

3. Incident: Policies Violation

Insider threats  remove data without authorization trigger a full-fledged incident response.

The team
 The incident response plan identifies the individuals who make up the
incident response team and their roles.
 Usually, someone from cybersecurity, at the manager or director level,
owns incident response.
 Management: owns incident response where It funds, allocates resources,
and controls policy decisions.
 IT support: Not everyone in IT will respond to incidents. Unique events call for others to
participate, based on expertise.
 Legal department: The general counsel’s presence on the extended team or executive response
team is expected. Engaging the legal department earlier should be expected in certain situations.
 Public affairs and media relations: Large breaches garner media attention, and involvement of
personal information requires disclosure.
 Human resources: This group’s input becomes necessary when employee involvement is
suspected.
How Vulnerabilities Become Risks
Vulnerabilities represent weaknesses in information systems. Threat actors seek to uncover and exploit
these in a successful attack. Weak passwords, default accounts with default passwords, and unpatched
systems are examples of vulnerabilities commonly exploited.

For a risk to be present, a threat and a vulnerability must exist. Vulnerabilities that no threat actor or
scenario would exploit are not a cybersecurity risk.

A threat actor, in this case, a malicious insider, exploits a vulnerability—default admin credentials—
creating a risk to the confidentiality, integrity, or availability of customer data.

Event Detection and Identification


- Incident response  begins with event detection and identification.
- Detection should be deployed based on identified risks and potential attack patterns of known
threats.
- Automation provides several automated detection and identification.
- Automation is desirable when  it lowers costs.
 increases efficiency.
 more reliable than manual processes.
- A significant use case for automation exists when technology correlates and detects behavior
patterns and activity not always seen easily by the human eye.
- Not all detection requires technology.
- End users are an example of how the human element can be very effective, such as noticing
phishing e-mails first when other employees do not observe good e-mail hygiene.
- End Point Detection and Response:
 a capability to detect changes made to endpoints consistent with known indicators of
attack or behavior and inconsistent baselines of normal behavior.
 These solutions act in a front-line detection capacity and are valuable during
containment.
 These solutions allow the team to respond quickly to the event and take appropriate
action. Example: FireEye Endpoint Security and Symantec Endpoint Protection
- Analyzing Traffic

Packet capture aids incident response teams’ need to confirm whether suspected events exist.
Organizations implement these solutions based on the incident response and monitoring strategy.
Example is NetFlow developed by Cisco, which allows entities to capture data on the origination,
destination, and amount of traffic.

- Security Incident and Event Management (SIEM)

A set of tools and services offering a holistic view of an organization's information security. SIEM tools
provide Real-time visibility across an organization's information security systems. 

Containment
- Containment comes after identifying an event and concluding that action is required to limit its
impact.
- Containment is about limiting the damage done by attackers. This is achieved by keeping the
attacker away from key assets not yet compromised. Containing an event or incident requires
identifying indicators of the attack and identifying them in other systems.
- Once a system is suspected of being compromised, it should be isolated. Some ways to do this
include: Unplugging the network cable, Putting the machine in sleep mode (Powering it off
causes volatile memory loss and the loss of forensic evidence.), or Isolating the machine so that
it cannot receive data via changes to DNS and firewall rules.
- Denial of Service (DoS)
 DoS and distributed denial of service (DDoS) attacks aim to shut down services and disrupt
business operations. The attacks target web-facing applications, and DNS services.
 Attempting to contain these attacks involves the following important steps:
1. Assess firewalls, routers, servers, and other affected device logs.
2. Pinpoint how the DDoS attack traffic differs from non-threatening ones and review network
traffic looking for DDoS traffic.
3. Block traffic with perimeter devices.
4. Block outbound traffic responding to the DDoS.
5. Blackhole malicious IPs attributed to the attacker.
6. Temporarily disable applications and services affected by the attack.
7. contact the Internet service provider to confirm if it sees the attack.
- Lost Assets
 Assets can be misplaced or stolen by end users and employees, and when these events occur,
several questions must be answered.
 Assets can be laptops, tablets, mobile phones, desktops, printers, hard drives, and other types of
removable or portable storage.
 Attempting to contain these attacks involves the following important steps:
1. Reporting to Policy
2. Wiped remotely.

Eradication, Recovery, and Post-incident Review


 Eradication  the process of removing all the remnants of a cyberattack.

This starts once systems known to be compromised are available to be taken offline so that eradication
can occur. Removing files and reversing registry and configuration changes malware and attackers made
during the attack are addressed.

Once all the affected machines are identified and isolated and forensic backups are completed, the
company can address weaknesses exploited by the attackers. These vulnerabilities are patched, and
insecure configurations are repaired.

 Eradication Techniques

Malware Artifacts

 Antivirus solutions removed files and fixed changes made to operating systems by malicious
software.
 Some Malware can only be removed by:
1. Taking the machine offline by removing the network cable
2. Booting the machine in safe mode
3. Using the Malware removal tool
4. Rebooting the machine and confirming that the infection is gone.
6 Business continuity STRATEGY
 BCM strategy should be aligned with business and IT strategies to ensure that regulatory and
legal requirements are met. BCM policies and procedures should incorporate the necessary
controls to ensure that data integrity and privacy are not compromised during recovery efforts.
 While developing business continuity strategy, the following should be focused:
1. Business processes and operations
2. Users
3. Datacenter
4. Networks
5. Facilities
6. Supplies
7. Data (off-site storage of backup data and applications)

 The following factors pose a large challenge in the choice of appropriate BCM strategies:
1. Presence in multiple locations
2. Availability of recovery options such as owned, leased, shared or mobile facilities.
3. Increasing number of threats, risks and vulnerabilities
4. Complexity of external dependencies on supply chain channel
 Business Continuity strategy is based on worst-case scenarios, and Business Continuity team will
help build these scenarios based on past incidents and future predictions. Some businesses
propose business recovery strategies that are different from the rest of the organization.

Business Continuity Objective


The business continuity objectives are the real premise to begin with since they convey the management
attitude and commitment toward the BCM program. BCM objectives may include the following:

1. Protection of assets.
2. Measures to limit loss during disruption.
3. Minimize business loss and loss of customer goodwill.
4. Improving prompt salvage of assets during disaster.
5. Ensuring orderly evacuation of personnel and moving them to safety. Providing resources for
BCM and ensuring proper coordination between BCM teams by properly structuring them across
locations and providing for their backups in case any of them is not available during crisis.
6. Reduction of response time through planning and exercising.
Recovery Options
1. Prevention: strategy that aims to reduce the chances of the disaster happening. It consists of
deterrent controls that reduce the likelihood of occurring threats. Preventive controls safeguard
vulnerable areas to ward off any threat that occurs and reduce its impact. Having these
measures in place is always more cost-effective than attempting recovery after the interruption.
 Types of preventive controls that can be adopted by the enterprise:

A. Ensure security of the facilities: example of a deterrent control that obstructs

unauthorized entry to the installation/facilities by imposing physical access controls such

as guards, biometric access control, and surveillance systems at the location.

B. Personnel procedures: Critical locations can be made restricted zones, entry to

authorized personnel only, and a log of entry other than authorized personnel have to

be maintained. Identification badge is a good way of identifying personnel and ensuring

that they are confined to their authorized workspaces only.

C. Application controls: They help run business processes. Hence proper access control,

antivirus software, encryption algorithms, firewalls for peripheral security, intrusion

detection systems to study anomalous behavior over the network, annual vulnerability

assessments and penetration testing to overrule risk from open ports, and so on may be

deployed as preventive controls.

D. Data storage controls: Off-site storage of backups and a proper predefined backup policy

and procedures for backup, storage, testing, restoration, and purging after retention

dates expire are controls connected to data storage.

2. Response: In this stage, the first responses to an incident should be delisted. The first response
to an incident is to notify the right people. A point to note is that major recipients of BCM
communication are:
A. CIOs and CTOs
B. IT directors and data center managers
C. Security and risk management officers
D. Data center architects
E. Application owners
 Notification of impending disaster can be given by issuing prior warnings through the appointed

communication channels to employees, visitors, and customers on the premises.

 Timely notification can ensure orderly shutdown of machines and systems and if necessary, have

an orderly evacuation of premises made in case of risk to premises. This is one of the first

response steps to move to safety all personnel on the premises and to alert the police, fire

service, and hospitals. This is required only if the interruption is of the nature of an accident, act

of sabotage, or natural calamity. Precise notification procedures must be documented, and call

lists for persons to be contacted and informed should exist both at primary site and at the

backup site to facilitate mobilization of notification procedures. Notification can be done using

various tools: pager, short message service (SMS), phone, and e-mail.

3. Resumption: involves resuming only the time-sensitive business processes, either immediately

after the interruption or after the declared mean time between failures (MTBF).

 All operations have not fully recovered. The focus shifts to the command center once the BCM

teams declare the severity of the disaster and invoke the appropriate plan of action. The

resumption and subsequently the recovery activities are coordinated after this point.

 Command center is a facility located near to the primary facility and has adequate

communication facilities, PCs, printers, fax machines, and office equipment to support the

activities of the team.

Choice for alternate processing sites


1) Hot site: is a fully functional data center with hardware, software, personnel, and customer data.

It is a 24/7 staffing; it is ready to be operational within a small span of time. In case of extremely

small RTOs and RPOs, it would be good to have the systems up and running in a short time.

Organizations such as financial institutions where they hold a lot of customer data and have a lot

of customer-facing applications must go for a hot site option.


2) Warm site: is an equipped data center with hardware, software, network services, and

personnel. The element missing here is customer data. An organization can install additional

equipment and introduce customer data when a disaster occurs.

3) Cold site: A cold site is a type of data center that has its own associated infrastructure that

includes power, telecommunications, and environmental controls designed to support IT

systems, applications, and data which are installed only when disaster strikes, and the DR plan is

activated.

4) Mobile site: A mobile site is a portable van or trailer that can be used as an emergency-

processing center at the time of disaster. It provides an excellent alternative to the above three

options. After a disaster, the trailer can move to site, all essential equipment, and supplies can

be loaded onto it, and then connections for power and communication are added to it before it

can be made functional.

5) Mirrored site: A mirror site is identical in all aspects to the primary site, right down to the

information availability. It is equivalent to having a redundant site in normal times and is

naturally the most expensive option. At the alternate site (or primary site, if still usable), the

work environment is restored. Communication, networks, and workstations are set up and

contact with the external world can be resumed.

Restoration: It is the process of repairing and restoring the primary site. At the end of this, the business

operations are resumed in totality from the original site or a completely new site. While the recovery

team is supporting operations from the alternate site, restoration of the primary site for full functionality

is initiated.
7 Operations Security
Information Security Domains
1. Application
2. Physical
3. Network
4. Cryptograph
5. Access Control
6. Legal Regulations
7. Operational
8. Risk Mgmet
9. Security Architecture
10. BCP/DRP

What is Operations Security?


 Operations Security: ensuring that we have policies, standards, procedures, etc in place to
ensure that our normal business functions are secure and that we’re providing C-I-A to the
routine functions of the business.
 All about ensuring People, Process, and Technology are adequately secured.
 It is the practice of continual maintenance to keep the environment running at a necessary
security level.
 Goal is to reduce the possibility of damage that could result from unauthorized access or
disclosure.

Operations Security vs. Security Operations


Operations Security is primarily concerned with the protection and control of information processing
assets in centralized and distributed environments.

Security Operations are primarily concerned with the daily tasks required to keep security services
operating reliable and efficiently.

Operations security is a quality of other services. Security operations is a service in its own right.”

Due Care & Diligence Concepts


 Due care: is about correcting something immediately. The first letter of the two words even help
to remember this, DC = do correct. To perform due care, the organization must first perform due
diligence.
 Due diligence: takes longer than just fixing something immediately, it is more the investigation
as to why that something had to be corrected in the first place.
Examples: The implementation of controls is due care, and verification of those controls being
implemented is due diligence.

 Issuing policies, standards, baselines, and procedures are part of due diligence. Applying these
types of documents is due care.
 Installing patches to mitigate the latest CVE is due care, understanding the reason for the CVE
and making sure it has been fully understood is due diligence.
 Performing an annual security audit is due diligence, but taking corrective action from the results
of an audit is due care.

Administrative Management
• It is the most important piece of Operations management.

• One aspect is dealing with personnel issues.

Separation of Duties:
is a preventive administrative control put into place to reduce the potential of fraud. For example, an
employee cannot complete a critical financial transaction by herself. She will need to have her
supervisor’s written approval before the transaction can be completed.

 Objective is to ensure one person acting alone cannot compromise the security of a system in
any way.
 Prevents any person from becoming too powerful within an organization. This policy also
provides singleness of focus. For instance, a network administrator who is concerned with
providing users access to resources should never be the security administrator. This policy also
helps prevent collusion as there are many individuals with discrete capabilities. Separation of
Duties is a preventative control.
 Collusion is an agreement among multiple people to perform unauthorized or illegal actions. It is
hindered by the separation of duties, restricted job responsibilities, audit logging, and job
rotation. It helps prevent mistakes and minimizes conflicts of interest.

Separation of Privilege
Similar to Separation of duties but builds upon the principle of least privilege and applies it to
applications and processes. It mandates that users, accounts, and computing processes only have
minimal rights and access to resources that they absolutely need. Requires the use of granular rights and
permissions.
Segregation of duties
Goal is to ensure the individuals do not have excessive system access that may result in conflict of
interest.

Most common implementation of segregation of duties policy is ensuring that security duties are
separate from other duties.

Need-to-know Access.
 Access is granted only to data or resources that are needed to perform a task.
 It is commonly associated with security clearances of subject.
 Restricting access based on need-to-know helps protect against unauthorized access resulting in
a loss of confidentiality.

Principle of Least Privilege


 Subjects are granted only the privileges necessary to perform the assigned task.

 It protects confidentiality and Integrity.

 Typically, it is focused on user privileges, but it can also be applied to processes and applications.

Two-Person Control
 Also called two-man rule, requires approval of two individuals for a critical task.
 It ensures peer-review and reduces opportunity for collusion and fraud.

Split knowledge
 Combines the concept of separation of duties and two-person control.
 No single person has sufficient privileges to compromise the security of the environment.

Job Rotation
 Employees are rotated through jobs.
 Provides peer-review, controls fraud and enables cross-training.
 It can act as a deterrent and detective control.

Mandatory vacation
 Employees are required to take one-week or two-week vacations mandatorily.
 Provides peer-review, helps detect fraud and collusion.
 It can act as a deterrent and detective control.
Clipping Levels
 Predefined thresholds for the number of certain types of errors that will be allowed before the
activity is considered suspicious.

 To goal is to alert if a possible attack is underway within the network.

 In most cases IDS software is used to track these activities and behavior patterns.

Clipping Level
• Important term* – “the threshold of “violations attempts” that should be considered NORMAL
and NOT logged”

• Example: you might not log that a user unsuccessfully tried to login, unless they unsuccessfully
logged in more than 3 times. (for example, the first or second time might have been typing
mistakes or caps lock being down”

• Why use clipping levels – (avoid to many false positives, avoid “overwhelming” the analysis unit)

• Clipping level thresholds should NOT be known to end-users (why?)

Control Mechanisms
Control Mechanisms
Protect information and resources from unauthorized disclosure, modification, and destruction

Main types of mechanisms


1. Physical
2. Administrative
3. Technical

Key Operational Procedures and Controls


• Fault Management

• Configuration Management

• System Hardening

• Change Control

• Trusted Recovery

• Media Management

• Identity and Access Management

• Monitoring
• Security Auditing and Reviews

Fault Management
The goal of high availability is to reduce/eliminate downtime within an organization. Though there are a
variety of ways to provide availability most revolve around the idea of fault tolerance and eliminating a
single point of failure.

 Spares

 Redundant Servers

 UPS

 Clustering

 RAID

 Shadowing, Remote Journaling, Electronic Vaulting

 Back Ups

 Redundancy of Staff

MTBF and MTTR


• MTBF, or Mean Time Between Failures, is a metric that concerns the average time elapsed
between a failure and the next time it occurs. MTBF can be calculated by dividing the total
uptime by the total number of breakdowns.

• Mean Time to Repair (MTTR) is the time needed to repair a failed hardware module. In an
operational system, repair generally means replacing a failed hardware part.  It is the time it
takes to run a repair after the occurrence of the failure.

Clustering
is a fault tolerant server technology that is similar to redundant servers, except each server takes part in
processing services that are requested. A server cluster is a group of servers that are viewed logically as
one server to users and are managed as a single logical system. Clustering provides for availability and
scalability.

Backups
Full backup
is a method of backup where all the files and folders selected for the backup will be backed up

Incremental backup
Backs up all files that have been modified since last backup
Differential backup
Backs up all files that have been modified since last full backup.

Copy backup.
Use before upgrades, or system maintenance.

Sunday Monday Tuesday Wednesday Thursday Backups


needed
Full Full Full Full Full(w) to

Full(s) + Inc (m,t,w) recover


Full Inc Inc Inc

Full Diff Diff Diff Full(s) + Diff (w)

Server Crash!!!!!

Configuration Management
Is a process of identifying and documenting hardware components, software and the associated
settings.”

 The goal is to move beyond the original design to a hardened, operationally sound configuration.
 These changes come about as we perform system hardening tasks to secure a system.
 Implemented hand in hand with change control.
 ESSENTIAL to Disaster Recovery

Important in running a network or business, especially when subject to regulation (ex. SOX)

 There should be a change control policy and process (next slide)


 Important during operational use
 Important during the whole lifecycle of a product
 By the way “service packs” etc are types of changes!

Change control process.


1. Request a change take place.
2. Approve change.
3. Document the change.
4. Test and present the change.
5. Implement the change.
6. Report changes to MANAGEMENT*
System Hardening
• Removing unnecessary applications and services is always the first step in hardening a system.

• Disable unnecessary ports: Applications and services often have associated ports that are
configured to “listen”. Essentially this provides an attacker with an entry point into the network.

• Ensuring that the latest security patches and services packs are installed is another way to work
towards creating a secure environment.

• Renaming default accounts and change default setting.

• Further, keep in mind that systems are rarely configured with the most robust security settings
out of the box. Often, ease of use and performance are the first considerations. Default settings
and accounts, though useful, are well known and easy targets. Everyone knows the
administrative account is often named simply “administrator” and many people may choose a
very simple password, like “password”.

Trusted Recovery
When an operating system or application crashes, it should not put the system in any type of insecure
state

Types of failures can be classified as:


System Reboot
• Takes place after a system shuts itself-down in a controlled manner in response to a kernel failure

• It releases resources and returns the system to a more stable and safe state

After a system crash


1. Enter single user or safe mode:

 In this mode the system does not start users’ services


 File systems typically remain unmounted and only the local console access is available
 Administrator must either be physically at the console or have deployed external technology to
connect the system remotely.

2. Fix Issue and recover files

 In single user mode, administrator salvages damaged files and attempts to find the cause of the
shutdown to prevent it from happening.
 Administrator then brings the system out of single user mode.

3. Validate critical files and operations

Administrator must ensure validate the contents of configuration files and ensure system files are
consistent with their expected state.
Media Management
Media must reflect the companies security policy and enforce Confidentiality, Integrity and proper
access controls (same as Confidentiality)

 Backup media need to be protected from people and the environment (how?)
 Auditing of media access must be done
 Company may have “media librarian”
 Media reuse issues?
 Media destruction

Media destruction
Sanitization  process of destroying media when it is no longer used.

Data remanence  residual information left on a computer after being erased. (Object re-use)

Purging  making information unrecoverable even through extraordinary measures

Zeroization  overwriting, don’t use simple all zeros or all ones. Do multiple passes.

Degaussing  data is exposed to the powerful magnetic field of a degausser and neutralized, rendering
the data unrecoverable.

Physical destroy.

8 Business continuity and disaster recovery


audit
Internal Controls
Mechanisms that ensure proper functioning of processes within the company. Every system and process
within a company exists for some specific business purpose. The auditor must look for the existence of
risks to those purposes and then ensure that internal controls are in place to mitigate those risks.

Types of Internal Controls


• Controls can be preventive, detective, or reactive, and they can have administrative, technical,
and physical implementations. Examples of administrative implementations include items such
as policies and processes. Technical implementations are the tools and software that logically
enforce controls (such as passwords).

• Preventive Controls stop a bad event from happening. For example, requiring a user ID and
password for access to a system is a preventive control. It prevents (theoretically) unauthorized
people from accessing the system.
• Detective Controls record a bad event after it has happened. For example, logging all activities
performed on a system will allow you to review the logs to look for inappropriate activities after
the event.

• Reactive Controls (aka Corrective Controls) fall between preventive and detective controls. They
do not prevent a bad event from occurring, but they provide a systematic way to detect when
those bad events have happened and correct the situation, which is why they are sometimes
called corrective controls. For example, you might have a central antivirus system that detects
whether each user’s PC has the latest signature files installed.

Internal Control Examples


Backups and Disaster-Recovery Plans

• If the system or its data were lost, system functionality would be unavailable, resulting in a loss
of your ability to track outstanding receivables or post new payments.

• What are some internal controls that would mitigate this risk?

1. Back up the system and its data periodically.

2. Ship backup tapes offsite.

3. Document a disaster recovery plan.

internal audit
• To provide independent assurance to the audit committee (and senior management) that
internal controls are in place at the company and are functioning effectively.

• To improve the state of internal controls at the company by promoting internal controls and by
helping the company identify control weaknesses and develop cost-effective solutions for
addressing those weaknesses.

Data Center Auditing Essentials


A facility that is designed to house an organization’s critical systems, which comprise computer
hardware, operating systems, and applications.

Test Steps for Auditing Data Centers


The following topic areas should be addressed during the data center audit:

1. Neighborhood and external risk factors


2. Physical access controls
3. Environmental controls
4. Power and electricity
5. Fire suppression
6. Data center operations
7. Data backup and restore.
8. Disaster recovery planning
Neighborhood and External Risk Factors
When auditing a data center facility, you should first evaluate the environment in which the data center
resides. The goal is to identify high-risk threats. For example, the data center you are auditing may be in
the flight path of a regional airport, flood zone, or a high-crime area.

Physical Access Controls


Several information security incidents have occurred in which thieves gained unauthorized access to
sensitive information by defeating physical access control mechanisms.

Therefore, restricting physical access is just as critical as restricting logical access. In a data center
environment, physical access control mechanisms consist of the following:

 Exterior doors and walls


 Access control procedures
 Physical authentication mechanisms
 Security guards

Environmental Controls
• Computer systems require specific environmental conditions such as controlled temperature and
humidity. Data centers are designed to provide this type of controlled environment. When
auditing a data center, you should verify that there is enough HVAC capacity to service the data
center even in the most extreme conditions.

• IT Auditor need to review the Temperature and humidity logs to verify that each falls within
acceptable ranges over a period of time. In general, data center temperatures should range from
65 to 70°F (with temperatures above 85°F damaging computer equipment) and humidity levels
should be between 45 and 55 percent. However, this will vary depending on the specifications of
the equipment.

• Auditor should verify the temperature and humidity alarms to ensure data center personnel are
notified of conditions when either factor falls outside of acceptable ranges. Sensors should be
placed in all areas of the data center where electronic equipment is present. Ensure that sensors
are placed in appropriate locations either by reviewing architecture diagrams or by touring the
facility.

• Auditor should verify that the HVAC design to verify that all areas of the data centers are covered
appropriately. Determine whether the air flow within the data center has been modeled to
ensure adequate and efficient coverage.

Fire Suppression
Since data centers face a significant risk from fire, they typically have sophisticated fire suppression
systems, generally one of two types: gas-based systems and water-based systems.

The Auditor should Ensure that fire suppression systems are protecting the data center from fire. All data
centers should have a fire suppression system to help contain fires. Most systems are gas-based or
water-based and often use multistage processes, in which the first sensor (usually a smoke sensor)
activates the system and a second sensor (usually a heat sensor) causes a discharge of either water or
gas.
• Gas-Based Systems Varieties of gas-based fire suppression systems include CO2 FM-200 and CEA-
410. Gas-based systems are expensive and often impractical, but their use does not damage
electronic equipment.

• Water-Based Systems Water-based systems are less expensive and more common but can cause
damage to computer equipment. To mitigate the risk of damaging all the computer equipment in
a data center or in the extended area of a fire, fire suppression systems are designed to drop
water from sprinkler heads only at the location of the fire.

Data Center Operations


Effective data center operations require strict adherence to formally adopted policies, procedures, and
plans. The areas that should be covered include the following:

Roles and responsibilities of data center personnel

 Segregation of duties of data center personnel


 Facility and equipment maintenance

Data Center Operations


Effective data center operations require strict adherence to formally adopted policies, procedures, and
plans. The areas that should be covered include the following:

Roles and responsibilities of data center personnel


The Auditor should ensure that roles and responsibilities of data center personnel are clearly defined.

• Segregation of duties of data center personnel


The Auditor should verify that duties and job functions of data center personnel are segregated
appropriately.

• Facility and equipment maintenance


The Auditor should verify that data center facility-based systems and equipment are maintained properly
by reviewing maintenance logs for critical systems and equipment.

Disaster Recovery Planning


The goal of disaster recovery planning is to reconstitute systems efficiently following a disaster, such as a
hurricane or flood.

• The Auditor should ensure that a disaster recovery plan (DRP) exists and is comprehensive and
that key employees are aware of their roles in the event of a disaster. If a disaster strikes your
only data center and you don’t have a DRP, the overwhelming odds are that your organization
will suffer a large enough loss to cause bankruptcy. Disaster recovery, therefore, is a serious
matter.

• An auditor who is auditing an organization’s disaster recovery plan should also interview
personnel who participate in Disaster Recovery

• The Auditor should verify that the DRP covers all systems and operational areas. It should
include a formal schedule outlining the order in which systems should be restored and detailed
step-by-step instructions for restoring critical systems.
• The Auditor should verify that the DRP identifies a critical recovery time period during which
business processing must be resumed before suffering significant or unrecoverable loss. Validate
that the plan provides for recovery within that time period.

• Ensure that DRPs are updated and tested regularly.

Data Backup and Restore: System backup is regularly performed on most systems. Often, however,
restore is tested for the first time when it is required because of a system corruption or hard-disk failure.
Sound backup and restore procedures are critical for reconstructing systems after a disruptive event.

• The Auditor should ensure that backup procedures and capacity are appropriate for respective
systems. Backup schedules typically are 1 week in duration, with full backups normally occurring
on weekends and incremental or differential backups at intervals during the week.

• Verify that systems can be restored from backup media.

• Ensure that backup media can be retrieved promptly from off-site storage facilities.

• The Auditor should determine whether a Business Impact Analysis (BIA) has been performed on
the application to establish backup and recovery needs. A business impact analysis is the first
major task in a disaster recovery or business continuity planning project. A business impact
analysis helps determine which processes in an organization are the most important.

Criticality Analysis
When all of the BIA information has been collected and charted, the criticality analysis (CA) can be
performed. Critical analysis is a study of each system and process, a consideration of the impact on the
organization.

Recovery Time Objective (RTO) vs Recovery Point Objective (RPO)


• Recovery time objective (RTO) is the period from the onset of an outage until the resumption of
service. RTO is usually measured in hours or days. Each process and system in the BIA should
have an RTO value.
• A recovery point objective (RPO) is the period for which recent data will be irretrievably lost in a
disaster. Like RTO, RPO is usually measured in hours or days. However, for critical transaction
systems, RPO could even be measured in minutes.

1 Risk Management
What Is Risk?
• Risk: The likelihood that a loss will occur. Losses occur when a threat exposes a vulnerability.
• Threat: Any activity that represents a possible danger.
• Vulnerability: A weakness.
• Loss: A loss results in a compromise to business functions or assets.
 Tangible
 Intangible

Risk management process


1. Threat assessment
2. Vulnerability assessment
3. Impact assessment
4. Risk mitigation strategy development

Threat Assessment
• Process of formally evaluating the degree of threat to an information system or enterprise and
describing the nature of the threat.

• Threats are the tactics, techniques, and methods used by threat actors that have the potential to
cause harm to an organization's assets.

• Threat: An attacker performs an SQL injection

• Vulnerability: unpatched

• Asset: web server

• Consequence: stealing customers' private data.

• The process of threat assessment begins with the initial assessment of a threat. It is then
followed by a review of its seriousness, and creation of plans to address the underlying, Finally, a
follow-up assessment and plans for mitigation. In the last phase.

Vulnerability assessment
• The vulnerability assessment analyzes how vulnerable, susceptible, and exposed a business or
system is to a particular threat.

• it is useful to know that a system is vulnerable to a threat that has a 90% chance of occurring, a
50% chance of occurring, or a 1% chance of occurring. The vulnerability and the likelihood of the
event are closely related, and the results are used as inputs to the impact assessment.
• A server that is outside the firewall is far more vulnerable to external attacks than a server that is
inside the firewall.

Impact assessment
• The impact assessment analyzes how great or small the impact of a threat occurrence will be on
the business or system.

• An earthquake has an enormous impact on a business that is in or near the epicenter of the
quake; it has a lesser impact on businesses further from the epicenter.

Risk mitigation strategy development


• You can reduce, avoid, accept, or transfer risks. Each strategy comes with an associated cost. It’s
far more expensive in many cases to completely avoid a risk than it is to reduce the impact of the
risk.

• Most businesses are more likely to build in state-of the art fire suppression systems rather than
construct a building with absolutely no flammable materials. The cost of building a completely
fireproof building is far higher than installing a high-quality fire system.

• Some risks are worth accepting We drive cars, we cross busy intersections on foot, we eat
unhealthy food.

What is Risk Management?


It is a process to:

1. Identify all relevant risks


2. Assess / rank those risks
3. Address the risks in order of priority
4. Monitor risks & report on their management

Risk Management – why do we need it?

Promotes good management

May be a legal requirement depending upon


industry or sector

Resources available are limited – therefore a


focused response to Risk Management is needed
Risk Management Elements/Process
Assess Assess risks

Identify Identify risks to manage

Select Select controls

Implement and test Implement and test controls

Evaluate Evaluate controls

Risk Identification Process

Identify threats1 2
Identify vulnerabilities 3
Estimate likelihood of a
threat exploiting a
vulnerability

Organization-Wide Risk Management


Managing information system-related security and privacy risk is a complex undertaking that requires
the involvement of the entire organization—from senior leaders providing the strategic vision and top-
level goals and objectives for the organization, to mid-level leaders planning, executing, and managing
projects, to individuals developing, implementing, operating, and maintaining the systems supporting
the organization’s missions and business functions.
Techniques of Risk Management
1. Avoidance
2. Transfer
3. Mitigation
4. Acceptance
5. Residual Risk
6. Cost-Benefit Analysis

Risk Avoidance
• Risk avoidance is a way for businesses to reduce their level of risk by not engaging in certain
high-risk activities. While it’s impossible to eliminate all risks, a risk avoidance strategy can help
prevent some losses from happening.

• An example: A retailer discontinues collection of personal data such as customer information,


ages and telephone numbers to avoid the risk that such data would be stolen in an information
security incident.

• The key advantage of this technique is that it’s the most successful method of mitigating risk. You
eliminate the possibility of suffering losses by stopping the threat altogether.

Risk Avoidance

Risk Transfer
• You can transfer all or part of the risk to a third party. The two main types of transfer are
insurance and outsourcing. For example a company may choose to transfer a collection project
risk by out sourcing the project.

• The advantage here is that you can take some or most of the burden from risks and share it with
a third party.

Mitigate the Risk


• Risk mitigation is the process of planning for disasters and having a way to lessen negative
impacts.
• Although the principle of risk mitigation is to prepare a business for all potential risks, a proper
risk mitigation plan will weigh the impact of each risk and prioritize planning around that impact.
Risk mitigation focuses on the inevitability of some disasters and is used for those situations
where a threat cannot be avoided entirely. Rather than planning to avoid a risk, mitigation deals
with the aftermath of a disaster and the steps that can be taken prior to the event occurring to
reduce adverse and, potentially, long-term effects.

Risk Mitigation

Acceptance and Residual Risk


• Risk Acceptance, also known as risk retention, is choosing to face a risk. It is impossible to profit
in business or enjoy an active life without choosing to take on risk

• Residual Risk: Risk treatments don’t necessarily reduce risks to zero. Remaining risk after
treatment is known as residual risk.

• Residual risk is the level of risk remaining after applying risk controls.
Best Practices for Managing Threats

CREATE A PURCHASE USE ACCESS USE


SECURITY POLICY. INSURANCE. CONTROLS. AUTOMATION.

INCLUDE INPUT PROVIDE USE ANTIVIRUS PROTECT THE


VALIDATION. TRAINING. SOFTWARE. BOUNDARY.

Risk Analysis
Single Loss Expectancy (SLE)

 Asset Value (AV)


 X Exposure Factor (EF)
 Percentage loss in asset value if a compromise occurs.
 = Single Loss Expectancy (SLE)
 Expected loss in case of a compromise.

Annualized Loss Expectancy (ALE)

 SLE
 X Annualized Rate of Occurrence (ARO)
 Annual probability of a compromise
 = Annualized Loss Expectancy (ALE)
 Expected loss per year from this type of compromise

Risk Analysis calculation


Qualitative Risk Assessment Benefits
• Uses the opinions of experts.
• Is easy to complete.
• Uses words that are easy to express and understand.

Categories of Risks
• There are multiple ways into which risks can be categorized.

• Final categories used will depend upon each organization / unit’s circumstances.

• Goal is to cluster risks into standard, meaningful & actionable groupings.

• What follows is one example of a type of categorization.

Financial

1. Reduction in funding
2. Failure to safeguard assets
3. Poor cash flow management
4. Lack of value for money
5. Fraud / theft
6. Poor budgeting

Operational: These risks result from failed or inappropriate policies, procedures, systems or activities
e.g.

1. Failure of an IT system
2. Poor quality of services delivered.
3. Lack of succession planning
4. Health & Safety risks
5. Staff skill levels
6. No process to track contractual commitments.

Reputational: Organization engages in activities that could threaten its good name

1. Through association with other bodies


2. Staff / members acting in a criminal or unethical way.
3. Poor stakeholder relations

Risk Register
• A Risk Register is a management tool used to record relevant details relating to risks.

• It is a database of information on risks.  Components

• Best kept simple to begin with!  How to report on it


2 Business Impact Analysis
What Is a Business Impact Analysis?
• A study used to identify the impact that can result from disruptions in the business.

• Focuses on the failure of one or more critical IT functions.

Consider the Impact


• The BIA will identify the operational and financial impacts resulting from the disruption of
business functions and processes. Impacts to consider include:

• Lost sales and income

• Delayed sales or income

• Increased expenses (e.g., overtime labor, outsourcing, expediting costs, etc.)

• Regulatory fines

• Contractual penalties or loss of contractual bonuses

• Customer dissatisfaction or defection

• Delay of new business plans

Four purposes of the BIA


• Obtain an understanding of the organization’s most critical objectives

• Inform a management decision on maximum tolerable outage for each function

• Provide the resource information from which an appropriate recovery strategy can be
determined

• Outline dependencies that exist both internally and externally

Understanding impact criticality


Criticality categories
Category 1: Critical functions ---mission-critical
• Mission-critical business processes and functions are those that have the greatest impact on
your company’s operations and need for recovery. 

• The network, system, or application outage that is mission-critical would cause extreme
disruption to the business.

Category 2: Essential functions ---vital


• Fall somewhere between mission-critical and important

• Vital systems might include those that interface with mission-critical systems
Category 3: Necessary functions ---Important
Systems may include e-mail, Internet access, databases, and other business tools

Category 4: Desirable functions ---Minor 


• Deal with small, recurring issues, or functions

• Need to be recovered over the longer term

• Cause minor disruptions to the business and can easily be restored

Recovery Time Requirements


1. Maximum tolerable downtime (MTD):
• the maximum downtime a business can tolerate the absence or unavailability of a particular
business function. The higher the criticality the shorter the MTD is likely to be
• Downtime consists of two elements:
• systems recovery and the work recovery time
2. Recovery Time Objective (RTO)

time available to recover disruptive systems.

3. Work Recovery Time (WRT):

second segment of the MTD

4. Recovery Point Objective:

the amount or extent of data loss to be tolerated.

Recovery Time Objective (RTO)


The targeted duration of time and the service level within which a business process must be restored
after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in
business continuity.

Recovery point objective (RPO)


The maximum acceptable amount of data loss measured in time. It is the age of the files or data in
backup storage required to resume normal operations if a computer system or network failure occurs.

MTD -MAO
Maximum tolerable period of disruption (MTPOD), also known as maximum tolerable downtime
(MTD), maximum tolerable outage (MTO), or maximum allowable outage (MAO).
Methodological steps for developing a business impact analysis.

Define the boundaries of the BIA.


• The starting point prior to the development of the BIA is the identification of the scope.

• Top management should have identified the scope, considering the products and services of the
organization. Several key criteria could be considered to decide the products and services of the
organization that need to be protected to assure continuity; including:

a) market pressure,

b) specific company sites,

c) products and services profitability.

• Once the scope has been established, it is strategically recommended that its boundaries are
outlined and precisely defined in terms of with what activity they initiate and with which one
they terminate.

Identify activities that support the scope.


• An activity is considered a process or set of processes undertaken by an organization (or on its
behalf) that produces or supports one or more products or services.

• When the scope is delimited, the organization should identify all the activities involved in the
scope that directly contribute to the generation of its products and services. A good tool that
helps in this step is a flowchart.
Assess Financial and operational impacts.
• The next step is to assess the financial and operational impacts that would affect the
organization in the event of a disruption of the activities identified in the preceding step.

• The financial impact assessment is performed before carrying out the operational impact
assessment.

The financial impact assessment


• This measures the extent and severity of the organization’s financial losses.

• A financial impact assessment is carried out for each activity. The question to be asked is “What
would the magnitude and severity of financial loss be if the activities were interrupted following
a disruption?” The losses are estimated daily.

Financial losses for a specific scope.

The second part of the financial impact assessment ranks each impact in a severity level based on its
monetary loss value. The following scale is recommended:

• Severity level 0: No impact

• Severity level 1: Minor impact

• Severity Level 2: Intermediate level

• Severity level 3: Major impact


Operational Impacts

Identify critical activities.


This step identifies the activities that have to be performed in order to deliver the key products and
services, which enable an organization to meet its most important and time sensitive objectives. The
financial and operational impact rankings assigned in step three provide a basis for identifying critical
activities. An activity is considered critical if any of the following is true:

• A severity level of 2 or 3 is assigned to its financial impact.


• A ranking of high is assigned to at least three of its operational impacts.
• A ranking of high is assigned to at least two of its operational impacts and a ranking of highest is
assigned to at least one.
• A ranking of highest is assigned to at least two of its operational impacts.
• The critical activities listed in the next Figure were obtained by applying the above selection
criteria to the impact rankings of business activities presented in figures two and three.

Critical Assets
Assess MTPDs and prioritize critical activities:
• “The maximum tolerable period of disruption (MTPD) is the duration after which the viability of
the organization will be irrevocably threatened if product and service delivery cannot be
resumed”.

• The estimates of MTPD can be based on either financial or operational impacts. The personnel
responsible for assessing the financial and operational impacts are asked the following question:
“What is the maximum period of time that can be tolerated for this process based on the
financial and operational impact levels?” Let’s imagine that the financial loss of US $25,000 per
day becomes unacceptable when it exceeds US $50,000.

• Therefore, the MTPD is two days, since then the financial losses will exceed US $50,000, if the
disruption continues for a longer period of time. This example assumes that the operational
impacts are insignificant relative to the financial losses.

• Usually the analysis requires revising the financial and operational impacts of the disruption to
estimate the MTPD.

• Once the MTPDs are calculated, a priority for their recovery should be established. A critical
activity that has a shorter MTPD compared with another critical activity is assigned a higher
recovery priority.

• Considering today’s connectivity and the dependency on information technology, the trend of
MTPDs is to shrink in terms of duration and probably they will be close to zero in the near future.

MTPDs and recovery priorities

Estimate the resources that each critical activity will require for
resumption.
• In this step, the organization needs to estimate the resources required for resumption at the
level of each critical activity. Previously, the firm should have identified the minimum level at
which each critical activity needs to be performed upon resumption.

• The sources that a business can use to determine the minimum levels of performance
acceptable are the contractual agreements and service level agreements for the key products
and services involved in the scope. The minimum resources needed for each activity can be
classified as:

I. critical IT systems and applications, and


II. critical non IT resources.

• This second category can be subdivided in: ‘physical areas’, ‘human competences’, ‘equipment’
and ‘documents’.

Critical activities and resources needed for resumption.

Determine RTOs for critical activities


• “The recovery time objective (RTO) is the target time set for resumption of product, service or
activity delivery after an incident” (Fullick, 2013). The RTO, which is the length of time between a
disruptive event and the recovery of resources, indicates the time available to recover disrupted
resources. The MTPD value expresses the maximum limit for the RTO value.

• The exercise of business continuity management arrangements enables the organization to


validate its RTOs and, therefore, to take corrective actions to reduce them. Cross-functional
teams involved with the critical activities, have the task to make the estimates of the RTOs.

RTOs and RPOs for critical activities


Identify all dependencies relevant to critical activities.
In this step the organization must “consider all dependencies relevant to the critical activities, including
suppliers and outsource partners” (Alexander, 2009) The critical activities that have been considered
usually have some vital inputs that are provided by some other company processes or by external
suppliers or outsource partners. The internal processes that supply important inputs to critical activities
have also to be considered as critical activities. In the case of external suppliers and outsource partners,
contractual agreements requiring them to have a BCMS set up and managed should be in place. It is
important to bear in mind that every company is only as resilient as its weakest link in the supply chain.

Determine recovery point objectives for critical activities.


The amount of data lost because of a business disruption. The RPO is the time that will take to
investigate, repair and carry out all the arrangements to be able to activate the RTO. RPO is measured as
the time between the last data backup and the disruptive event.  In the BIA process, RPO is determined
for each application, by asking the critical activity owners the following question: “What is the tolerance,
in terms of length of time, to loss of data that may occur between any two backup periods?” The
response to this question indicates the value of RPO. In Figure seven there is an example of RPOs for
certain critical activities. The RPO always has to be less than the RTOs.

Information gathering methods.


• Obtaining the information needed for the BIA from relevant areas of the organization can be a
complex and frustrating process. A structured methodology strategy should be developed
considering the magnitude of the scope. Three methods are recommended in the technical
literature (Graham, Kaye, 2006) (Hiles, Barnes, 2001)

• Survey: the method uses a set of questions which are prepared in advance and are sent to each
activity owner. The survey allows covering a vast number of respondents. However, this method
has two main constraints: (1) The accuracy of respondents becomes a problem in the event of
lack of internal consistency and reliability of the survey. (2) Survey responses may not be
returned within the time allowed for this purpose.

• Interview: in this method the BIA information is collected by personally interviewing the activity
owners. The questions can be tailored according to each particularly activity concerned.
Although this method is very accurate and minimizes the possibility of misinterpreting the
questions, it is more expensive than the survey approach and involves the additional effort of
planning, scheduling, and conducting the interview.

• Workshop: this method, which uses group dynamic techniques, allows a group of people
strategically chosen to work together to provide the BIA information needed. Because of group
dynamics, a large amount of data is generated in a short period of time with this method. This
technique also allows the activity owners to have a systematic view of the BIA process and to
clear out any misunderstanding regarding the BIA process. In addition to this, an important side
effect associated with this method is the teamwork spirit it helps to create among owners of
critical activities.
• The choice of the appropriate method for gathering BIA information seems to be influenced by
its cost, efficiency, and by the quality of the information. Sometimes the best methodological
strategy is to combine these three techniques.

Business impact analysis project management


• The BIA methodology is based on a task force approach. All the steps of the methodology are
performed by a cross functional group integrated by the owners of critical activities. To put the
methodology into action, someone at the tactical level having the appropriate support should be
appointed as project manager. He/she becomes responsible for the BIA resources that have been
allocated to the BIA project.

• Moreover, someone at the strategic level, with appropriate seniority and authority among other
responsibilities, should be accountable for supporting the BIA process and ensuring that the BIA
methodology is implemented in the most effective and efficient manner. It is important to
understand that a BIA is developed within an organizational context.

• It is highly probable that there will be organizational obstacles that could prevent a BIA project
from accomplishing its goals.

• If external consultants are involved, the project manager should ensure that the consultants
work closely together with the critical activity owners.

BIA Best Practices


Start with clear objectives

Maintain focus on objectives

Use a top-down approach

Vary data collection methods

Plan Interviews And Avoid The Quick Use Normal Project Consider The Use Of
Meetings In Advance Solution Management Technology
Methods Resources
3 Cost-Benefit Analysis
Performing a Cost-Benefit Analysis
1. Identify losses you expect before, or without, a countermeasure.
2. Identify the losses you expect after implementing the countermeasure.

Calculating projected benefits: Loss Before Countermeasure ─ Loss After Countermeasure = Projected
Benefits

Determining value of countermeasure: Projected Benefits ─ Cost of Countermeasure = Countermeasure


Value

Cost benefit analysis


Hackers are regularly trying to attack an online book selling company and 2.6 such attacks are successful
every year. Each successful hack attack results in a loss of about $10000 to the company. The current
firewall is an outdated one. A consulting company suggested to replace the firewall with a new one. A
company XYZ proposed a firewall at a cost of $9000 and a maintenance cost of $5,000. The estimated
useful life of the firewall is 5 years. The company guarantees that the chance that an attacker break
through the firewall is reduced to 30%. Conduct a cost benefit analysis and recommend.

CBA Report Elements


1. Recommended countermeasure
2. Risk to be mitigated
3. Annual projected benefits
4. Initial costs
5. Annual or recurring costs
6. A comparison of the costs and benefits
7. Recommendation

Asset Value
Asset Value (AV) – includes the following:

• cost of buying/developing hardware, software, service


• cost of installing, maintaining, upgrading hardware, software, service
• cost to train and re-train personnel.

Exposure – percentage loss that would occur from a given vulnerability being exploited.

COST Benefit Analysis


aka economic feasibility study - quantitative decision-making process that:

• determines the loss in value if the asset remained unprotected.


• determines the cost of protecting an asset
• helps prioritize actions and spending on security
Quantitative Analysis
A widget manufacturer has installed new network servers, changing its network from P2P, to
client/server-based network. The network consists of 200 users who make an average of $20 an hour,
working on 100 workstations. Previously, none of the workstations involved in the network had an anti-
virus software installed on the machines. This was because there was no connection to the Internet and
the workstations did not have USB/disk drives or Internet connectivity, so the risk of viruses was deemed
minimal. One of the new servers provides a broadband connection to the Internet, which employees can
now use to send and receive email, and surf the Internet.

Example: Determining ALE to Occur from Risks (cont.)


One of the managers read in a trade magazine that other widget companies have reported an annual
75% chance of viruses infecting their network, and it may take up to 3 hours to restore the system. A
vendor will sell licensed copies of antivirus software for all servers and the 100 workstations at a cost of
$4,700 per year. The company has asked you to determine the annual loss that can be expected from
viruses, and determine if it is beneficial in terms of cost to purchase licensed copies of anti-virus
software.

Other feasibilities
• Organizational feasibility – A firewall may be good from security point of view, but it may prevent
free flow of data.
• Behavioral feasibility – user’s acceptance
• Technical feasibility
• Political feasibility

4 Disaster Recovery Plan


• Business continuity planning (BCP) is a methodology used to create and validate a plan for
maintaining continuous business operations before, during, and after disasters and disruptive
events.

• Disaster recovery is a part of business continuity and deals with the immediate impact of an
event. Recovering from a server outage, security breach, or hurricane, all fall into this category.

• Disaster recovery involves stopping the effects of the disaster as quickly as possible and
addressing the immediate aftermath. This might include shutting down systems that have been
breached, evaluating which systems are impacted by a flood or earthquake, and determining the
best way to proceed.

• Once the effects of the disaster or event have been addressed, business continuity activities
typically begin.
Components Of Business
The components include people, process, and technology. Technology is implemented by people using
specific processes. Technology is only as good as the people who designed and implemented it, and the
processes developed to utilize it.

People in DR planning
• People are the ones who do the actual planning and implementation of a disaster plan.

• Every company is different, and therefore, every DR planning process will have to be different. A
small retail outlet’s IT planning for DR will be very different from a college, hospital, accounting
firm, or a manufacturing facility.

• According to a survey completed in 2010, human error is responsible for 40% of all data loss, as
compared to just 29% for hardware or system failures. People are responsible for designing,
implementing, and monitoring processes intended to safeguard data. However, people make
mistakes every single day.

• Another key aspect to people in DR planning is that it’s critical to remember that if a disaster hits
your company, people will have a wide variety of responses. Some people, especially those with
emergency preparedness training, will rise to the occasion and start taking effective action
through leadership roles. As was seen in many natural disaster responses over the years, people
are often without food, shelter, power, or cellular service.

Questions:

1. What are the role of Lecturers in Disaster Recovery?


2. What is the role of IT team in Disaster Recovery?
3. What is the role of the Database team in Disaster Recovery?

• Process in DR planning has two phases: the planning phase and the implementation phase.

• The processes your company uses to run the day-to-day business are key to the long-term
success of the business. These processes are developed (and hopefully documented) in order to
manage the recurring business tasks. Things outside the normal recurring tasks typically are
handled as exceptions until they recur often enough to create a new process, and the cycle
continues.

• If your business is suddenly hit by a disaster—fire, flood, earthquake, or chemical spill—your


processes are immediately interrupted. Trying to develop effective processes in the face of an
emergency is usually not at all successful. Having simple, well-tested processes to rely on when
disaster strikes is often the difference between eventual recovery and business failure

Technology in DR planning
Question:

• Why it is difficult to create an effective process for Disaster Response?


• Part of the reason for DR planning is to look at your use of technology and understand which
elements are vulnerable to which types of disasters.

• A power outage, for example, impacts all the technology in a building. As we look at DR
planning, we’ll also look at various vulnerabilities of different technologies and discuss, in broad
strokes, strategies, tools, and techniques that might be helpful to mitigate or avoid some of
these risks.

The Cost of Planning Versus the Cost of Failure


Disasters can result in enormous business losses—financial, investor confidence, and corporate image.
They can also lead to serious legal issues, especially when more and more private data are being
captured, stored, and transmitted across the public Internet. These losses and legal challenges can have
a small, short-term impact but more often than not, they have a significant, long-term impact, and in
some cases endanger the existence of the company.

The Sony PlayStation Incident


Sony PlayStation Network had approximately 90 million users as of 2012. Users can purchase and
download games and movies using Netflix and other accounts (D'Angelo, 2012). Users may pay for
content by submitting credit card data online via the Play Station Network. In April 2011, Sony
discovered an intrusion on its network. In response, Sony blocked users from playing online games or
accessing services for 7 days. Later, Sony again took down the system in an attempt to block any further
hacking attacks. Sony's investigation revealed that hackers obtained names, addresses, email addresses,
dates of birth, PlayStation password and login information, password security questions and online ID, as
well as unencrypted credit card information. In response to this security breach, Sony expects to spend
more than $170 million on its personal information theft protection program, customer support, and
legal costs associated with the breach.

The Cost of Planning Versus The Cost of Failure


• Fire is the most common emergency (disaster) companies face. 40-50% of companies that
experience a major fire go out of business because most do not have BC/DR plans in place.

• Despite the high likelihood that a company will go out of business after a disaster, more than
90% of small businesses lack a disaster recovery plan.

• Even though many companies say they understand the need for a disaster recovery plan, very
few make it a priority.

• There may be substantial financial and legal implications for failing to plan and for failing to take
reasonable precautions. This can add to a company’s burdens after a disaster strikes.

Types of Disasters
• Threats or hazards come in three basic categories: Natural hazards, Human-caused hazards ,
Accidents and technological hazard.

• Natural hazards include weather problems in both hot and cold climates as well as geological
hazards such as earthquakes, tsunamis, volcanic eruption, and land shifting.
• Human-caused hazards can be accidental or intentional. Some intentional human-caused
hazards fall under the category of terrorism, and some are less severe and may be “simply”
criminal or unethical. • Human-caused hazards include cyber-attacks, rioting, protests, product
tampering, bombs, explosions, and terrorism, to name a few.

• Accidents and technological hazards include such issues as transportation accidents and failures,
infrastructure failures, and hazardous materials accidents, to name a few.

Disaster Recovery Planning Basics

Disaster Recovery Planning Basics


Some types of disasters that organizations can plan for include:

1. Application failure
2. Communication failure
3. Data center disaster
4. Building disaster
5. Campus disaster
6. Citywide disaster
7. Regional disaster
8. National disaster
9. Multinational disaster

Having two servers or routers in the same rack leaves your network vulnerable—the single point of
failure could be as simple as someone tripping and spilling a large cup of coffee on the rack itself.

You might conscientiously make backups, verify the backups, and store them securely but leave them
on-site. The single point of failure could be as minor as something falling on the rack holding your tape
backups or as major as a serious fire in the server room or building.

Disaster Recovery Planning Basics.


The basic steps in any Disaster Recovery plan include:
 Project initiation
 Risk assessment
 Business impact analysis
 Mitigation strategy development
 Plan development
 Training, testing, and auditing
 Plan maintenance

Project initiation
one of the most important elements in Disaster Recovery planning because without full organizational
support, the plan will be incomplete. As an IT professional, there may be limits to what you can do to
create an organization-wide functional DR plan. For example, If the application server is destroyed and
you have data backups, do you also have a way to access those backups? Do you have a way to allow
users to connect to the application securely? Where are users located? How will business resume? Can it
resume without that application in the near term or not? You will not likely be able to answer these
questions.

Risk assessment
The process of sitting down with key members of your company and looking at the potential risks your
company faces. These risks run from ordinary to extraordinary—from a fire or minor flood in a server
room to a catastrophic loss such as an earthquake or major hurricane and everything in between.

IT professional, can certainly lend your expertise to this process by helping define the likely impact to
technology components in various types of disasters or events, but you can’t do it alone. For example,
it’s likely that your transportation manager understands the potential business impact of bad weather
around the country, not just in your local area. Your marketing manager might best understand the
potential business risk of a contaminated product or a Web site breach.

Business impact analysis


• Once you’ve outlined your risks, you need to turn your attention to the potential impact of these
various risks.

• For example, you might determine that your Enterprise Resource Planning or your Electronic
Medical Record application cannot be down. Period. E-mail, Web servers, and reporting tools,
however, can go down, even though both events would be disruptive. Once you understand
these parameters, you can develop an IT-based strategy to meet the requirements that result
from this analysis.
Mitigation strategy development
The mitigation strategy might be quite simple for a small company. Keep critical data backed up to a
secure cloud location, keep several copies, of backups off-site, and keep several copies of key
information such as employee list, phone numbers, emergency service phone numbers, key suppliers,
and customers in a binder off-site in a secure but accessible location.

Plan development
After you’ve gone through the analysis steps, you’ll be ready to develop your plan. As with other types of
IT project plans, you’ll want to outline the methodology you’re going to follow so that you improve your
chance of success and reduce your chances for errors and gaps. This includes standard processes such as
developing business and technical requirements, defining scope, budget, timeline, quality metrics, and
so forth.

Training, testing, and auditing


Once the plan has been developed, people need to be trained on how to implement it. In many cases,
scenario-based case studies can be a good first step. Running through appropriate drills, exercises, and
simulations can be of great help, especially for disasters or events that rank high on the list of “likely to
occur.

Plan maintenance
Finally, plan maintenance is the last step in the DR planning process, and in many companies, it is “last
and least.” Without a plan to maintain your plan, it will become just another project document on a file
server or sitting in a binder on a shelf. If it doesn’t get maintained, updated, and revalidated from time to
time, you’ll find that the plan may be rendered useless if a disaster does strike. Maintenance doesn’t
have to be an enormous task, but it is one that must be done.

Recovery plan considerations


•  The recovery time objective (RTO) describes the target amount of time a business application
can be down, typically measured in hours, minutes or seconds.

• The recovery point objective (RPO) describes the age of files that must be recovered
from backup storage for normal operations to resume.

Types of disaster recovery plans


• Virtualized disaster recovery plan - Virtualization provides opportunities to implement disaster
recovery in a more efficient and simpler way. A virtualized environment can spin up new virtual
machine (VM) instances within minutes and provide application recovery. Testing can also be
easier to achieve, but the plan must include the ability to validate that applications can be run in
disaster recovery mode and returned to normal operations within the RPO and RTO.

• Network disaster recovery plan - Developing a plan for recovering a network gets more
complicated as the complexity of the network increases. It is important to detail the step-by-step
recovery procedure, test it properly and keep it updated. Data in this plan will be specific to the
network, such as in its performance and networking staff.
• Cloud disaster recovery plan - Cloud disaster recovery (cloud DR) can range from a file backup in
the cloud to a complete replication. Cloud DR can be space, time and cost-efficient, but
maintaining the disaster recovery plan requires proper management. The manager must know
the location of physical and virtual servers. The plan must address security, which is a common
issue in the cloud that can be alleviated through testing.

• Data center disaster recovery plan - This type of plan focuses exclusively on the data center
facility and infrastructure. An operational risk assessment is a key element in data center DRPs. It
analyzes key components such as building location, power systems and protection, security, and
office space. The plan must address a broad range of possible scenarios.

Disaster recovery plan checklist


1. Determine Recovery Objectives.
2. Identify the stakeholders (Team)
3. Channels Communication Establish
4. Tests Extensive Perform
5. Stay Up to Date.

Example

The DR plan for a modern Company, running 200 physical servers and virtual servers in an on-premises
data center. The company relies on its production environment being available 24/7 to customers, which
is why their DR strategy needs to function perfectly with minimal downtime. This company uses Amazon
Web Service (AWS) as their target DR infrastructure in order to cut costs and improve their RTO and
RPO.

You might also like