0% found this document useful (0 votes)
163 views7 pages

IT Disaster Recovery and Business Continuity For Kuwait Oil Company (KOC)

This document discusses IT disaster recovery and business continuity plans for Kuwait Oil Company (KOC). It provides background on relevant literature around disaster recovery, business continuity, and critical success factors. The paper then describes problems at KOC's IT center, including a lack of proper disaster recovery planning, lack of replication of some critical systems at the recovery site, and unsynchronized databases between the production and recovery sites. The document aims to highlight the impact of these issues and provide a disaster recovery solution to help ensure IT service continuity at KOC.

Uploaded by

Nelson Russo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
163 views7 pages

IT Disaster Recovery and Business Continuity For Kuwait Oil Company (KOC)

This document discusses IT disaster recovery and business continuity plans for Kuwait Oil Company (KOC). It provides background on relevant literature around disaster recovery, business continuity, and critical success factors. The paper then describes problems at KOC's IT center, including a lack of proper disaster recovery planning, lack of replication of some critical systems at the recovery site, and unsynchronized databases between the production and recovery sites. The document aims to highlight the impact of these issues and provide a disaster recovery solution to help ensure IT service continuity at KOC.

Uploaded by

Nelson Russo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/331962495

IT Disaster Recovery and Business Continuity for Kuwait Oil Company (KOC)

Conference Paper · April 2012

CITATIONS READS
0 184

2 authors, including:

Falah Alsaqre
Al-Hikma University College
14 PUBLICATIONS   42 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Moving objects classification in video sequences. View project

All content following this page was uploaded by Falah Alsaqre on 30 March 2019.

The user has requested enhancement of the downloaded file.


IT Disaster Recovery and Business Continuity for
Kuwait Oil Company (KOC)

Mohammad Matar Al-shammari Falah E. Alsaqre


Corporate Information Technology Group College of Computer Engineering and Sciences
Kuwait Oil Company, KOC Gulf University
Ahmadi, Kuwait Sanad, Kingdom of Bahrain
[email protected] [email protected]

suit the environment and facilities available to organizations in


Abstract—IT center of Kuwait Oil Company (KOC) is one different regions. In [2], differences between disaster and
of the largest information centers in the State of Kuwait. It crisis are introduced with comprehensive definitions, types,
provides many services to KOC departments and other characteristics, criteria and models of disasters and crises.
Kuwait oil sector companies (K-Companies). These The need for a comprehensive, integrated, coherent, multi-
services need to ensure High Availability (HA) and dimensional and aggressive disaster policy is presented in [3].
flexibility in their operations which are more than The paper reports both strengths and weakness of alternative
desirable for KOC to continue and prosper. The aim of viewpoints of disasters. It also suggests that a broad
this paper is to highlight the impact of the damage affects conceptualization of vulnerability may be best suited to
and to provide a disaster recovery solution to assist IT assimilating findings for academia and simplifying policy
services at KOC. Moreover, it introduces a better guidance for professionals in the field. Moreover, the paper
understanding of Business Disaster Recovery Plan (DRP) recommends issues to assess liabilities and capabilities, reduce
and Continuity Plan (BCP) methodology to enhance KOC risk and susceptibility, and raise resistance and resilience to
IT center by creating and validating a plan for disasters. Business continuity is performed via designing
maintaining continuous business operations before, duplicate (or near duplicate) hardware infrastructure, data
during, and after disasters and disruptive events. replication and instant (or nearly instant) application
availability in [4]. Realizing the lack of homogeneity,
Keywords- Kuwait Oil Company; Business Continuity consistency and quality control in emergency planning are
proposed in [5] by stating 18 principles for standardizing
Plan; Disaster Recovery Plan; High Availability; VMware;
emergency plan for DR and Business Continuity (BC). The
Site Recovery Manager.
emergency plan focuses on local authorities such as local
governments with guidelines for testing, revising and utilizing.
I. INTRODUCTION
The disaster preparedness and BCP are integrated in one
Now day’s business environment, IT resources including model in [6]. The disaster preparedness consists of
data are the most important assets owned by organizations. preparedness, response and recovery, meanwhile business
Natural disasters such as earthquakes, cyclones, hurricane, and continuity includes response, stabilization, assessment and
floods can destroy these assets. They also can be intentionally business continuity. In [7], the outcomes of a survey of 274
destroyed by computer viruses, hackers, sabotage and terrorist executives in India are assessed the impact of computer
attacks. disasters on information management. The results show that
In view of this, organizations need to be prepared and the information management practices at all types of
equipped against any disaster in order to maintain their companies are adversely affected by the occurrence of
survival and reputation by quickly recovering the data and computer disasters. Virus and hardware faults are the disasters
continue their operations. Therefore, an effective disaster that take place in most organizations.
planning is not an optional, but it is critical for the success of The internet influencing and communication practices
an organization. utilized during a crisis are discussed in [8]. The paper reports
Literatures in Disaster Recovery (DR) and Business that the value of efficient internet communications such as text
Continuity Plan (BCP) can be broadly demonstrated as messages over Personal Digital Assistants (PDAs) or mobile
prescriptive suggestions in how to plan and implement phones, email, instant messaging and personal web pages and
DR/BCP. Documents and papers are incidents along with case blogs can be exploited during a crisis. In [9], the ranking of
studies and raise special awareness. Although many articles Critical Success Factors (CSFs) for implementing a BC/DR
have been published in recent years, it is still difficult to find program that have changed in previous research specifically
relevant information about regional events, e.g. Oman’s after the events of 9/11 is highlighted. The study identifies
Cyclone Guno [1], due to the lack of an appropriate design to several CSFs are not referenced in previous researches.
Whereas in [10], a suggestion has made by establishing a link it is preplanned procedures that allow an organization
between emergency response plans, BCP and DR, as a route to successfully achieve the following aspects:
effective crisis management.
 Provide an immediate and proper response to
The role of insurance industry in Indian disaster emergency situations.
management is introduced in [11]. It states that in many  Protect lives and ensure safety.
developing countries, most of the losses suffered in natural  Reduce business impact.
disasters are not insured due to the lack of proper plan to faces
 Resume critical business services.
the disaster. This situation arises due to a lack of purchasing
power, interest in insurance, and availability of suitable  Work with vendors during recovery period.
policies. The paper has suggested an integrated approach  Reduce confusion during a disaster.
towards disaster management.  Ensure continuity of business services.
 Obtain “up and running” quickly after a disaster.
II. PROBLEM DESCRIPTION
These procedures must be preplanned is such away that
The IT center of KOC is one of the largest information ensuring timely and orderly resumption of an organization’s
centers in the State of Kuwait. The center provides many business cycle, at the same time can be executed without
services to KOC departments and other Kuwait Oil Sector interruption or minimal to time-sensitive IT service
Companies (K-Companies). The growing number of services operations.
gives rise to the following drawbacks:
 The DRP is not prepared correctly. B. Disaster Recovery Plan(DRP)
 Some systems such as Gas Management Information Disaster is defined as an event of destructive or debilitating
System (GMIS) and Hospital Information System occurrences to a system compromising operational availability
(HIS) are not in replication at the recovery site. of the system in an unacceptable period of time. Disasters are
 Databases are not synchronized between the different from general failure in both severity and degree of
production and recovery sites. impact. The system failures don’t necessary impairs system
 Recovery site is activated manually which is a time capability. Disasters which cannot be ameliorated by existing
consuming. failure prevention system are generally the result of
catastrophe events including, but not limited to, human
 BCP and Business Impact Analysis (BIA) are not intervention, severe weather, floods or fire. Since a disaster
existed. destroys and interrupts the continuity of business operations
within an IT centre, the response expects usage of additional
This paper attempts to recover the aforementioned infrastructures.
drawbacks by providing a disaster recovery solution that will
assist IT services at KOC. It also introduces a better DRP is a part of BCP and deals with immediate impact of
understanding of Business Continuity Plan (BCP) and Disaster an event. Recovering from a server outage, security breach, or
Recovery Plan (DRP) methodology that support KOC by hurricane, all fall into this category. DRP usually includes
creating and validating a plan for maintaining continuous several deliberate steps which are prepared to be applied in
business operations before, during, and after disasters and planning stages. The implementation of these steps is quickly
disruptive events. when a disaster occurs due to the situation during the disaster
is almost never exactly as planned. In this direction, the
III. BUSINESS CONTINUITY AND DISASTER RECOVERY resources can be controlled in accordance to the prepared
steps. Presently, DRP implies the effects of disaster as quickly
A. Business Continuity Plan as possible and addressing the immediate result.
Business Continuity Plan (BCP) is a methodology used to
create and validate a plan for maintaining continuous business IV. ESTABLISHING HIGH AVAILABILITY
operations before, during, and after disasters and disruptive
events. It considers as a concept of managing operational A. High Availability
elements that allow a business to function normally in order to High Availability (HA) is a first crucial step to ensure
generate revenue. BCP guarantees the shift of critical system business continuity within IT services in case of failures or
to another environment while the original is being repaired. problems that are not caused by major disasters and can be
Also, it sets the right people in the right places and performs managed locally according to ordinary operational procedures
business in different modes by involving the shareholders without recurring to DRP. One of the early promises of
through different stations until everything returns to normal. virtualization is the ability to keep virtualized systems online
and operate regardless to the problems with underlying
In addition, BCP enhances organizations ability to hardware by allowing the Virtual Machine (VM) to run on any
continue their operations regardless to the nature of potential host in virtual environment. HA is a design methodology used
disruption by providing guidance to IT staff to follow to ensure the uptime and availability of virtual machines
emergency plan in order to recover and resume IT services (VMs).
when operations are unexpectedly disrupted [12]. Specifically,
In general, there are two types of downtime mitigate by
HA that provides by virtualization technologies:
 Planned downtime: It is a time for scheduling the
maintenance and upgrading during in which a system
cannot be used for normal productive operations.
 Unplanned downtime: It is a time in which a system
cannot be used for normal productive operations due to
unforeseen failure in hardware/ software components or
operator mistakes.

B. Virtualization
Virtualization is a technique for simultaneously running
multiple operating systems on a single computer and making
one computer operates as multiple computers as shown in
Figure 2. SRM production and recovery sites.
Figure 1. It is considered as a framework or methodology for
dividing the resources of computer hardware into multiple
execution environments via applying one or more concepts or V. PROPOSED SOLUTION FOR IT DISASTER RECOVERY IN
technologies such as hardware and software partitioning, time- KOC
sharing, partial or complete machine simulation, emulation,
quality of service, and many others Site Recovery Manager (SRM) is an end-to-end DR
automation solution. SRM automatically customizes VMs in a
way that they can run at recovery site. It is designed to protect
VMs residing in datastore on replication storage. In the event
of a storage array failure or complete site failure VMs can be
failed over to a remote datacenter.
The proposed solution involves the implementation of
SRM with VMs environment for IT services at KOC via
providing references to install and configure SRM within test
environment at KOC. It also introduces a framework to
implement a DRP at the recovery site. Setup, testing,
evaluation and failover are taken into account as well.

Figure 1. Virtualization environment.

C. VMware Site Recovery Manager


VMware Site Recovery Manager (SRM) is a business
continuity and disaster recovery solution that helps IT
organizations to plan, test, and execute a scheduled migration
or emergency failover of datacenter services from one site to
another. SRM is supported by VMware vCenter to provide
integration with array based replication, discovery and
management of replicated data stores, and automated
migration of inventory from one vCenter to another.
SRM server uses to coordinate the operations of replicated Figure 3. SRM logical architecture in KOC.
storage arrays and vCenter servers at two sites. This implies
that as VMs at production site are shut down, the VMs at The solution assumes two sites that are: production and
recovery site start up and use the data replicated from the recovery sites. In this particular assumption, production site
production site to assume responsibility to provide the same provides business-critical data store services whereas recovery
services. Transfer of services from one site to other is site acts as an alternative facility to which these services can
controlled by a recovery plan that specifies the order in which be migrated. The location of production site is in the main
VMs are shut down and started up, the computer resources computer center of KOC in which a virtual infrastructure
they are allocated, and the networks they can access. In Figure supports a critical business need. The recovery site is in the
2, SRM allows testing a recovery plan by using a temporary main KOC office and its far 1.5 miles from production site.
copy of the replicated data and the process does not disrupt The logical architecture of SRM in KOC is shown in Figure 3.
ongoing operations at either site. Obviously, SRM needs several requirements for vCenter
configurations at each site as follows:
 Each site must include at least one vCenter datacenter.  At both sites, setting up and configuring the applicable
 The recovery site supports array-based replication with networking.
the production site, and having hardware and network  Each site consists of VMware vCenter Server and SRM
resources to support same VMs and workloads as the plug-in in two VMs.
production site.  Configure the databases at each site to support vCenter
 One VMs must be located on a replicated datastore at Server and SRM.
production site. The datastore must be supported by a
B. Hardware and Software Configurations
storage array that is compatible with SRM.
 The production and recovery sites connect via fiber The hardware and software for the configuration of
channel network. environment test are listed in Table 1.
 The recovery site have an ability to access the same
public and private networks as production site, and not TABLE I. THE VMS IN PRODUCTION AND RECOVERY SITE IN KOC TEST
ENVIRONMENT.
necessarily be in the same range of network addresses.
Hardware
Therefore, the objectives of the proposed solution can be Production site VM-1
stated to provide guidance to the IT-services group at KOC in & Recovery site VM name = "vCenter Server"
order to: 2048MB RAM
 Describe disaster scenarios that may affect IT services at Hard disk 40 GB
KOC and high availably that can be replicated to remote VM-2
recovery sites for evaluation. VM name = " SRM Server"
 Provide suitable integration of VMware SRM in IT 2048MB RAM
department at KOC to react with the disasters and manage Hard disk 40 GB
crisis situations. VM-3
 Identify the ability to simulate and test failover process VM name =" DB Server"
without impact on existing operation to insure the 2048MB RAM
processes will be performed correctly during a real Hard disk 40 GB
disaster. VM-4, VM-5, VM-6
 Demonstrate the process of restoring normal operation of VM name ="App1","App2","App3"
the original production site after a failover. 2048MB RAM
Hard disk 40 GB
A. Setup of Proposed Solution
Software
Figure 4 highlights the configuration of environmental test Production site VMware:
of the proposed solution in KOC. It requires two servers: & Recovery site  VMware ESX 4.1
production and recovery. Both of them include six VMs in  vCenter Server 4.1
one ESXi host.  SRM 4.1
VMs Operating System:
 Microsoft Windows Server 2008
 Microsoft Windows XP 64 bit
Storage Works:
 HP Storage Works P4000 Virtual
SAN Appliance
SRA:
 Version: v1.20.10713
Database:
 Microsoft SQL Server 2008

The configuration of both SRM and vCenter Server


requires a database to store necessary information for
operation. In order to function properly, both production and
Figure 4. Environment test architecture. recovery sites need a local database. The servers decide the
one to operate as the production site and the other to operate
The implementation of environmental test uses SRM as the recovery site.
prerequisite to perform the following tasks:
After SRM has been completely installed on both sites, it
 At both sites VMware ESXi host is installed to run is required to connect these sites in order to create a site pair,
multiple VMs. configure the array managers, configure inventory mappings,
 Creating six VMs in each site. create a protection group and recovery plan at each site. The
SRM client plugin is used to administer SRM. Site pairing estimate and analysis the time for switching the operation
uses vCenter administrative privileges at both sites. from production site to recovery site. The two failovers are
applied in different VMs environment. Two different policies
VI. IMPLEMENTATION OF A DISASTER RECOVERY PLAN are used for dynamic virtualized infrastructure operations
management as showing in Figure 5.
In SRM, Recovery Plan (RP) consists of certain steps to
switch datacenter operation from production site to recovery a) Test-1
site. The RP ensures that both tests and failovers are executed This test is conducted by configuring the virtual
in a repeatable and reliable manner. Also, it provides servers to operate on a sequential schedule basis. In
approaches to test the BCP and DRP in an isolated effect, VM1 is scheduled to first start, and VM2 will start
environment at recovery site without impacting the protected after the complete loading of VM1. VM2 will be
VMs at production site. scheduled to offer its planned capacity similar to VM3,
VM4 and VM5. The test starts immediately and
A. Testing of Disaster Recovery Plan sequentially after the previous volume has loaded to its
After SRM is configured on both production and recovery planned capacity. The total elapsed time for the full
sites, the recovery plan can be tested without affecting current failover is measured and the results are tabulated in Table
services at either site. The test runs recovery plan and, if II.
necessary, configures the two sites for failback so it can
restore the services at production site. TABLE II. TEST-1 ACTUAL TIME.

When the test recovery plan is enabled a test network and VM Name Time Actual Time
temporary copy of replicated data at the recovery site are used (Min) (Min)
and the operations are not disrupted at production site. Testing VM1 1.19 1.19
a recovery plan will complete all the required steps with the VM2 1.12 2.31
exception of powering down of VMs at the production site and VM3 1.15 3.46
forcing devices at the recovery site to assume mastership of VM4 1.05 4.51
replicated data. A test recovery makes no changes to VM5 1.10 5.61
production environment at each site. 5.61

SRM performs this test with copy-on-write Flashcopies of


the mirrored logical drives at the recovery site. The Flashcopy It is clear that, VM1 completed loading in 1.19
datastores are removed from the recovery host. During the minutes and VM2 took 1.12 minutes and so on. The total
recovery plan configuration, SRM can suspend non-critical time to load all virtual machines is 5.61 minutes.
VMs on the recovery site to assume that SRM have enough b) Test-2
resources to run the recovery test.
This test is configured in a different way. All virtual
servers are scheduled to operate sequentially but almost
simultaneously. VM1 is scheduled to start first, and VM2
starts after 10 seconds of loading of VM1. Then, VM2 are
scheduled to offer its planned capacity. Similarly VM3,
VM4 and VM5 are started sequentially 10 seconds after
the previous volume has loaded to its planned capacity.
The total elapsed time for full failover is measured and
the results are shown in in Table III.

TABLE III. TEST-2 ACTUAL TIME

VM Name Time Actual Time


(Min) (Min)
VM1 1.41 1.41
VM2 1.35 1.45
VM3 1.42 1.62
Figure 5. The failover test. VM4 1.53 1.83
VM5 2.02 2.42
B. Performance Evaluation 2.42
Virtualization technology based on VMware platform
In the above table, VM1 completed loading in 1.41
conceptually promises potentially powerful, simple and cost-
minutes allowing VM2 to start loading after 10 seconds
effective solutions to support disaster recovery and business
with duration time of 2.31 minutes. This test is
continuity objectives. To verify and validate such assumption
configured to run parallel processing with 10 seconds
when applied to business continuity, the following two proof-
server startup interval for a total elapse time of 2.42
of-concept tests are carried out. Herein, the failover test
minutes.
solution proposed two environment tests (Test– 1, Test –2) to
space requirements and maintenance costs at both production
and recovery sites and the reducing level is depended on the
server strength. Also, this paper provides a simulated testing
and monitoring that are performed during a disaster and
indicates that the virtualization solution is cost-effective,
simpler and more reliable to meet business continuity
requirements. The solution provides a simplified recovery plan
for virtual environment without any human interactions.
Meanwhile, it improves HA and protection of data integrity by
synchronous data replication between production and recovery
sites.

REFERENCES
[1] Anil A. H. Al-Badi and Rafi A. and Ali O. Al-Majeeni and Pam J.
Mayhew, "IT disaster recovery: Oman and Cyclone Gonu lessons
Figure 6. The failover actual times in Test 1 and Test 2.
learned", Information Management & Computer Security, Vol. 17 No. 2,
2009, pp. 114-126
The results from the two tests confirm the intuitive [2] Shaluf, I.M., Ahmadun, F. and Said, A.M. (2003), “A review of disaster
conclusion that the scheduling arrangements in Test-1 have and crisis”, Disaster Prevention and Management, Vol. 12 No. 1, pp. 24-
more favorable outcome and provide quicker failover and 32.
meet more demanding Recovery Time Objectives (RTO) of [3] McEntire, D.A, “Why vulnerability matters: exploring the merit of an
Service Level agreements(SLA) for disaster recovery and inclusive disaster reduction concept”, Disaster Prevention and
business continuity as depicted in Figures 5 and Figures 6. Management, Vol. 14 No. 2, 2005, pp. 206-22.
They indicate that Test-1 scenario is preferable where [4] Wainwright, V.L., “Business continuity by design”, Health Management
enterprise application SLAs are stringent and high availability Technology,Vol. 28, 2007, pp. 20-1.
is necessary. The Test-2 is more adequate to provide services [5] Alexander, D., “Towards the development of a standard in emergency
planning”, Disaster Prevention and Management, Vol. 14 No. 2, 2005,
for scheduled downtime situations because scheduled pp. 158-75.
downtimes are bound by time limit. In Test-2, all virtual [6] Castillo, C., “Disaster preparedness and business continuity planning at
servers start at once without wasting time or waiting other Boeing: an integrated model”, Journal of Facilities Management, Vol. 3
server to start and load fully. Moreover, these tests have No. 1, 2005, pp. 8-26.
validated the premise that VMware virtualization solutions [7] Kundu, S.G., “Impact of computer disaster on information management:
offer proven performance benefits, provide flexibility in a study”, Industrial Management & Data Systems, Vol. 104 No. 2, 2005,
operations and offer affordable scalability to data center pp. 136-43.
managers. [8] Jefferson, T.L., “Using the internet to communicate during a crisis”, The
Journal of Information and Knowledge Management Systems, Vol. 36
No. 2, 2006, pp. 139-42.
[9] Barbara, M., “Determining the critical success factors of an effective
business continuity/disaster recovery program in a post 9/11 world: a
multi-method approach”, MSc thesis, Concordia University, Montreal,
2006.
[10] Gorge, M., “Crisis management best practice-where do we start from?”,
Computer Fraud & Security, Vol. 6, 2006, pp. 10-13
[11] Atmanand, “Insurance and disaster management: the Indian context”,
Disaster Prevention and Management, Vol. 12 No. 4, 2003, pp. 286-304.
[12] Shon H., "CISSP All-in-One Exam Guide", Mark Bedell, Fourth
Edition, 2007.

Figure 7. Comparison of failover actual times in Test 1 and Test 2.

VII. CONCLUSIONS
Based on the assumption that of most likely disaster
scenarios in this region along with the impact analysis and
most suitable risk mitigation options, this paper proposed a
virtualization solution to sustain IT services at KOC and
maintain these services robust and resilient at all times under
all foreseeable conditions. The presented disaster recovery
solution consists of exploiting VMware SRM to reduce the

View publication stats

You might also like