MCAD2223 Datamining and Warehousing - Module
MCAD2223 Datamining and Warehousing - Module
1.1 Introduction.
NOTES
The major reason that data mining has attracted a great deal of attention in information
industry in recent years is due to the wide availability of huge amounts of data and the
imminent need for turning such data into useful information and knowledge. The
information and knowledge gained can be used for applications ranging from business
management, production control, and market analysis, to engineering design and science
exploration.
Data mining has numerous applications across various industries, including marketing,
healthcare, finance, and manufacturing. In marketing, data mining is used to identify
customer preferences and behaviors, and to develop targeted marketing campaigns. In
healthcare, data mining is used to analyze patient data and develop personalized treatment
plans. In finance, data mining is used to detect fraudulent transactions and assess credit
risk.
Some of the commonly used techniques in data mining include clustering, classification,
regression, and association rule mining. Data mining can be performed using a variety of
tools and software packages, including Python, R, SAS, and Tableau. Overall, data mining
plays a critical role in today's data-driven world, allowing organizations to gain insights
from large datasets and make informed decisions based on data-driven evidence.
Data storage became easier as the availability of large amounts of computing power at low
cost
i.e., the cost of processing power and storage is falling, made data cheap. There was also
the
introduction of new machine learning methods for knowledge representation based on
logic
programming etc. in addition to traditional statistical analysis of data. The new methods
tend to be computationally intensive hence a demand for more processing power.
It was recognized that information is at the heart of business operations and that
decisionmakers
could make use of the data stored to gain valuable insight into the business. Database
Management Systems gave access to the data stored but this was only a small part of what
could be gained from the data. Traditional on-line transaction processing systems, OLTPs,
are good at putting data into databases quickly, safely and efficiently but are not good at
delivering meaningful analysis in return. Analyzing data can provide further knowledge
about a business by going beyond the data explicitly stored to derive knowledge about the
business. Data Mining, also called as data archeology, data dredging, data harvesting, is
the process of extracting hidden knowledge from large volumes of raw data and using it to
make crucial business decisions. This is where Data Mining or Knowledge Discovery in
Databases (KDD) has obvious benefits for any enterprise.
According to Marcel Holshemier and Arno Siebes “Data mining is the search for
relationships
and global patterns that exist in large databases but are ‘hidden’ among the vast amount of
data,
such as a relationship between patient data and their medical diagnosis. These relationships
represent valuable knowledge about the database and the objects in the database and, if the
database is a faithful mirror, of the real world registered by the database”.
Data mining is also called as mining of knowledge from data, extraction of knowledge,
data/arrangement analysis, data -archaeology, and data-dredging. Data mining refers to
extracting or mining" knowledge from large amounts of data. There are many other terms
related to data mining, such as knowledge mining, knowledge extraction, data/pattern
▪ Design and Construction of data warehouses based on the benefits of data mining.
▪ Multidimensional analysis of sales, customers, products, time and region.
▪ Analysis of effectiveness of sales campaigns.
▪ Customer Retention.
▪ Product recommendation and cross-referencing of items.
One of the most lucrative applications of data mining has been undertaken by social media
companies. Platforms like Facebook, TikTok, Instagram, and Twitter gather reams of data
about their users, based on their online activities.
That data can be used to make inferences about their preferences. Advertisers can target
their messages to the people who appear to be most likely to respond positively.
Data mining on social media has become a big point of contention, with several
investigative reports and exposes showing just how intrusive mining users' data can be. At
the heart of the issue, users may agree to the terms and conditions of the sites not realizing
how their personal information is being collected or to whom their information is being
sold.
ETL Tools are meant to extract, transform and load the data into Data Warehouse
for decision making. Before the evolution of ETL Tools, the above mentioned ETL process
was done manually by using SQL code created by programmers. This task was tedious in
many cases since it involved many resources, complex coding and more work hours. On
top of it, maintaining the code placed a great challenge among the programmers.
These difficulties are eliminated by ETL Tools since they are very powerful and
they offer many advantages in all stages of ETL process starting from extraction, data
cleansing, data profiling, transformation, debugging and loading into data warehouse when
compared to the old method.
Data warehouses are designed to help you analyze data. For example, to learn more about
your company’s sales data, you can build a warehouse that concentrates on sales. Using
this warehouse, you can answer questions like “Who was our best customer for this item
last year?” This ability to define a data warehouse by subject matter, sales in this case,
makes the data warehouse subject oriented. Integration is closely related to subject
orientation. Data warehouses must put data from disparate sources into a consistent format.
They must resolve such problems as naming conflicts and inconsistencies among units of
measure. When they achieve this, they are said to be integrated.
2.3.1 Statistics
Statistics has a solid theoretical foundation but the results from statistics can be
overwhelming and difficult to interpret as they require user guidance as to where
and how to analyze the data. Data mining however allows the expert’s knowledge
of the data and the advanced analysis techniques of the computer to work together.
Analysts to detect unusual patterns and explain patterns using statistical models
2.3.3 Algorithms
▪ Genetic Algorithms
Optimization techniques that use processes such as genetic combination,
mutation, and natural selection in a design based on the concepts of natural
evolution.
▪ Statistical Algorithms
Statistics is the science of colleting, organizing, and applying numerical facts.
Statistical analysis systems such as SAS and SPSS have been used by analysts
2.3.5 Visualization
Data visualization makes it possible for the analyst to gain a deeper, more intuitive
understanding of the data and as such can work well alongside data mining. Data mining
allows the analyst to focus on certain patterns and trends and explore in-depth using
visualization. On its own data visualization can be overwhelmed by the volume of data in
a database but in conjunction with data mining can help with exploration. Visualization
indicates the wide range of tool for presenting the reports to the end user .The presentation
ranges from simple table to complex graph, using various 2D and 3D rendering techniques
to distinguish information presented.
Second type of pollution that frequently occurs in lack of domain consistency and
disambiguation. This type of pollution is particularly damaging, because it is hard to trace,
but in greatly influence to type of patterns when we apply data mining to this table. In our
example we replace the unknown data to NULL. The duration value is indicated in
negative value for Patient number 103. The value must be positive so the incorrect value
to be replaced with NULL.
2.7.9 Visualization/Interpretation/Evaluation
Visualization techniques are a very useful method of discovering patterns in datasets, and
may be used at the beginning of a data mining process to get a rough feeling of the quality
of the data set and where patterns are to be found. Interesting possibilities are offered by
object-oriented three-dimensional tool kits, such as Inventor, which enable to user to
explore three dimensional structures interactively.
Advanced graphical techniques in virtual reality enable people to wander through artificial
data spaces, white historic development of data sets can be displayed as a kind of animated
movie. These simple methods can provide us with a wealth of information.
An elementary technique that can be of great value is the so-called scatter diagram. Scatter
diagrams can be used to identify interesting subsets of the data sets so that we can focus
on the rest of the data mining process. There is a whole field of research dedicated to the
search for interesting projections for data sets that is called projection pursuit.
2.8.2 Association
Whereas the goal of the modelling operation is to create a generalized description that
characterizes the contents of a database, the goal of Association is to establish relations
between the records in a data base.
3.2 Classification
Classification is learning a function that maps a data item into one of several predefined
classes. Examples of classification methods used as part of knowledge discovery
applications include classifying trends in financial markets and automated identification of
objects of interest in large image databases.
Prediction involves using some variables or fields in the database to predict unknown or
future values of other variables of interest. Description focuses on finding human
interpretable patterns describing the data.
Advantages
Neural Networks is that, theoretically, they are capable of approximating any continuous
function, and thus the researcher does not need to have any hypotheses about the
underlying model, or even to some extent, which variables matter.
Disadvantages
The final solution depends on the initial conditions of the network. It is virtually impossible
to “interpret” the solution in traditional, analytic terms, such as those used to build theories
that explain phenomena.
3.4 Decision Tree technique
Decision trees are powerful and popular tools for classification and prediction. Decision
trees represent rules. Decision tree is a classifier in the form of a tree structure where each
node is either:
▪ a leaf node, indicating a class of instances, or
▪ a decision node that specifies some test to be carried out on a single attribute value,
with one branch and sub-tree for each possible outcome of the test.
A decision tree can be used to classify an instance by starting at the root of the tree and
moving through it until a leaf node, which provides the classification of the instance.
(b) Two Point Crossover: In this type, two crossover points are selected and the string from
beginning of one chromosome to the first crossover point is copied from one parent, the
part from the first to the second crossover point is copied from the second parent and the
rest is copied from the first parent. This type of crossover is mainly employed in
permutation encoding and value encoding where a single point crossover would result in
inconsistencies in the child chromosomes. For instance, consider the following
chromosomes and crossover points at positions 2 and 5 as shown in figure 3.5.
3.8.2 Mutation
As new individuals are generated, each character is mutated with a given probability. In a
binary-coded Genetic Algorithm, mutation may be done by flipping a bit, while in a non-
binary-coded GA, mutation involves randomly generating a new character in a specified
position. Mutation produces incremental random changes in the offspring generated
through crossover. When used by itself without any crossover, mutation is equivalent to a
random search consisting of incremental random modification of the existing solution and
acceptance if there is improvement. However, when used in the GA, its behavior changes
radically. In the GA, mutation serves the crucial role for replacing the gene values lost
from the population during the selection process so that they can be tried in a new context,
or of providing the gene values that were not present in the initial population. The type of
mutation used as in crossover is dependent on the type of encoding employed.
The various types are as follows:
(a) Bit Inversion: This mutation type is employed for a binary encoded problem. Here, a
bit is randomly selected and inverted i.e., a bit is changed from 0 to 1 and vice-versa. For
instance, consider mutation at figure 3.7.
NOTES (b) Order Changing: This type of mutation is specifically used in permutation-encoded
problems. Here, two random points in the chromosome are chosen and interchanged.
For instance shown in figure 3.8
(d) Operator Manipulation: This method involves changing the operators randomly in an
operator tree and hence is used with tree-encoded problems shown in figure 3.9.
Another kind of clustering is conceptual clustering: two or more objects belong to the
same cluster if this one defines a concept common to all that objects. In other words,
objects are grouped according to their fit to descriptive concepts, not according to simple
similarity measures.
NOTES
Manhattan Distance Function.
When using the Euclidean distance function to compare distances, it is not necessary to
calculate the square root because distances are always positive numbers and as such, for
two distances, d1 and d2, d1 > d2 d1 > d2. If some of an object’s attributes are
measured along different scales, so when using the Euclidean distance function,
attributes with larger scales of measurement may overwhelm attributes measured on a
smaller scale. To prevent this problem, the attribute values are often normalized to lie
between 0 and 1.
Applications
▪ Clustering algorithms can be applied in many fields, for instance:
▪ Marketing: finding groups of customers with similar behavior given a large
database of customer data containing their properties and past buying records;
▪ Biology: classification of plants and animals given their features;
▪ Libraries: book ordering;
▪ Insurance: identifying groups of motor insurance policy holders with a high
average claim cost; identifying frauds;
▪ City-planning: identifying groups of houses according to their house type, value
and geographical location;
▪ Earthquake studies: clustering observed earthquake epicenters to identify
dangerous zones;
▪ WWW: document classification; clustering weblog data to discover groups of
similar access patterns.
Step-02:
Randomly select any K data points as cluster centers.
Select cluster centers in such a way that they are as farther as possible from each other.
Step-03:
Calculate the distance between each data point and each cluster center.
Step-04:
Assign each data point to some cluster.
A data point is assigned to that cluster whose center is nearest to that data point.
Step-05:
Re-compute the center of newly formed clusters.
Advantages
It is relatively efficient with time complexity O(nkt) where-
▪ n = number of instances
▪ k = number of clusters
▪ t = number of iterations.
It often terminates at local optimum.
Techniques such as Simulated Annealing or Genetic Algorithms may be used to find the
global optimum.
Disadvantages
It requires to specify the number of clusters (k) in advance.
It can not handle noisy data and outliers.
It is not suitable to identify clusters with non-convex shapes.
PROBLEM
Cluster the following eight points (with (x, y) representing locations) into three clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)
Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as
Solution : NOTES
Iteration-01:
we calculate the distance of each point from each of the center of the three clusters. The
distance is calculated by using the given distance function.
The following illustration shows the calculation of distance between point A1(2, 10) and
each of the center of the three clusters.
Ρ(A1, C1)
= |2 – 2| + |10 – 10|
=0
Ρ(A1, C2)
= |5 – 2| + |8 – 10|
=3+2
=5
Ρ(A1, C3)
= |1 – 2| + |2 – 10|
=1+8
=9
Cluster-01:
Cluster-02:
Second cluster contains points - A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A8(4, 9)
Cluster-03:
Now, We re-compute the new cluster clusters. The new cluster center is computed by
taking mean of all the points contained in that cluster.
For Cluster-01: We have only one point A1(2, 10) in Cluster-01. So, cluster center
remains the same.
= (6, 6)
= (1.5, 3.5)
Iteration-02:
We calculate the distance of each point from each of the center of the three clusters. The
distance is calculated by using the given distance function. The following illustration
shows the calculation of distance between point A1(2, 10) and each of the center of the
three clusters.
Calculating Distance Between A1(2, 10) and C1(2, 10) , Ρ(A1, C1)
= |2 – 2| + |10 – 10|
=0
= |6 – 2| + |6 – 10|
=4+4
=8
Calculating Distance Between A1(2, 10) and C3(1.5, 3.5)- Ρ(A1, C3)
=7
NOTES
In the similar manner, we calculate the distance of other points from each of the center
of the three clusters. Next,
Using the table, we decide which point belongs to which cluster. The given point belongs
to that cluster whose center is nearest to it.
Cluster-02:
Second cluster contains points-A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4)
Cluster-03:
We re-compute the new cluster clusters. The new cluster center is computed by taking
mean of all the points contained in that cluster.
For Cluster-01:
= (3, 9.5)
For Cluster-02:
Center of Cluster-02
= (6.5, 5.25)
For Cluster-03:
Center of Cluster-03
= (1.5, 3.5)
C1(3, 9.5)
C2(6.5, 5.25)
C3(1.5, 3.5)
In the context of hierarchical clustering, the hierarchy graph is called a dendogram. Fig.
4.3
shows a sample dendogram that could be produced from a hierarchical clustering
algorithm. Unlike with the k-means algorithm, the number of clusters (k) is not specified
in hierarchical clustering. After the hierarchy is built, the user can specify the number of
clusters required, from 1 to n. The top level of the hierarchy represents one cluster, or k =
1. To examine more clusters, we simply need to traverse down the hierarchy.
Fig. 4.4 shows a simple hierarchical algorithm. The distance function in this algorithm can
determine similarity of clusters through many methods, including single link and group-
average. Single link calculates the distance between two clusters as the shortest distance
between any two objects contained in those clusters. Group-average first finds the average
values for all objects in the group (i.e., cluster) and calculates the distance between clusters
as the distance between the average values.
Each object in X is initially used to create a cluster containing a single object. These
clusters
Example: Suppose the input to the simple agglomerative algorithm described above is the
set
X, shown in Fig. 4.5 represented in matrix and graph form. We will use the Manhattan
distance function and the single link method for calculating distance between clusters. The
set X contains n = 10 elements, x1 to x10, where x1 = (0,0).
2) Medical Diagnosis
Association rules in medical diagnosis can help physicians diagnose and treat patients.
Diagnosis is a difficult process with many potential errors that can lead to unreliable
results. You can use relational association rule mining to determine the likelihood of illness
based on various factors and symptoms. This application can be further expanded using
3) Census Data
The concept of Association Rule Mining is also used in dealing with the massive amount
of census data. If properly aligned, this information can be used in planning efficient public
services and businesses.
Algorithms of Association Rule Mining
Some of the algorithms which can be used to generate association rules are as follows:
▪ Apriori Algorithm
▪ Eclat Algorithm
▪ FP-Growth Algorithm
The above table indicated the products frequently bought by the customers.
Step 2
Create pairs of products such as RP, RO, RM, PO, PM, OM. You will get the given
frequency table represented in the table 4.3.
Table 4.3 : Frequency table
Step 3
Implementing the same threshold support of 50 percent and consider the products that are
more than 50 percent. In our case, it is more than 3
Step 5
Calculate the frequency of the two itemsets, and you will get the given frequency table
Table 4.4:No of Frequencies
If you implement the threshold assumption, you can figure out that the customers' set of
three products is RPO. We have considered an easy example to discuss the apriori
algorithm in data mining. In reality, you find thousands of such combinations.
Data mining techniques have found wide application in supply chain analysis. It is of use
to the supplier in the following ways:
(i) It analyses the process data to manage buyer rating.
(ii) It mines payment data to advantageously update pricing policies
(iii) Demand analysis and forecasting helps the supplier to determine the optimum
levels of stocks and spare parts.
Coming to the purchaser side in supply chain analysis, data mining techniques help them
by:
(i) Knowing vendor rating to choose the beneficial supplier.
(ii) Analyzing fulfillment data to manage volume purchase contracts.
Enterprise Warehouse
An enterprise warehouse collects all the information and the subjects spanning an entire
Organization It provides us enterprise-wide data integration. The data is integrated from
operational systems and external information providers.This information can vary from
a few gigabytes to hundreds of gigabytes, terabytes or beyond.
Process Manager
Process managers are responsible for maintaining the flow of data both into and out of
the data warehouse. There are three different types of process managers −
• Load manager
• Warehouse manager
• Query manager
What is ETL?
ETL is a process that extracts the data from different source systems,
then transforms the data (like applying calculations, concatenations, etc.)
and finally loads the data into the Data Warehouse system. Full form of
ETL is Extract, Transform and Load.
Step 1) Extraction
In this step of ETL architecture, data is extracted from the source system into
the staging area. Transformations if any are done in staging area so that
performance of source system in not degraded. Also, if corrupted data is copied
directly from the source into Data warehouse database, rollback will be a
It is one of the important ETL concepts where you apply a set functions on
extracted data. Data that does not require any transformation is called as direct
move or pass through data.
Types of Loading:
Initial Load — populating all the Data Warehouse tables
Incremental Load — applying ongoing changes as when needed periodically.
Full Refresh —erasing the contents of one or more tables and reloading with
fresh data.
▪ Load verification
▪ Ensure that the key field data is neither missing nor null.
▪ Test modeling views based on the target tables.
▪ Check that combined values and calculated measures.
▪ Data checks in dimension table as well as history table.
▪ Check the BI reports on the loaded fact and dimension table.
ETL Tools
There are many ETL tools are available in the market. Here, are some most
prominent one:
1. MarkLogic:
MarkLogic is a data warehousing solution which makes data integration easier
and faster using an array of enterprise features. It can query different types of
data like documents, relationships, and metadata. Data warehouse needs to
integrate systems that have different.
Analysts frequently need to group, aggregate and join data. These OLAP
operations in data mining are resource intensive. With OLAP data can be pre-
calculated and pre-aggregated, making analysis faster.
OLAP databases are divided into one or more cubes. The cubes are designed
Usually, data operations and analysis are performed using the simple
spreadsheet, where data values are arranged in row and column format. This is
ideal for two-dimensional data. However, OLAP contains multidimensional
data, with data usually obtained from a different and unrelated source. Using a
spreadsheet is not an optimal option. The cube can store and analyze
multidimensional data in a logical and orderly manner
5.8.3 Roll-up:
Roll-up is also known as “consolidation” or “aggregation.” The Roll-up
operation can be performed in 2 ways
Reducing dimensions
Climbing up concept hierarchy.
Concept hierarchy is a system of grouping things based on their order or level.
Consider the diagram (in the figure 5.7) above , Quarter Q1 is drilled down to
months January, February, and March. Corresponding sales are also registers.
In this example, dimension months are added.
5.8.5 Slice:
Here, one dimension is selected, and a new sub-cube is created. Following
diagram in the figure 5.8 , explain how slice operation performed:
5.8.6 Dice:
This operation is similar to a slice. The difference in dice is you select 2 or
more dimensions that result in the creation of a sub-cube shown in figure 5.9.
▪ This kind of OLAP helps to economize the disk space, and it also
remains compact which helps to avoid issues related to access speed and
convenience.
▪ Hybrid HOLAP’s uses cube technology which allows faster
performance for all types of data.
▪ ROLAP are instantly updated and HOLAP users have access to this real-
time instantly updated data. MOLAP brings cleaning and conversion of
data thereby improving data relevance. This brings best of both worlds.
Advantages of OLAP
▪ OLAP is a platform for all type of business includes planning,
budgeting, reporting, and analysis.
▪ Hierarchy
A logical structure that uses ordered levels as a means of organizing data. A
hierarchy can be used to define aggre; for example, in a time dimension, a hierarchy
might be used to aggregate data from the month level to the quarter level, from the
quarter level to the year level. A hierarchy can also be used to define a navigational
drill path, regardless of whether the levels inthe hierarchy represent aggregated
totals or not.
▪ Level
A position in a hierarchy. For example, a time dimension might have a hierarchy
that represents data at the month, quarter, and year levels.
Example 1: Suppose a star schema is composed of a Sales fact table as shown Architecture
in Figure 6.5 and several dimension tables connected to it for Time, Branch, Item and
Location.
Fact Table
Sales is the Fact table.
Dimension Tables
The Time table has a column for each day, month, quarter, year etc..
The Item table has columns for each item_key, item_name, brand, type and supplier_type.
The Branch table has columns for each branch_key, branch_name and branch_type.
The Location table has columns of geographic data, including street, city, state, and
country. Unit_Sold and Dollars_Sold are the Measures.
NOTES
16 sunny Bangalore W
64 san Mumbai S
30 5 3.67 3-Aug-13 16
35 4 5.33 3-Sep-13 16
40 5 2.50 3-Sep-13 64
45 7 5.66 3-Sep-13 16
Row Splitting
Row splitting tends to leave a one-to-one map between partitions. The motive of row
splitting is to speed up the access to large table by reducing its size.
Total Sales
Total Quantity Sold
Average Sale Amount
Metadata Repository
Metadata repository is an integral part of a data warehouse system. It has the following
metadata −
▪ Metadata could be present in text files or multimedia files. To use this data for
information management solutions, it has to be correctly defined.
▪ There are no industry-wide accepted standards. Data management solution vendors
have narrow focus.
▪ There are no easy and accepted methods of passing metadata.
Before proceeding further, some of the backup types are discussed below.
▪ Complete backup − It backs up the entire database at the same time. This backup
includes all the database files, control files, and journal files.
▪ Partial backup − As the name suggests, it does not create a complete backup of the
database. Partial backup is very useful in large databases because they allow a
strategy whereby various parts of the database are backed up in a round-robin
fashion on a day-to-day basis, so that the whole database is backed up effectively
once a week.
▪ Cold backup − Cold backup is taken while the database is completely shut down.
In multi-instance environment, all the instances should be shut down.
▪ Hot backup − Hot backup is taken when the database engine is up and running. The
requirements of hot backup varies from RDBMS to RDBMS.
▪ Online backup − It is quite similar to hot backup.
9.7 Backup the data warehouse
Oracle RDBMS is basically a collection of physical database files. Backup and recovery
problems are most likely to occur at this level. Three types of files must be backed up:
database files, control files, and online redo log files. If you omit any of these files, you
have not made a successful backup of the database.
Cold backups shut down the database. Hot backups take backups while the database is
functioning. There are also supplemental backup methods, such as exports. Each type of
Putting in place a backup and recovery plan for data warehouses is imperative. Even
though most of the data comes from operational systems originally, you cannot always
rebuild data warehouses in the event of a media failure (or a disaster). As operational data
ages, it is removed from the operational databases, but it may still exist in the data
warehouse. Furthermore, data warehouses often contain external data that, if lost, may
have to be purchased.
We allow you to back up any number of computers to your account. You only pay a small
one-time fee for each additional computer. All computers will share the common pool of
storage space.
Keeping a local backup allows you to save data way in excess of the storage that you
purchase for online backup. For other backup products designed for local backup.
Create a Hot Backup
The best solution for critical systems is always to restore the backup once it has been made
onto another machine. This will also provide you with a “hot” backup that is ready to be
NOTES
10.1 RECOVERY –STRATEGIES
Various Testing Strategies
Testing is very important for data warehouse systems to make them work correctly and
efficiently. There are three basic levels of testing performed on a data warehouse −
▪ Unit testing
▪ Integration testing
▪ System testing
Data is loaded from a growing number of diverse sources across the enterprise to create
larger, richer assemblages of both text and numerical information.
Data is loaded into either a high-volume test area or in the user acceptance testing (UAT)
environments.
Regression testing: ensures that existing functionality remains intact each time a new
release of ETL code and data is completed.
Performance, load, and scalability tests: ensure that data loads and queries perform within
expected periods and that the technical architecture is scalable.
Acceptance testing: includes verifications of data model completeness to meet the
reporting needs of the project, reviewing table designs, validation of data to be loaded in
the production data warehouse, a review of the periodic data upload procedures, and finally
application reports.
Verifications that need a strategy: For the reason that data warehouse testing is different
from most software testing, a best practice is to break the testing and validation process
into several well-defined, high-level focal areas for data warehouse projects. Doing so
allows targeted planning for each focus area, such as integration and data validation.
Data validations: includes reviewing the ETL mapping encoded in the ETL tool as well as
reviewing samples of the data loaded into the test environment.
Data integration tests: tasks include reviewing and accepting the logical data model
captured with a data modeling tool (e.g., ERWin), converting the models to actual physical
Each recovery model addresses a different need. Knowledgeable administrators can use
this recovery model feature to significantly speed up data loads and bulk operations.
However, the amount of exposure to data loss varies with the model chosen. It is imperative
that the risks be thoroughly understood before choosing a recovery model.
Each recovery model addresses a different need. Trade-offs are made depending on the
model you chose. The trade-offs that occur pertain to performance, space utilization (disk
or tape), and protection against data loss. When you choose a recovery model, you are
deciding among the following business requirements:
▪ Performance of large-scale operations (for example, index creation or bulk loads)
▪ Data loss exposure (for example, the loss of committed transactions)
▪ Transaction log space consumption
▪ Simplicity of backup and recovery procedures
Depending on what operations you are performing, one model may be more appropriate
than another. Before choosing a recovery model, consider the impact it will have. The
following table provides helpful information
NOTES
▪ Preventive: Ensuring your systems are as secure and reliable as possible, using
tools and techniques to prevent a disaster from occurring in the first place. This
may include backing up critical data or continuously monitoring environments for
configuration errors and compliance violations.
▪ Detective: For rapid recovery, you’ll need to know when a response is necessary.
These measures focus on detecting or discovering unwanted events as they happen
in real time.
▪ Corrective: These measures are aimed at planning for potential DR scenarios,
ensuring backup operations to reduce impact, and putting recovery procedures into
action to restore data and systems quickly when the time comes.
10.4.2 Types of disaster recovery
The types of disaster recovery you’ll need will depend on your IT infrastructure, the type
of backup and recovery you use, and the assets you need to protect. Here are some of the
most common technologies and techniques used in disaster recovery:
▪ Backups: With backups, you back up data to an offsite system or ship an external
drive to an offsite location. However, backups do not include any IT infrastructure,
so they are not considered a full disaster recovery solution.
▪ Backup as a service (BaaS): Similar to remote data backups, BaaS solutions
provide regular data backups offered by a third-party provider.
▪ Disaster recovery as a service (DRaaS): Many cloud providers offer DRaaS, along
with cloud service models like IaaS and PaaS. A DRaaS service model allows you
to back up your data and IT infrastructure and host them on a third-party provider’s
cloud infrastructure. During a crisis, the provider will implement and orchestrate
your DR plan to help recover access and functionality with minimal interruption to
operations.
▪ Point-in-time snapshots: Also known as point-in-time copies, snapshots replicate
data, files, or even an entire database at a specific point in time. Snapshots can be
**************