Data Warehouse and Data Mining - Unit 1
Data Warehouse and Data Mining - Unit 1
...-..;·:-·----.;. .
,, -,
CHAPTER OUTuNE
LIF EC YC LE OF DA TA
dat a life cyc le pro vid es a hig h-le vel ove rvie w of the stag es inv olv ed
in suc ces afu ] ~tern~l\t
The
dat a for use and reu se. Mu ltip le ver sion s of a dat a life cyc le ex~;
and pre ser vat ion of ain s or com mu niti es. The dat a life "-' th
1
in pra ctic es acr oss dom
diff eren ces attr ibu tab le to var iati on m one dat a cycl~ is
crib ed as a cyc le bec aus e the less ons lear ned and insi ght s gle ane d fro
ofte n des ces s fee ds bac k into the first. PrOjeq
In this way , the fina l step of the pro
typ ical ly info rm the nex t.
opp ortu niti es, and Pote .
dat a pro ject s are iden tica l; eac h brin gs its ow n cha llen ges ,
No two ic life n~
its traj ecto ry. Nea rly all dat a pro ject s, how eve r, foll ow the sam e bas
sol utio ns tha t imp act ph?cl~
le can be spli t into eig ~t c~m m~ n _stag es or step s, or
fro m s~a rt to fini_sh. Thi s life cyc rpretation ses.
, Ma1 1agement, Ana lysi s, V1sual1zation, and Inte
Generahon, Collccho11, Processing, Storage
1.
Gen erat ion
2.
8. Coll ectio n
lnta -pet atio n
(
3.
7. Proc essi ng
Vi.S1•eJimtion
\ )
6. 4.
Ana lysis Stor age
5.
Man agem ent
SW
Introduction to Data Warehousing O CHAPTEI 1 IJ
2. Collection
Not all of the data that's generated every day is collected
or used. It's up to your data team to
identify what information should be captured and the best
means for doing so, and what data is
unnecessary or irrelevant to U,e project at hand. We can collect data
in a variety of ways,
including:
• Forms: Web forms, client or customer intake forms
, vendor forms, and huma n
resources applications are some of the most common ways
businesses generate data.
• Surve ys: Surveys can be an effective way to gather vast
amounts of information from a
large numb er of respondents.
• Inter views: Interviews and focus groups conducted
with customers, users, or job
applicants offer opportunities to gather qualitative and
subjective data that may be
difficult to captu re through other means.
• Direc t Observation: Observing how a customer
interacts with your website,
application, or produ ct can be an effective way to gather
data that may not be offered
throu gh the methods above.
It's important to note that many organizations take a
broad approach to data collection,
capturing as much data as possible from each interaction and
storing it for potential use.
3. Proce ssing
Once data has been collected, it must be processed. Data
processing can refer to various
activities, including:
• Data wran gling , in which a data set is deane d and transf
ormed from its raw form into
something more accessible and usable. This is also know
n as data deani ng or data
remediation.
• Data comp ressio n, in which data is transformed
into a format that can be more
efficiently stored.
• Data encry ption , in which data is translated into anoth
er form of code to protect it
from privacy concerns.
Even the simple act of taking a printed form and digitizing
it can be considered a form of data
processing.
4. Storage
After data has been collected and processed, it must be
stored for future use. This is most
commonly achieved throu gh the creation of databases or
datasets. These datasets may then be
stored in the cloud, on servers, or using another form of
physical storage like a hard drive,
CD, cassette, or flopp y disk.
When determining how to best store data for your organ
ization, it's impo rtant to build in a
certain level of redun dancy to ensure that a copy of your
data will be protected and accessible,
even if the original source becomes corru pted or compromis
ed.
5. Mana geme nt
Data management, also called database management,
involves orgaruzmg, storing, and
retrieving data as necessary over the life of a data project.
While referred to here as a step, it's
an ongoing process that takes place from the beginning
throu gh the end of a project. Data
management includes everything from storage and encry
ption to implementing access logs
and change logs that track who have accessed data and what
changes they may have made.
e a;µ
d Da ta Mi nin g
Da ta Wa reh ou sin g an
from ra
6. An al ys is se s tha t att em pt to gle an °? ea ni ng fu l in sig ht s ~~ E,u
Da ta an aly sis ref ers to
proces
ls an d str~te_gies to co n~ uc t th es e anaiy "'
tists use differ en t too go rit hm s,~· '
An aly sts an d da ta scien ds inc lud e statistical mo de lin g, al
rm s ~ an aly sis cl r¾
ly us ed me tho
of th e mo re co mm on ing. Exactly _wh o pe rfo
int ell ige nc e, da ta mi nin
g, an d ma ch ine lea rn za tio n's cl ~
ad dr es se d, as we ll as th e siz e of yo ur or ga ni
ge be ing ata ~Q i.
on th e specific challen sc ien tis ts ca n all pl ay a role.
analysts, an d da ta
Bu sin es s an aly sts , da ta
l re pr es en tat io ns
7. Vi su al iza tio n to the pr oc es s of cr ea tin g gr ap hi ca _of >'O\tr
Da ta vi su ali za tio n ref ers
on e or mo re vi su ali·d za tio n tools. Visua1·IZm g ,f_
the us e of di
information., tv~ pk all y
th ro ug h al is · en ce bo th • . "'Ila
e ~c le ~
ys to a W l er au
ly communicate yo ur an
m ak es it ea sie r to quick yo ur vis ua lizati o_ n tak es ~e pe nd s _on th
ati on . The fo nn lly no t a Y~ rt
ou tsi de yo ur or ga niz ur uc ate . W hi le tec hn ica
wo rk in g "; th , as we ll
as th e sto ry yo u wa nt to
ha s
co mm
be co me an in cr ea sin gl
y im po rta nt p:;:Uired
cts . da ta vis ua liz ati on of lllt
ste p fo r all da ta pr oje It
da ta lif e C\'Cle. ill
ke
8. ln te tp te t.t io n da ta lif e cy cle pr ov id es th e op po rtu ni ty to ma pl
ph as e of th e th is is when5en9t
Finally, ~ int e.r pte tat ion . Be yo nd sim pl y pr es en tin g th e da ta , o1
visualizati on interpretation Yoa
of yo ar an aly sis an d ur ex pe rti se an d un de rs ta nd in g. Yo ur Vi
th e len s of yo ~
in ve sti ga te it th ro ug h tio n of wJ :ia t th e da ta sh ow s bu t, m or e i m b
no t on ly in clu de a de
sc rip tio n or ex pla na ; . '
wh at th e im pli ca tio ns
may be.
a) Da ta M atr ix l
b) Document Data t
c)Transaction Data
2. Graph bu ed c1au
a) Li nk ed web pages
res
b) Benzene MolecuJar Structu
3. Or de re d el m
a) Sequential Data
C, en eti c Sequence Da
ta
b)
c) Temporal Data
d) Spatial Data
R ec or d D at a . cifJ)
collection of d a fix ed se t of attrib uteS iS 41'
Da ta th at co ns ist s of a ofr s, ea has
basic form Orecre ch of which consists of
. .
_......nrds ot .
·tftd'
re co rd da ta . 1ne m os t cord data no ex pb at relationship am on g
rP ."' '
It is a special type of record data, in which each record contains a set of items. For example, shopping
in a supermarket or a grocery store. For any particular customer, a record will contain a set of items
purchased by the customer in that respective visit to the supermarket or the grocery store. This type
of data is called Market Basket Data. Transaction data is a collection of sets of items, but it can be
viewed as a set of records whose fields are asymmetric attributes. Most often, the attributes are
binary, indicating whether or not an item was purchased or not.
Tid Item
"
1 Pencil, Paper
2 Pencil, Book, Rubber, Ink
3 Paper, Book, Rubber, Rule~
4 Pencil, Pape~Book,Rubber
5 Pencil, Paper, Book, Ruler
12.65 6.25
i~15.22
16.22
I Data Warehousing and Data Mining
The ~-0..:
uy-~
M · / Documcnt~ta Matrix
atto: . ) is a special case of a data
tnX
t-data ma
d ina~
A sparse data matrix (sometimes also called ocumen ,,.c:vmmetric; i.e., only non-zero Valu
in which the attributes are of the same type and are ~ i -- es~,
important.
Game Win Lost Timeout
Ball Score Se~
Team Co.lch Play
Docwnent 1 3 0 5 0 2 6 0 2 0
'2'
7 0 2 1 0 0 3 0 ()'
Docwneut 2 0
0 1 0 0 1 3 2 0 4 ()'
Do.:umemJ
If objects have structure, that is, the objects contain sub objects that have relationships, then such
objects are frequently represente d as graphs. For example, the structure of chemical compound s can
be represented by a graph, where the nodes are atoms and the links between nodes are chemical
bonds.
Ordered Data
Ordered data set records are kept in a physical sequence based on a user-specified key without the
necessity of utilizing a set. Ordered data sets can_be either disjoint or embedded, but are normally
embedded. For some types of data, the attributes have relationshi ps that- involve order in time or
space. It can be segregated into four types:
Sequential Data
-.
Whenever the points in the dataset are dependent on the other points in the dataset the data is said
to be Sequential data. A common example of this is a Time series such as a stock price or a sensor
data where each point represents an observation at a certain point in time.
Sequential Data is any kind of data where the order matters as you said. So, we can assume that time
series is a kind ol sequential data, because the· order matters. A time series is a sequence taken at
successive equally spaced points in time and it is not the only case of sequential d ata. Consider a
retail transaction data set that also stores the time at which the transaction took place
Time Customer Item Purchased
Tl Aarav Bag, book
T2 Umesh Bag, pen
T2 Aarav Pen, Copy
T3 Aadesh Bag, Copy
T4 Aadesh Doll
TS Aarav Bag, Doll
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
. .. -.,.. .
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
/ CCAACCGAGTCCGACCAGGTGCC
. ;
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG . .. .
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Figure 1.4: Genomic sequence data
Time Series Data (Tem poral Data)
2065/ 66 749.1
2066/ 67 477.73
2067/68 362.85
2068.69 389.74
2069/70 518.33
2070/71 1036.1
2071/72 961.2
2072/73 1718.2
1073/74 1582.67
2074/75 1200.09
2075/76* 1102.64
- - *Nepse Index is of Falgun 21, 2072
Spatial Data
. h sical object that can be
Spatial data, al!.O known M gl'ospatial data, ls Information about a p y k' spatial data
, , d' . 1 ·t m Generally spea mg,
rPnrPO;M\tl'd by numerical values m a gl•ogmph1c coor ,na c sys e · 'Id ' Jake mountain
·-,.-· --·· E h hasa bu1 mg, '
~ t s t~ location, size and shape of an objt-ct on planet art sue
figure 1.6: Spatial data of Total Precipitable Water (TPW) in the atmosphere over the globe.
the volume of data, is increasing day by day the traditional ways and methods that were used to
AF,
manage and manipulate data were becoming obsolete in nature, to overcome this problem we Il;eed
to have a more effective and advanced data storage system that is with the use of data warehouses. A
warehouse in general terms is a historic repository of information collected from multiple sources,
stored under a unified schema, and that usually resides at a single site. A data warehouse stores
historical data of an organization so that they can analyze their performance over the past time
(days, weeks, months or years) and plan for the future.
A data warehouse may contain multiple databases. Within each database, data is organized into
tables and columns. Within each column, you can define a description of the data, such as integer,
data field, or string. Tables can be organized inside of schemas, which you can think of as folders.
When data is ingested, it is stored in various tables described by the schema. Query tools use the
schema to determine which data tables to access and analyze.
Introduction to Data Warehousing O ~- CHAPTER f ...,,, fl
Data Sources Data Warehouse Users
Operational
Database
Metadata
-Summary~· .-----=-----1i,:;,7-:-
Data Raw Data •
Operational Analysis
Database ::, Data for
Mining \ ~,:, >
A data warehouse (OW) is a digital storage system that connects and harmonizes large amounts of
data from many different sources. Its purpose is to feed business intelligence (BI), reporting, and
analytics, and support regulatory requirements - so companies can turn their data into insight and
make smart, data~ven decisions. Data -warehouses store current and historical data in one place
and act as the single source of truth for an organization.
Data warehousing is the process of constructing and using data warehouses. It is the process of
extracting & transferring operational data into informational data & loading it into a central data
store (warehouse). ·
/ I\
Read Add/C hange /Delet e Read
Figure 1.1: OLTP versus data warehouse and both are non-vo.
latile
The
info Opera
. tional Datab ase is the source of informat ion
. for the data wareh ouse. It includ es detaile
d
rmatio n used to run the day-to -day operat ions of th b .
update s are made and reflect the curren t value of th:
Manag ement System s also called as OLTP (Onlin e T .
1:~~c:
Th
data freque ~tly change s as
ons. Opera tional Database
manag e dynam ic data in real-ti me. . . ransac tions Proces sing Databa ses), are used to
lnlt tl<illrtlon to Dntn Wnrehousing O CHAPTll 1 I 11
l \,t., \\'1\l\'t\l,11:-t' ~ , :it,•m:-1 l"l' t\'t' u~,•1~ or k11owh--dgt' workl•rs In the purpose of data analysis and
,f,, ,~i\'I\ 11 \.\km~. ~urh ~v:1h-m~ \\ H\ 11rg,,nl1,, nnd prl'St'nt information in specific formats to
,h''''"''wda h• the' ,11\'l'l~t• lW,'\b ,,1 VMit,ull u~r:1, ThcllC systcrns ore called as Online-Ana lytical
l't\,\'~,h~ ('-"It ,\ l') $,·:-t,•ms.
~'"''' m,,,,,, ditt,•1,•n,,•~hdwt'l'I\ n.,t,1 Wnrl'lmmR'S ond Opt.>r,1tional Database Systems are tabulated
l~'k'" ·
-O~ratlun.a
- l t,.,,..;;t'
®-
- -
t\"'r.,twn.,l ~y~t\'ms ,\1\' 1fos~1wd
\'\,tumc lt\\" ~\1.'lh.m p1,x~~~h'&,
t,, support high-
Data Warehouse
D,tln warehousin g systems are typically designed to
support high-volum e analytical processing (i.e.,
OLAP).
Oper.itiC'n.,l ~yst\'m~ ,'It\' usu;\lly conccmed with Dntn warehousing systems are usually concerned
'-'-t~nt J,'lt.,. with historical data.
o._,t,\ within o~~mti,.'nal systems nre mrunly Non-volatile, new data may be added regularly.
UJ.Xi,,~,d l\'gul,uly nC\."Otding to need. Once Added rarely changed.
It is Jcsignoo for real-time business denting nnd It is designed for analysi~ of business measures by
pm."-.~ subject area, categories, and attributes.
It is optimized for n simple set of trnnsnctions, It is optimized for exte!'t loads and high, complex,
~nerally adding or retrieving a single row at n unpredictable queries that access many rows per
ti~ per table. table.
It is optimized for validation of incoming Loaded with consistent, valid information, requires
information during transactions, uses validation no real-time validation.
data tables.
It supports thousands of concurrent clients. It supports a few concurrent clients relative to OLTP.
Operational systems are widely process-oriented.Data warehousin g sy~tems are widely subject-
oriented
Operational systems are usually optimized to Data warehousing s~stems are usually optimized to
perform fast inserts and upda~ of associatively perform fast retrievals of relatively high volumes of
small volumes of data. data.
Less Number of data accessed. Large Number of data accessed.
Relational databases are created for on-line Data Warehouse designed for on-line Analytical
transactional Pr~ing (OLTP) Processing (OLAP)
-
It has normalized schema Data warehouse has de-normaliz ed schema
E-R Model is used for designing Star or Snow flake Model is used for designing
Multidimen sional data model in data warehouse is a model which represents data in the form of
data cubes. It allows to model and view the data in multiple dimensions and it is defined by
dimensions and facts. Multidimen sional data model is generally categorized around a central theme
and represented by a fact table. It is typically used in the organizatio ns for drawing out Analytical
results and generation of reports, which can be used as the main source for imperative decision-
making processes. This model is typically applied to systems that operate with OLAP techniques
(Online Analytical Processing).
-
14 Data Warehousing and Data Mining
The Multi-Dimensional Data Model is a significant improvement amongst various areas of Data
Science, like the Data Warehouse system and the Data Management techniques. Multi-Dimensionai
Models are found to be the competent relational systems, which can serve as a key input for
generating Analytical outcomes for the purpose of business decision making processes.
Now, if we want to view the sales data with a third dimension, for example, suppose the data
according to time, product and location. Time is considered for four quarters i.e., Ql, Q2, Q3, and Q4,
wh~ four products are considered i.e., Television (TV), Personal Computer (PC), Access Point (AP),
and Solid-Smit Drrot (SSD), and the location is considered for the cities Pokhara, Kawasoti, Dhangadi,
and MahenJnnaagar. These 30 data are shown in the table below. The 3D data of the table are
represented as a series of 20 tables.
I ...........
Table 1.1 :3D view of sales data according to time, produd and location
.......
Location = "Kawasotf"
Product
Locallon • "Dhangadl" Location=''Mahendranagar"
- sso
PC AP SSD TV . PC AP SSD TV PC AP SSD 1V PC AP
88 623 1087 968 38 872 818 746 43 591 605 825 14 400
-
890 64 698 1130 1024 41 925 894 769 52 682 680 952 31 512
..
58 788 1034 1048 45 1002 940 ?95 58 .728
. 812 1023 30 501
QI 1129 99'l 63 870 1142 1081 54 984 978 864 59 784 927 1038 38 580
.i).
<>~
~
~-o0~ Kawasoti
vi' Dhangadi
Ql 605
-C
825 14
~
::s
Q2 680 952 31 512
-
Cl
II
...
6
Q3 812 1023 30 501
r-
Q4 927 1038 38 580
1V PC AP SSD
Product (types)
Figure 1.9: Multidimensional Data Model (3D d t be
a a cu of sales data)
Introduction to Data Warehousing 0 CHAPTER 1 I 15
Working Mechanism of Multidimensi onal Data Model
Like any other system, lhe Multidimensional Data Model also works based on the predetermined
steps, in order to keep the pattern, the same throughout the industry and for ena bling the reusability
of the already designed or created database systems. For creating a Multidimensional Data Model,
every project should go all the way through the below phases,
• Congregating the requirements from the client
Similar to the other software applications, a Data Model also requires the precise
requiremen t from the c1ient. Most of the time, the client might not know what could be
accomplished with the selected technology. It is the software professional's duty to
provide clarity on to what extent a requirement can be achieved with the selected
technology, and elaborately collect the complete requirement.
• Categorizing the various modules of the system
After the process of collecting the entire requirement, the next step is to identify and
categorize each of the requirements under the module where they belong. Modularity
helps in better management, and also makes it trouble-free to implement, one at a time.
• Spotting the various dimensions based on which the system needs to be designed
Once the separation of various requirements and moving them to the matching
modules are completed, the next step is to identify the main factors, from the user's
point of view. These factors can be termed as the dimensions, based on which the
multidimensional data model can be created.
• Drafting the real-time dimensions and the corresponding properties
As a part of next step, in the process of the Multi-Dimensional Data Model, the dimensions
identified in the previous step can be further used for recognizing the related properties.
These properties are termed as the 'attributes' in the database systems.
• Discovering the facts from the already listed dimensions and their properties
From the initial requirement gathering, the dimensions can be a mix of dimensions and
facts. It is a significant step to distinguish and segregate the facts from the dimensions.
These facts play a great role in the structure of the Multi-Dimensional Data Models.
• Constructing the Schema to place the data, with respect to the information gathered
from the above steps:
Based on the information collected so far, the elaborate requirements, the dimensions,
the facts, and their respective attributes, a Schema can be constructed. There are many
types of Schemas, from which the most suitable type of schema can be chosen. A few of
the commonly used schema types are the Star Schema, the Galaxy Schema, and the
Snowflake Schema.
• Overall, organizational capacity and structural d efini'tion of the Mult i-Dim ensio ~
Data Models aids in holding cleaner and reliable data in the
database.
• Oear ly defined construction of the data placements makes it uncomplicated, in
.
situations like one team constructs the database, another team
works on~ an: SOine
other team works on the maintenance. It serves as a self - learn
ing sy5tem an when
required.
• . · y of the data and performance of the
As the system is fresh and free of junk, the eff1aenc
database system is found to be advanced & elevated.
Disadvanta,ges
~ As the Multi-Dimensional Data Model handles complex sySt
ems, these types of
databases are typically complex in nature.
• Being a complex system means the contents of the data b~
are hu?e in the amount as
well. This makes the system to be highly risky when there IS a secun
ty breach.
• When the system caches due to the operations on the Multi
-Oimens!onal Data Model,
the performance of the system is affected greatly.
·
• Though the end product in a Multi-Dimensional Data Model
is a~vantageous, the path
to achieving it is intricate most of the time.
Roll-Up
~o~j' Gandaki
~. ~
"~~-- S11dwp,1sbchim
~
QI 1423 1571 57
-...,
II
t:fl
Q2 1574 172 83 1194
·::,
-
CJ
~
E
QJ 1752 1818 38
1V PC AP SSD
Product (types)
ap on
om cities
irovinces)
Mala
liQ5 825 14
e QI
-
! 512
; Q.? 680 9S2 31
-e
a
Q3 112 1()23 30
i=
Q4 9'1:1 1038 38 580
1V PC AP SSD
Produd (types)
Drill-Down
The drllhlown operation (ol..<0 called roll-dow
n) is the reverse operation of roll-up. Drill-d
"'--cming-in on the data cube. It navigates from own is like
can be ~ less detailed record to more detailed data
by either ,repping down a concept hie
. Drill-down
ruchYfor a dimension or adding addition
dmten>ionS- The rollov,ing diagram illustrates al
how Drill-down works when drill-do
wn on time from
qu..rteIStomonth:
--..
f'I,
Ql 6:0 825 1-l
•= ~ 6&} 95! 31
-..
0
Cl
:: Q3 SU um 30 501
'€=
Q! '1!i 1038 38
1V PC .\P S.SD
Product ltvPesl
la
-"
f
t:
1ft
~
Apr
1
-=s
!,by
Ci ha
Cl w
Aq
~ ~
0a
Nov
Dee J
1V PC AP sso
F. Product (types)
igure 1.11: Ulustration of drill-aown
. n
operatio on sales clala
Introduction to Data Warehousing O CNAP1III I 1•
Drill..cfown is performed by stepping down a concept hierarchy for the dimension time. Initially the
concept hierarchy was "day < month < quarter < year." On drilling down, the time dimension is
descended from the level of quarter to the level of month.
Slice
The slice operation selects one particular dimension from a given cube and provides a new sub-cube.
Consider the following diagram that shows how slice works when sliced for first quarter i.e., Qt.
C
~~
-~
·~
. 0~ Kawasoti
,LI>-'¢: ~--,,,,.;.__- ~---'--.....r
V.r Dhangadi
818
- Ql 605 825 14
i'
-~ Pokhara 854 882 89 623
-
0
§ Kawasoti
'.::I
1087 968 38 872
Mahendranagar 14
605 825 400
1V PC AP SSD
Product (types)
Here, Slice is performed for the dimension "time" using the criterion time = "Ql". It will form a new
sub-cube by selecting one or more dimensions.
21 Data W~housing and Data Mining
~ ~
.
. selects two or more dimensions from a given
Dice cu be and Provides a new sub-cube. Consider ~ n
following diagram that shows the dice operation. all
bE
Q2 680 952 31
":,
-
CJ
Cl/
E Q3 812 1023 30 501
~
Q4 927 1038 38 580
TV PC AP SSD
Product (types)
f QI 605
si
~ (5
- Q2 680
TV
952
PC
Product (types)
The dice operation on the cube based on the following selection criteria involved three dimensi~
are
• (location = "Mahendranagar" or "Dhangadi") and
• (time = "Ql II or "Q2") and
• (item = 11 TV" or "PC").
Introduction to Data Warehousing O e-CHAPTEI 1.._I 21
Pivot
\ · m
· view · ·an
· ord er to provide
. operation is. also known
The pivot . data axes m
as rotation • It rotates the
alternative presentation of data. Consider the following diagram that sh th •
between location and product dimension. ows e pivot operation
Mahendranaga
605 825 14 400
TV PC AP SSD
Product (types)
T
'
TV 854 1087 818 605
PC 882
-
968 746 825
. ,· . - :-.· :·-. -- -<
AP 89 38 :· 43 14
The conceptual data model is a structured business view of the data required to support business
. proc ellies, record business events, and trade related performance measures. This model focuses on
i identifying the data used in the business but not its processing flow or physical characteristics It is a
concise description of the user'sdata requirements without taking into account implementation details. ·
Conventional databases are generally designed at the conceptual level using some variation of the well-
known entity-relationship (ER) model, although the Unified Modeling language (UML) is being
22 Data Warehnusing and o.t, Minint ,
tt r,-~H1K0I mtJtid t,y Myyf ym~ • "'1
increasingly u!ed. Conceptu.al 5eheffla~can be eaflily tr~t!-..d ti, ,el , ,t t- w~o.f1t~ wn,A , ...,,u,.
of mappmg. rules. Providing extemwnf"i to the ERand uie '" UMf., mt'°'-',.
-
,11,• v;,"',vJ f/~J,::,;,tw,n ,,f 11.,
a eolution to the problem, since ultimately, they reyr~.:nt ' r~=,"' :Wn
pr1.!,kmt, ·n~t4,,,.,
underlying reJational technology cor.cept5 and, in additlfJ11, ,v.v,-:;, ~:on v,p ,J the J.<!'1)'~ J--v..«, ,
conceptual.data warehou5mg modeHngrequir~a mood ttiat,tearlyfM ..1 J· t' . _ t 11 p b!:tw~ t}...
h',,huu,.k-/~ re ,:, IIN"'n ~ .,.
A Data warehouse conceptual data model ii ru,thing but a 117 ·~.,n ~ ~
different entities {in other word different table) in the dJJ~ ,rvJdcl,
Following are the feature, o f ~ data ~el; diff,erent entitkt in the ~ta nw~
• Thi5 is initial or high-level reJatitm- betw~ ~~ .1.._ •-L!~l..jflCl amo,w ~
, ludes the unportant entJtief ctnv llJII:' reuauv,_, r ' ?
Conceptual model me data model wt will not ~ amy attrwu~ fJ> th.,
• 1n the data warehouse conceptual '
entities.
• We also not define any primary key yet.
The figure 1.15 is an example of a conceptual data model.
Patient Date
0 1•
8
~Fact
--
'
Hospital
Figure 1.1~ Example of ~ . . . •
From the above figure(see figure 1.15) you can see that, data warehome conceptual model de9cribe
only high-level relationship between the entities,
.
Schema• for Multidimeuional Data llodell
A schema is a logical de!Cription that descn'bes the entire database. In the data waael!IOUle 11d
includes the name and description of record.5. It has all data items and mo diffeiait •W ~
associated with the data. Lib a database ha., a schema, it is required to maintain a 9dlftna for a d,la
warehouse as well There are different !Chemas based on Im !etup and data which~ maintained•
Ii data warehouse.
Introduction to Data Warehousing O ; CHAPTD 1 ID
There nre fact tnblt•s rmd dimension loblcs that form the basis of any schema in the data warehouse
that ore important to be understood. The foct tables should have data corresponding data to any
business process. Every row r~prescnts any event that can be associated with any process. It stores
quantitative informotion for onolysis. A dimension toble stores data about how the data in fact table
is being analyzed . 11,cy foci lilatc the foc t table in gothcring different dimensions on the measures
whkh are to he token.
The most populllr d.,t,, model for o Jato worchouse is a multidimensional model, which can exist in
the fom, of a ~h1r ~rhc111,1, a s11owflnkc sd1e111a, or 11 fi1ct constellation schema.
Star Schema
The most common modeling paradigm of da ta warehouse is the star schema. A star schema is
represented by one ln11:,--e fact table nnd many dimension tables. The schema diagram looks like a star
with a central fact table from which points radiating to the surrounding dimension tables. The fact
data is organized in the fact table, and the dimensional data is organized in the dimension table. The
fact tables are in 3NF form and the dimension tables are in denormalized form. Every dimension in
star schema should be represented by the only one-dimensional table. The dimension table should be
joined to a fact table. The fact table should have a key and measure.
A star schema for sales data is shown in figure below. Sales are considered along three dimensions:
product, time and location. The schema contains a central fact table for sales that contains key to each
of the three dimensions, along with two measures: rupees_sold and units_sold. To minimize the size of
the fact table, dimension identifiers (e.g., time_key and product_key) are system-generated identifiers.
time product
dimension table dimension table
sales
fact table
month e
ru ees_sold
location
units_sold dimension table
location_ke
street
Snowflake Schema
Snowflake schema can be considered as a variant of the star schema. However, this is a more
complex data model compared to the star schema. In a snowflake schema, there is single, large and
21 Data Warehousing and Data Mining .
central fact table and one or more tables for each dimension. In order to eliminate redundancy,
dimension tables split data into different tables. Due to this normalization, often it results in mor,
complex queries and reduced query performance. The advantage of snowflake schema is that it uses
small disk space. The implementatio n of dimensions is easy when they are added to this schema. The
same set of attributes are published by different sources.
A snowflake schema for sales data is shown in figure below. Sales are considered along three
dimensions: product. time and location. The fact table is identical to star schema. The main difference
between the two schemas is in the definition of dimension tables. The single dimension table for
location in the star schema can be normalized into two new tables: location and city. The city key in
the new location table links to the city dimension as shown in figure below.
time product
dimension table dimension table
time_
da sales
dav_of_week fact table brand
month
units_sold
location city
dimension table •
ct
Fact Constellation
Sophisticated applications may require multiple fact tables to share dimension tables. This kind of
schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact
constellation.
A fact constellation schema is shown in figure below. This schema specifies two fact tables, salts and
shipping. The sales table definition is identical to that of the star schema. The shipping table has five
d imensions, or keys: product_lcey, time_lcey, shippn-_lcey, from location, and to loa,tion and two measuresi
rupees_cost, and units_shipped. A fact constellation schema allows dimension tables to be shared
between fact tables. For example, the dimensions tables for time, product, and location are shared
between the sales and shipping fact tables.
Introduction to Data Warehou sing O ":' CHAnll i ·) 21
time shipping
fact ta ble
dimension table
time_ke
da
da _of_week
month
shipper
d imension table
shi _,ke
Figure 1.8: Fad constellation schema of salH and shipping data warehouse.
• Middle Tier: The middle tier in data warehou se is an OLAP server which is
impleme nted using either ROLAP or MOLAP or HOLAP model. For a user, this
applicati on tier presents an abstracte d view of the database. This layer also acts as a
mediator between the end-user and the database.
ll Data Warehousing and Data Mining
th
• Ton-Tier: The top tier is a front-end client layer. It is the tools and APll at yotin~conn~
r · It ld be query too s, repor g too\
and get data out from the data warehouse. cou s, t.
managed query tools, Analysis tools and Data mining tools.
. h'tecture of a data warehouse.
The data warehouse diagram below illustrates the 3-tier arc 1
2.
Top
--------- -----------------
Tier
Middle 4
OLAP Server _
Tier
(ROLAP or MOLAP or HOLAP)
Output
_ ·- ___ - - - - - - -
---------------~- ------------- - .
:-..,,. .. - ..
- Bottom
Data
: : '. -ner --- ....
Warehouse f
Data.Mart
ETL
(Extract, Transform, Load and Refresh)
,---8
______8
_______8______ c::::::r-,
E3 I -
·-·.,
. -
: . . c::::::J : ~eterogen~us
1 Operational ERP CRM _ Flat Files I Data Sources
~ System · ' · : ·
----------------------------~
Figure 1.19: Three-Tier data warehouse architecture ·
Data Warehouse Implementation is a series of activities that are essential to create a fully functioning
Da_ta Warehouse, after classifying, analyzing and designing the Data Warehouse with respect to the
requirements provided by the client. The process of establishing and implementing a data
warehouse system in an organization is known as data warehouse implementation. . Data
warehousing is one of the most important components of the business intelligence process for an
organization. The data warehousing implementation process requires a series of steps that need to be
followed in a very effective manner. The processes are as follows:
Introduction to Data Warehousing 0 [1:HAPTER ·f .,...I %1 -
1. Requirement's analysis and capacity planning
The first process in data warehousing involves defining enterprise needs, defining
architectures, carrying out capacity planning, and selecting the hardware and software tools.
This step will contain be consulting senior management as well as the different stakeholder.
2. Hardware integration
Once the hardware and software has been selected, they require to be put by integrating the
servers, the storage methods, and the user software tools.
3. Modeling
Modelling is a significant stage that involves designing the warehouse schema and views.
This may contain using a modeling tool if the data warehouses are sophisticated.
4. Physical modeling
For the data warehouses to perform efficiently, physical modeling is needed. This contains
designing the physical data warehouse organization, data placement, data partitioning,
deciding on access techniques, and indexing. ·
S. Sources
The information for the data warehouse is likely to come from several data sources. This step
contains identifying and connecting the sources using the gateway, ODBC drives, or another
wrapper. . __
6, Ell. •. a ,•., • • ' '
The data from the source system will require to go through an ETL phase. The process of
designing and implementing the E1t. phase may contain defining a suitable ETL tool vendor
and purchasing and implementing the tools: This may contains customize the tool to suit the
. need of the enterprises. ~.r
DATAMART S
A data mart is a subset of a data warehouse oriented to a specific business line. Data marts contain
repositories of summarized data collected for analysis on a specific section or unit within an
organization E.g., Marketing, Sales, HR or finance. It is often controlled by a single department in.an
organization. Data Mart usually draws data from only a few sources compared to a Data warehouse.
Data marts are small in size and are more flexible compared to a Data warehouse.
n Data Warehousing and Da
ta Mining
Manufacturing
Data Mart
Finance
Data M ar t
Sales Marketing
Data.Mart r. -- -- -t 1 Data M
Data Warehouse art
~
Da ta ma rt are simpler
to implement wh en co
the same time, the cost mp are d to co rp or ate Da
of implementing Da ta taw• !hauae. At
implementing a full da M ar t is certainly low er
ta warehouse. . co mp are d witl\._
• Compared to Da ta Wareho
use, a da tam art is agile
ca n be bu ilt quicker du . In ca se ~ change in mo
e to a smaller size. de l datarnart
_ .- _ .
• A Data.mart is defined
by a single Subject ·Matt
warehouse is defined by er Expert. .On the contrary
interdisciplinary SME fro da ta
ma rt is more op en to ~ a variety of do ma ins
change compared to Da . Hence, Data
ta.warehouse. ..
• Data is partitioned an d
allows very granular ac
- .
• Data can be segmented cess control privileges.
an d stored on different
ha rd wa re / software pla
tforms.
T yp ea of D at a Mart
Th er e are thr ee ma in
types of da ta m ar t
• De pe nd en t De pe nd en
t da ta marts are cre
operational, external or ated by dr aw ing da
bo th sources. ta directly . from
• In de pe nd en t In de pe
nd en t da ta ma rt is cre
wa re ho us e. ated wi th ou t th e use
of a central data
• Hy br id : Th is type of
da ta ma rts ca n take da
sy ste ms. ta from da ta warehou
ses or operational
· - .. . . ., _ _
De pe nd en t Da ta M ar
t
A de pe nd en t da ta ma
rt allows so ur cin g orga
of th e da ta ma rts ex am nization's da ta from a
ple s wh ich offers the be single Data Warehouse.
nefit of centralization. It is one
m or e phys ica l da ta ma U yo u ne ed to develop
Da ta M ar t in da ta wa
rts, th en yo u ne ed to
configure th em as depe one or
re ho us e ca n be bu ilt in ndent da ta ma rts. Depe
tw o different ways. Eithe ndestt
r wh ere a user ca n access
both
Introduction to Data Warehousing O [ CHAPTR 1 121
the data mart and data warehouse, depending on need, or where access is limited only to the data
mart. The second approach is not optimal as it produces sometimes referred to as a data junkyard. In
the data junkyard, all data begins with a common source, but they are scrapped, and mostly junked.
Fi
l__J
Operational
l ------
Sources
Ente,prl
Da
r=::::::::::iu
LJ Dependent
Departmental
Data Marts
Figure 1.21: Dependent data mart
An independent data mart is created without the use of central Data warehouse. This kind of Data
Mart is an ideal option for smaller groups within an organization.
An independent data _mart has neither a relationship with the enterprise data warehouse nor with
any other data mart. In Independent data mart, the data is input separately, and its analyses are also
performed autonomously. _
Implementation of independent data marts is antithetical to the motivation for building a data
warehouse. First of all, you need a consistent, centralized store of enterprise data which can be
analyzed by multiple users with different interests who want widely varying information.
Operational
Sources
Independent
Data Marts
fllvre 1.22: llldependent data mart
Hybrid Data'llart
A hybrid data mart combines input from sources apart from Data warehouse. This could be helpful
when you want ad-hoc integration, like after a new group or product is added to the organization. It
is the best data mart example suited for multiple database environments and fast implementation
• 'Oata Warehousing and Data Mining
. 1 t data cleansing. effort. Hybrid Data ma rt aJso
. ti n
tufflU'O\llld for any orgamza o . It also req uire
d ·t .
s eas fl 'ble for smaller data<entri.c
supports large storage structureS, best suited for eXJ
an • is
applicalioffi.
operational
Sources
0
0
~
Dependent
Departmental
Data Marts
METADATA
Meta.data is data about the data or documentation about the information which is required by the
users. In data warehousing, metadata is one of the essential aspects. Several examples of Meta data
are listed below:
• A library catalog may be considered metadata. The directory metadata consists of
several predefined components representing specific attributes of a resource, and each
item can have one or more values. These components could be the name of the author,
the name of the document, the publisher's name, the publication date, and the methods
to which it belongs. ·
• The table of content and the index in a book may be treated metadata for the book.
• Suppose we say that a data item about a person is 70. This must be defined by noting
that it is the person's weight and the unit is kilograms. Therefore, (weight, kilograms) is
the metadata about the data is 70.
• A webpage may include metadata specifying what language it is written in, what tools
were used to create it, and where to go for more on the subject, allowing browsers to
automatically improve the experience of users.
• A digital image may include metadata that describes how large the picture is, the color
depth, the image resolution, when the image was created, and other data.
• A text document's metadata may contain information about how long the document is,
who the author is, when the document was written, and a short summary of the
document.
·• Another example of metadata are data about the tables and figures in a report like this
book. A table (which is a record) has a name (e.g., table titles), and there are column
names of the tables that may be treated metadata. 'The figures also have titles or names.
II Data Warehousing and Data Mining
Metadata
A wcll-dcHlfincd du tn warehouse Is the fo undotlon for nny E1ucccst1fu1 Bl or anolytks progrom. Jts
m.ilr1 job !11 to power the rcporl11, dashboards, ond n11ulytJcoJ tools lhot hove become indispensable to
buRlncsses today. A doln worchousc provides the Jnformation for your data-driven decision!! - and
helps you make the rlghl coll on everything from new product development to Jnvcntory levels.
1ncrc ore mony benefits of o data warehouse. f lcre are just a few:
• Better busJness analytics
With data worehouslng, decision-makers have access to data from multiple sources and
no longer have to make decisions based on incomplete information.
• Faster queries
Data warehouses arc built specifically for fast data retrieval and analysis. With a data
warehouse, we can very rapidly query large amounts of consolidated data with little to
no support from IT.
• Improved data quality
Before being loaded into the data warehouse, data cleansing cases are created by the
system and entered In a worklist for further processing, ensuring data is transformed
into a con11Jstent format to support analytics - and decisions - based on high quality,
accurate data.
• HJttorkal insight
By storing rich historical data, a data warehouse lets decision-makers learn from past
trends and challenges, make predictions, and drive continuous business improvement.
Data warehouses have come a long way since their earliest iterations back in the 1980s. They're now
fa11ter, more powerful, and in the cloud. But what hasn't changed is their goal: to unlock the full
value of an organization's data. The latest developments are only making this easier with
automation, empowerment, and openness. Trends in data warehousing are listed below:
• CootinueJ Growth in Data warehousing
• Data warehou..~ N!- be\.-ome Mainstream
• lndustJie!. usms
Data wareho\L.~
• Y~~u tioo& tn,duc ts
tv.-er-~ ci.:ie t"UHxsheci by the Data \\ arehousing Institnte .at that time featured no fewer than 1w
='D'
lea..:::f r-oaaos.
Data Wuelu ,ue Bu Becom e llainstreaDl
r:: ::1e ea..-fy stage. mar significant factots drove many companies to move into data warehou sing;
• F1erCk c:ompeliliao
• ~ ·emmen t deregu)ation
• ~ eed to n?\'amp internal plOC 95FS
Althoug h earlier data warehouses concentrated on keeping summary data for high-lev
el analysis, we
now see .larger and larger data warehouses being built by different businesses. Now compan
ies have
the ability to capture, cleanse, maintain , and use the vast amounts of data generate
d by their
business transactions. The quantities of data kept in data warehouses continue to
swell to the
terabyte range Data warehouses storing several terabytes of data are not uncommon
in retail and
telecommunications.
--
Introduction to Data Warehousing O r c HAPTER f'"'a U
vendor Solution & Products
As an information technology professional, you are familiar with database vendors and database
products. In the same way, you are familiar with most of the operating systems and their vendors.
How many leading database vendors are there? How many leading vendors of operating systems
are there? A handful? The number of database and operating system vendors pales in comparison
with data warehousing products and vendors. There are hundreds of data warehousing vendors and
thousands of data warehousing products and solutions.
In the beginning, the market was filled with confusion and vendor hype. Every vendor, small or big,
that had any product remotely connected to data warehousing jumped on the bandwagon. Data
warehousing meant what each vendor defined it to be. Each company positioned its own products as
the proper set of data warehousing tools. Data warehousing was a new concept for many of the
businesses that adopted it. These businesses were at the mercy of the marketing hype of the vendors.
With so many vendors and products, how can we classify the vendors and products, and thereby
make sense of the market? It is best to separate the market broadly into two distinct groups. The first
group consists of data warehouse vendors and products catering to the needs of corporate data
warehouses in which all enterprise data is integrated and transformed. This segment has been
referred to as the market for strategic data warehouses. This segment accounts for about a quarter of
the total market. The second segment is looser and more dispersed, consisting of departmental data
marts, fragmented database marketing systems, and a wide range of decision support systems.
Specific vendors and products dominate each segment.
DW market in beginning stages DW market currently -
(state of flux) more mature and stable)
New
· -vendor Technologies
conslidations {LAP, etc)
Product Web-enabled
Sophistication solutions
Administrative Infrastructure
Tools Tools
• VI.SUalization Type s
t • Adva nced Visualization Techniques ·
o Char t Manipulation.
o Drill Down.
0 Advanced Interaction
4. Web Enabled Data warehouse
1. Real-Time Data Warehousing
ouses have been used mainly for
Business intelligence systems and the supporting data wareh
strategic decision making. The·data warehouse was kept
separate from operational systems.
business intelligence for tactical
Recen tly industry momentum is swinging towards using
warehousing is progressing rapidly
decision making for day-to-day business operations. Data
senior executives.
to the point that real-time data warehousing is the focus of
ical trends, whereas real-time data
Traditional data warehousing is passive, providing histor
view of the business in real time. A
ware housing is dyna mic, providing the most up-to-date
almost zero latency.
real-time data warehouse gets refreshed continuously, with
ndously by sharing information
Real-time infor mation delivery increases productivity treme
unde r a lot of pressure to provide
with more people. Companies are, therefore, coming
al business processes. However,
infor matio n, in real time, to everyone connected to critic
real-time data warehousing have
extraction, trans form ation, and integration of data for
sever al chall enges.
2. M ultip le Data Types
e ~low shows the different types
Wha t are the types of data we call unstr uctur ed data? Figur · · on making more
h rt dec1S1
,l data that need to be integ rated in the data ware ouse to supp o
c:ff1•<; t1v,·1y.
tntroduction to Data Warehousing 0 CHAmlt 1 131
Dato Warehouse
Rcspository Video
~turroTe, t
~ iii
Audio
figure 1.26; Multiple data types in a data warehouse
Adding Unstructured Data: Some vendors are addressing the inclusion of unstructured data,
~pt.'ci;illy te-'t and images, by treating such multimedia data as just another data type. These
.ire d~IDN as part of the relational data and stored as binary large objects (BLOBs) up to 2 GB
in si::e. User-defined functions (UDFs) are used to define these as user defined types (UDTs).
Se..uchi:ng Unstructured Data: For free-form text data, retrieval engines pre index the textual
documents to allow searches by words, character strings, phrases, wild cards, proximity
operators und Boolean operators. Some engines are powerful enough to substitute
corresponding words and search. A search with a word mouse will also retrieve documents
containing the word mice. Searching audio and video data directly is still in the research
stige. Usually, these are described with free-form text, and then searched using textual search
methods that are currently available.
Sp.ti;tl D.ita: Adding spatial data will greatly enhance the value of your data warehouse.
Address,. street block, city quadrant, county, state, and zone are examples of spatial data.
Vendors have begun to address the need to include spatial data. Some database vendors are
pro\id:ing spatial extenders to their products using SQL extensions to bring spatial and
business data together.
3.. Om V1.SU.lization
Visualization of data in the result sets boosts the process of analysis for the user, especially
when the user is looking for trends over time. Data visualization helps the user to interpret
query results quickly and easily.
/
~teraction ~o- -
-
Advanced--~~~
~~o~
1/
Multiple Link
Drill ~~ .,.,charts
Oo111'II ~
'.S,~ O
Scientific
/Chart Types Neural Data
C / • ~0\J Entcrp_nllC /
~
=/
In~~ ii ~arting Unstructured
] ,...,~terns Text Data
1 oL I
~ ~~~o / T~~~s
!g. / ~'t-~~ ~wntation
Printm Graphics Realtlme
/
~
";I Charting Series - -
Small Data sets to largoe, complex strucures
•
i· '-
lfif
General Public
zC . &J
Customers Business Partners
·~
. ~
Employees
t ·.
Results through
Extranets t
The Web
Simplified
View or Clickstream Data,
Web-enabled
Data Warehouse l Requests through
Extranets
l
'
Warehouse Webhouse
Repository Repository
(4Exer
.....___ ______..;.
cis~) ___________________________
_...
1. Define data. Describe life cycle of data with suitable diagram.
2. List out types of data. Describe them with suitable example.
3. What is data warehouse? How it is differed from database? Explain
4. Differences between operational database and data warehouse.
41 Data Warehousing and Data Mining
5. Define multi-dimensional data model. Explain their uses.
6. Describe OLAP opera tion in multidimensional data mode l.
7. What is data warehousing? Describe architecture of data wareh ouse.
8. What do you mean by conceptual modeling of data wareh ouse? Expla
in
9. . D
How to imple ment data warehouse? Explam. escn co 'be mpon ents of data wareh ouse.
10. What is data mart? How it is differed from data wareh ouse? Expla in
11. Describe needs of data warehousing. Describe trends in data wareh ousin
g.
12. Define metad ata. How it is differed from datab ase? Expla in
13. Define Real-Time Data Warehousing with suitab le exam ple.
14. What are the stages of data warehousing?
15. What are the steps to build the data wareh ouse?
16. What is the difference betwe en metad ata and data dictio nary?
17. What is the very basic difference betwe en data wareh ouse and opera
tional datab ases?
18. Explain the data warehouse architecture. Differentiate between distri
buted and virtua l data
warehouse
(
ii
19. Explain the structure of a data wareh ouse and how a data wareh ouse
of a business.
helps in better analy sis
I
20. Differentiate between data marts and data cubes.
□□□