Unit 2_V2_Data Science
Unit 2_V2_Data Science
Program
Name B.C.A Semester VI
Course Title Fundamentals of Data Science (Theory)
Course Code: DSE-E2 No. of Credits 03
Unit 2
Topics:
Data Warehouse:
According to William H. Inmon, a leading architect in the construction of data warehouse systems
(Father of Data Warehouse- American Computer Scientist), “A data warehouse is a subject-
oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s
decision making process”. In simple words, it is a centralized data location for multiple sources of
data for management decision making process.
Data warehousing:
The process of constructing and using data warehouses as shown the following figure.
• Usage: Used in operational systems like banking, retail, and airline reservations.
OLTP OLAP
◼ The bottom tier is a warehouse database server that is almost always a relational database
system. Back-end tools and utilities are used to feed data into the bottom tier from
operational databases or other external sources (e.g., customer profile information provided
by external consultants). These tools and utilities perform data extraction, cleaning, and
transformation (e.g., to merge similar data from different sources into a unified format), as
well as load and refresh functions to update the data warehouse
◼ The middle tier is an OLAP server that is typically implemented using either
(1) a relational OLAP (ROLAP) model (i.e., an extended relational DBMS that maps
operations on multidimensional data to standard relational operations); or
(2) a Multi-dimensional OLAP (MOLAP) model (i.e., a special-purpose server that directly
implements multidimensional data and operations).
◼ The top tier is a front-end client layer , which contains query and reporting tools, analysis
tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
Data Warehouse Models: Enterprise Warehouse, Data Mart, and Virtual Warehouse
o Enterprise warehouse
o collects all of the information about subjects spanning the entire organization
o Data Mart
o a subset of corporate-wide data that is of value to a specific groups of users. Its
scope is confined to specific, selected groups, such as marketing data mart
o Virtual warehouse
o A set of views over operational databases
o Only some of the possible summary views may be materialized
A recommended method for the development of data warehouse systems is to implement the
warehouse in an incremental and evolutionary manner, as shown in Figure.
First, a high-level corporate data model is defined within a reasonably short period (such as
one or two months) that provides a corporate-wide, consistent, integrated view of data among
different subjects and potential usages. This high-level model, although it will need to be
refined in the further development of enterprise data warehouses and departmental data marts,
will greatly reduce future integration problems. Second, independent data marts can be
implemented in parallel with the enterprise warehouse based on the same corporate data model
set noted before. Third, distributed data marts can be constructed to integrate different data
marts via hub servers. Finally, a multitier data warehouse is constructed where the enterprise
warehouse is the sole custodian of all warehouse data, which is then distributed to the various
dependent data marts.
oSales_Amount (Measure)
o Quantity_Sold (Measure)
Dimension Tables (Descriptive Data)
1. Dim_Date (Time-based details)
o Date_Key (Primary Key)
o Date
o Month
o Quarter
o Year
2. Dim_Product (Product details)
o Product_Key (Primary Key)
o Product_Name
o Category
o Brand
3. Dim_Customer (Customer details)
o Customer_Key (Primary Key)
o Customer_Name
o Age
o Gender
o Location
4. Dim_Store (Store details)
o Store_Key (Primary Key)
o Store_Name
o City
o Region
In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid,
which holds the highest-level of summarization, is called the apex cuboid. The apex cuboid is
typically denoted by ‘all’.
The lattice(patterened structure like fence) of cuboids forms a data cube as shown below.
In multidimensional data modeling for a data warehouse, three common schemas define how fact
and dimension tables are structured:
1. Star schema: A fact table in the middle connected to a set of dimension tables.
Star Schema → Best for fast query performance and simple design.
Galaxy Schema → Best for complex business models with multiple fact tables.
OLAP Operations
o Drill down (roll down): In drill-down operation, the less detailed data is converted into
highly detailed data. It can be done by:
o Slice: Extracts a subset of the data for a single dimension value. It selects a single
dimension from the OLAP cube which results in a new sub-cube creation.
Example: Viewing sales for Q1 2024 in New York for Electronics category.
o Pivot (rotate):
Summary Table:
OLAP
Function Example
Operation
Drill-Down Breaks data into a finer level Sales from yearly → monthly
Slice Selects data for one dimension Sales only for Q1 2024
Data Cleaning
Today’s data are highly susceptible to noisy, missing and inconsistent data due to their typically
huge size and because of heterogeneous sources. Low quality data will lead to poor mining results.
Different data preprocessing techniques(data cleaning, data integration, data reduction, data
transformation) that when applied before data mining will improve the overall quality of the
pattern mined and also time required for actual mining. Data cleaning stage helps in smooth out
noise, attempts to fill in missing values, removing outliers, and correct inconsistency in data.
1) Handling missing values: Missing values are encountered due to Data entry errors,
system failures, incomplete records.
Techniques to handle missing values:
i. Ignoring the tuple: Used when class label is missing. This method is not very
effective when more missing value is present.
ii. Fill in missing value manually: It is time consuming.
iii. Using global constant to fill missing value: Ex: unknown or ∞
iv. Use attribute mean to fill the missing value
v. Use attribute mean for all samples belonging to the same class as the given
tuple
vi. Use most probable value to fill the missing value: (using decision tree)
2) Handling Noisy data: Noise is a random error or variance in measured variable caused due to
Sensor errors, outliers, rounding errors, incorrect data entry.
Data Integration
Data mining often works on integrated data from multiple repositories. Careful integration helps
in accuracy of data mining results.
Challenges of DI
Data Reduction
Data Reduction techniques can be applied to obtain a reduced representation of the data set that is
much smaller in volume, yet closely maintain the integrity of the original data.
1. Dimensionality reduction:
Reducing the number of attributes/variables under consideration.
Ex: Attribute subset selection, Wavelet Transform, PCA.
2. Numerosity reduction:
Replace original data by alternate smaller forms, clustering.
Ex: Histograms, Sampling, Data cube aggregation,
3. Data compression:
Reduce the size of data.
Wavelet Transform:
DWT- Discrete Wavelet Transform is a linear signal processing technique, that when applied to a
data vector X, transforms it to a numerically different vector X’ of same length. The DWT is a fast
and simple transformation that can translate an image from the spatial domain to the frequency
domain.
PCA reduces the number of variables or features in a data set while still preserving the most
important information like major trends or patterns.
Dataset for analysis consists of many attribute which may be irrelevant to the mining task. (Ex:
Telephone no. may not be important while classifying customer). Attribute subset selection reduces
the data set by removing irrelevant attributes.
• Combined method
• At each step, procedure selects the best attribute & remove worst from remaining.
4. Decision Tree Induction:
In DTI a tree is constructed from the given data. All attributes that do not appear in tree are
assumed to be irrelevant. Measures such as Information Gain, Gain Index, Gini Index, Chi-
square statistics, etc are used to select the best attributes out of the set of attributs. Thereby,
reducing the number of attributes.
Histograms:
Histogram is a frequency plot. It uses bins/buckets to approximate data distributions and are
popular form of data reduction. They are highly effective at approximating both sparse & dense
data as well as skewed & uniform data.
The following data are a list of AllElectronics prices for commonly sold items (rounded to the
nearest dollar). The numbers have been sorted: 1, 1, 5, 5, 5,5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14,
15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18,18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25,
25, 25, 25, 25, 28, 28, 30,30, 30) Figure shows the histogram for this data.
Clustering:
Clustering partition data into clusters/groups which are similar/close. In data reduction, cluster
representation of data are used to replace the actual data. Instead of storing all data points, store
only cluster centroids or representative points.
Example:
• Given a dataset with 1 million customer records, k-means clustering can reduce it to 100
clusters, where each centroid represents a group of similar customers.
Example:
• In gene expression data, clustering similar genes can help reduce thousands of variables
into meaningful groups.
Instead of analyzing the entire dataset, work on a sample of clusters that represent the whole
population.
Example:
• Market research: Instead of surveying all customers, businesses analyze a few customer
segments.
Clustering helps detect and remove outliers, reducing noise in the dataset.
Example:
• Fraud detection: Unusual transaction patterns form separate clusters, helping identify
fraudulent activities.
Sampling:
Used as data reduction technique in which large data are represented as small random samples
(subset).
This is created by drawing s of the N tuples from D ( s < N ), where the probability of drawing any tuple
in D is 1 = N , that is, all tuples are equally likely to be sampled.
This is similar to SRSWOR, except that each time a tuple is drawn from , it is recorded and then replaced
. That is, after a tuple is drawn, it is placed back in D so that it may be drawn again.
The tuples in D are grouped into M mutually disjoint “clusters,” then an SRS of s clusters can be obtained,
where s < M .
If D is divided into mutually disjoint parts called strata, a stratified sample of D is generated by
obtaining an SRS at each stratum. For example, a stratified sample may be obtained from customer
data, where a stratum is created for each customer age group. In this way, the age group having the
smallest number of customers will be sure to be represented
An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional
to the size of the sample, s , as opposed to N , the data set size. Hence, sampling complexity is
potentially sublinear to the size of the data.
Data Transformation
The data is transformed or consolidated so that the resulting mining process may be more efficient,
and the patterns found may be easier to understand.
The measurement unit used can affect data analysis. To help avoid dependence on the choice of
measurement units, the data should be normalized or standardized. This involves transforming the
data to fall within a smaller or common range such as Range = [-1,1], [0.0,1.0].
Normalizing the data attempts to give all attributes an equal weight For Ex: Changing unit from
meters to inches in height lead to different results because of larger range for that attribute. To help
avoid dependence on the choice of units, the data should be normalized.
Normalization attempts to give all attributes equal weight. Normalization is useful in classification
algorithm involving neural networks or distance measurements such as nearest neighbor
classification & clustering. There are different methods for normalization like - min-max
normalization, z-score normalization, normalization by decimal scaling.
Min-Max Normalization:
Vi’ = 0.716.
Z-score Normalization:
Also, variance(Sᴀ) could be used which is more robust than std deviation(𝜎ᴀ).
Decimal Scaling:
b) The no. of decimal point moved depends on the max absolute value of A
𝑉𝑖
𝑉𝑖 ′ = 10𝑗