0% found this document useful (0 votes)
2 views

Unit 2_V2_Data Science

The document provides an overview of data warehousing concepts, including definitions, architectures, and comparisons between OLTP and OLAP systems. It details data cleaning, integration, and reduction techniques essential for effective data mining, along with various data modeling schemas such as star, snowflake, and galaxy schemas. Additionally, it discusses OLAP operations and the importance of maintaining data quality for accurate analysis and decision-making.

Uploaded by

chiranthc116
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit 2_V2_Data Science

The document provides an overview of data warehousing concepts, including definitions, architectures, and comparisons between OLTP and OLAP systems. It details data cleaning, integration, and reduction techniques essential for effective data mining, along with various data modeling schemas such as star, snowflake, and galaxy schemas. Additionally, it discusses OLAP operations and the importance of maintaining data quality for accurate analysis and decision-making.

Uploaded by

chiranthc116
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

1

Program
Name B.C.A Semester VI
Course Title Fundamentals of Data Science (Theory)
Course Code: DSE-E2 No. of Credits 03

Unit 2
Topics:

Data Warehouse: Introduction, Definition, Multidimensional Data Model, Data Cleaning,


Data Integration and transformation, Data reduction, Discretization.

Data Warehouse:

According to William H. Inmon, a leading architect in the construction of data warehouse systems
(Father of Data Warehouse- American Computer Scientist), “A data warehouse is a subject-
oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s
decision making process”. In simple words, it is a centralized data location for multiple sources of
data for management decision making process.

Subject-oriented: A data warehouse is organized around major subjects such as customer,


supplier, product, and sales.
Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous
sources, such as relational databases, flat files, and online transaction records.
Time-variant : Data are stored to provide information from an historic perspective.
(e.g., the past 5–10 years).
Nonvolatile: A data warehouse is always a physically separate store of data transformed from the
application data found in the operational environment. Due to this separation, a data warehouse
does not require transaction processing, recovery, and concurrency control mechanisms. It usually
requires only two operations in data accessing: initial loading of data and access of data

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


2

Data warehousing:

The process of constructing and using data warehouses as shown the following figure.

Fig 1.1: Data ware house of a sales organization.

OLTP (Online Transaction Processing) vs. OLAP (Online Analytical Processing)

1. OLTP (Online Transaction Processing)

• Purpose: Handles real-time transaction processing.

• Usage: Used in operational systems like banking, retail, and airline reservations.

• Data Type: Stores current and detailed transactional data.

• Operations: Frequent, short, atomic transactions (INSERT, UPDATE, DELETE).

• Performance: Optimized for fast query processing and high availability.

• Example: A banking system processing multiple account transactions simultaneously.

2. OLAP (Online Analytical Processing)

• Purpose: Supports complex analysis and decision-making.

• Usage: Used in business intelligence, data mining, and reporting.

• Data Type: Stores historical and aggregated data for analysis.

• Operations: Complex queries involving multi-dimensional data (e.g., SUM, AVERAGE,


GROUP BY).

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


3

• Performance: Optimized for read-heavy operations and large data analysis.

• Example: A sales dashboard analyzing monthly revenue trends.

OLTP OLAP

users clerk, IT professional knowledge worker

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-date historical,


detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc

access read/write lots of scans


index/hash on prim. key
unit of work short, simple transaction complex query

# records accessed tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


4

Data Warehousing: Three Tier Architecture

Data warehouses often adopt a three-tier architecture, as presented in Figure.

Fig. Three Tier Architecture of Data warehousing

◼ The bottom tier is a warehouse database server that is almost always a relational database
system. Back-end tools and utilities are used to feed data into the bottom tier from
operational databases or other external sources (e.g., customer profile information provided
by external consultants). These tools and utilities perform data extraction, cleaning, and
transformation (e.g., to merge similar data from different sources into a unified format), as
well as load and refresh functions to update the data warehouse

◼ The middle tier is an OLAP server that is typically implemented using either

(1) a relational OLAP (ROLAP) model (i.e., an extended relational DBMS that maps
operations on multidimensional data to standard relational operations); or

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


5

(2) a Multi-dimensional OLAP (MOLAP) model (i.e., a special-purpose server that directly
implements multidimensional data and operations).

◼ The top tier is a front-end client layer , which contains query and reporting tools, analysis
tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).

Data Warehouse Models: Enterprise Warehouse, Data Mart, and Virtual Warehouse
o Enterprise warehouse
o collects all of the information about subjects spanning the entire organization
o Data Mart
o a subset of corporate-wide data that is of value to a specific groups of users. Its
scope is confined to specific, selected groups, such as marketing data mart
o Virtual warehouse
o A set of views over operational databases
o Only some of the possible summary views may be materialized
A recommended method for the development of data warehouse systems is to implement the
warehouse in an incremental and evolutionary manner, as shown in Figure.
First, a high-level corporate data model is defined within a reasonably short period (such as
one or two months) that provides a corporate-wide, consistent, integrated view of data among
different subjects and potential usages. This high-level model, although it will need to be
refined in the further development of enterprise data warehouses and departmental data marts,
will greatly reduce future integration problems. Second, independent data marts can be
implemented in parallel with the enterprise warehouse based on the same corporate data model
set noted before. Third, distributed data marts can be constructed to integrate different data
marts via hub servers. Finally, a multitier data warehouse is constructed where the enterprise
warehouse is the sole custodian of all warehouse data, which is then distributed to the various
dependent data marts.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


6

Fig: A recommended approach for data warehouse development

Data Warehouse Modeling: Data Cube and OLAP


Data warehouses and OLAP tools are based on a multidimensional data model. This model views
data in the form of a data cube.
o A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions.
It is defined by dimensions and facts. Fact tables contain numerical data, while dimension
tables provide context and background information.
- Dimension tables, such as item (item_name, brand, type), or time(day, week,
month, quarter, year) (entities in which org keeps records), Store descriptive
attributes (product, customer, date, store).
- Fact table contains numeric measures (such as dollars_sold(sale amt in $), units
sold) and keys to each of the related dimension tables
Example Scenario: Sales Data Warehouse
Fact Table (Transactional Data)
• Fact_Sales (Stores measurable business data)
o Sales_ID (Primary Key)
o Date_Key (Foreign Key to Date Dimension)
o Product_Key (Foreign Key to Product Dimension)
o Customer_Key (Foreign Key to Customer Dimension)
o Store_Key (Foreign Key to Store Dimension)

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


7

oSales_Amount (Measure)
o Quantity_Sold (Measure)
Dimension Tables (Descriptive Data)
1. Dim_Date (Time-based details)
o Date_Key (Primary Key)
o Date
o Month
o Quarter
o Year
2. Dim_Product (Product details)
o Product_Key (Primary Key)
o Product_Name
o Category
o Brand
3. Dim_Customer (Customer details)
o Customer_Key (Primary Key)
o Customer_Name
o Age
o Gender
o Location
4. Dim_Store (Store details)
o Store_Key (Primary Key)
o Store_Name
o City
o Region

In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid,
which holds the highest-level of summarization, is called the apex cuboid. The apex cuboid is
typically denoted by ‘all’.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


8

The lattice(patterened structure like fence) of cuboids forms a data cube as shown below.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


9

Schemas for Multidimensional Data Models

In multidimensional data modeling for a data warehouse, three common schemas define how fact
and dimension tables are structured:

1. Star schema: A fact table in the middle connected to a set of dimension tables.

Dimensions are directly linked to the fact table.

Pros: Simple, fast query performance.

Cons: Data redundancy in dimensions.

2. Snowflake schema: A refinement of star schema where some dimensional


hierarchy is normalized into a set of smaller dimension tables, forming a shape
similar to snowflake. Dimension tables are split into sub-dimensions to reduce
redundancy.

Pros: Saves storage space.

Cons: Complex queries, slower performance.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


10

3. Fact constellations: Multiple fact tables share dimension tables, viewed as a


collection of stars, therefore called galaxy schema or fact constellation. Used when
multiple business processes are analyzed together.

Pros: Flexible for large-scale data warehousing.

Cons: Complex structure.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


11

Choosing the Right Schema:

Star Schema → Best for fast query performance and simple design.

Snowflake Schema → Best for storage optimization when normalization is needed.

Galaxy Schema → Best for complex business models with multiple fact tables.

OLAP Operations

o Roll up (drill-up): summarize data or aggregation of data

- by climbing up hierarchy or by dimension reduction

- In the cube given in the overview section, the roll-up operation is


performed by climbing up in the concept hierarchy
of Location dimension (City -> Country).

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


12

o Drill down (roll down): In drill-down operation, the less detailed data is converted into
highly detailed data. It can be done by:

- Moving down in the concept hierarchy

- Adding a new dimension

- In the cube given in overview section, the drill down operation is


performed by moving down in the concept hierarchy
of Time dimension (Quarter -> Month).

o Slice: Extracts a subset of the data for a single dimension value. It selects a single
dimension from the OLAP cube which results in a new sub-cube creation.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


13

Example: Viewing sales data only for Q1 2024.

o Dice: Extracts a subset of data based on multiple conditions (multiple slices).

Example: Viewing sales for Q1 2024 in New York for Electronics category.

o Pivot (rotate):

- reorient the cube, visualization, 3D to series of 2D planes. Rearranges data for


better visualization by switching rows and columns.

- It is also known as rotation operation as it rotates the current view


to get a new view of the representation. In the sub-cube obtained
after the slice operation, performing pivot operation gives a new
view of it.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


14

Summary Table:

OLAP
Function Example
Operation

Roll-Up Aggregates data to a higher level Sales from monthly → yearly

Drill-Down Breaks data into a finer level Sales from yearly → monthly

Slice Selects data for one dimension Sales only for Q1 2024

Sales for Q1 2024 & Electronics


Dice Filters data for multiple conditions
category

Rotates data for different


Pivot Sales by category vs. year
perspectives

Data Cleaning

Today’s data are highly susceptible to noisy, missing and inconsistent data due to their typically
huge size and because of heterogeneous sources. Low quality data will lead to poor mining results.

Different data preprocessing techniques(data cleaning, data integration, data reduction, data
transformation) that when applied before data mining will improve the overall quality of the
pattern mined and also time required for actual mining. Data cleaning stage helps in smooth out
noise, attempts to fill in missing values, removing outliers, and correct inconsistency in data.

Different types of data cleaning tasks:

1) Handling missing values: Missing values are encountered due to Data entry errors,
system failures, incomplete records.
Techniques to handle missing values:

i. Ignoring the tuple: Used when class label is missing. This method is not very
effective when more missing value is present.
ii. Fill in missing value manually: It is time consuming.
iii. Using global constant to fill missing value: Ex: unknown or ∞
iv. Use attribute mean to fill the missing value
v. Use attribute mean for all samples belonging to the same class as the given
tuple
vi. Use most probable value to fill the missing value: (using decision tree)

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


15

2) Handling Noisy data: Noise is a random error or variance in measured variable caused due to
Sensor errors, outliers, rounding errors, incorrect data entry.

Techniques to handle noisy data are:

1. Smoothing: Average out fluctuations in the data.


Techniques for smoothing are:
a) Binning: Smooth the sorted data by consulting its neighborhood. The values are
distributed into buckets/bins. They perform local smoothing.

Different binning methods for data smoothing:

i. Smoothing by bin means: Each value in bin is replaced by mean


Ex: BIN 1 : 4,8,15 = BIN 1: 9,9,9
ii. Smoothing by bin boundaries: Min and max value is identified and value is
replaced by closest boundary value
Ex: BIN 1 : 4,8,15 = BIN 1: 4,4,15
b) Regression: Data smoothing can also be done by regression (linear regression,
multiple linear regression). In this one attribute can be used to predict the value of
another.
c) Outlier analysis: Outliers can be done by clustering. The value outside the clusters
are outliers.

Data Integration

Data mining often works on integrated data from multiple repositories. Careful integration helps
in accuracy of data mining results.

Challenges of DI

1. Entity Identification Problem:


“How to match schema and objects from many sources?” This is called Entity
Identification Problem.
Ex: Cust-id in one table and Cust-no in another table.
Metadata helps in avoiding these problems.
2. Redundancy and correlation analysis:
Redundancy -> repetition.
Some redundancy can be detected by correlation analysis. Given two attributes, correlation
tell how strongly the relationship is (Chi-square test, correlation coefficient are ex).

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


16

Data Reduction

Data Reduction techniques can be applied to obtain a reduced representation of the data set that is
much smaller in volume, yet closely maintain the integrity of the original data.

Data Reduction Strategies:

1. Dimensionality reduction:
Reducing the number of attributes/variables under consideration.
Ex: Attribute subset selection, Wavelet Transform, PCA.
2. Numerosity reduction:
Replace original data by alternate smaller forms, clustering.
Ex: Histograms, Sampling, Data cube aggregation,
3. Data compression:
Reduce the size of data.

Wavelet Transform:

DWT- Discrete Wavelet Transform is a linear signal processing technique, that when applied to a
data vector X, transforms it to a numerically different vector X’ of same length. The DWT is a fast
and simple transformation that can translate an image from the spatial domain to the frequency
domain.

Principal Components Analysis(PCA)

PCA reduces the number of variables or features in a data set while still preserving the most
important information like major trends or patterns.

Attribute Subset Selection:

Dataset for analysis consists of many attribute which may be irrelevant to the mining task. (Ex:
Telephone no. may not be important while classifying customer). Attribute subset selection reduces
the data set by removing irrelevant attributes.

Some heuristics methods for attribute subset selection are:

1. Stepwise forward selection:


• Start with empty set
• Best attribute are added to reduce set
• At each iteration, the rest of remaining attribute are added.
2. Stepwise backward elimination:
• Start with full set of attributes
• At each step, remove the worst attributes.
3. Combination of forward selection & backward selection:

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


17

• Combined method
• At each step, procedure selects the best attribute & remove worst from remaining.
4. Decision Tree Induction:
In DTI a tree is constructed from the given data. All attributes that do not appear in tree are
assumed to be irrelevant. Measures such as Information Gain, Gain Index, Gini Index, Chi-
square statistics, etc are used to select the best attributes out of the set of attributs. Thereby,
reducing the number of attributes.

Histograms:

Histogram is a frequency plot. It uses bins/buckets to approximate data distributions and are
popular form of data reduction. They are highly effective at approximating both sparse & dense
data as well as skewed & uniform data.

The following data are a list of AllElectronics prices for commonly sold items (rounded to the
nearest dollar). The numbers have been sorted: 1, 1, 5, 5, 5,5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14,
15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18,18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25,
25, 25, 25, 25, 28, 28, 30,30, 30) Figure shows the histogram for this data.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


18

Fig : Histogram for ALL Electronics

Clustering:

Clustering partition data into clusters/groups which are similar/close. In data reduction, cluster
representation of data are used to replace the actual data. Instead of storing all data points, store
only cluster centroids or representative points.

Example:

• Given a dataset with 1 million customer records, k-means clustering can reduce it to 100
clusters, where each centroid represents a group of similar customers.

Clustering can identify important features and remove redundant ones.

Example:

• In gene expression data, clustering similar genes can help reduce thousands of variables
into meaningful groups.

Instead of analyzing the entire dataset, work on a sample of clusters that represent the whole
population.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


19

Example:

• Market research: Instead of surveying all customers, businesses analyze a few customer
segments.

Clustering helps detect and remove outliers, reducing noise in the dataset.

Example:

• Fraud detection: Unusual transaction patterns form separate clusters, helping identify
fraudulent activities.

Sampling:

Used as data reduction technique in which large data are represented as small random samples
(subset).

Common ways to sample:

i.Simple random sample without replacements of size(SRSWOR)

This is created by drawing s of the N tuples from D ( s < N ), where the probability of drawing any tuple
in D is 1 = N , that is, all tuples are equally likely to be sampled.

ii. Simple random sample with replacement(SRSWR)

This is similar to SRSWOR, except that each time a tuple is drawn from , it is recorded and then replaced
. That is, after a tuple is drawn, it is placed back in D so that it may be drawn again.

iii. Cluster sample

The tuples in D are grouped into M mutually disjoint “clusters,” then an SRS of s clusters can be obtained,
where s < M .

iv. Stratified sample

If D is divided into mutually disjoint parts called strata, a stratified sample of D is generated by
obtaining an SRS at each stratum. For example, a stratified sample may be obtained from customer
data, where a stratum is created for each customer age group. In this way, the age group having the
smallest number of customers will be sure to be represented

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional
to the size of the sample, s , as opposed to N , the data set size. Hence, sampling complexity is
potentially sublinear to the size of the data.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


20

Fig. Sampling Techniques

Data Cube Aggregation:

• Aggregate data into one view.


• Data cube store multidimensional aggregated information.
• Data cube provides fast access to precomputed, summarized data, thereby benefits
OLAP/DM.
• Data cube created for varying level of abstraction are often referred to as cuboids.
• Cube created at lowest level of abstraction is base cuboids.
o Ex: Data regarding sales or customers.
• Cube created at highest level of abstraction is apex cuboids.
o Ex: Total sales for all 3 years, for items.
Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College
21

Fig. Data Cube

Data Transformation

The data is transformed or consolidated so that the resulting mining process may be more efficient,
and the patterns found may be easier to understand.

Data Transformation Strategies overview:

1. Smoothing: Performed to remove noise.


Ex: Binning, regression, clustering.
2. Attribute construction: New attributes are added to help mining process.
3. Aggregation: Data is summarized or aggregated.
Ex: Sales data is aggregated into monthly & annual sales. This step is used for constructing
data cube.
4. Normalization: Data is scaled so as to fall within a smaller range.
Ex: -1.0 to +1.0.
5. Data Discretization: Where raw values are replaced by interval labels or conceptual labels.
Ex: Age
• Interval labels (10-18, 19-50)
• Conceptual labels (youth, adult)
6. Concept hierarchy generation for nominal data: Attributes are generalized to higher level
concepts
Ex: Street is generalized to city or country.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


22

Data Transformation by Normalization:

The measurement unit used can affect data analysis. To help avoid dependence on the choice of
measurement units, the data should be normalized or standardized. This involves transforming the
data to fall within a smaller or common range such as Range = [-1,1], [0.0,1.0].

Normalizing the data attempts to give all attributes an equal weight For Ex: Changing unit from
meters to inches in height lead to different results because of larger range for that attribute. To help
avoid dependence on the choice of units, the data should be normalized.

Normalization attempts to give all attributes equal weight. Normalization is useful in classification
algorithm involving neural networks or distance measurements such as nearest neighbor
classification & clustering. There are different methods for normalization like - min-max
normalization, z-score normalization, normalization by decimal scaling.

Min-Max Normalization:

a) Find min & max no. in the data.


b) Transform the data to range [𝑛𝑒𝑤𝑚𝑖𝑛ᴀ, 𝑛𝑒𝑤𝑚𝑎𝑥ᴀ] by computing
𝑉𝑖−𝑚𝑖𝑛ᴀ
Vi’ = (𝑛𝑒𝑤𝑚𝑎𝑥ᴀ − 𝑛𝑒𝑤𝑚𝑖𝑛ᴀ ) + 𝑛𝑒𝑤𝑚𝑖𝑛ᴀ
𝑚𝑎𝑥ᴀ−𝑚𝑖𝑛ᴀ

c) It preserves the relationship among the original data values


Ex: If min income is Rs.12,000 & max income is Rs.98,000. If new range is [0.0,1.0].
A value Vi= Rs.73600 will transform into
73600−12000
Vi’ = (1.0 − 0.0) + 0
48000−12000

Vi’ = 0.716.

Z-score Normalization:

Values of an attribute A, are normalized on the mean & std deviation of A.


𝑉𝑖−𝐴̅
𝑉𝑖 ′ = 𝐴̅= mean; 𝜎ᴀ= std deviation.
𝜎ᴀ

Also, variance(Sᴀ) could be used which is more robust than std deviation(𝜎ᴀ).

Decimal Scaling:

a) Normalizes by moving the decimal point of values.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


23

b) The no. of decimal point moved depends on the max absolute value of A

𝑉𝑖
𝑉𝑖 ′ = 10𝑗

Ex: A= -986 to 917


Max Abs value = 986 (j=3)
Divide each no. by 1000 i.e. 103
Therefore -0.986 to 0.916 is the normalized value.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

You might also like