Unit 2 Data Science Good
Unit 2 Data Science Good
Program
Name B.C.A Semester VI
Course Title Fundamentals of Data Science (Theory)
Course Code: DSE-E2 No. of Credits 03
Contact hours 42 Hours Duration of SEA/Exam 2 1/2 Hours
Formative Assessment
Marks 40 Summative Assessment Marks 60
Course Outcomes (COs): After the successful completion of the course, the student will be able to:
CO1 Understand the concepts of data and pre-processing of data.
CO2 Know simple pattern recognition methods
CO3 Understand the basic concepts of Clustering and Classification
CO4 Know the recent trends in Data Science
Contents 42 Hrs
Unit I: Data Mining: Introduction, Data Mining Definitions, Knowledge Discovery
in Databases (KDD) Vs Data Mining, DBMS Vs Data Mining, DM techniques, 8
Problems,Issues and Challenges in DM, DM applications.
Data Warehouse: Introduction, Definition, Multidimensional Data Model, Data
Cleaning, Data Integration and transformation, Data reduction, Discretization 8
Mining Frequent Patterns: Basic Concept – Frequent Item Set Mining Methods -
8
Aprioriand Frequent Pattern Growth (FPGrowth) algorithms -Mining Association Rules
Classification: Basic Concepts, Issues, Algorithms: Decision Tree Induction. Bayes
Classification Methods, Rule-Based Classification, Lazy Learners (or Learning from 10
yourNeighbors), k Nearest Neighbor. Prediction - Accuracy- Precision and Recall
Clustering: Cluster Analysis, Partitioning Methods, Hierarchical Methods, Density-
Based Methods, Grid-Based Methods, Evaluation of Clustering 8
Unit 2
Topics:
Data Warehouse:
Data warehousing provides architectures and tools for business executives to systematically
organize, understand, and use their data to make strategic decisions.
According to William H. Inmon, a leading architect in the construction of data warehouse systems,
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data
in support of management’s decision making process”
Subject-oriented: A data warehouse is organized around major subjects such as customer,
supplier, product, and sales.
Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous
sources, such as relational databases, flat files, and online transaction records.
Time-variant : Data are stored to provide information from an historic perspective.
(e.g., the past 5–10 years).
Nonvolatile: A data warehouse is always a physically separate store of data transformed from the
application data found in the operational environment. Due to this separation, a data warehouse
does not require transaction processing, recovery, and concurrency control mechanisms. It usually
requires only two operations in data accessing: initial loading of data and access of data
Data warehousing:
The process of constructing and using data warehouses as shown the following figure.
OLTP OLAP
◼ The bottom tier is a warehouse database server that is almost always a relational database
system. Back-end tools and utilities are used to feed data into the bottom tier from
operational databases or other external sources (e.g., customer profile information provided
by external consultants). These tools and utilities perform data extraction, cleaning, and
transformation (e.g., to merge similar data from different sources into a unified format), as
well as load and refresh functions to update the data warehouse
◼ The middle tier is an OLAP server that is typically implemented using either
(1) a relational OLAP (ROLAP) model (i.e., an extended relational DBMS that maps
operations on multidimensional data to standard relational operations); or
(2) a Multi-dimensional OLAP (MOLAP) model (i.e., a special-purpose server that directly
implements multidimensional data and operations).
◼ The top tier is a front-end client layer , which contains query and reporting tools, analysis
tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
Data Warehouse Models: Enterprise Warehouse, Data Mart, and Virtual Warehouse
o Enterprise warehouse
o collects all of the information about subjects spanning the entire organization
o Data Mart
o a subset of corporate-wide data that is of value to a specific groups of users. Its
scope is confined to specific, selected groups, such as marketing data mart
o Virtual warehouse
o A set of views over operational databases
o Only some of the possible summary views may be materialized
A recommended method for the development of data warehouse systems is to implement the
warehouse in an incremental and evolutionary manner, as shown in Figure.
First, a high-level corporate data model is defined within a reasonably short period (such as
one or two months) that provides a corporate-wide, consistent, integrated view of data among
different subjects and potential usages. This high-level model, although it will need to be
refined in the further development of enterprise data warehouses and departmental data marts,
will greatly reduce future integration problems. Second, independent data marts can be
implemented in parallel with the enterprise warehouse based on the same corporate data model
set noted before. Third, distributed data marts can be constructed to integrate different data
marts via hub servers. Finally, a multitier data warehouse is constructed where the enterprise
warehouse is the sole custodian of all warehouse data, which is then distributed to the various
dependent data marts.
The lattice(patterened structure like fence) of cuboids forms a data cube as shown below.
o Star schema: A fact table in the middle connected to a set of dimension tables
10
OLAP Operations
o Drill down (roll down): In drill-down operation, the less detailed data is converted into
highly detailed data. It can be done by:
11
o Slice and dice: It selects a single dimension from the OLAP cube which
results in a new sub-cube creation. In the cube given in the overview
section, Slice is performed on the dimension Time = “Q1”
o Pivot (rotate):
12
-
Data Cleaning, …
Today’s data are highly susceptible to noisy, missing and inconsistent data due to their typically
huge size and because of heterogeneous sources. Low quality data will lead to poor mining results.
Different data preprocessing techniques(data cleaning, data integration, data reduction, data
transformation) that when applied before data mining will improve the overall quality of the
pattern mined and also time required for actual mining.
13
Data cleaning
Data cleaning stage helps in smooth out noise, attempts to fill in missing values, removing outliers,
and correct inconsistency in data.
14
1. Binning: Smooth the sorted data by consulting its neighborhood. The values are
distributed into buckets/bins. They perform local smoothing.
Data Integration
Data mining often works on integrated data from multiple repositories. Careful integration helps
in accuracy of data mining results.
Challenges of DI
Data Reduction
Data Reduction techniques can be applied to obtain a reduced representation of the data set that is
much smaller in volume, yet closely maintain the integrity of the original data.
1. Dimensionality reduction:
Reducing the number of attributes/variables under consideration.
15
Wavelet Transform:
DWT- Discrete Wavelet Transform is a linear signal processing technique, that when applied to a
data vector X, transforms it to a numerically different vector X’ of same length. The DWT is a fast
and simple transformation that can translate an image from the spatial domain to the frequency
domain.
PCA reduces the number of variables or features in a data set while still preserving the most
important information like major trends or patterns.
Dataset for analysis consists of many attribute which may be irrelevant to the mining task. (Ex:
Telephone no. may not be important while classifying customer). Attribute subset selection reduces
the data set by removing irrelevant attributes.
16
Histograms:
Histogram is a frequency plot. It uses bins/buckets to approximate data distributions and are
popular form of data reduction. They are highly effective at approximating both sparse & dense
data as well as skewed & uniform data.
The following data are a list of AllElectronics prices for commonly sold items (rounded to the
nearest dollar). The numbers have been sorted: 1, 1, 5, 5, 5,5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14,
15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18,18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25,
25, 25, 25, 25, 28, 28, 30,30, 30) Figure shows the histogram for this data.
17
Clustering:
Clustering partition data into clusters/groups which are similar/close. In data reduction, cluster
representation of data are used to replace the actual data.
Sampling:
Used as data reduction technique in which large data are represented as small random samples
(subset).
This is created by drawing s of the N tuples from D ( s < N ), where the probability of drawing any tuple
in D is 1 = N , that is, all tuples are equally likely to be sampled.
This is similar to SRSWOR, except that each time a tuple is drawn from , it is recorded and then replaced
. That is, after a tuple is drawn, it is placed back in D so that it may be drawn again.
18
The tuples in D are grouped into M mutually disjoint “clusters,” then an SRS of s clusters can be obtained,
where s < M .
If D is divided into mutually disjoint parts called strata, a stratified sample of D is generated by
obtaining an SRS at each stratum. For example, a stratified sample may be obtained from customer
data, where a stratum is created for each customer age group. In this way, the age group having the
smallest number of customers will be sure to be represented
An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional
to the size of the sample, s , as opposed to N , the data set size. Hence, sampling complexity is
potentially sublinear to the size of the data.
19
Data Transformation
The data is transformed or consolidated so that the resulting mining process may be more efficient,
and the patterns found may be easier to understand.
20
Ex: Age
• Interval labels (10-18, 19-50)
• Conceptual labels (youth, adult)
6. Concept hierarchy generation for nominal data: Attributes are generalized to higher level
concepts
Ex: Street is generalized to city or country.
The measurement unit used can affect data analysis. To help avoid dependence on the choice of
measurement units, the data should be normalized or standardized. This involves transforming the
data to fall within a smaller or common range such as Range = [-1,1], [0.0,1.0].
Normalizing the data attempts to give all attributes an equal weight For Ex: Changing unit from
meters to inches in height lead to different results because of larger range for that attribute. To help
avoid dependence on the choice of units, the data should be normalized.
Normalization attempts to give all attributes equal weight. Normalization is useful in classification
algorithm involving neural networks or distance measurements such as nearest neighbor
classification & clustering. There are different methods for normalization like - min-max
normalization, z-score normalization, normalization by decimal scaling.
Min-Max Normalization:
Vi’ = 0.716.
Z-score Normalization:
21
𝑉𝑖−𝐴̅
𝑉𝑖 ′ = 𝜎ᴀ
𝐴̅= mean; 𝜎ᴀ= std deviation.
Also, variance(Sᴀ) could be used which is more robust than std deviation(𝜎ᴀ).
Decimal Scaling:
𝑉𝑖
𝑉𝑖 ′ =
10𝑗