0% found this document useful (0 votes)
79 views

Unit 2 Data Science Good

Unit 2 data science 6th sem

Uploaded by

Ajay Kumar R
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views

Unit 2 Data Science Good

Unit 2 data science 6th sem

Uploaded by

Ajay Kumar R
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

lOMoARcPSD|44504838

Unit 2 Data Science - Good

BCA (University of Mysore)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Ajay Kumar R ([email protected])
lOMoARcPSD|44504838

Program
Name B.C.A Semester VI
Course Title Fundamentals of Data Science (Theory)
Course Code: DSE-E2 No. of Credits 03
Contact hours 42 Hours Duration of SEA/Exam 2 1/2 Hours
Formative Assessment
Marks 40 Summative Assessment Marks 60
Course Outcomes (COs): After the successful completion of the course, the student will be able to:
CO1 Understand the concepts of data and pre-processing of data.
CO2 Know simple pattern recognition methods
CO3 Understand the basic concepts of Clustering and Classification
CO4 Know the recent trends in Data Science
Contents 42 Hrs
Unit I: Data Mining: Introduction, Data Mining Definitions, Knowledge Discovery
in Databases (KDD) Vs Data Mining, DBMS Vs Data Mining, DM techniques, 8
Problems,Issues and Challenges in DM, DM applications.
Data Warehouse: Introduction, Definition, Multidimensional Data Model, Data
Cleaning, Data Integration and transformation, Data reduction, Discretization 8
Mining Frequent Patterns: Basic Concept – Frequent Item Set Mining Methods -
8
Aprioriand Frequent Pattern Growth (FPGrowth) algorithms -Mining Association Rules
Classification: Basic Concepts, Issues, Algorithms: Decision Tree Induction. Bayes
Classification Methods, Rule-Based Classification, Lazy Learners (or Learning from 10
yourNeighbors), k Nearest Neighbor. Prediction - Accuracy- Precision and Recall
Clustering: Cluster Analysis, Partitioning Methods, Hierarchical Methods, Density-
Based Methods, Grid-Based Methods, Evaluation of Clustering 8

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Downloaded by Ajay Kumar R ([email protected])


lOMoARcPSD|44504838

Unit 2
Topics:

Data Warehouse: Introduction, Definition, Multidimensional Data Model, Data Cleaning,


Data Integration and transformation, Data reduction, Discretization.

Data Warehouse:

Def 1: A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection


of data in support of management’s decision-making process.”—W. H. Inmon (Father of Data
Warehouse- American Computer Scientist)

Def 2: Centralized data location for multiple sources of data.

Data warehousing provides architectures and tools for business executives to systematically
organize, understand, and use their data to make strategic decisions.
According to William H. Inmon, a leading architect in the construction of data warehouse systems,
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data
in support of management’s decision making process”
Subject-oriented: A data warehouse is organized around major subjects such as customer,
supplier, product, and sales.
Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous
sources, such as relational databases, flat files, and online transaction records.
Time-variant : Data are stored to provide information from an historic perspective.
(e.g., the past 5–10 years).
Nonvolatile: A data warehouse is always a physically separate store of data transformed from the
application data found in the operational environment. Due to this separation, a data warehouse
does not require transaction processing, recovery, and concurrency control mechanisms. It usually
requires only two operations in data accessing: initial loading of data and access of data

Data warehousing:

The process of constructing and using data warehouses as shown the following figure.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Downloaded by Ajay Kumar R ([email protected])


lOMoARcPSD|44504838

Fig 1.1: Data ware house of a sales organization.


Difference between

OLTP OLAP

users clerk, IT professional knowledge worker

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-date historical,


detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc

access read/write lots of scans


index/hash on prim. key
unit of work short, simple transaction complex query

# records accessed tens millions

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Downloaded by Ajay Kumar R ([email protected])


lOMoARcPSD|44504838

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

Data Warehousing: Three Tier Architecture

Data warehouses often adopt a three-tier architecture, as presented in Figure.

Fig. Three Tier Architecture of Data warehousing

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Downloaded by Ajay Kumar R ([email protected])


lOMoARcPSD|44504838

◼ The bottom tier is a warehouse database server that is almost always a relational database
system. Back-end tools and utilities are used to feed data into the bottom tier from
operational databases or other external sources (e.g., customer profile information provided
by external consultants). These tools and utilities perform data extraction, cleaning, and
transformation (e.g., to merge similar data from different sources into a unified format), as
well as load and refresh functions to update the data warehouse

◼ The middle tier is an OLAP server that is typically implemented using either

(1) a relational OLAP (ROLAP) model (i.e., an extended relational DBMS that maps
operations on multidimensional data to standard relational operations); or

(2) a Multi-dimensional OLAP (MOLAP) model (i.e., a special-purpose server that directly
implements multidimensional data and operations).

◼ The top tier is a front-end client layer , which contains query and reporting tools, analysis
tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).

Data Warehouse Models: Enterprise Warehouse, Data Mart, and Virtual Warehouse
o Enterprise warehouse
o collects all of the information about subjects spanning the entire organization
o Data Mart
o a subset of corporate-wide data that is of value to a specific groups of users. Its
scope is confined to specific, selected groups, such as marketing data mart
o Virtual warehouse
o A set of views over operational databases
o Only some of the possible summary views may be materialized
A recommended method for the development of data warehouse systems is to implement the
warehouse in an incremental and evolutionary manner, as shown in Figure.
First, a high-level corporate data model is defined within a reasonably short period (such as
one or two months) that provides a corporate-wide, consistent, integrated view of data among
different subjects and potential usages. This high-level model, although it will need to be
refined in the further development of enterprise data warehouses and departmental data marts,
will greatly reduce future integration problems. Second, independent data marts can be
implemented in parallel with the enterprise warehouse based on the same corporate data model
set noted before. Third, distributed data marts can be constructed to integrate different data
marts via hub servers. Finally, a multitier data warehouse is constructed where the enterprise
warehouse is the sole custodian of all warehouse data, which is then distributed to the various
dependent data marts.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Downloaded by Ajay Kumar R ([email protected])


lOMoARcPSD|44504838

Fig: A recommended approach for data warehouse development

Data Warehouse Modeling: Data Cube and OLAP


Data warehouses and OLAP tools are based on a multidimensional data model. This model views
data in the form of a data cube.
o A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions.
It is defined by dimensions and facts. Fact tables contain numerical data, while dimension
tables provide context and background information.
- Dimension tables, such as item (item_name, brand, type), or time(day, week,
month, quarter, year) (entities in which org keeps records)
- Fact table contains numeric measures (such as dollars_sold(sale amt in $), units
sold) and keys to each of the related dimension tables
In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid,
which holds the highest-level of summarization, is called the apex cuboid. The apex cuboid is
typically denoted by ‘all’.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Downloaded by Ajay Kumar R ([email protected])


lOMoARcPSD|44504838

The lattice(patterened structure like fence) of cuboids forms a data cube as shown below.

Schemas for Multidimensional Data Models

Stars, Snowflakes, and Fact Constellations:

o Star schema: A fact table in the middle connected to a set of dimension tables

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Downloaded by Ajay Kumar R ([email protected])


lOMoARcPSD|44504838

o Snowflake schema: A refinement of star schema where some dimensional


hierarchy is normalized into a set of smaller dimension tables, forming a shape
similar to snowflake

o Fact constellations: Multiple fact tables share dimension tables, viewed as a


collection of stars, therefore called galaxy schema or fact constellation

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Downloaded by Ajay Kumar R ([email protected])


lOMoARcPSD|44504838

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Downloaded by Ajay Kumar R ([email protected])


lOMoARcPSD|44504838

10

OLAP Operations

o Roll up (drill-up): summarize data or aggregation of data

- by climbing up hierarchy or by dimension reduction

- In the cube given in the overview section, the roll-up operation is


performed by climbing up in the concept hierarchy
of Location dimension (City -> Country).

o Drill down (roll down): In drill-down operation, the less detailed data is converted into
highly detailed data. It can be done by:

- Moving down in the concept hierarchy

- Adding a new dimension

- In the cube given in overview section, the drill down operation is


performed by moving down in the concept hierarchy
of Time dimension (Quarter -> Month).

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Downloaded by Ajay Kumar R ([email protected])


lOMoARcPSD|44504838

11

o Slice and dice: It selects a single dimension from the OLAP cube which
results in a new sub-cube creation. In the cube given in the overview
section, Slice is performed on the dimension Time = “Q1”

o Pivot (rotate):

- reorient the cube, visualization, 3D to series of 2D planes

- It is also known as rotation operation as it rotates the current view


to get a new view of the representation. In the sub-cube obtained
after the slice operation, performing pivot operation gives a new
view of it.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Downloaded by Ajay Kumar R ([email protected])


lOMoARcPSD|44504838

12

-
Data Cleaning, …

Today’s data are highly susceptible to noisy, missing and inconsistent data due to their typically
huge size and because of heterogeneous sources. Low quality data will lead to poor mining results.

Different data preprocessing techniques(data cleaning, data integration, data reduction, data
transformation) that when applied before data mining will improve the overall quality of the
pattern mined and also time required for actual mining.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Downloaded by Ajay Kumar R ([email protected])


lOMoARcPSD|44504838

13

Fig: Forms of Data Preprocessing

Data cleaning

Data cleaning stage helps in smooth out noise, attempts to fill in missing values, removing outliers,
and correct inconsistency in data.

1) Handling missing values:


i. Ignoring the tuple: Used when class label is missing. This method is not very effective
when more missing value is present.
ii. Fill in missing value manually: It is time consuming.
iii. Using global constant to fill missing value: Ex: unknown or ∞
iv. Use attribute mean to fill the missing value
v. Use attribute mean for all samples belonging to the same class as the given tuple
vi. Use most probable value to fill the missing value: (using decision tree)

2) Noisy data: Noise is a random error or variance in measured variable.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Downloaded by Ajay Kumar R ([email protected])


lOMoARcPSD|44504838

14

Different methods for smoothing are:

1. Binning: Smooth the sorted data by consulting its neighborhood. The values are
distributed into buckets/bins. They perform local smoothing.

Different binning methods for data smoothing:

i. Smoothing by bin means: Each value in bin is replaced by mean


Ex: BIN 1 : 4,8,15 = BIN 1: 9,9,9
ii. Smoothing by bin boundaries: Min and max value is identified and value is
replaced by closest boundary value
Ex: BIN 1 : 4,8,15 = BIN 1: 4,4,15

2. Regression: Data smoothing can also be done by regression(linear regression, multiple


linear regression). In this one attribute can be used to predict the value of another.
3. Outlier analysis: Outliers can be done by clustering. The value outside the clusters are
outliers.

Data Integration

Data mining often works on integrated data from multiple repositories. Careful integration helps
in accuracy of data mining results.

Challenges of DI

1. Entity Identification Problem:


“How to match schema and objects from many sources?” This is called Entity
Identification Problem.
Ex: Cust-id in one table and Cust-no in another table.
Metadata helps in avoiding these problems.
2. Redundancy and correlation analysis:
Redundancy -> repetition.
Some redundancy can be detected by correlation analysis. Given two attributes, correlation
tell how strongly the relationship is (Chi-square test, correlation coefficient are ex).

Data Reduction

Data Reduction techniques can be applied to obtain a reduced representation of the data set that is
much smaller in volume, yet closely maintain the integrity of the original data.

Data Reduction Strategies:

1. Dimensionality reduction:
Reducing the number of attributes/variables under consideration.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Downloaded by Ajay Kumar R ([email protected])


lOMoARcPSD|44504838

15

Ex: Attribute subset selection, Wavelet Transform, PCA.


2. Numerosity reduction:
Replace original data by alternate smaller forms.
Ex: Histograms, Sampling, Data cube aggregation.
3. Data compression:
Reduce the size of data.

Wavelet Transform:

DWT- Discrete Wavelet Transform is a linear signal processing technique, that when applied to a
data vector X, transforms it to a numerically different vector X’ of same length. The DWT is a fast
and simple transformation that can translate an image from the spatial domain to the frequency
domain.

Principal Components Analysis(PCA)

PCA reduces the number of variables or features in a data set while still preserving the most
important information like major trends or patterns.

Attribute Subset Selection:

Dataset for analysis consists of many attribute which may be irrelevant to the mining task. (Ex:
Telephone no. may not be important while classifying customer). Attribute subset selection reduces
the data set by removing irrelevant attributes.

Some heuristics methods for attribute subset selection are:

1. Stepwise forward selection:


• Start with empty set
• Best attribute are added to reduce set
• At each iteration, the rest of remaining attribute are added.
2. Stepwise backward elimination:
• Start with full set of attributes
• At each step, remove the worst attributes.
3. Combination of forward selection & backward selection:
• Combined method
• At each step, procedure selects the best attribute & remove worst from remaining.
4. Decision Tree Induction:
In DTI a tree is constructed from the given data. All attributes that do not appear in tree are
assumed to be irrelevant.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Downloaded by Ajay Kumar R ([email protected])


lOMoARcPSD|44504838

16

Histograms:

Histogram is a frequency plot. It uses bins/buckets to approximate data distributions and are
popular form of data reduction. They are highly effective at approximating both sparse & dense
data as well as skewed & uniform data.

The following data are a list of AllElectronics prices for commonly sold items (rounded to the
nearest dollar). The numbers have been sorted: 1, 1, 5, 5, 5,5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14,
15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18,18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25,
25, 25, 25, 25, 28, 28, 30,30, 30) Figure shows the histogram for this data.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Downloaded by Ajay Kumar R ([email protected])


lOMoARcPSD|44504838

17

Fig : Histogram for ALL Electronics

Clustering:

Clustering partition data into clusters/groups which are similar/close. In data reduction, cluster
representation of data are used to replace the actual data.

Sampling:

Used as data reduction technique in which large data are represented as small random samples
(subset).

Common ways to sample:

i.Simple random sample without replacements of size(SRSWOR)

This is created by drawing s of the N tuples from D ( s < N ), where the probability of drawing any tuple
in D is 1 = N , that is, all tuples are equally likely to be sampled.

ii. Simple random sample with replacement(SRSWR)

This is similar to SRSWOR, except that each time a tuple is drawn from , it is recorded and then replaced
. That is, after a tuple is drawn, it is placed back in D so that it may be drawn again.

iii. Cluster sample

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Downloaded by Ajay Kumar R ([email protected])


lOMoARcPSD|44504838

18

The tuples in D are grouped into M mutually disjoint “clusters,” then an SRS of s clusters can be obtained,
where s < M .

iv. Stratified sample

If D is divided into mutually disjoint parts called strata, a stratified sample of D is generated by
obtaining an SRS at each stratum. For example, a stratified sample may be obtained from customer
data, where a stratum is created for each customer age group. In this way, the age group having the
smallest number of customers will be sure to be represented

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional
to the size of the sample, s , as opposed to N , the data set size. Hence, sampling complexity is
potentially sublinear to the size of the data.

Fig. Sampling Techniques

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Downloaded by Ajay Kumar R ([email protected])


lOMoARcPSD|44504838

19

Data Cube Aggregation:

• Aggregate data into one view.


• Data cube store multidimensional aggregated information.
• Data cube provides fast access to precomputed, summarized data, thereby benefits
OLAP/DM.
• Data cube created for varying level of abstraction are often referred to as cuboids.
• Cube created at lowest level of abstraction is base cuboids.
o Ex: Data regarding sales or customers.
• Cube created at highest level of abstraction is apex cuboids.
o Ex: Total sales for all 3 years, for items.

Fig. Data Cube

Data Transformation

The data is transformed or consolidated so that the resulting mining process may be more efficient,
and the patterns found may be easier to understand.

Data Transformation Strategies overview:

1. Smoothing: Performed to remove noise.


Ex: Binning, regression, clustering.
2. Attribute construction: New attributes are added to help mining process.
3. Aggregation: Data is summarized or aggregated.
Ex: Sales data is aggregated into monthly & annual sales. This step is used for constructing
data cube.
4. Normalization: Data is scaled so as to fall within a smaller range.
Ex: -1.0 to +1.0.
5. Data Discretization: Where raw values are replaced by interval labels or conceptual labels.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Downloaded by Ajay Kumar R ([email protected])


lOMoARcPSD|44504838

20

Ex: Age
• Interval labels (10-18, 19-50)
• Conceptual labels (youth, adult)
6. Concept hierarchy generation for nominal data: Attributes are generalized to higher level
concepts
Ex: Street is generalized to city or country.

Data Transformation by Normalization:

The measurement unit used can affect data analysis. To help avoid dependence on the choice of
measurement units, the data should be normalized or standardized. This involves transforming the
data to fall within a smaller or common range such as Range = [-1,1], [0.0,1.0].

Normalizing the data attempts to give all attributes an equal weight For Ex: Changing unit from
meters to inches in height lead to different results because of larger range for that attribute. To help
avoid dependence on the choice of units, the data should be normalized.

Normalization attempts to give all attributes equal weight. Normalization is useful in classification
algorithm involving neural networks or distance measurements such as nearest neighbor
classification & clustering. There are different methods for normalization like - min-max
normalization, z-score normalization, normalization by decimal scaling.

Min-Max Normalization:

a) Find min & max no. in the data.


b) Transform the data to range [𝑛𝑒𝑤𝑚𝑖𝑛ᴀ, 𝑛𝑒𝑤𝑚𝑎𝑥ᴀ] by computing
𝑉𝑖−𝑚𝑖𝑛ᴀ
Vi’ = (𝑛𝑒𝑤𝑚𝑎𝑥ᴀ − 𝑛𝑒𝑤𝑚𝑖𝑛ᴀ ) + 𝑛𝑒𝑤𝑚𝑖𝑛ᴀ
𝑚𝑎𝑥ᴀ−𝑚𝑖𝑛ᴀ

c) It preserves the relationship among the original data values


Ex: If min income is Rs.12,000 & max income is Rs.98,000. If new range is [0.0,1.0].
A value Vi= Rs.73600 will transform into
73600−12000
Vi’ = (1.0 − 0.0) + 0
48000−12000

Vi’ = 0.716.

Z-score Normalization:

Values of an attribute A, are normalized on the mean & std deviation of A.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Downloaded by Ajay Kumar R ([email protected])


lOMoARcPSD|44504838

21

𝑉𝑖−𝐴̅
𝑉𝑖 ′ = 𝜎ᴀ
𝐴̅= mean; 𝜎ᴀ= std deviation.

Also, variance(Sᴀ) could be used which is more robust than std deviation(𝜎ᴀ).

Decimal Scaling:

a) Normalizes by moving the decimal point of values.


b) The no. of decimal point moved depends on the max absolute value of A

𝑉𝑖
𝑉𝑖 ′ =
10𝑗

Ex: A= -986 to 917


Max Abs value = 986 (j=3)
Divide each no. by 1000 i.e. 103
Therefore -0.986 to 0.916 is the normalized value.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Downloaded by Ajay Kumar R ([email protected])

You might also like