0% found this document useful (0 votes)

22 views12 pages

FoDS Notes - Unit 2

Notes

Uploaded by

rakeshrocky0607

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views12 pages

FoDS Notes - Unit 2

Notes

Uploaded by

rakeshrocky0607

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Unit 2 Data Warehouse

Data Warehouse: Introduction, Definition, Multidimensional Data Model, Data

Cleaning, Data Integration and transformation, Data reduction, Discretization

A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in

support of management’s decision-making process.

It is a centralized data location for multiple sources of data.

Data Warehouse is separate from DBMS; it stores a huge amount of data, which is typically collected
from multiple heterogeneous sources like files, DBMS, etc.

Data Warehouse Architecture

Operational System: An operational system is a method used in data warehousing to refer to

a system that is used to process the day-to-day transactions of an organization.

Flat Files: A Flat file system is a system of files in which transactional data is stored, and every file in
the system must have a different name.

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 1

Summarized data: The area of the data warehouse saves all the predefined lightly and highly
summarized (aggregated) data generated by the warehouse manager.

Meta Data: A set of data that defines and gives information about other data.
• Meta Data used in Data Warehouse for a variety of purpose:
• Summarizes necessary information about data. For example,
author, data build, and data changed, and file size of any data.
• Metadata is used to direct a query to the most appropriate data
source.

End-User access Tools: The principal purpose of a data warehouse is to provide information to the
business managers for strategic decision-making. These customers interact with the warehouse using
end-client access tools.
The examples of some of the end-user access tools can be:
 Reporting and Query Tools
 Application Development Tools
 Executive Information Systems Tools
 Online Analytical Processing Tools
 Data Mining Tools

Characteristics of Data Warehouse

Subject-oriented: A data warehouse is organized around major subjects such as customer, supplier,
product, and sales instead of the ongoing operations.
Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources,
such as relational databases, flat files, and online transaction records.
Time-variant: Data are stored to provide information from an historic perspective. (e.g., the past 5–10
years).
Non volatile: A data warehouse is always physically separate data storage. Due to this separation, a
data warehouse does not require transaction processing, recovery and concurrency control
mechanisms.
It usually requires only two operations in data accessing: initial loading of data and access of data.

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 2

Benefits of Data Warehouse

• Data Warehouses are designed to perform well enormous amounts of data.

• The structure of data warehouses is more accessible for end-users to navigate, understand, and
query.
• Queries that would be complex in many normalized databases could be easier to build and
maintain in data warehouses.
• Data warehousing is an efficient method to manage demand for lots of information from lots of
users.
• Data warehousing provide the capabilities to analyze a large amount of historical data.

Differences between OLTP and OLAP

OLTP OLAP

users clerk, IT professional knowledge worker

function Day to day operations Decision support

DB design application-oriented subject-oriented

data current, up-to-date historical,

detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc

access read/write Lots of scans

index /hash on primary key
Unit of work short, simple transaction complex query

#records accessed tens millions

#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric Transaction throughput Query throughput, response

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 3

Dimensional Modelling

 Dimensional Modelling is a data structure technique optimized for data storage in a Data
warehouse.
 The purpose of dimensional modeling is to optimize the database for faster retrieval of data.
 A dimensional model in data warehouse is designed to read, summarize, analyze numeric
information like values, balances, counts, weights, etc. in a data warehouse. In contrast, relation
models are optimized for addition, updating and deletion of data in a real-time Online
Transaction System.
 These dimensional and relational models have their unique way of data storage that has specific
advantages. For instance, in the relational mode, normalization and ER models reduce
redundancy in data. On the contrary, dimensional model in data warehouse arranges data in
such a way that it is easier to retrieve information and generate reports.
 Hence, Dimensional models are used in data warehouse systems and not a good fit for
relational systems.

Multidimensional Data Model

 The multi-Dimensional Data Model is a method which is used for ordering data in the database
along with proper arrangement and assembling of the contents in the database.
 It represents data in the form of data cubes.
 Data cubes allow to model and view the data from many dimensions and perspectives.
 It is defined by dimensions and facts and is represented by a fact table.
 Fact
o Facts are the measurements from the business process.
o For a Sales business process, a measurement would be quarterly sales number
 Dimension
o Dimension provides the context surrounding a business process event. In simple terms,
they give who, what, where of a fact. In the Sales business process, for the fact
quarterly sales number, dimensions would be
o Who – Customer Names
o Where – Location
o What – Product Name
N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 4
 Fact Table
o A fact table is a primary table in dimension modelling.
 Dimension Table
o A dimension table contains dimensions of a fact.

2D Representation

3D Representation
Advantages:
• easy to handle.
• easy to maintain.
• performance is better than that of normal databases (e.g. relational databases).
• The representation of data is better than traditional databases. That is because the multi-
dimensional databases are multi-viewed and carry different types of factors.
Disadvantage:
• it requires professionals to recognize and examine the data in the database.

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 5

Data Preprocessing Techniques
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation
 Data Discretization

Data Cleaning

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset.

Steps for Cleaning Data

1. Remove duplicate or irrelevant observations

 Remove duplicate or pointless observations as well as irrelevant observations from the dataset.
 The majority of duplicate observations will occur during data gathering.
 Duplicate data can be produced when data sets are merged from several sources or get data
from clients or other departments.

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 6

2. Fix structural errors
 When the data is measured or transferred and find odd naming practices, typos, or wrong
capitalization, such are structural faults. Mislabeled categories or classes may result from
these inconsistencies.
 For instance, "N/A" and "Not Applicable" might be present on any given sheet, but they ought
to be analyzed under the same heading.

3. Filter unwanted outliers

 There will frequently be isolated findings that, at first glance, do not seem to fit the data that are
being analyzed.
 Removing an outlier if there is a good reason to, such as incorrect data entry, will improve the
performance of the data that are working with.

4. Handle missing data

 There are a few options for handling missing data.
 Observations with missing values can be removed; doing so will result in the loss of
information, so proceed with caution.

Data Cleaning Techniques

1. Ignore the tuples: This approach is not very practical because it is only useful when a tuple has
multiple characteristics and missing values.

2. Fill in the missing value: This strategy is also not very practical or effective. Additionally, it could
be a time-consuming technique. One must add the missing value to the approach. The most common
method for doing this is manually, but other options include using attribute means or the most likely
value.

3. Binning method: This strategy is fairly easy to understand. The values nearby are used to smooth
the sorted data. The information is subsequently split into several equal-sized parts. The various
techniques are then used to finish the assignment.

4. Regression: With the use of the regression function, the data is smoothed out. Regression may be
multivariate or linear. Multiple regressions have more independent variables than linear regressions,
which only have one.

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 7

5. Clustering: This technique focuses mostly on the group. Data are grouped using clustering. After
that, clustering is used to find the outliers. After that, the comparable values are grouped into a "group"
or "cluster".

Data Integration

Data integration is the process of merging heterogeneous data from several sources.
It is a strategy that integrates data from several sources to make it available to users in a single
uniform view that shows their status.

Ex: Integrated data from various patient records and clinics assist clinicians in identifying medical
disorders and diseases by integrating data from many systems into a single perspective of beneficial
information.

Issues in Data Integration

1. Entity Identification Problem

The records are obtained from heterogeneous sources, and how can you 'match the real-world entities
from the data'.

For example, you were given client data from specialized statistics sites. Customer identity is assigned
to an entity from one statistics supply, while a customer range is assigned to an entity from another
statistics supply. Analyzing such metadata statistics will prevent you from making errors during
schema integration.

2. Redundancy and Correlation Analysis

One of the major issues in the course of data integration is redundancy. Unimportant data that are no
longer required are referred to as redundant data. It may also appear due to attributes created from the
use of another property inside the information set.

Inconsistencies further increase the level of redundancy within the characteristic. The use of
correlation analysis can be used to determine redundancy.

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 8

3. Tuple Duplication

Information integration has also handled duplicate tuples in addition to redundancy. Duplicate tuples
may also appear in the generated information if the denormalized table was utilized as a deliverable for
data integration.

4. Data Value Conflict Detection and Resolution

The data detection technique of combining records from several sources is unhealthy. In the same way,
that characteristic values can vary. The disparity may be related to the fact that they are represented
differently within the special data units.

For example, in one-of-a-kind towns, the price of an inn room might be expressed in a particular
currency. This type of issue is recognized and fixed during the data integration process.

Data Reduction

Data reduction is a technique used to reduce the size of a dataset while still preserving the most
important information.

This can be beneficial in situations where the dataset is too large to be processed efficiently, or
where the dataset contains a large amount of irrelevant or redundant information.

Data Reduction Techniques

1. Data Sampling: This technique involves selecting a subset of the data to work with, rather than
using the entire dataset. This can be useful for reducing the size of a dataset while still preserving the
overall trends and patterns in the data.

2. Dimensionality Reduction: This technique involves reducing the number of features in the dataset,
either by removing features that are not relevant or by combining multiple features into a single
feature.

3. Data Compression: This technique involves using techniques such as lossy or lossless compression
to reduce the size of a dataset.

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 9

4. Data Discretization: This technique involves converting continuous data into discrete data by
partitioning the range of possible values into intervals or bins.

5. Feature Selection: This technique involves selecting a subset of features from the dataset that are
most relevant to the task at hand.

Data Transformation

Data transformation is a technique used to convert the raw data into a suitable format that efficiently
eases data mining and retrieves strategic information.

Data transformation includes data cleaning techniques and a data reduction technique to convert the
data into the appropriate form.

Data Transformation Techniques

1. Data Smoothing
• Data smoothing is a process that is used to remove noise from the dataset using some
algorithms.
• The concept behind data smoothing is that it will be able to identify simple changes to help
predict different trends and patterns.
• Binning: This method splits the sorted data into the number of bins and smoothens the
data values in each bin considering the neighborhood values around it.
• Regression: This method identifies the relation among two dependent attributes so that
if we have one attribute, it can be used to predict the other attribute.
• Clustering: This method groups similar data values and form a cluster. The values that
lie outside a cluster are known as outliers.

2. Attribute Construction
• The new attributes consult the existing attributes to construct a new data set that eases data
mining.
• For example, suppose we have a data set referring to measurements of different plots, i.e., we
may have the height and width of each plot. So we can construct a new attribute 'area' from
attributes 'height' and 'weight'. This also helps understand the relations among the attributes in a
data set.

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 10

3. Data Aggregation
• Data collection or aggregation is the method of storing and presenting data in a summary
format.
• For example, we have a data set of sales reports of an enterprise that has quarterly sales of each
year. We can aggregate the data to get the enterprise's annual sales report.

4. Data Normalization
• Data normalization involves converting all data variables into a given range.
• This involves transforming the data to fall within a smaller or common range such as Range =
[-1,1], [0.0,1.0].
• Normalizing the data attempts to give all attributes an equal weight
• For Ex: Changing unit from meters to inches in height lead to different results because of larger
range for that attribute.
• To help avoid dependence on the choice of units, the data should be normalized.
• Normalization attempts to give all attributes equal weight.

5. Data Discretization
• This is a process of converting continuous data into a set of data intervals. Continuous attribute
values are substituted by small interval labels. This makes the data easier to study and analyze.
• For example, (1-10, 11-20) (age:- young, middle age, senior).

6. Data Generalization
• It converts low-level data attributes to high-level data attributes.
• This conversion from a lower level to a higher conceptual level is useful to get a clearer picture
of the data.
• For example: age data can be in the form of (20, 30) in a dataset. It is transformed into a higher
conceptual level into a categorical value (young, old).

Data Discretization

Data discretization refers to a method of converting a huge number of data values into smaller ones
so that the evaluation and management of data become easy.

In other words, data discretization is a method of converting attributes values of continuous data

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 11

into a finite set of intervals with minimum data loss.

Supervised discretization refers to a method in which the class data is used.

Unsupervised discretization refers to a method depending upon the way which operation proceeds.

Data Discretization Techniques

1. Discretization by binning: It is unsupervised method of partitioning the data based on equal

partitions, either by equal width or by equal frequency.

2. Discretization by Cluster: clustering can be applied to discretize numeric attributes. It partitions

the values into different clusters or groups by following top down or bottom up strategy.

3. Discretization By decision tree: it employs top down splitting strategy. It is a supervised technique
that uses class information.

4. Discretization By correlation analysis: follows a bottom-up approach by finding the best

neighboring intervals and then merging them to form larger intervals.

5. Discretization by histogram: Histogram analysis is unsupervised learning because it doesn’t use

any class information like binning. There are various partition rules used to define histograms.

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 12

Unit-3 Data Preprocessing
100% (1)
Unit-3 Data Preprocessing
7 pages
1st Year Computer Science Imp Questions Ch#1
78% (46)
1st Year Computer Science Imp Questions Ch#1
1 page
Data Mining Techniques
No ratings yet
Data Mining Techniques
108 pages
ALL YOU NEED Data_Mining_and_Warehousing
No ratings yet
ALL YOU NEED Data_Mining_and_Warehousing
42 pages
DW&DM Material
No ratings yet
DW&DM Material
107 pages
Unit IV Data Mining
No ratings yet
Unit IV Data Mining
65 pages
API Pentesting Mindmap While Trying To Attack
No ratings yet
API Pentesting Mindmap While Trying To Attack
1 page
unit-2 dw (1)
No ratings yet
unit-2 dw (1)
26 pages
GUNADWDM
No ratings yet
GUNADWDM
105 pages
Peplink Algorithms
No ratings yet
Peplink Algorithms
4 pages
Unit II DW
No ratings yet
Unit II DW
6 pages
Session10-Parts 19-20
No ratings yet
Session10-Parts 19-20
171 pages
q2 Summative g9
No ratings yet
q2 Summative g9
8 pages
Contact Work Experience: Mindbridge Nov 2016 - Feb 2019
No ratings yet
Contact Work Experience: Mindbridge Nov 2016 - Feb 2019
2 pages
pptcs1661
No ratings yet
pptcs1661
38 pages
Data Preprocessing, Data Warehousing
No ratings yet
Data Preprocessing, Data Warehousing
9 pages
Unit-1 DMDW
No ratings yet
Unit-1 DMDW
22 pages
DMDW_ Preprocessing L-6,7
No ratings yet
DMDW_ Preprocessing L-6,7
16 pages
Unit_2 Data Warehouse
No ratings yet
Unit_2 Data Warehouse
11 pages
Design and Implementation of A Secured Web Based Medical Record Management System: A Case Study of Federal University Wukari (FUW) Clinic
No ratings yet
Design and Implementation of A Secured Web Based Medical Record Management System: A Case Study of Federal University Wukari (FUW) Clinic
8 pages
Practical-2 Date: / /: A. Download Sqlyog
No ratings yet
Practical-2 Date: / /: A. Download Sqlyog
11 pages
DW Life Cycle
No ratings yet
DW Life Cycle
114 pages
Chapter 13 - Data Warehousing
No ratings yet
Chapter 13 - Data Warehousing
31 pages
Case Study Server Administration
No ratings yet
Case Study Server Administration
2 pages
MSL Install Admin R9.1SP1 en
No ratings yet
MSL Install Admin R9.1SP1 en
88 pages
dwdm
No ratings yet
dwdm
11 pages
Antim Prahar Business Data Warehousing Data Mining 2024
No ratings yet
Antim Prahar Business Data Warehousing Data Mining 2024
65 pages
Data warehousing and Data Mining Unit 1,2,3 Q and A
No ratings yet
Data warehousing and Data Mining Unit 1,2,3 Q and A
41 pages
Data Warehousing Concepts Transparencies: © Pearson Education Limited 1995, 2005
No ratings yet
Data Warehousing Concepts Transparencies: © Pearson Education Limited 1995, 2005
58 pages
Data Warehousing
No ratings yet
Data Warehousing
21 pages
Ashwin Pawar 23 Practical: 17: Name: Roll No.
No ratings yet
Ashwin Pawar 23 Practical: 17: Name: Roll No.
9 pages
Adbs Unit IV
No ratings yet
Adbs Unit IV
34 pages
Unit-2
No ratings yet
Unit-2
144 pages
Playfair Cipher: × 5 Square in Some Predetermined Order
No ratings yet
Playfair Cipher: × 5 Square in Some Predetermined Order
3 pages
Data Mining and Data Warehouse Study Material_edited
No ratings yet
Data Mining and Data Warehouse Study Material_edited
7 pages
Commercial Cyber Security Capabilities
No ratings yet
Commercial Cyber Security Capabilities
2 pages
06 Data Warehouse Design and Analytics
No ratings yet
06 Data Warehouse Design and Analytics
36 pages
Unit 2_V2_Data Science
No ratings yet
Unit 2_V2_Data Science
23 pages
Synopsis For Government Ration Shop Management System
No ratings yet
Synopsis For Government Ration Shop Management System
2 pages
4th Year Dw& Dm Kai075 Unit 1
No ratings yet
4th Year Dw& Dm Kai075 Unit 1
25 pages
BDA U2
No ratings yet
BDA U2
44 pages
Advance Database System
No ratings yet
Advance Database System
8 pages
Unit 2 Data Warehouse
No ratings yet
Unit 2 Data Warehouse
22 pages
Unit 2 Data Warehouse and Data Mining
No ratings yet
Unit 2 Data Warehouse and Data Mining
19 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
Rman Commands
No ratings yet
Rman Commands
69 pages
List Data Warehouse Models With Example
No ratings yet
List Data Warehouse Models With Example
19 pages
Cloudivs 3000S
No ratings yet
Cloudivs 3000S
6 pages
Introduction to Data Warehouse
No ratings yet
Introduction to Data Warehouse
17 pages
Specialized Cloud Architectures
No ratings yet
Specialized Cloud Architectures
20 pages
University of Eastern Philippines: Name: Position
No ratings yet
University of Eastern Philippines: Name: Position
3 pages
Kyocera-CopyStar Trouble Codes
No ratings yet
Kyocera-CopyStar Trouble Codes
1,159 pages
DBA Resume With 2 Years
No ratings yet
DBA Resume With 2 Years
3 pages
Shivank Singh Baghel - 2020
No ratings yet
Shivank Singh Baghel - 2020
2 pages
Data Warehouse and Data Mining- Definition and Concepts
No ratings yet
Data Warehouse and Data Mining- Definition and Concepts
20 pages
Advance Excel Course Outline11
No ratings yet
Advance Excel Course Outline11
2 pages
DWM Assigment-Questions Ans
No ratings yet
DWM Assigment-Questions Ans
67 pages
Data Mining and Warehousing - L1 & L2
No ratings yet
Data Mining and Warehousing - L1 & L2
30 pages
Ch-03-1 Unlocked 2
No ratings yet
Ch-03-1 Unlocked 2
45 pages
Unit - 4 Final
No ratings yet
Unit - 4 Final
71 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
135 pages
Data Warehousing unit 1,2
No ratings yet
Data Warehousing unit 1,2
9 pages
Pony ICT Grade 6 T1 Final Revision
No ratings yet
Pony ICT Grade 6 T1 Final Revision
26 pages
By Bi Jay Mishra
No ratings yet
By Bi Jay Mishra
685 pages
BDA Unit 2 B.tech
No ratings yet
BDA Unit 2 B.tech
9 pages
Microservices Testing
No ratings yet
Microservices Testing
40 pages
Os Lec 5 Thread
No ratings yet
Os Lec 5 Thread
10 pages
IT 100 Living in IT Era: A Guidebook
No ratings yet
IT 100 Living in IT Era: A Guidebook
4 pages
9 MidReview
No ratings yet
9 MidReview
25 pages
Data Mining v3
No ratings yet
Data Mining v3
54 pages
Data Warehouse
No ratings yet
Data Warehouse
4 pages
Template ISMS Tahap IV - Conducting Risk Assessment & Planning Risk Treatment - Rev1
No ratings yet
Template ISMS Tahap IV - Conducting Risk Assessment & Planning Risk Treatment - Rev1
32 pages
Web Development Quotation: What About It?
No ratings yet
Web Development Quotation: What About It?
4 pages
Best Practice Document For Housekeeping in SAP
No ratings yet
Best Practice Document For Housekeeping in SAP
7 pages
Data Mining and Data Warehousing: Gayathri Vidya Parishad College of Engineering Visakhapatnam
No ratings yet
Data Mining and Data Warehousing: Gayathri Vidya Parishad College of Engineering Visakhapatnam
11 pages
A.V.C.College of Engineering: Mannampandal, Mayiladuthurai-609 305
No ratings yet
A.V.C.College of Engineering: Mannampandal, Mayiladuthurai-609 305
96 pages
Unit 1 - Introduction To Data Mining and Data Warehousing
No ratings yet
Unit 1 - Introduction To Data Mining and Data Warehousing
84 pages
DATA WAREHOUSE Basic Concepts
No ratings yet
DATA WAREHOUSE Basic Concepts
26 pages
Data Warehouse
No ratings yet
Data Warehouse
16 pages
Data Mining and Data Warehouse BY
100% (1)
Data Mining and Data Warehouse BY
12 pages
UNIT-1 Introduction To Data Mining
No ratings yet
UNIT-1 Introduction To Data Mining
29 pages
Unit I DATA MINING AAGAC
No ratings yet
Unit I DATA MINING AAGAC
27 pages
Data Mining and Data Warehouse: Qis College of Engineering & Technology Ongole
No ratings yet
Data Mining and Data Warehouse: Qis College of Engineering & Technology Ongole
10 pages
UNIT-1 (RIT-062) : Data Warehousing
No ratings yet
UNIT-1 (RIT-062) : Data Warehousing
34 pages
SKP Engineering College: A Course Material On
No ratings yet
SKP Engineering College: A Course Material On
212 pages
NO.1 A. B. C. D.: Answer
No ratings yet
NO.1 A. B. C. D.: Answer
6 pages
Working With Words Language at Work
No ratings yet
Working With Words Language at Work
2 pages
Data Mining and Data Warehouse
No ratings yet
Data Mining and Data Warehouse
11 pages
The Data Warehouse Advantage
From Everand
The Data Warehouse Advantage
Pasquale De Marco
No ratings yet
Database Management System
From Everand
Database Management System
Manish Soni
No ratings yet

FoDS Notes - Unit 2

Uploaded by

FoDS Notes - Unit 2

Uploaded by

Unit 2 Data Warehouse

Data Warehouse: Introduction, Definition, Multidimensional Data Model, Data

A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in

It is a centralized data location for multiple sources of data.

Data Warehouse Architecture

Operational System: An operational system is a method used in data warehousing to refer to

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 1

Characteristics of Data Warehouse

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 2

• Data Warehouses are designed to perform well enormous amounts of data.

Differences between OLTP and OLAP

users clerk, IT professional knowledge worker

function Day to day operations Decision support

DB design application-oriented subject-oriented

data current, up-to-date historical,

access read/write Lots of scans

#records accessed tens millions

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 3

Multidimensional Data Model

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 5

Steps for Cleaning Data

1. Remove duplicate or irrelevant observations

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 6

3. Filter unwanted outliers

4. Handle missing data

Data Cleaning Techniques

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 7

Issues in Data Integration

1. Entity Identification Problem

2. Redundancy and Correlation Analysis

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 8

4. Data Value Conflict Detection and Resolution

Data Reduction Techniques

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 9

Data Transformation Techniques

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 10

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 11

Supervised discretization refers to a method in which the class data is used.

Data Discretization Techniques

1. Discretization by binning: It is unsupervised method of partitioning the data based on equal

2. Discretization by Cluster: clustering can be applied to discretize numeric attributes. It partitions

4. Discretization By correlation analysis: follows a bottom-up approach by finding the best

5. Discretization by histogram: Histogram analysis is unsupervised learning because it doesn’t use

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 12

You might also like