0% found this document useful (0 votes)

10 views

Data Preprocessing, Data Warehousing

Unit 2 covers data preprocessing, data warehousing, and OLAP, essential for effective data mining. It discusses the importance of cleaning and organizing data, the architecture of data warehouses, and the capabilities of OLAP for multidimensional analysis. The unit also highlights challenges and applications in these areas, emphasizing their role in improving decision-making and efficiency in data analysis.

Uploaded by

ANIRUDDHA ADAK

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Data Preprocessing, Data Warehousing

Uploaded by

ANIRUDDHA ADAK

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Unit 2: Data Preprocessing, Data Warehousing, and

OLAP (11 Hours)

Overview
Unit 2 introduces the foundational steps in data mining: data preprocessing, data
warehousing, and OLAP (Online Analytical Processing). These topics are crucial
because raw data is often messy, incomplete, or not structured for analysis, and data min-
ing requires high-quality, well-organized data to produce meaningful insights. Data pre-
processing ensures the data is clean and usable, data warehousing provides a centralized
system to store and manage large volumes of data, and OLAP enables multidimensional
analysis for decision-making. This unit, spanning 11 hours, covers techniques to clean
and transform data, design data warehouses, and perform analytical queries, preparing
students for advanced data mining tasks like those in Unit 4 (Mining Data Streams).

1 Data Preprocessing
1.1 What is Data Preprocessing?
• Deﬁnition: Data preprocessing involves cleaning, transforming, and organiz-
ing raw data into a suitable format for mining and analysis.
• Why Its Important: Raw data often contains noise, inconsistencies, missing
values, and irrelevant attributes, which can lead to inaccurate or misleading
results in data mining.

Example: A dataset of online sales might have missing customer ages,

duplicate entries, or inconsistent date formats (e.g., "01/02/2023" vs.
"2023-02-01"). Preprocessing ﬁxes these issues before analysis.

1.2 Challenges in Data Preprocessing

• Heterogeneous Data Sources: Data may come from multiple sources with dif-
ferent formats (e.g., CSV ﬁles, databases, APIs).
• Missing Values: Some records may lack values for key attributes (e.g., missing
income data in a customer dataset).
• Noise: Data may contain errors or outliers (e.g., a persons age listed as 200 years).
• High Dimensionality: Datasets with too many attributes can complicate analysis
(e.g., thousands of features in a genomic dataset).
• Inconsistent Data: Variations in data entry (e.g., "USA" vs. "United States" for
the same country).

1.3 Steps in Data Preprocessing

• Data Cleaning:

1
– What is it?: Fixing or removing incorrect, incomplete, or noisy data.
– Techniques:
∗ Handling Missing Values:
· Ignore the Record: Remove rows with missing values if the dataset is
large.
· Fill with Mean/Median: Replace missing values with the average or
median (e.g., replacing missing ages with the average age of cus-
tomers).
· Predict Missing Values: Use algorithms like k-Nearest Neighbors (k-
NN) to predict missing values based on similar records.
∗ Smoothing Noise: Use techniques like binning, regression, or clustering to
smooth noisy data (e.g., averaging out erratic sales ﬁgures).
∗ Removing Duplicates: Identify and delete duplicate records (e.g., remov-
ing repeated customer entries).
∗ Correcting Inconsistencies: Standardize data formats (e.g., converting all
dates to "YYYY-MM-DD").

Example: In a dataset, a customers age is listed as -5 (an error). Data

cleaning replaces it with the median age of the dataset, say 30.

– Pros: Improves data quality for better mining results.

– Cons: May lead to data loss if too many records are removed.
• Data Integration:
– What is it?: Combining data from multiple sources into a unified dataset.
– Challenges:
∗ Entity Identification: Matching records that refer to the same entity (e.g.,
"John Smith" in one dataset and "J. Smith" in another).
∗ Schema Integration: Aligning different data structures (e.g., one dataset
uses "CustomerID," another uses "ClientID").
∗ Redundancy: Avoiding duplicate attributes (e.g., "Age" and "YearsOld"
might be the same).

Example: Merging sales data from an online store and a physical

store, ensuring "CustomerID" matches across both datasets.

– Pros: Provides a comprehensive view of the data.

– Cons: Can introduce errors if integration is not done carefully.
• Data Transformation:
– What is it?: Converting data into a format suitable for mining.

2
– Techniques:
∗ Normalization: Scaling numeric data to a speciﬁc range, often [0, 1], to
ensure fair comparisons (e.g., scaling income and age to the same range).
∗ Standardization: Transforming data to have a mean of 0 and a standard
deviation of 1 (e.g., standardizing test scores).
∗ Discretization: Converting continuous data into discrete bins (e.g., group-
ing ages into "Young," "Middle-Aged," "Senior").
∗ Encoding: Converting categorical data into numerical form (e.g., mapping
"Male" to 0 and "Female" to 1).

Example: Normalizing a dataset where income ranges from $20,000 to

$100,000 to a [0, 1] scale, so $60,000 becomes 0.5.

– Pros: Makes data compatible with mining algorithms.

– Cons: May lose some information during transformation.
• Data Reduction:
– What is it?: Reducing the size of the dataset while preserving its essential
information.
– Techniques:
∗ Dimensionality Reduction: Removing irrelevant or redundant attributes
using methods like Principal Component Analysis (PCA).
∗ Numerosity Reduction: Replacing data with smaller representations (e.g.,
using histograms instead of raw data).
∗ Data Compression: Compressing data to save space (e.g., storing sales
data as aggregates).
∗ Sampling: Selecting a subset of data (e.g., random sampling to reduce a
million records to 10,000).

Example: Using PCA to reduce a dataset with 100 features to 10 key

features for faster analysis.

– Pros: Speeds up mining and reduces storage needs.

– Cons: May lose some patterns or details.
• Data Discretization:
– What is it?: Converting continuous data into discrete categories.
– Techniques:
∗ Binning: Grouping values into bins (e.g., dividing income into "Low,"
"Medium," "High").

3
∗ Histogram Analysis: Using histograms to deﬁne bins based on data dis-
tribution.
∗ Clustering: Grouping similar values into clusters (e.g., clustering temper-
atures into "Cold," "Warm," "Hot").

Example: Discretizing a temperature dataset into "Cold" (<10řC),

"Warm" (10-25řC), and "Hot" (>25řC).

– Pros: Simpliﬁes data for certain algorithms like decision trees.

– Cons: May reduce precision of the data.

1.4 Applications of Data Preprocessing

• Machine Learning: Preparing data for algorithms like classiﬁcation or clustering
(e.g., cleaning a dataset for a spam email classiﬁer).
• Business Analytics: Ensuring sales data is accurate for forecasting (e.g., remov-
ing outliers from sales records).
• Healthcare: Cleaning patient data for predictive modeling (e.g., handling missing
blood pressure readings).
• Social Media Analysis: Standardizing user data for sentiment analysis (e.g.,
unifying location formats in tweets).

1.5 Challenges in Data Preprocessing

• Time-Consuming: Preprocessing can take up to 80% of the data mining process.
• Data Loss Risk: Aggressive cleaning or reduction may remove important patterns.
• Complexity: Handling large, heterogeneous datasets requires expertise.
• Bias Introduction: Improper preprocessing can introduce biases (e.g., over-sampling
a minority class).

2 Data Warehousing
2.1 What is a Data Warehouse?
• Definition: A centralized repository that stores large volumes of historical data
from multiple sources, optimized for analysis and reporting.
• Characteristics:
– Subject-Oriented: Focuses on specific subjects (e.g., sales, customers) rather
than operational processes.
– Integrated: Combines data from different sources into a consistent format.
– Non-Volatile: Data is stable and not updated in real-time (e.g., historical sales
data isnt changed).

4
– Time-Variant: Stores historical data for long-term analysis (e.g., sales trends
over years).

Example: A retail companys data warehouse stores sales, inventory, and

customer data from all its stores for trend analysis.

2.2 Why Data Warehousing is Important

• Supports Decision-Making: Provides a uniﬁed view of data for strategic deci-
sions (e.g., identifying best-selling products).
• Eﬃcient Querying: Optimized for complex analytical queries, unlike operational
databases.
• Historical Analysis: Enables trend analysis over long periods (e.g., sales patterns
over a decade).
• Data Consolidation: Integrates data from disparate sources (e.g., merging sales
data from online and physical stores).

2.3 Architecture of a Data Warehouse

• Three-Tier Architecture:
– Bottom Tier (Data Sources): Raw data from operational databases, external
sources (e.g., CRM systems, IoT devices).
– Middle Tier (Data Warehouse Server): Stores integrated and cleaned data,
often using a relational database (e.g., Oracle, SQL Server).
– Top Tier (Client Layer): Tools for querying and reporting (e.g., BI tools like
Tableau, Power BI).
• Components:
– ETL Process (Extract, Transform, Load):
∗ Extract: Collect data from various sources.
∗ Transform: Clean and transform data (e.g., standardize formats, remove
duplicates).
∗ Load: Store the transformed data into the warehouse.
– Metadata: Data about the data (e.g., source, format, update time).
– Data Marts: Subsets of the warehouse for speciﬁc departments (e.g., a mar-
keting data mart).

Example: A company extracts sales data from its POS system, transforms
it by cleaning duplicates, and loads it into a data warehouse for analysis.

5
2.4 Data Warehouse Schemas
• Star Schema:
– Structure: A central fact table (e.g., sales) connected to multiple dimension
tables (e.g., time, product, customer).
– Pros: Simple and fast for querying.
– Cons: May lead to redundancy in dimension tables.
• Snowﬂake Schema:
– Structure: Like a star schema, but dimension tables are normalized into sub-
tables (e.g., a "product" table splits into "category" and "subcategory").
– Pros: Reduces redundancy, saves storage.
– Cons: More complex, slower queries due to additional joins.
• Galaxy Schema (Fact Constellation):
– Structure: Multiple fact tables sharing dimension tables (e.g., sales and inven-
tory fact tables sharing a time dimension).
– Pros: Supports complex analysis across multiple subjects.
– Cons: Complex to design and maintain.

Example: A star schema with a sales fact table (containing revenue,

quantity) connected to dimension tables like time (date, month) and product
(name, category).

2.5 Challenges in Data Warehousing

• Data Integration: Combining data from heterogeneous sources can lead to in-
consistencies.
• Scalability: Warehouses must handle growing data volumes (e.g., terabytes of
historical data).
• ETL Complexity: Extracting, transforming, and loading large datasets is resource-
intensive.
• Data Quality: Poor-quality data in the warehouse can lead to unreliable insights.
• Cost: Building and maintaining a data warehouse is expensive (e.g., hardware,
software, personnel).

2.6 Applications of Data Warehousing

• Business Intelligence: Generating reports and dashboards (e.g., sales perfor-
mance reports).
• Trend Analysis: Identifying long-term patterns (e.g., seasonal sales trends).
• Forecasting: Predicting future outcomes (e.g., demand forecasting for inventory).

6
• Market Research: Analyzing customer behavior across regions and time periods.

3 OLAP (Online Analytical Processing)

3.1 What is OLAP?
• Deﬁnition: OLAP is a technology that enables multidimensional analysis of
data in a data warehouse, allowing users to perform complex queries for decision-
making.
• Key Features:
– Multidimensional View: Data is organized into dimensions (e.g., time, prod-
uct) and measures (e.g., sales revenue).
– Fast Querying: Optimized for analytical queries, not transactional updates.
– Interactive Analysis: Users can slice, dice, drill down, or roll up data interac-
tively.

Example: A manager uses OLAP to analyze sales data by region, product,

and month to identify top-performing regions.

3.2 Types of OLAP Systems

• MOLAP (Multidimensional OLAP):
– What is it?: Stores data in multidimensional cubes (e.g., a cube with dimen-
sions time, product, region).
– Pros: Fast query performance due to precomputed aggregates.
– Cons: Limited scalability for very large datasets.
• ROLAP (Relational OLAP):
– What is it?: Uses relational databases to store data, performing multidimen-
sional analysis via SQL queries.
– Pros: Scales well for large datasets.
– Cons: Slower query performance compared to MOLAP.
• HOLAP (Hybrid OLAP):
– What is it?: Combines MOLAP and ROLAP, storing detailed data in a rela-
tional database and aggregates in a cube.
– Pros: Balances speed and scalability.
– Cons: More complex to implement.

Example: A MOLAP system precomputes sales aggregates for quick

retrieval, while a ROLAP system queries raw sales data dynamically.

7
3.3 OLAP Operations
• Drill-Down: Zooming into more detailed data (e.g., from yearly sales to monthly
sales).
• Roll-Up: Aggregating data to a higher level (e.g., from monthly sales to yearly
sales).
• Slice: Selecting one dimension to focus on (e.g., sales for a specific year).
• Dice: Selecting a subset of dimensions (e.g., sales for specific years and regions).
• Pivot: Rotating the data axes to view it from different perspectives (e.g., switching
rows and columns in a report).

Example: Drilling down from total sales in 2024 to sales by quarter, then
slicing to see Q1 sales in the USA.

3.4 OLAP vs. OLTP (Online Transaction Processing)

• OLTP:
– Purpose: Handles day-to-day transactions (e.g., updating a customers order
in a database).
– Characteristics: Real-time updates, small transactions, normalized data.
• OLAP:
– Purpose: Analytical queries for decision-making (e.g., analyzing sales trends).
– Characteristics: Read-heavy, complex queries, denormalized data.

Example: OLTP updates a bank transaction in real-time, while OLAP

analyzes transaction trends over a year.

3.5 Challenges in OLAP

• Performance: Complex queries on large datasets can be slow without proper
optimization.
• Data Volume: Handling massive data requires eﬃcient storage and indexing.
• Cube Explosion: Precomputing all possible aggregates in MOLAP can lead to
storage issues.
• Data Freshness: OLAP systems often use historical data, which may not reﬂect
recent changes.

3.6 Applications of OLAP

• Business Reporting: Generating sales, ﬁnancial, or inventory reports (e.g., quar-
terly sales analysis).

8
• Forecasting: Predicting future trends (e.g., predicting next years sales based on
historical data).
• Market Analysis: Analyzing customer demographics and buying patterns.
• Budgeting: Planning budgets based on historical spending patterns.

4 Importance of Data Preprocessing, Warehousing, and OLAP

• Foundation for Data Mining: Preprocessing ensures high-quality data, ware-
housing provides a structured repository, and OLAP enables analytical queries, all
of which are prerequisites for advanced tasks like mining data streams (Unit 4).
• Improved Decision-Making: Clean data, centralized storage, and multidimen-
sional analysis lead to better business decisions.
• Eﬃciency: Reduces errors and speeds up the mining process by starting with
well-prepared data.
• Scalability: Warehouses and OLAP systems handle large datasets, making them
suitable for big data applications.

5 Challenges in Data Preprocessing, Warehousing, and OLAP

• Data Quality: Poor-quality data affects all stages, from preprocessing to OLAP
analysis.
• Complexity: Designing and maintaining warehouses and OLAP systems requires
expertise.
• Resource Intensive: Preprocessing, ETL processes, and OLAP querying demand
significant computational resources.
• Evolving Data Needs: Businesses constantly change their analytical needs, re-
quiring flexible systems.

Conclusion
Unit 2 lays the groundwork for data mining by covering data preprocessing, data ware-
housing, and OLAP. These topics ensure that raw data is cleaned, organized, and stored
eﬀectively, enabling multidimensional analysis for decision-making. The 11-hour duration
allows for an in-depth exploration of techniques like data cleaning, ETL processes, and
OLAP operations, preparing students for real-world applications in business intelligence,
trend analysis, and forecasting. By mastering these concepts, students build a strong
foundation for advanced data mining tasks, such as mining data streams in Unit 4, and
can handle the complexities of large-scale data analysis in modern systems.

Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
Data Warehousing and Mining
No ratings yet
Data Warehousing and Mining
56 pages
Data Mining Mid Syllabus
No ratings yet
Data Mining Mid Syllabus
162 pages
Unit 2 Data Warehouse and Data Mining
No ratings yet
Unit 2 Data Warehouse and Data Mining
19 pages
DWM Assigment-Questions Ans
No ratings yet
DWM Assigment-Questions Ans
67 pages
Data Mining Mid Syllabus
No ratings yet
Data Mining Mid Syllabus
185 pages
Unit-3 Finalized
No ratings yet
Unit-3 Finalized
9 pages
dwdm
No ratings yet
dwdm
11 pages
Data Warehouse
No ratings yet
Data Warehouse
10 pages
shortnjn
No ratings yet
shortnjn
12 pages
Data Mining
No ratings yet
Data Mining
4 pages
Unit-2
No ratings yet
Unit-2
144 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Data warehouse (1)
No ratings yet
Data warehouse (1)
14 pages
Data warehouse
No ratings yet
Data warehouse
11 pages
DMDW_ Preprocessing L-6,7
No ratings yet
DMDW_ Preprocessing L-6,7
16 pages
Unit-3 Data Preprocessing
100% (1)
Unit-3 Data Preprocessing
7 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
DMDW Imp Ques
No ratings yet
DMDW Imp Ques
17 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Data Warehouse and Data Mining Syllabus
No ratings yet
Data Warehouse and Data Mining Syllabus
5 pages
CH 3
No ratings yet
CH 3
68 pages
UNIT 3
No ratings yet
UNIT 3
22 pages
Module 2_data preprocessing
No ratings yet
Module 2_data preprocessing
16 pages
Week 4 - 5 - Data Preprocessing
No ratings yet
Week 4 - 5 - Data Preprocessing
67 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
dwm q bank
No ratings yet
dwm q bank
16 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
PPT 2
No ratings yet
PPT 2
51 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Unit 3 DW
No ratings yet
Unit 3 DW
19 pages
DWDM unit 3
No ratings yet
DWDM unit 3
16 pages
Resume 1
No ratings yet
Resume 1
106 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
Introduction to Data Warehouse
No ratings yet
Introduction to Data Warehouse
17 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
9 MidReview
No ratings yet
9 MidReview
25 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
DWDM-LS3-Fall-24-25
No ratings yet
DWDM-LS3-Fall-24-25
50 pages
Data Warehouse and Data Mining- Definition and Concepts
No ratings yet
Data Warehouse and Data Mining- Definition and Concepts
20 pages
Data Mining & Business Intelligence
No ratings yet
Data Mining & Business Intelligence
322 pages
Data Mining
No ratings yet
Data Mining
22 pages
Bana Reviewer
No ratings yet
Bana Reviewer
4 pages
Chapter3
No ratings yet
Chapter3
50 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Data Mining and Warehousing
No ratings yet
Data Mining and Warehousing
18 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Warehousing and Data Mining
100% (1)
Data Warehousing and Data Mining
30 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
REVIEWER
No ratings yet
REVIEWER
9 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Math211101020
No ratings yet
Math211101020
12 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Correlation
No ratings yet
Correlation
14 pages
Data Binning
No ratings yet
Data Binning
9 pages
Aniruddha Adak -- Software Developer skills Resume
No ratings yet
Aniruddha Adak -- Software Developer skills Resume
1 page
Research Methodology Guide ----- by Aniruddha Adak
No ratings yet
Research Methodology Guide ----- by Aniruddha Adak
24 pages
Grok Human Resource Development and Ob
No ratings yet
Grok Human Resource Development and Ob
11 pages
Mining Data Streams
No ratings yet
Mining Data Streams
17 pages
Research Methodology Guide for Beginners A Detailed and Colorful Exploration of Research Concepts By Aniruddha Adak
No ratings yet
Research Methodology Guide for Beginners A Detailed and Colorful Exploration of Research Concepts By Aniruddha Adak
24 pages
Human Resource Development Organisational Behaviour Organizer
No ratings yet
Human Resource Development Organisational Behaviour Organizer
121 pages
Sem 6 Syllebus Grok-5-6
No ratings yet
Sem 6 Syllebus Grok-5-6
2 pages
Data Mining Techniques
No ratings yet
Data Mining Techniques
11 pages
Sem 6 Syllebus Grok
No ratings yet
Sem 6 Syllebus Grok
9 pages
Data Warehousing & Data Mining Organizer (for B.Tech MAKAUT )
No ratings yet
Data Warehousing & Data Mining Organizer (for B.Tech MAKAUT )
97 pages
Human Resource Development Organisational Behaviour Organizer (for B.Tech MAKAUT )
No ratings yet
Human Resource Development Organisational Behaviour Organizer (for B.Tech MAKAUT )
121 pages
DWDM (Data Warehousing and Data Mining) summarizer (for B.Tech MAKAUT )
No ratings yet
DWDM (Data Warehousing and Data Mining) summarizer (for B.Tech MAKAUT )
27 pages
Sem6 Old Syllebus by Aniruddha Adak
No ratings yet
Sem6 Old Syllebus by Aniruddha Adak
14 pages
Image Processing Organizer 2024 by Aniruddha Adak
No ratings yet
Image Processing Organizer 2024 by Aniruddha Adak
128 pages
ANIRUDDHA ADAK __ 27600122030 (for B.Tech MAKAUT )
No ratings yet
ANIRUDDHA ADAK __ 27600122030 (for B.Tech MAKAUT )
7 pages
Makaut 6th Sem Exam Form by Aniruddha Adak
No ratings yet
Makaut 6th Sem Exam Form by Aniruddha Adak
1 page
1 Line Definition for All Subject Topics include DBMS, CN, IP, Data mining, OB, RM
No ratings yet
1 Line Definition for All Subject Topics include DBMS, CN, IP, Data mining, OB, RM
10 pages
Sem 6 Admit Card by Aniruddha Adak
No ratings yet
Sem 6 Admit Card by Aniruddha Adak
1 page
Computer Networks Organizer 2024 by Aniruddha Adak
No ratings yet
Computer Networks Organizer 2024 by Aniruddha Adak
136 pages
DBMS Organizer 2024 by Aniruddha Adak
No ratings yet
DBMS Organizer 2024 by Aniruddha Adak
160 pages
E Commerce File - File
No ratings yet
E Commerce File - File
44 pages
Answer Key
No ratings yet
Answer Key
11 pages
Fpga Assement4 PDF
No ratings yet
Fpga Assement4 PDF
26 pages
Python Examples: The Following Code Shows How To Implement The Bubble Sort Algorithm in Python
No ratings yet
Python Examples: The Following Code Shows How To Implement The Bubble Sort Algorithm in Python
4 pages
Full Blast Plus 4-Read Pg14
No ratings yet
Full Blast Plus 4-Read Pg14
3 pages
Chapter 2 - Matrix Algebra
No ratings yet
Chapter 2 - Matrix Algebra
89 pages
Excel IFERROR Function: Example #1
No ratings yet
Excel IFERROR Function: Example #1
5 pages
Unit 3 - Word 2016
No ratings yet
Unit 3 - Word 2016
72 pages
Employee Management System: Abhaysinhraje Bhonsle Institute of Technology
No ratings yet
Employee Management System: Abhaysinhraje Bhonsle Institute of Technology
10 pages
System Audit
No ratings yet
System Audit
18 pages
Chapter 4. Polymorphism
No ratings yet
Chapter 4. Polymorphism
64 pages
NBL-S-TM Soil Temperature and Moisture Sensor Instruction Manual 4.0
No ratings yet
NBL-S-TM Soil Temperature and Moisture Sensor Instruction Manual 4.0
5 pages
Ericsson Rbs 6000 New
No ratings yet
Ericsson Rbs 6000 New
62 pages
Bentley Substation V8: Frequently Asked Questions
No ratings yet
Bentley Substation V8: Frequently Asked Questions
2 pages
Xion Token Whitepaper
No ratings yet
Xion Token Whitepaper
17 pages
Abc Pharma - Sales Performance Dashboard
No ratings yet
Abc Pharma - Sales Performance Dashboard
6 pages
Buy X.Cell Classic 3 Talk Lite Smartwatch in Qatar - AlaneesQatar - Qa
No ratings yet
Buy X.Cell Classic 3 Talk Lite Smartwatch in Qatar - AlaneesQatar - Qa
1 page
Manual Book Ba 400
No ratings yet
Manual Book Ba 400
32 pages
COMPUTER STUDIES F3T2 PP1 MS
No ratings yet
COMPUTER STUDIES F3T2 PP1 MS
7 pages
Three Arm Caliper (9074, 8074 & 7074) : User Guide
No ratings yet
Three Arm Caliper (9074, 8074 & 7074) : User Guide
4 pages
OG0VA PB v1.3 WEB
No ratings yet
OG0VA PB v1.3 WEB
2 pages
LENS Peopleware
No ratings yet
LENS Peopleware
32 pages
IDM - Continuous - UAT - Test - Matrix V0.4
No ratings yet
IDM - Continuous - UAT - Test - Matrix V0.4
31 pages
Website:Need and Importance: Submitted To: Geetha P Submitted By: Reshmi M.R Mathematics 13383004 Submitted On
No ratings yet
Website:Need and Importance: Submitted To: Geetha P Submitted By: Reshmi M.R Mathematics 13383004 Submitted On
8 pages
Sachin Resume
No ratings yet
Sachin Resume
1 page
2019 - Book - Data Analytics and Learning
100% (1)
2019 - Book - Data Analytics and Learning
450 pages
Features of DMC
No ratings yet
Features of DMC
2 pages
Download Complete Mike Meyers' CompTIA A+ Guide to Managing and Troubleshooting PCs Lab Manual, Seventh Edition (Exams 220-1101 & 220-1102) Mike Meyers PDF for All Chapters
100% (2)
Download Complete Mike Meyers' CompTIA A+ Guide to Managing and Troubleshooting PCs Lab Manual, Seventh Edition (Exams 220-1101 & 220-1102) Mike Meyers PDF for All Chapters
50 pages
BSI-TR-03111_V-2-1_pdf
No ratings yet
BSI-TR-03111_V-2-1_pdf
43 pages
ERP - Task 4 Testing Presentation
No ratings yet
ERP - Task 4 Testing Presentation
23 pages

Data Preprocessing, Data Warehousing

Uploaded by

Data Preprocessing, Data Warehousing

Uploaded by

Unit 2: Data Preprocessing, Data Warehousing, and

OLAP (11 Hours)

Example: A dataset of online sales might have missing customer ages,

1.2 Challenges in Data Preprocessing

1.3 Steps in Data Preprocessing

Example: In a dataset, a customers age is listed as -5 (an error). Data

– Pros: Improves data quality for better mining results.

Example: Merging sales data from an online store and a physical

– Pros: Provides a comprehensive view of the data.

Example: Normalizing a dataset where income ranges from $20,000 to

– Pros: Makes data compatible with mining algorithms.

Example: Using PCA to reduce a dataset with 100 features to 10 key

– Pros: Speeds up mining and reduces storage needs.

Example: Discretizing a temperature dataset into "Cold" (<10řC),

– Pros: Simpliﬁes data for certain algorithms like decision trees.

1.4 Applications of Data Preprocessing

1.5 Challenges in Data Preprocessing

Example: A retail companys data warehouse stores sales, inventory, and

2.2 Why Data Warehousing is Important

2.3 Architecture of a Data Warehouse

Example: A star schema with a sales fact table (containing revenue,

2.5 Challenges in Data Warehousing

2.6 Applications of Data Warehousing

3 OLAP (Online Analytical Processing)

Example: A manager uses OLAP to analyze sales data by region, product,

3.2 Types of OLAP Systems

Example: A MOLAP system precomputes sales aggregates for quick

3.4 OLAP vs. OLTP (Online Transaction Processing)

Example: OLTP updates a bank transaction in real-time, while OLAP

3.5 Challenges in OLAP

3.6 Applications of OLAP

4 Importance of Data Preprocessing, Warehousing, and OLAP

5 Challenges in Data Preprocessing, Warehousing, and OLAP

You might also like