Data Preprocessing, Data Warehousing
Data Preprocessing, Data Warehousing
Overview
Unit 2 introduces the foundational steps in data mining: data preprocessing, data
warehousing, and OLAP (Online Analytical Processing). These topics are crucial
because raw data is often messy, incomplete, or not structured for analysis, and data min-
ing requires high-quality, well-organized data to produce meaningful insights. Data pre-
processing ensures the data is clean and usable, data warehousing provides a centralized
system to store and manage large volumes of data, and OLAP enables multidimensional
analysis for decision-making. This unit, spanning 11 hours, covers techniques to clean
and transform data, design data warehouses, and perform analytical queries, preparing
students for advanced data mining tasks like those in Unit 4 (Mining Data Streams).
1 Data Preprocessing
1.1 What is Data Preprocessing?
• Definition: Data preprocessing involves cleaning, transforming, and organiz-
ing raw data into a suitable format for mining and analysis.
• Why Its Important: Raw data often contains noise, inconsistencies, missing
values, and irrelevant attributes, which can lead to inaccurate or misleading
results in data mining.
1
– What is it?: Fixing or removing incorrect, incomplete, or noisy data.
– Techniques:
∗ Handling Missing Values:
· Ignore the Record: Remove rows with missing values if the dataset is
large.
· Fill with Mean/Median: Replace missing values with the average or
median (e.g., replacing missing ages with the average age of cus-
tomers).
· Predict Missing Values: Use algorithms like k-Nearest Neighbors (k-
NN) to predict missing values based on similar records.
∗ Smoothing Noise: Use techniques like binning, regression, or clustering to
smooth noisy data (e.g., averaging out erratic sales figures).
∗ Removing Duplicates: Identify and delete duplicate records (e.g., remov-
ing repeated customer entries).
∗ Correcting Inconsistencies: Standardize data formats (e.g., converting all
dates to "YYYY-MM-DD").
2
– Techniques:
∗ Normalization: Scaling numeric data to a specific range, often [0, 1], to
ensure fair comparisons (e.g., scaling income and age to the same range).
∗ Standardization: Transforming data to have a mean of 0 and a standard
deviation of 1 (e.g., standardizing test scores).
∗ Discretization: Converting continuous data into discrete bins (e.g., group-
ing ages into "Young," "Middle-Aged," "Senior").
∗ Encoding: Converting categorical data into numerical form (e.g., mapping
"Male" to 0 and "Female" to 1).
3
∗ Histogram Analysis: Using histograms to define bins based on data dis-
tribution.
∗ Clustering: Grouping similar values into clusters (e.g., clustering temper-
atures into "Cold," "Warm," "Hot").
2 Data Warehousing
2.1 What is a Data Warehouse?
• Definition: A centralized repository that stores large volumes of historical data
from multiple sources, optimized for analysis and reporting.
• Characteristics:
– Subject-Oriented: Focuses on specific subjects (e.g., sales, customers) rather
than operational processes.
– Integrated: Combines data from different sources into a consistent format.
– Non-Volatile: Data is stable and not updated in real-time (e.g., historical sales
data isnt changed).
4
– Time-Variant: Stores historical data for long-term analysis (e.g., sales trends
over years).
Example: A company extracts sales data from its POS system, transforms
it by cleaning duplicates, and loads it into a data warehouse for analysis.
5
2.4 Data Warehouse Schemas
• Star Schema:
– Structure: A central fact table (e.g., sales) connected to multiple dimension
tables (e.g., time, product, customer).
– Pros: Simple and fast for querying.
– Cons: May lead to redundancy in dimension tables.
• Snowflake Schema:
– Structure: Like a star schema, but dimension tables are normalized into sub-
tables (e.g., a "product" table splits into "category" and "subcategory").
– Pros: Reduces redundancy, saves storage.
– Cons: More complex, slower queries due to additional joins.
• Galaxy Schema (Fact Constellation):
– Structure: Multiple fact tables sharing dimension tables (e.g., sales and inven-
tory fact tables sharing a time dimension).
– Pros: Supports complex analysis across multiple subjects.
– Cons: Complex to design and maintain.
6
• Market Research: Analyzing customer behavior across regions and time periods.
7
3.3 OLAP Operations
• Drill-Down: Zooming into more detailed data (e.g., from yearly sales to monthly
sales).
• Roll-Up: Aggregating data to a higher level (e.g., from monthly sales to yearly
sales).
• Slice: Selecting one dimension to focus on (e.g., sales for a specific year).
• Dice: Selecting a subset of dimensions (e.g., sales for specific years and regions).
• Pivot: Rotating the data axes to view it from different perspectives (e.g., switching
rows and columns in a report).
Example: Drilling down from total sales in 2024 to sales by quarter, then
slicing to see Q1 sales in the USA.
8
• Forecasting: Predicting future trends (e.g., predicting next years sales based on
historical data).
• Market Analysis: Analyzing customer demographics and buying patterns.
• Budgeting: Planning budgets based on historical spending patterns.
Conclusion
Unit 2 lays the groundwork for data mining by covering data preprocessing, data ware-
housing, and OLAP. These topics ensure that raw data is cleaned, organized, and stored
effectively, enabling multidimensional analysis for decision-making. The 11-hour duration
allows for an in-depth exploration of techniques like data cleaning, ETL processes, and
OLAP operations, preparing students for real-world applications in business intelligence,
trend analysis, and forecasting. By mastering these concepts, students build a strong
foundation for advanced data mining tasks, such as mining data streams in Unit 4, and
can handle the complexities of large-scale data analysis in modern systems.