0% found this document useful (0 votes)
3 views

Data Warehousing and Data Mining

Uploaded by

springboot513
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Data Warehousing and Data Mining

Uploaded by

springboot513
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

DATA WAREHOUSING AND

DATA MINING

UNIT 3
AKTU
WHAT IS DATA
• Data refers to raw, unprocessed facts, figures, symbols, or observations representing various
attributes or properties.
• It lacks context, meaning, or interpretation on its own.
WHAT IS INFORMATION
• Information results from processing and organizing data in a way that gives it meaning, context, and
relevance.
• It involves presenting data in a structured form that can be easily understood and used.
WHAT IS KNOWLEDGE
• Knowledge is a higher level of understanding that goes beyond information.
• It results from assimilating, interpreting, and contextualizing information.
• Knowledge allows us to recognize patterns, relationships, and connections.

2
WHAT IS DATA MINING AND ITS KEY FEATURES
Data mining is the process of extracting valuable patterns, relationships, and anomalies from large datasets.
KEY FEATURES
1. Automatic Pattern Discovery: Identifying patterns and trends without human intervention.
2. Prediction and Forecasting: Using historical data to predict future trends and outcomes.
3. Classification: Categorizing data into predefined classes.
4. Clustering: Grouping similar objects together.
5. Association Rules: Finding relationships between variables.
6. Anomaly Detection: Identifying unusual or interesting data records.
7. Sequential Patterns: Recognizing the order of events.
8. Data Cleaning and Preprocessing: Preparing data by handling missing values and inconsistencies.
9. Scalability: Handling large volumes of data efficiently.
10.Visualization: Representing data visually for easier interpretation.
11.Integration with Databases: Seamless integration with DBMS and data warehouses.
12.Interactivity: Providing interactive tools for dynamic data exploration.
13.Performance and Efficiency: Ensuring time-efficient data analysis.
14.Support
3 for Different Data Types: Handling structured, semi-structured, and unstructured data.
MAJOR ISSUES IN DATA MINING
1.Mining Methodology and User Interaction Issues:
1. Diverse Knowledge Discovery: Different users have varying interests, so data mining must cover a broad
range of knowledge discovery tasks.
2. Interactive Process: Users refine data mining requests based on results, allowing for interactive
exploration.
3. Incorporation of Background Knowledge: Using existing knowledge to guide the discovery process.
4. Data Mining Query Languages: Integration with data warehouse query languages for efficient ad hoc
mining tasks.
5. Presentation and Visualization: Expressing discovered patterns in high-level languages and visual
representations.

2.Performance Issues:
1. Efficiency and Scalability: Algorithms must handle large datasets efficiently.
2. Parallel, Distributed, and Incremental Mining: Addressing factors like data size and complexity.

3.Diverse Data Types Issues:


1. Handling Complex Data: Relational, multimedia, spatial, and temporal data.
2. Data Cleaning: Necessary to handle noise and incomplete data.
3. Pattern Evaluation: Ensuring discovered patterns are interesting and not trivial.

4
DATA MINING/KNOWLEDGE EXTRACTION PROCESS
This process refers to extracting useful data from data bases.
1.Data Cleaning:
1. Remove noisy and irrelevant data.
2. Handle missing values.
3. Detect and correct data discrepancies.
4. Transform data to a consistent format.

2.Data Integration:
1. Combine heterogeneous data from multiple sources into a common format (usually a Data Warehouse).
2. Use tools like data migration, synchronization, and ETL (Extract-Load-Transformation) processes.

3.Data Selection:
1. Decide which data is relevant for analysis.
2. Retrieve relevant data using techniques like neural networks, decision trees, Naive Bayes, clustering, or
regression.

4.Data Transformation:
1. Convert data into a suitable form for mining procedures.
2. Two steps:
5 1. Data Mapping: Assign elements from the source to the destination to capture transformations.
2. Code Generation: Create the actual transformation program.
5. Data Mining:
1. Apply techniques to extract potentially useful patterns from task-relevant data.
2. Use classification or characterization to decide the purpose of the model.

6. Pattern Evaluation:
1. Identify interesting patterns based on given measures.
2. Evaluate patterns using summarization and visualization to make them understandable to users.

7. Knowledge Representation:
1. Present the results in a meaningful way for decision-making.

6
CLASSIFICATION OF DATA MINING
1. Classification Based on the Mined Databases:
1. Relational Database Mining: Analyzing data from relational databases.
2. Transactional Database Mining: Focusing on transactional data (e.g., sales transactions).
3. Object-Relational Database Mining: Combining object-oriented and relational data.
4. Data Warehouse Mining: Extracting insights from data warehouses.
2. Classification Based on the Type of Mined Knowledge:
1. Characterization: Describing general properties of a dataset.
2. Discrimination: Identifying patterns distinguishing different classes.
3. Association and Correlation Analysis: Discovering relationships between variables.
4. Classification: Assigning labels to instances based on features.
5. Prediction: Forecasting future values.
6. Outlier Analysis: Detecting unusual data points.
7. Evolution Analysis: Analyzing trends over time.
3. Classification Based on the Techniques Utilized:
1. Rule-Based Methods: Using if-then rules for decision-making.
2. Decision Trees: Hierarchical structures for classification.
3. Neural Networks: Mimicking the human brain for pattern recognition.
4. Clustering Algorithms: Grouping similar data points.
5. Genetic Algorithms: Evolving solutions through natural selection.
7
4. Classification Based on Adapted Applications:
1. Finance: Analyzing stock market trends or credit risk.
2. Telecommunications: Identifying network anomalies.
3. DNA Sequencing: Finding patterns in genetic data.
4. Stock Markets: Predicting stock prices.
5. E-mail: Filtering spam emails.

8
DATA MINING FUNCTIONALITIES
1. Class/Concept Descriptions:
1. Define features that characterize a class or concept.
2. Example: Grouping products for sale vs. non-sale items.

2. Classification:
1. Assign predefined labels to data instances.
2. Used for spam detection, disease diagnosis, etc.

3. Prediction:
1. Use historical data to predict future outcomes.
2. Forecast business metrics, diagnose diseases, etc.

4. Association Analysis:
1. Identify relationships between items.
2. Commonly used in market basket analysis.

5. Cluster Analysis:
1. Group similar data instances.
2. Discover natural groupings within data.

6. Outlier Analysis:
1. Detect anomalies or outliers.
2. Crucial for fraud detection, quality control.

7. Feature Selection:
1. Choose relevant attributes for modeling.
2. Improves model performance, reduces dimensionality.

8. Dimensionality Reduction:
1. Reduce features while preserving information.
2. Techniques like PCA achieve this.
9
1.No Coupling: Direct retrieval from data sources without using a database or data warehouse. Simple
but inefficient.
2.Loose Coupling: Retrieval from a database or data warehouse, with results stored in the same
system. More efficient than no coupling.
3.Semi-Tight Coupling: Uses both database and data mining engine. Moderate scalability and
performance.
4.Tight Coupling: Centralized repository for integrated data. Efficient retrieval and analysis .
DIFFERENT FORMS OF DATA PROCESSING
1.Batch Processing:
1. Data collected over time and processed in large volumes.
2. Executed as a batch or group.
3. Used for non-real-time tasks like data backups and report generation.

2.Real-Time Processing:
1. Immediate handling of data as it is generated or received.
2. Requires low latency and quick response times.
3. Used in monitoring systems and financial trading.

10
3. Distributed Processing:
1. Tasks distributed across multiple interconnected computers or servers.
2. Enhances efficiency in large-scale systems and big data applications.
3. Examples: MapReduce in Hadoop, distributed databases.

4. Multiprocessing:
1. Utilizes multiple processors for concurrent task execution.
2. Common in scientific simulations, weather forecasting, and video rendering.

5. Commercial Data Processing:


1. Standardized processing for business applications.
2. Involves handling transactions, financial records, and customer data.

6. Scientific Data Processing:


1. Complex processing for research and analysis.
2. Involves experimental data, simulations, and modeling.

7. Online Processing:
1. Interactive, precomputed data available for immediate use.
2. Used in online applications and real-time interactions.

11
SMOOTHING TECHNIQUES IN DATA CLEANING PROCESS
here are the main smoothing techniques used in data cleaning, summarized briefly:
1. Moving Average
• Simple Moving Average (SMA): Average of the last n data points.
• Weighted Moving Average (WMA): Weighted average, giving more importance to recent data.
2. Exponential Smoothing
• Simple Exponential Smoothing: For data with no trend or seasonality.
• Holt’s Linear Trend Model: For data with a trend.
• Holt-Winters Seasonal Model: For data with both trend and seasonality.
3. Gaussian Smoothing
• Applies a Gaussian function to reduce noise and smooth data.
4. Median Filtering
• Replaces each data point with the median of neighboring points, effective for outlier removal.

12
5. Kernel Smoothing
• Uses a weighted average of surrounding data points with various kernel functions like Gaussian.
6. Local Regression (LOESS and LOWESS)
• Fits multiple regressions in local neighborhoods to produce a smooth curve.
7. Spline Smoothing
• Uses piecewise polynomials to fit a smooth curve to the data.
8. Binning
• Groups continuous data into bins, reducing minor observation differences.
9. Wavelet Transform
• Decomposes data into different frequency components for smoothing.
10. Fourier Transform Smoothing
• Decomposes time series into sinusoids and filters out high-frequency noise.
• These techniques help in noise reduction, trend analysis, seasonality detection, outlier handling, and
data visualization.
13
Strategies for Data Cleaning:

1. Remove Duplicates:
1. Common when collecting data from various sources.
2. Eliminate duplicated entries.
2. Remove Irrelevant Data:
1. Filter out observations not relevant to the analysis.
3. Standardize Capitalization:
1. Ensure consistent text formatting.
4. Convert Data Types:
1. Match data types to their intended use.
5. Clear Formatting:
1. Remove unnecessary spaces and special characters.
6. Fix Errors:
1. Correct inaccuracies and inconsistencies.
7. Language Translation:
1. Translate multilingual data for consistency.

14
Data Reduction
• Data reduction transforms and minimizes data volume while preserving its key information, making
analysis more efficient.
Methods of Data Reduction
1.Dimensionality Reduction
1. PCA: Reduces data to orthogonal components.
2. LDA: Reduces dimensions while preserving class separability.
3. Factor Analysis: Models observed variables using potential factors.
4. t-SNE: Preserves structure for high-dimensional data visualization.

2.Numerosity Reduction
1. Regression Models: Use statistical models for data approximation.
2. Histograms: Aggregate data into bins.
3. Clustering: Group similar data into clusters.
4. Sampling: Select a representative subset of data.

3.Data Compression
1. Lossless Compression: Reduces size without losing information (e.g., Huffman Coding).
2. Lossy Compression: Reduces size with some information loss (e.g., JPEG).
15
4. Aggregation
1. Averaging: Replace groups of data points with their average.
2. Summarization: Use summary statistics like mean and median.

5. Feature Selection
1. Filter Methods: Evaluate features independently (e.g., Chi-square test).
2. Wrapper Methods: Use models to find the best feature subsets (e.g., Recursive Feature Elimination).
3. Embedded Methods: Perform selection during model training (e.g., Lasso).

7. Discretization
1. Binning: Divide continuous data into intervals.
2. Decision Tree-Based Discretization: Use trees to split data into intervals.

Benefits of Data Reduction


• Improved Efficiency: Speeds up processing and analysis.
• Lower Storage Costs: Requires less storage space.
• Enhanced Model Performance: Improves model interpretability and performance.
• Noise Reduction: Eliminates irrelevant data.

16
DIFFERENTIATE BETWEEN DIMENSIONALITY REDUCTION AND NUMEROSITY REDUCTION\

Dimensionality Reduction Numerosity Reduction

In dimensionality reduction, data encoding or In Numerosity reduction, data volume is


data transformations are applied to obtain a reduced by choosing suitable alternating forms
reduced or compressed for of original data. of data representation.

It can be used to remove irrelevant or redundant It is merely a representation technique of


attributes. original data into smaller form.

In this method, some data can be lost which is


In this method, there is no loss of data.
irrelevant.
1.Methods for Numerosity reduction
1.Methods for dimensionality reduction
are:Regression or log-linear model (parametric).
are:Wavelet transformations.
2.Histograms, clusturing, sampling (non-
2.Principal Component Analysis.
parametric).

The components of dimensionality reduction are It has no components but methods that ensure
feature selection and feature extraction. reduction of data volume.

It leads to less misleading data and more model It preserves the integrity of data and the data
accuracy. volume is also reduced.
17
Concept Hierarchy Generation for Categorical Data
• Concept hierarchy generation organizes categorical data into hierarchical levels to enable better data analysis and
abstraction.
Steps:
1. Identify Attribute: Select the categorical attribute.
2. Determine Levels: Define hierarchical levels from specific to general.
3. Organize Categories: Group categories into these levels.
Methods:
1. Domain Expert Knowledge:
1. Manually defined hierarchies based on expert understanding.
2. Example: City → State → Country.

2. Data-Driven Methods:
1. Generated automatically based on data.
2. Example: Clustering, Attribute-Oriented Induction.

3. Static and Dynamic Generation:


1. Static: Predefined, unchanging hierarchies (e.g., Day → Month → Year).
2. Dynamic: Hierarchies that adapt to data needs.

4. Rule-Based Generation:
1. Hierarchies created using predefined logical rules.
2. Example: Age categories (e.g., <18: Child, 18-65: Adult, ≥65: Senior).

• Examples:
1. Geographical Data: City → State → Country → Continent.
2. Temporal Data: Day → Week → Month → Quarter → Year.
3. Product
18
Categories: Item → Category → Department.
Aspect Data Mining Techniques Data Mining Strategies

Definition Specific methods or algorithms used to analyze and extract data. Overall plans or methodologies guiding the data mining process.

Purpose Extract patterns, relationships, and insights from data. Plan and manage the entire data mining project lifecycle.

Broad and covers multiple stages from problem understanding to


Scope Narrow and focused on particular tasks (e.g., classification).
deployment.

Examples Classification, Clustering, Regression, Association Rule Mining. CRISP-DM, SEMMA, KDD, Agile Data Mining.

Application Used during the modeling phase of a strategy. Used to structure and guide the entire data mining process.

19
THANK YOU FOR WATCHING PLEASE SUBSCRIBE

20

You might also like