0% found this document useful (0 votes)
4 views

VIPDMTheoryChapter1

The document provides an introduction to data mining, explaining its significance due to the explosive growth of data and the necessity for automated analysis. It outlines the knowledge discovery process, including data cleaning, integration, selection, transformation, mining, evaluation, and presentation, as well as the various types of data that can be mined. Additionally, it discusses applications of data mining across different fields and highlights major issues such as efficiency, scalability, and privacy.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

VIPDMTheoryChapter1

The document provides an introduction to data mining, explaining its significance due to the explosive growth of data and the necessity for automated analysis. It outlines the knowledge discovery process, including data cleaning, integration, selection, transformation, mining, evaluation, and presentation, as well as the various types of data that can be mined. Additionally, it discusses applications of data mining across different fields and highlights major issues such as efficiency, scalability, and privacy.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Chapter 1.

Introduction
 Why Data Mining?
 What Is Data Mining?
 What Kind of Data Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 Summary

1
Why Data Mining?
 The Explosive Growth of Data: from terabytes to petabytes
 Data collection and data availability

Automated data collection tools, database systems,
Web, computerized society
 Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific
simulation, …

Society and everyone: news, digital cameras, YouTube
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets
2
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 What Kind of Data Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 Summary

3
What Is Data Mining?

 Data mining (knowledge discovery from data)


 Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data
 Data mining: a misnomer?
 Alternative names
 Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
 Watch out: Is everything “data mining”?
 Simple search and query processing
 (Deductive) expert systems
4
Knowledge Discovery (KDD) Process
 This is a view from typical
database systems and data
Pattern Evaluation
warehousing communities
 Data mining plays an
essential role in the
knowledge discovery process Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
5
Knowledge Discovery
Process
 Data cleaning (remove noise and inconsistent data)

 Data integration (multiple data sources may be combined)

 Data selection (data relevant to the analysis task are retrieved


from the database)

 Data transformation (data are transformed and consolidated


into forms appropriate for mining by performing summary or
aggregation operations)

 Data mining (intelligent methods are applied to extract data


patterns)

 Pattern evaluation (to identify the truly interesting patterns


representing knowledge based on interestingness measures)

 Knowledge presentation (where visualization and knowledge


representation techniques are
Data Mining: used
Concepts and to present mined
knowledge to users)
April 9, 2025 Techniques 6
Example: A Web Mining
Framework

 Web mining usually involves


 Data cleaning
 Data integration from multiple sources
 Warehousing the data
 Data cube construction
 Data selection for data mining
 Data mining
 Presentation of the mining results
 Patterns and knowledge to be used or stored
into knowledge-base

7
Data Mining
 Data mining is the process of discovering
interesting patterns and knowledge from
large amounts of data.

 The data sources can include databases,


data warehouses, the Web, other
information repositories, or data that are
streamed into the system dynamically.

Data Mining: Concepts and


April 9, 2025 Techniques 8
Data Mining in Business Intelligence

Increasing potential
to support
business decisions End User
Decisio
n
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
9
KDD Process: A Typical View from ML
and Statistics

Input Data Data Pre- Data Post-


Processing Mining Processin
g

Data integration Pattern discovery Pattern evaluation


Normalization Association & Pattern selection
correlation
Feature selection Classification Pattern
Dimension reduction interpretation
Clustering
Outlier analysis Pattern visualization
…………

 This is a view from typical machine learning and statistics


communities
10
Example: Medical Data
Mining

 Health care & medical data mining – often


adopted such a view in statistics and
machine learning
 Preprocessing of the data (including feature
extraction and dimension reduction)
 Classification or/and clustering processes
 Post-processing for presentation

11
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 What Kind of Data Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 Summary

12
Data Mining: On What Kinds of
Data?
 Database-oriented data sets and applications
 Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-
sequences)
 Structure data, graphs, social networks and multi-linked data
 Object-relational databases
 Heterogeneous databases and legacy databases
 Spatial data and spatiotemporal data
 Multimedia database
 Text databases
 The World-Wide Web
13
Data Warehouse Framework

Data Mining: Concepts and


April 9, 2025 Techniques 14
A Multidimensional Cube

Data Mining: Concepts and


April 9, 2025 Techniques 15
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 What Kind of Data Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 Summary

16
Data Mining: Confluence of Multiple
Disciplines

Machine Pattern Statistics


Learning Recognition

Applications Data Mining Visualization

Algorithm Database High-Performance


Technology Computing

17
Why Confluence of Multiple
Disciplines?
 Tremendous amount of data
 Algorithms must be highly scalable to handle such as tera-
bytes of data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
 High complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked
data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications
18
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 A Multi-Dimensional View of Data Mining
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary
19
Applications of Data Mining
 Web page analysis: from web page classification, clustering
to PageRank & HITS algorithms
 Collaborative analysis & recommender systems
 Basket data analysis to targeted marketing
 Biological and medical data analysis: classification, cluster
analysis (microarray data analysis), biological sequence
analysis, biological network analysis
 Data mining and software engineering (e.g., IEEE Computer,
Aug. 2009 issue)
 From major dedicated data mining systems/tools (e.g., SAS,
MS SQL-Server Analysis Manager, Oracle Data Mining Tools)
to invisible data mining

20
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 What Kind of Data Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 Summary

21
Major Issues in Data Mining
(1)
 Mining Methodology
 Mining various and new kinds of knowledge
 Mining knowledge in multi-dimensional space
 Data mining: An interdisciplinary effort
 Boosting the power of discovery in a networked
environment
 Handling noise, uncertainty, and incompleteness of data
 Pattern evaluation and pattern- or constraint-guided
mining
 User Interaction
 Interactive mining
 Incorporation of background knowledge
 Presentation and visualization of data mining results 22
Major Issues in Data Mining
(2)

 Efficiency and Scalability


 Efficiency and scalability of data mining algorithms
 Parallel, distributed, stream, and incremental mining
methods
 Diversity of data types
 Handling complex types of data
 Mining dynamic, networked, and global data repositories
 Data mining and society
 Social impacts of data mining
 Privacy-preserving data mining
 Invisible data mining

23
Summary
 Data mining: Discovering interesting patterns and knowledge
from massive amount of data
 A natural evolution of database technology, in great demand,
with wide applications
 A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation,
and knowledge presentation
 Mining can be performed in a variety of data
 Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend
analysis, etc.
 Data mining technologies and applications
 Major issues in data mining
24
Recommended Reference
Books
 S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data.
Morgan Kaufmann, 2002
 R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
 U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge
Discovery and Data Mining. AAAI/MIT Press, 1996
 U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
 J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3 rd ed.,
2011
 D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
 T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, 2nd ed., Springer-Verlag, 2009
 B. Liu, Web Data Mining, Springer 2006.
 T. M. Mitchell, Machine Learning, McGraw Hill, 1997
 G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press,
1991
 P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
 S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
 I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with
Java Implementations, Morgan Kaufmann, 2 nd ed. 2005
25

You might also like