SlideShare a Scribd company logo
Introduction
UNIT 1 - Chapter 1
Ranjit Reddy M M. Tech., (Ph. D)
Associate Professor
Department of Computer Science & Engineering
2
Contents/Topics
 What Is Data Mining?
 Motivating Challenges
 The Origins of Data Mining
 Data Mining Tasks
 Summary
January 31, 2016 Data Mining: Concepts and Techniques 3
What Is Data Mining?
 Data Mining: (knowledge discovery from data)
 Extracting or “Mining” knowledge from large amounts of data.
 Searching for knowledge in your data
 Alternative names:
 Knowledge discovery (mining) in databases (KDD)
 knowledge extraction
 data/pattern analysis
 data archeology
 data dredging
 information harvesting
 business intelligence, etc.
Knowledge Discovery (KDD) Process
January 31, 2016 Data Mining: Concepts and Techniques 5
Knowledge Discovery (KDD) Process steps
 1. Data cleaning (to remove noise and inconsistent data)
 2. Data integration (where multiple data sources may be combined-Flat files,
spread sheets and relational tables)
 3. Data selection (where data relevant to the analysis task are retrieved from the
database)
 4. Data transformation (where data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations, for
instance)
 5. Data mining (an essential process where intelligent methods are applied in
order to extract data patterns)
 6. Pattern evaluation (to identify the truly interesting patterns representing
knowledge based on some interestingness measures)
 7. Knowledge presentation (where visualization and knowledge representation
techniques are used to present the mined knowledge to the user)
Architecture of typical data mining system
January 31, 2016 Data Mining: Concepts and Techniques 7
Architecture of typical data mining system
 Database, data warehouse, World Wide Web, or other information
repository: This is one or a set of databases, data warehouses, spreadsheets, or
other kinds of information repositories. Data cleaning and data integration
techniques may be performed on the data.
 Database or data warehouse server: The database or data warehouse server is
responsible for fetching the relevant data, based on the user’s data mining request.
 Knowledge base: This is the domain knowledge that is used to guide the search or
evaluate the interestingness of resulting patterns. Such knowledge can include
concept hierarchies, used to organize attributes or attribute values into different
levels of abstraction. Knowledge such as user beliefs, which can be used to assess a
pattern’s interestingness based on its unexpectedness, may also be included. Other
examples of domain knowledge are additional interestingness constraints or
thresholds, and metadata (e.g., describing data from multiple heterogeneous
sources).
January 31, 2016 Data Mining: Concepts and Techniques 8
Architecture of typical data mining system
 Data mining engine: Consists of a set of functional modules for tasks such as
characterization, association and correlation analysis, classification, prediction,
cluster analysis, outlier analysis, and evolution analysis.
 Pattern evaluation module: This component typically employs interestingness
measures and interacts with the data mining modules so as to focus the search
toward interesting patterns. It may use interestingness thresholds to filter out
discovered patterns. Alternatively, the pattern evaluation module may be integrated
with the mining module.
 User interface: This module communicates between users and the data mining
system, allowing the user to interact with the system by specifying a data mining
query or task, providing information to help focus the search, and performing
exploratory data mining based on the intermediate data mining results. This
component allows the user to browse database and data warehouse schemas or
data structures, evaluate mined patterns, and visualize the patterns in different
forms.
Motivating Challenges
 Scalability:
 Datasets with sizes of gigabytes, terabytes or even petabytes
 Massive datasets cannot fit into main memory
 Need to develop scalable data mining algorithms to mine massive datasets
 Scalability can also be improved by using sampling or developing parallel and
distributed algorithms.
 High Dimensionality:
 Data sets with hundreds or thousands of attributes.
 Example: Dataset that contains measurements of temperature at various
location
 Traditional data analysis techniques that were developed for low dimensional
data .
 Need to develop data mining algorithms to handle high dimensionality.
Motivating Challenges
 Heterogeneous and Complex Data:
 Traditional data analysis methods deal with datasets containing attributes of
same type(Continuous or Categorical).
 Complex data sets contains image, video, text etc.
 Need to develop mining methods to handle complex datasets
 Data Ownership and Distribution:
 Data is not stored in one location or owned by one organization.
 Data is geographically distributed among resources belonging to multiple
entities.
 Need to develop distributed data mining algorithms to handle distributed
datasets.
 Key challenges:
 How to reduce the amount of communication needed for distributed data.
 How to effectively consolidate the data mining results from multiple sources
 How to address data security issues.
Motivating Challenges
 Non Traditional Analysis:
 Traditional statistical approach is based on a hypothesize-and-test paradigm.
 A hypothesis is proposed, an experiment is designed to gather the data, and then
data is analyzed with respect to the hypothesis.
 This process is extremely labor-intensive.
 Need to develop mining methods to automate the process of hypothesis
generation and evaluation.
The Origins of Data Mining
 Data Mining Draws ideas, such as:
 Sampling, estimation and hypothesis testing from statistics.
 Search algorithms, modeling techniques and learning theories from Artificial
Intelligence or Machine Learning, Pattern Recognition.
 Database systems are
needed to provide support
for efficient storage,
Indexing and query
processing.
 The Techniques from
parallel computing are
addressing the massive size of some datasets.
 Distributed Computing techniques are used to gather information from different
locations.
Data Mining Tasks
 Data Mining tasks divided into two major categories:
 Predictive Tasks: Predict the value of particular attribute based on the values
of other attributes. The predicted attribute is known as target or dependent
variable and other attribute is known as explanatory or independent
variables.
 Descriptive Tasks: Characterize the general properties of the data in the
database(Correlations, Trends, Clusters, Trajectories and anomalies).
 Four of the core data mining tasks:
 Classification & Regression
 Association Analysis
 Cluster Analysis
 Anomaly Detection
Data Mining Functionalities
Data Mining Functionalities
 Predictive Modeling: Building a model for the target variable as a function of the
explanatory variable.
 Classification: Which is used for Discrete Target Variables.
Ex: Predicting whether a web user will make a purchase at an online book
store(Target variable is binary valued).
 Regression: Which is used for Continuous Target Variables.
 Ex: Forecasting the future price of a stock(Price is a continuous-valued attribute)
.
Data Mining Functionalities
 Association Analysis:
 Used to discover patterns that describe strongly associated features in the data.
 The discovered patterns are typically represented in the form of implication rules or
feature subsets
 The above table illustrate the data collected at supermarkets.
 Association analysis can be applied to find items that are frequently bought together
by customers.
 Discovered Association Rule is {Diapers} → {Milk} (Customers who buy diapers
also tend to buy milk)
Transaction ID Items
1
2
3
4
5
6
7
8
9
10
{Bread, Butter, Diapers, Milk}
{Coffee, Sugar, Cookies, Salmon}
{Bread, Butter, Coffee, Diapers, Milk, Eggs}
{Bread, Butter, Salmon, Chicken}
{Eggs, Bread, Butter}
{Salmon, Diapers, Milk}
{Bread, Tea, Sugar, Eggs}
{Coffee, Sugar, Chicken, Eggs}
{Bread, Diapers, Milk, Salt}
{Tea, Eggs, Cookies, Diapers, Milk}
Market
Basket
Analysis
Data Mining Functionalities
 Cluster Analysis:
 Grouping of similar things is called cluster.
 The objects are clustered or grouped based on the principle of maximizing the
intra class similarity(Within a Cluster) and minimizing the interclass
similarity(Cluster to Cluster).
Document Clustering
 Each Article is represented as a set of word frequency pairs (w, c), Where w is a
word and c is the number of times the word appears in the article.
 There are 2 natural clusters in the above dataset
 First Cluster consists of the first 3 articles (News about the Economy)
 Second cluster contain last 3 articles (News about the Heath Care)
Article Word
1
2
3
4
5
6
Dollar : 1, Industry : 4, Country : 2, Loan : 3, Deal : 2, Government : 2
Machinery : 2, Labor : 3, Market : 4, Industry : 2, Work : 3, Country : 1
Domestic: 4, Forecast : 2, Gain : 1, Market : 3, Country : 2, Index : 3
Patient : 4, Symptom : 2, Drug : 3, Health : 2, Clinic : 2, Doctor : 2
Death : 2, Cancer : 4, Drug : 3, Public : 4, Health : 3, Director : 2
Medical : 2, Cost : 3, Increase : 2, Patient : 2, Health : 3, Care : 1
Data Mining Functionalities
 Anomaly detection:
 The task of identifying observations whose characteristics are significantly different
from the rest of the data.
 Such observations are known as anomalies or Outliers.
 A good anomaly detector must have a high detection rate and a low false rate.
 Applications: Detection of fraud, Network Intrusions etc…
 Ex: Credit Card Fraud Detection:
 A Credit Card Company records the transactions made by every credit card holder,
along with the personal information such as credit limit, age, annual income and
address.
 When a new transaction arrives, it is compared against the profile of the user.
 If the characteristics of the transaction are very different from the previously
created profile, then the transaction is flagged as potentially fraudulent.
Ad

More Related Content

What's hot (20)

Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
lavanya marichamy
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classification
Krish_ver2
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
DataminingTools Inc
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
Krish_ver2
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
Benazir Income Support Program (BISP)
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining
kavitha muneeshwaran
 
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data science
MaryamRehman6
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
Archana Swaminathan
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
Sulman Ahmed
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
Azad public school
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
Valerii Klymchuk
 
3 tier data warehouse
3 tier data warehouse3 tier data warehouse
3 tier data warehouse
J M
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
International School of Engineering
 
Data mining tasks
Data mining tasksData mining tasks
Data mining tasks
Khwaja Aamer
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
Justin Cletus
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering method
rajshreemuthiah
 
Association Rule Learning Part 1: Frequent Itemset Generation
Association Rule Learning Part 1: Frequent Itemset GenerationAssociation Rule Learning Part 1: Frequent Itemset Generation
Association Rule Learning Part 1: Frequent Itemset Generation
Knoldus Inc.
 
data generalization and summarization
data generalization and summarization data generalization and summarization
data generalization and summarization
janani thirupathi
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Lior Rokach
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classification
Krish_ver2
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
DataminingTools Inc
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
Krish_ver2
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining
kavitha muneeshwaran
 
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data science
MaryamRehman6
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
Sulman Ahmed
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
Valerii Klymchuk
 
3 tier data warehouse
3 tier data warehouse3 tier data warehouse
3 tier data warehouse
J M
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
Justin Cletus
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering method
rajshreemuthiah
 
Association Rule Learning Part 1: Frequent Itemset Generation
Association Rule Learning Part 1: Frequent Itemset GenerationAssociation Rule Learning Part 1: Frequent Itemset Generation
Association Rule Learning Part 1: Frequent Itemset Generation
Knoldus Inc.
 
data generalization and summarization
data generalization and summarization data generalization and summarization
data generalization and summarization
janani thirupathi
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Lior Rokach
 

Similar to data mining (20)

Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
Harsha Patel
 
Talk
TalkTalk
Talk
sumit621
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousing
Sunny Gandhi
 
Unit 4 Advanced Data Analytics
Unit 4 Advanced Data AnalyticsUnit 4 Advanced Data Analytics
Unit 4 Advanced Data Analytics
Rani Channamma University, Sangolli Rayanna First Grade Constituent College, Belagavi
 
Data mining
Data miningData mining
Data mining
heba_ahmad
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
Nandakumar P
 
Unit i
Unit iUnit i
Unit i
AishwaryaLakshmiA
 
data mining lecture notes for btech students+
data mining lecture notes for btech students+data mining lecture notes for btech students+
data mining lecture notes for btech students+
mrsam3062
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
FellowBuddy.com
 
Data mining
Data miningData mining
Data mining
hardavishah56
 
Seminar Presentation
Seminar PresentationSeminar Presentation
Seminar Presentation
Vaibhav Dhattarwal
 
G045033841
G045033841G045033841
G045033841
IJERA Editor
 
1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt
1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt
1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt
JITENDER773791
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
Sushil Kulkarni
 
Data Mining
Data MiningData Mining
Data Mining
Gary Stefan
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
sumit621
 
Data Mining
Data MiningData Mining
Data Mining
SOMASUNDARAM T
 
Machine_Learning_VTU_6th_Semester_Module_2.1.pptx
Machine_Learning_VTU_6th_Semester_Module_2.1.pptxMachine_Learning_VTU_6th_Semester_Module_2.1.pptx
Machine_Learning_VTU_6th_Semester_Module_2.1.pptx
MaheshKini3
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
Basma Gamal
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
Premkumar R
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
Harsha Patel
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousing
Sunny Gandhi
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
Nandakumar P
 
data mining lecture notes for btech students+
data mining lecture notes for btech students+data mining lecture notes for btech students+
data mining lecture notes for btech students+
mrsam3062
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
FellowBuddy.com
 
1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt
1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt
1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt
JITENDER773791
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
Sushil Kulkarni
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
sumit621
 
Machine_Learning_VTU_6th_Semester_Module_2.1.pptx
Machine_Learning_VTU_6th_Semester_Module_2.1.pptxMachine_Learning_VTU_6th_Semester_Module_2.1.pptx
Machine_Learning_VTU_6th_Semester_Module_2.1.pptx
MaheshKini3
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
Basma Gamal
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
Premkumar R
 
Ad

Recently uploaded (20)

Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Ad

data mining

  • 1. Introduction UNIT 1 - Chapter 1 Ranjit Reddy M M. Tech., (Ph. D) Associate Professor Department of Computer Science & Engineering
  • 2. 2 Contents/Topics  What Is Data Mining?  Motivating Challenges  The Origins of Data Mining  Data Mining Tasks  Summary
  • 3. January 31, 2016 Data Mining: Concepts and Techniques 3 What Is Data Mining?  Data Mining: (knowledge discovery from data)  Extracting or “Mining” knowledge from large amounts of data.  Searching for knowledge in your data  Alternative names:  Knowledge discovery (mining) in databases (KDD)  knowledge extraction  data/pattern analysis  data archeology  data dredging  information harvesting  business intelligence, etc.
  • 5. January 31, 2016 Data Mining: Concepts and Techniques 5 Knowledge Discovery (KDD) Process steps  1. Data cleaning (to remove noise and inconsistent data)  2. Data integration (where multiple data sources may be combined-Flat files, spread sheets and relational tables)  3. Data selection (where data relevant to the analysis task are retrieved from the database)  4. Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance)  5. Data mining (an essential process where intelligent methods are applied in order to extract data patterns)  6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures)  7. Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user)
  • 6. Architecture of typical data mining system
  • 7. January 31, 2016 Data Mining: Concepts and Techniques 7 Architecture of typical data mining system  Database, data warehouse, World Wide Web, or other information repository: This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data.  Database or data warehouse server: The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining request.  Knowledge base: This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction. Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness, may also be included. Other examples of domain knowledge are additional interestingness constraints or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources).
  • 8. January 31, 2016 Data Mining: Concepts and Techniques 8 Architecture of typical data mining system  Data mining engine: Consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis.  Pattern evaluation module: This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns. It may use interestingness thresholds to filter out discovered patterns. Alternatively, the pattern evaluation module may be integrated with the mining module.  User interface: This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results. This component allows the user to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms.
  • 9. Motivating Challenges  Scalability:  Datasets with sizes of gigabytes, terabytes or even petabytes  Massive datasets cannot fit into main memory  Need to develop scalable data mining algorithms to mine massive datasets  Scalability can also be improved by using sampling or developing parallel and distributed algorithms.  High Dimensionality:  Data sets with hundreds or thousands of attributes.  Example: Dataset that contains measurements of temperature at various location  Traditional data analysis techniques that were developed for low dimensional data .  Need to develop data mining algorithms to handle high dimensionality.
  • 10. Motivating Challenges  Heterogeneous and Complex Data:  Traditional data analysis methods deal with datasets containing attributes of same type(Continuous or Categorical).  Complex data sets contains image, video, text etc.  Need to develop mining methods to handle complex datasets  Data Ownership and Distribution:  Data is not stored in one location or owned by one organization.  Data is geographically distributed among resources belonging to multiple entities.  Need to develop distributed data mining algorithms to handle distributed datasets.  Key challenges:  How to reduce the amount of communication needed for distributed data.  How to effectively consolidate the data mining results from multiple sources  How to address data security issues.
  • 11. Motivating Challenges  Non Traditional Analysis:  Traditional statistical approach is based on a hypothesize-and-test paradigm.  A hypothesis is proposed, an experiment is designed to gather the data, and then data is analyzed with respect to the hypothesis.  This process is extremely labor-intensive.  Need to develop mining methods to automate the process of hypothesis generation and evaluation.
  • 12. The Origins of Data Mining  Data Mining Draws ideas, such as:  Sampling, estimation and hypothesis testing from statistics.  Search algorithms, modeling techniques and learning theories from Artificial Intelligence or Machine Learning, Pattern Recognition.  Database systems are needed to provide support for efficient storage, Indexing and query processing.  The Techniques from parallel computing are addressing the massive size of some datasets.  Distributed Computing techniques are used to gather information from different locations.
  • 13. Data Mining Tasks  Data Mining tasks divided into two major categories:  Predictive Tasks: Predict the value of particular attribute based on the values of other attributes. The predicted attribute is known as target or dependent variable and other attribute is known as explanatory or independent variables.  Descriptive Tasks: Characterize the general properties of the data in the database(Correlations, Trends, Clusters, Trajectories and anomalies).  Four of the core data mining tasks:  Classification & Regression  Association Analysis  Cluster Analysis  Anomaly Detection
  • 15. Data Mining Functionalities  Predictive Modeling: Building a model for the target variable as a function of the explanatory variable.  Classification: Which is used for Discrete Target Variables. Ex: Predicting whether a web user will make a purchase at an online book store(Target variable is binary valued).  Regression: Which is used for Continuous Target Variables.  Ex: Forecasting the future price of a stock(Price is a continuous-valued attribute) .
  • 16. Data Mining Functionalities  Association Analysis:  Used to discover patterns that describe strongly associated features in the data.  The discovered patterns are typically represented in the form of implication rules or feature subsets  The above table illustrate the data collected at supermarkets.  Association analysis can be applied to find items that are frequently bought together by customers.  Discovered Association Rule is {Diapers} → {Milk} (Customers who buy diapers also tend to buy milk) Transaction ID Items 1 2 3 4 5 6 7 8 9 10 {Bread, Butter, Diapers, Milk} {Coffee, Sugar, Cookies, Salmon} {Bread, Butter, Coffee, Diapers, Milk, Eggs} {Bread, Butter, Salmon, Chicken} {Eggs, Bread, Butter} {Salmon, Diapers, Milk} {Bread, Tea, Sugar, Eggs} {Coffee, Sugar, Chicken, Eggs} {Bread, Diapers, Milk, Salt} {Tea, Eggs, Cookies, Diapers, Milk} Market Basket Analysis
  • 17. Data Mining Functionalities  Cluster Analysis:  Grouping of similar things is called cluster.  The objects are clustered or grouped based on the principle of maximizing the intra class similarity(Within a Cluster) and minimizing the interclass similarity(Cluster to Cluster). Document Clustering  Each Article is represented as a set of word frequency pairs (w, c), Where w is a word and c is the number of times the word appears in the article.  There are 2 natural clusters in the above dataset  First Cluster consists of the first 3 articles (News about the Economy)  Second cluster contain last 3 articles (News about the Heath Care) Article Word 1 2 3 4 5 6 Dollar : 1, Industry : 4, Country : 2, Loan : 3, Deal : 2, Government : 2 Machinery : 2, Labor : 3, Market : 4, Industry : 2, Work : 3, Country : 1 Domestic: 4, Forecast : 2, Gain : 1, Market : 3, Country : 2, Index : 3 Patient : 4, Symptom : 2, Drug : 3, Health : 2, Clinic : 2, Doctor : 2 Death : 2, Cancer : 4, Drug : 3, Public : 4, Health : 3, Director : 2 Medical : 2, Cost : 3, Increase : 2, Patient : 2, Health : 3, Care : 1
  • 18. Data Mining Functionalities  Anomaly detection:  The task of identifying observations whose characteristics are significantly different from the rest of the data.  Such observations are known as anomalies or Outliers.  A good anomaly detector must have a high detection rate and a low false rate.  Applications: Detection of fraud, Network Intrusions etc…  Ex: Credit Card Fraud Detection:  A Credit Card Company records the transactions made by every credit card holder, along with the personal information such as credit limit, age, annual income and address.  When a new transaction arrives, it is compared against the profile of the user.  If the characteristics of the transaction are very different from the previously created profile, then the transaction is flagged as potentially fraudulent.