0% found this document useful (0 votes)
4 views

Unit 1

Uploaded by

redoxit809
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Unit 1

Uploaded by

redoxit809
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 63

Data Mining (DM)

2101CS521

Unit-1
Introduction to
Data Mining (DM)

Prof. Jayesh D. vagadiya


Computer Engineering
Department
Darshan Institute of Engineering & Technology, Rajkot
[email protected]
9537133260
 Looping
Topics to be covered
• Motivation for Data Mining
• Data Mining - Definition and Functionalities
• Data Mining – On what kind of data?
• KDD Process (Knowledge Discovery in Databases)
• What Kinds of Patterns Can Be Mined?
• Are All Patterns Interesting?
• Issues in DM
• Types of Attributes
• Mean, Median, mode, Standard Deviation of Data
• Data Matrix vs Dissimilarity Matrix
• Dissimilarity of Numeric Data
Just think: One Second on Internet
 9,003 Tweets
 4,705 Skype Calls
 1,711 Tumblr Posts
 83,378 Google Searches
 84,388 YouTube videos viewed
Are all these
 996 Instagram photos uploaded
information is really
 & many more…
important to us
?????????

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 3
Motivation: Why data mining?
 “Necessity is the Mother of all Inventions”
 “It has been estimated that the amount of information in the world
doubles every 10 months.”
 There is a tremendous increase in the amount of data recorded and stored
on digital media as well as individual sources.
 Since the 1960’s, database and information technology has been changed
systematically from primitive file processing systems to powerful database
systems.
 The research and development in database systems since the 1970’s has
led to the development
“We are of relational
drowning database
in data, systems.for
but starving
knowledge!”
“Data rich but Information poor”

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 4
Motivation: Why data mining? (Cont..)
Years Evolutions
Since Data collection, database creation, IMS (hierarchical database system by IBM) and
1960’s network DBMS
1970s Relational data model, relational DBMS implementation
1980s RDBMS, advanced data models, application-oriented DBMS (spatial, scientific,
engineering, etc.)
1990s Data mining, data warehousing, multimedia databases, and web databases
2000s Stream data management and mining, Social Networks (Facebook, etc.), web
technology (XML) and global information systems
At Present Heterogeneous database systems, big data

Every day data grows exponentially,


but these all data are really
important to us??

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 5
Motivation for Data Mining : An Example

Data  Knowledge  Action 


Goal
Netflix collects user ratings of movies (data)  What types of
movies you will like (knowledge)  Recommend new movies to
you (action)  Users stay with Netflix (goal)

Gene sequences of cancer patients (data)  Which genes lead to


cancer? (knowledge)  Appropriate treatment (action)  Save
life (goal)
Road traffic (data)  Which road is likely to be congested?
(knowledge)  Suggest better routes to drivers (action)  Save
time and energy (goal)
Summary
The overall goal of the data mining process is to extract
information from a large data sets or databases and
transform it into an understandable structure for
#2101CS521further use.
(DM)  Unit 1 – Introduction to
Prof. Jayesh D. Vagadiya 6
What is Data Mining?
Database  Data mining refers to extracting or “mining”
Technolog
knowledge from large amounts of data.
y

Other  “Knowledge mining from data” or


Statistics
Disciplines
“Knowledge mining”
Data
Mining  “Extract knowledge from large data or
databases”
Visualizati Machine
on Learning
 “Knowledge discovery from database
Informatio (KDD)”
n Science

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 7
What is Data Mining? Definition 2
 The process of automatically discovering useful information from large
data repositories

Data Informatio
Input Data n
Data Post
Preprocessin
Mining Processing
g

Feature Selection
Dimensionality Filtering Patterns
Reduction Visualization
Normalization Pattern Interpretation
Data Subsetting

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 8
Data Mining Architecture

Graphical User Interface

Pattern Evaluation

Knowle
Data Mining Engine dge
Base

Database or Data
Warehouse Server

Cleaning, Integration & Selection

Databas Data Other Info


Warehous WWW Repositorie
e
e s
#2101CS521 (DM)  Unit 1 – Introduction to
Prof. Jayesh D. Vagadiya 9
KDD (Knowledge Discovery in Databases)
Process
 Knowledge discovery in databases is a process of an iterative sequence of
the following steps:
1. Selection
2. Preprocessing
3. Transformation
4. Data Mining
5. Pattern Evaluation
6. User Interface (Visualization of Pattern or Knowledge)

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 10
KDD (Knowledge Discovery in Databases) Process
(Cont..)
Appropriate for
mining by
Intelligent Pattern
performing
methods are Evaluation
summary
appliedorin Patterns
The aggregation
order to
analysis To remove
operations, for Data Mining Knowle
extract data
task are noise andinstance.
dge
inconsistent patterns.
retrieved
from the data. Transformati
KDD database. on
Process Transform
Preprocessi ed Data
ng
Visualization and
knowledge
Selectio Preprocess representation
n ed Data techniques are used to
Target present the mined
Data knowledge to the user.

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 11
KDD (Knowledge Discovery in Databases) Process
(Cont..)
• Data Selection: Where data relevant to the analysis task are retrieved
from the database.
• Data Cleaning: To remove noise and inconsistent data.
• Data Integration: Where multiple data sources may be combined.
• Data Transformation: Where data are transformed or consolidated into
appropriate forms for mining by performing summary or aggregation
operations.
• Data Mining: An essential process where intelligent methods are applied
in order to extract data patterns.
• Pattern Evaluation: To identify the truly interesting patterns
representing knowledge based on some interestingness measures.
• Knowledge Presentation: Where visualization and knowledge
representation techniques are used to present the mined knowledge to
theProf.
user.
Jayesh D. Vagadiya
#2101CS521 (DM)  Unit 1 – Introduction to
12
Data Mining—On what kind of data?
 Relational Databases:
• A database system, also called a database management system (DBMS), consists of
a collection of interrelated data, known as a database tables, and a set of software
programs to manage and access these data.
• E.g. : SQL Server, Oracle etc.
 Data Warehouses:
• A data warehouse is a repository of information collected from multiple sources.
• It is constructed after pre-processing of data. (Data cleaning, Data integration, Data
transformation, Data loading, and Periodic data refreshing etc.)
• E.g. : Stock Market, D-Mart, Big Bazar etc.

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 13
Data Mining—On what kind of data? (Cont..)
 Transactional Databases:
• Transactional database consists of a file where each record represents a transaction.
• A transaction typically includes a unique transaction identity number (TID) and a list
of the items making up the transaction (such as items purchased in a store).
• E.g. : Online shopping on Flipkart, Amazon etc.
 Other Data/Databases
• Spatial data (Maps or Location related data)
• Engineering design data (Designs of Buildings, Offices Structures data)
• Hypertext and multimedia data (Including text, image, video and audio data), the
World Wide Web (WWW a huge, widely distributed information repository made
available on the Internet).

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 14
What Kinds of Patterns Can Be Mined?
 Data mining functionalities can be classified into two categories:
1. Descriptive
2. Predictive

 Descriptive
• This task presents the general properties of data stored in a database.
• The descriptive tasks are used to find out patterns in data.
• E.g.: Cluster, Trends, etc.

 Predictive
• These tasks predict the value of one attribute on the basis of values of other
attributes.
• E.g.: Festival Customer/Product Sell prediction at store

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 15
What Kinds of Patterns Can Be Mined? (Cont..)
 Characterization and Discrimination:
• Class characterization focuses on summarizing the characteristics or properties of
specific class or categories within a data set.
• It is used to describe representative attributes, patterns or behaviors associated
within a particular class.

• Data discrimination also know as class discrimination or class comparison, focuses


on identifying significant differences between different classes or categories in data
set.
• It is used to describe which attributes or features are most discriminatory in
distinguishing one class from another.

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 16
What Kinds of Patterns Can Be Mined? (Cont..)
 Mining Frequent Patterns:
• Frequent patterns are those patterns that
occur frequently in data. Here is the list
of kind of frequent patterns

• Frequent Item Set


• It refers to a set of items that frequently
appear together, for example, milk and
bread.

• Frequent Subsequence
• A sequence of patterns that occur
frequently such as purchasing a laptop is
followed by digital camera and a memory
card.

• Frequent Sub Structure


• A substructure can refer to different
structural forms (e.g., graphs, trees, or
lattices) that may be #2101CS521
Prof. Jayesh D. Vagadiya
combined
(DM) with
 Unit 1 – Introduction to
17
What Kinds of Patterns Can Be Mined? (Cont..)
 Association analysis:
• the process of uncovering the relationship among data and determining association
rules.
• It is used to discover interesting relationships and associations among items or
events in large datasets.

• Example
• Suppose we have a transactional dataset from a Electronics store, and we want to discover
associations between purchased items. Here's a simple example of an association rule
generated from the data:
• buys(X,“computer”) ⇒ buys(X,“software”) [support = 1%,confidence = 50%],
• where X is a variable representing a customer.
• A confidence, or certainty, of 50% means that if a customer buys a computer, there is a
50% chance that she will buy software as well.
• A 1% support means that 1% of all the transactions under analysis show that computer and
software are purchased together.

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 18
What Kinds of Patterns Can Be Mined? (Cont..)
 Mining of correlations:
• it is a data mining technique that aims to identify the statistical relationships or
associations between variables in a dataset.
• It measures the strength and direction of the linear relationship between two or more
variables that if they have positive, negative or no effect on each other.
• Example:
- Correlation between TV Advertising and Sales: +0.95 (approximate)
- Correlation between Radio Advertising and Sales: +0.85 (approximate)
- Correlation between Online Advertising and Sales: +0.90 (approximate)

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 19
What Kinds of Patterns Can Be Mined? (Cont..)
 Classification and Regression for Predictive Analysis :
• Classification is the process of finding a model (or function) that describes and
distinguishes data classes or concepts.
• The model are derived based on the analysis of a set of training data (i.e., data
objects for which the class labels are known).
• How is the derived model presented?
• Classification (IF-THEN) Rules
• Decision Trees
• Mathematical Formulae
• Neural Networks

• Classification
• It predicts the class of objects whose class label is unknown.
• The Derived Model is based on the analysis set of training data i.e. the data object whose
class label is well known.
• Example:Consider a scenario where you receive a large volume of emails, and you want to
automatically classify them as spam or non-spam.

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 20
What Kinds of Patterns Can Be Mined? (Cont..)
 Classification and Regression for Predictive Analysis (Cont..) :

• Prediction
• It is used to predict missing or unavailable numerical data values rather than class labels.
• Regression Analysis is generally used for prediction.
• Example: Let's consider a scenario where you want to predict the price of a house based
on its size (in square feet).

• Cluster Analysis
• clustering analyzes data objects without consulting class labels.
• In many cases, class- labeled data may simply not exist at the beginning
• Clustering can be used to generate class labels for a group of data.
• The objects are clustered or grouped based on the principle of maximizing the intraclass
similarity and minimizing the interclass similarity.
• That is, clusters of objects are formed so that objects within a cluster have high similarity in
comparison to one another, but are rather dissimilar to objects in other clusters.

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 21
What Kinds of Patterns Can Be Mined? (Cont..)
 Classification and Regression for Predictive Analysis (Cont..) :
• Example: Imagine you have a customer database containing various attributes such as
age, income, and purchasing behavior. By applying clustering algorithms to this data, you
can identify distinct groups or segments of customers with similar characteristics and
behaviors.

• Outlier Analysis
• A data set may contain objects that do not comply with the general behavior or model of
the data.
• These data objects are outliers.
• Many data mining methods discard outliers as noise or exceptions.
• Example: Consider a dataset that records the attendance of students in a class over a
semester. By examining the dataset, we notice that the some data points is significantly
lower than the attendance values of the other students.

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 22
Are All Patterns Interesting?
 A data mining system has the potential to generate thousands or even
millions of patterns, or rules.
 Typically, the answer is no.
 techniques for evaluating and selecting interesting patterns:

 Objective Measures of Interestingness:


 Objective measures quantify the quality or interestingness of patterns based on
statistical significance or measures derived from the data.
 These measures include support, confidence, lift, and various statistical tests.

 Subjective Measures of Interestingness:


 Subjective measures take into account the user's preferences, domain knowledge,
and specific application requirements.
 Users can specify interestingness thresholds or define constraints to filter and focus
on patterns that meet their criteria.
#2101CS521 (DM)  Unit 1 – Introduction to
Prof. Jayesh D. Vagadiya 23
Which Technologies Are Used?
 Data mining has incorporated many techniques from other domains such
as statistics, machine learning, pattern recognition, database and data
warehouse systems, information retrieval, visualization, algorithms, high
performance computing, and many application domains
 Statistics:
 A statistical model is a mathematical representation or description of the relationship
between variables in a dataset.
 It consists of a set of mathematical functions or equations that define the behavior of
the objects.
 It provides methods and techniques for summarizing and understanding numerical
information and making predictions based on data.
 Machine Learning:
 It enable computers to learn and make predictions or decisions without being
explicitly programmed.
 it is concerned with creating systems that can automatically learn and improve from
experience or data.
 Supervised learning:
 Jayesh
Prof. is basically a
D. Vagadiya synonym for#2101CS521 (DM)  Unit 1 – Introduction to
classification. The supervision in the learning comes from 24
the
Which Technologies Are Used? (Cont..)
 Machine Learning:
 Unsupervised learning
 is Unsupervised learning is essentially a synonym for clustering. The learning process is
unsupervised since the input examples are not class labeled.
 Semi-supervised learning
 Semi-supervised learning is a class of machine learning techniques that make use of both
labeled and unlabeled examples when learning a model.
 Active learning
 Active learning is a machine learning approach that lets users play an active role in the
learning process. An active learning approach can ask a user (e.g., a domain expert) to
label

 Database Systems and Data Warehouses


 Many data mining tasks need to handle large data sets or even real-time, fast
streaming data. Therefore, data mining can make good use of scalable database
technologies to achieve high efficiency and scalability on large data sets.
 Recent database systems have built systematic data analysis capabilities on
database
data using data warehousing
Prof. Jayesh D. Vagadiya
#2101CS521 and data
(DM)  Unit 1mining facilities.
– Introduction to
25
Which Technologies Are Used? (Cont..)
 Information Retrieval:
 Information retrieval (IR) is the science of searching for documents or information in
documents.
 Documents can be text or multimedia, and may reside on the Web.
 The differences between traditional information retrieval and database systems are
 the data under search are unstructured
 the queries are formed mainly by keywords, which do not have complex structures

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 26
Which Kinds of Applications Are Targeted?
 There are a two highly successful and popular application examples of
data mining: business intelligence and search engines.
 Business Intelligence
 Business intelligence (BI) technologies provide historical, current, and predictive
views of business operations.
 Examples include reporting, online analytical processing, business performance
management, competitive intelligence, benchmarking, and predictive analytics.
 Without data mining, many businesses may not be able to perform effective market
analysis, compare customer feedback on similar products, discover the strengths
and weaknesses of their competitors, retain highly valuable customers, and make
smart business decisions.
 Web search engine
 A Web search engine is a specialized computer server that searches for information
on the Web.
 Web search engines are essentially very large data mining applications. Various data
mining techniques are used in all aspects of search engines
 Search engines pose grand challenges to data mining.
 Prof.
First, they have to handle#2101CS521
Jayesh D. Vagadiya
a huge and(DM) ever-growing
 Unit 1 – Introductionamount
to of data. 27
Which Kinds of Applications Are Targeted? (Cont..)
 . Web search engine
 First, they have to handle a huge and ever-growing amount of data.
 Second, Web search engines often have to deal with online data. A search engine
may be able to afford constructing a model offline on huge data sets.
 Another challenge is maintaining and incrementally updating a model on fast-
growing data streams.
 Third, Web search engines often have to deal with queries that are asked only a very
small number of times.

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 28
Data Mining Issues
 Data mining issues can be classified into five categories:
1. Mining Methodology
2. User Interaction
3. Efficiency and Scalability (Algorithms)
4. Diversity of Database Types
5. Data Mining and Society

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 29
1. Mining Methodology Data Mining
Issues
 Mining various and new kinds of knowledge
• Data mining covers a wide spectrum of data analysis and knowledge discovery tasks,
so these tasks may use the same database in different ways and requires a
development of numerous data mining techniques.
 Mining knowledge in multidimensional space
• When searching for knowledge in large data sets, we can explore the data in
multidimensional space.
• That is, we can search for interesting patterns among combinations of dimensions
(attributes) at varying levels of abstraction. Such mining is known as (exploratory)
multidimensional data mining.
 Data mining—an interdisciplinary effort
• The power of data mining can be substantially enhanced by integrating new methods
from multiple disciplines.
• For example, to mine data with natural language text, it makes sense to fuse data
mining methods of information retrieval and natural language processing.
 Handling uncertainty, #2101CS521
noise, (DM)
or incompleteness
Unit 1 – Introduction to of data
Prof. Jayesh D. Vagadiya 30
2. User Interaction Data Mining
Issues
 Interactive mining
• The data mining process should be highly interactive. Thus, it is important to build
flexible user interfaces and an exploratory mining environment, facilitating the user’s
interaction with the system.
 Incorporation of background knowledge
• Background knowledge, constraints, rules, and other information regarding the
domain under study should be incorporated into the knowledge discovery process.
 Presentation and visualization of data mining results
• How any system can present data mining results, vividly(clear image in mind) and
flexibly ?, so that the discovered knowledge can be easily understood and directly
usable by humans.

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 31
3. Efficiency and Scalability Data Mining
Issues
 Efficiency and scalability of data mining algorithms
• Data mining algorithms must be efficient and scalable in order to effectively extract
information from huge amounts of data lies in many data repositories or in dynamic
data streams.
• In other words, the running time of a data mining algorithm must be predictable,
short, and acceptable by applications.
• Efficiency, scalability, performance, optimization and the ability to execute in real
time are key criteria for new mining algorithms.
 Parallel, distributed, and incremental mining algorithms
• The giant size of many data sets, the wide distribution of data, and the
computational complexity of some data mining methods are factors that motivate
the development of parallel and distributed data-intensive mining algorithms.

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 32
4. Diversity of Database Types Data Mining
Issues
 Handling complex types of data
• Data mining is how to uncover knowledge from stream, time-series, sequence,
graph, social network and multi-relational data.
• In mining various types of attributes are available and also different types of data in
database or dataset.
 Mining dynamic, networked, and global data repositories
• Data from multiple sources are connected by the Internet and various kinds of
networks like distributed and heterogeneous global information systems.
• The discovery of knowledge from different sources of structured, semi-structured, or
unstructured is challengeable.

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 33
5. Data Mining and Society Data Mining
Issues
 Social impacts of data mining
• With data mining penetrating our everyday lives, it is important to study the
impact of data mining on society,
• How can we used at a mining technology to benefit our society?
• How can we guard against its misuse?
 Privacy-preserving data mining
• Data mining will help in scientific discovery, business management, economy
recovery, and security protection (e.g., the real-time discovery of intruders and cyber
attacks).
• However, it poses the risk of disclosing an individual’s personal information.
 Invisible data mining
• We cannot expect everyone in society to learn and master in data mining
techniques.
• For example, when purchasing items online, users may be unaware that the store is
likely collecting data on the buying patterns of its customers, which may be used to
recommend other items for #2101CS521
purchase (DM)
in the future.
 Unit 1 – Introduction to
Prof. Jayesh D. Vagadiya 34
What is an Attribute?
 The attribute can be defined as a field for storing the data that represents
the characteristics of a data object.
 It can also be viewed as a property, characteristics, feature or column of a
data object.
 The nouns attribute, dimension, feature, and variable are often used
interchangeably in the literature.
 It represents the different features of an object (real world entity) like..
👨 Person  Name, Age, Qualification, Birthdate etc.
💻 Computer  Brand, Model, Processor, RAM etc.
📚 Book  Book Name, Author, Price, ISBN etc.
 An attribute set defines an object.
 The object is also referred to as a record of the instances or entity.

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 35
Attribute Types
 Attribute types can be divided into mainly two categories.
1. Quantitative
1. Discrete
2. Continuous

2. Qualitative
1. Nominal Quantit Qualita
• Nominal
2. Ordinal ative •
tive
• Discreat Ordinal
3. Binary e • Binary
• • Symm
1. Symmetric Continuo
etric
2. Asymmetric us • Asymm
etric

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 36
1. Quantitative Attribute Attribute
Types
 Quantitative is an adjective that simply means something that can be
measured.
 It is a special attribute that is used to compare two values, i.e., it is used
to compare a user-defined value against an upper limit and a lower limit.
 Example
 We can count the number of sheep on a farm or measure the liters of milk produced
by a cow.
 Consider a query to find all patients with low or high blood glucose levels. In
database, for each patient a lower value and an upper value for blood glucose level
is stored in the Result class.
 To find patients with low/high level of blood glucose, without QA you would have to
specify a limit on the Low attribute or the High attribute of the Result class.
 While defining limit you can use Between, Equals, Less than, Less than or Equal to,
Greater than, Greater than or Equal as relational operators.

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 37
1. Quantitative Attribute Attribute
Types
 1) Discrete Attribute
 A discrete attribute has a finite or countably infinite set of values, which may or may
not be represented as integers.
 The attributes hair_color, smoker, medical_test, and drink_size each have a finite
number of values, and so are discrete.
 CustomerID in a table has countably infinite set of values because over a time period
it grows.

 2) Continues Attribute
 Real numbers as attribute values.
 The attributes temperature, height, or weight are the examples of continuous
attributes.
 Practically, real values can only be measured and represented using a finite number
of digits.
 Continuous attributes are typically represented as floating- point variables.

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 38
2. Qualitative Attribute Attribute
Types
 Qualitative data deals with characteristics and descriptors that can't be
easily measured, but can be observed subjectively—such as smells,
tastes, textures, attractiveness, and color.
 Simple arithmetic attributes that is named or described in words.
 It is represented in integer or real values.
 Results of qualitative attribute are often quoted on scales.
 Below are the qualitative Attributes.
 Nominal
 Ordinal
 Binary
 Symmetric
 Asymmetric

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 39
2. Qualitative Attribute Cont.. Attribute
Types
1) Nominal Attribute
 Nominal attributes are named attributes which can be separated into discrete
(individual) categories which do not overlap.
 Nominal attributes values also called as distinct values.
 Example

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 40
2. Qualitative Attribute Cont.. Attribute
Types
2) Ordinal Attribute
 Ordinal attribute is the order of the values, that’s important and significant, but
the differences between each one is not really known.
 Example
 Rankings  1st, 2nd, 3rd
 Ratings  ,
 We know that a 5 star is better than a 2 star or 3 star, but we don’t know and cannot
quantify–how much better it is?
3) Binary Attribute
 Binary attributes are the categorical attributes with only two possible values (yes or
no), (true or false), (0 or 1).
 Symmetric binary attribute is the attribute which each value is equally valuable
(male or female). The male here is not more important than the female value.
 Asymmetric is the attribute which the two states is not equally important, for
example, the medical test (positive or negative), here, the positive results is more
significant than the negative one.
#2101CS521 (DM)  Unit 1 – Introduction to
Prof. Jayesh D. Vagadiya 41
Extra Attribute
Types
Interval Attribute
 Interval attribute comes in the form of a numerical value where the difference
between points is meaningful.
 Example
 Temperature  10°-20°, 30°-50°, 35°-45°
 Calendar Dates  15th – 22nd, 10th – 30th
 We can not find true zero (absolute) value with interval attributes.

Ratio Attribute
 Ratio attribute is looks like interval attribute, but it must have a true zero
(absolute) value.
 It tells us about the order and the exact value between units or data.
 Example
 Age Group  10-20, 30-50, 35-45 (In years)
 Mass  20-30 kg, 10-15 kg
 It does have a true zero (absolute) so, it is possible to compute ratios.
#2101CS521 (DM)  Unit 1 – Introduction to
Prof. Jayesh D. Vagadiya 42
Mean is the average of a
Mean (Average) dataset
 Mean is the average of a dataset.
 The mean is the total of all the values, divided by the number of values.
𝑛
 Formula to find 𝑋 1
mean
= ∑
𝑛 𝑖 =1
𝑥
 Example
 Find out mean for 12, 15, 11, 11, 7, 13 (Here total data is = 6)

First, find the sum of the


data.
12 + 15 +11 + 11 + 7 +
Then divide by the total
13 = 69
number of data. (Mea
69 / 6 n)
= 11.5

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 43
Median {Centre Or Middle Value}
 The median is the middle number in a list of numbers ordered from lowest
to highest. If count is Odd then middle
number is Median
 Example
 Find out Median for 12, 15, 11, 11, 7, 13, 15 (Here total data is = 7 {odd})
First, arrange the data in
ascending order.
7, 11, 11, 12, 13, 15, 15
Partitioning data into equal half's
7, 11, 11, 12, 13, 15, 15
12
Median

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 44
Median {Centre Or Middle Value} (Cont..)
If count is Even then take average (mean)
of middle two numbers that is Median
 Example
 Find out Median for 12, 15, 11, 11, 7, 13 (Here total data is = 6 {even})

First, arrange the data in


ascending order.
7, 11, 11, 12, 13, 15
Calculate an average (mean) of the two
numbers in the middle.
7, 11, 11, 12, 13, 15

(11 + 12)/2 = 11.5


Median
#2101CS521 (DM)  Unit 1 – Introduction to
Prof. Jayesh D. Vagadiya 45
Mode
 The mode is the number that occurs most often within a set of numbers.
 Example

12, 15, 11, 11, 7, 12, 12 15, 11, 11, 7,


13 13, 7
11 Mode 7, 11, 12
(Unimodal)
Mode (Trimodal)
12, 15, 11, 11, 7, 12, 15, 11, 10, 7, 14,
12, 13 13
11, 12 Mode No Mode
(Bimodal)

 If more than three numbers repeats within a set of numbers then it is


called as multimodal.

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 46
Range
 The range of a set of data is the difference between the largest and the
smallest number in the set.
 Example
 Find the range for given data 40, 30, 43, 48, 26, 50, 55, 40, 34, 42, 47, 50

First, arrange the data in ascending


order.
26, 30, 34, 40, 40, 42, 43, 47,
 In our example largest number is 55, and subtract the smallest number is 26.
48, 50, 50, 55
55 – 26 = 29
Range

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 47
Standard Deviation (σ)
 The Standard Deviation is a measure of how numbers are spread out.
 Its symbol is σ (the Greek letter sigma)
 In statistics, the standard deviation is a measure of the amount of
variation or dispersion of a set of values.
 A low standard deviation indicates that the values tend to be close to the
mean of the set, while a high standard deviation indicates that the values
are spread out over a wider range.
 Formula to find standard deviation σ =

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 48
Standard Deviation (σ) Cont..
 Standard Deviation is Square root of sample variance.
 The Variance is defined as:
 The average of the squared differences from the Mean.
 To calculate the variance follow these steps:
1. Calculate the mean, .
2. Write a table that subtracts the mean from each observed value.
3. Square each of the differences, add this column in table.
4. Divide by n -1 where n is the number of items in the sample, this is the variance
(In actual case take n).
5. To get the standard deviation we take the square root of the variance.

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 49
Standard Deviation (σ) Cont..
 The owner of the Indian restaurant is interested in how much people spend at the
restaurant.
 He examines 8 randomly selected receipts for parties and writes down
σ=the following
data.
44, 50, 38, 96, 42, 47, 40, 39
1. Find out Mean (Mean is 49.5 for given data)
Step (X – Step
2. X subtracts
Write a table that X - Mean the
Value
meanMean)
from2 each observed value. (2nd step)
:3 :4
44 44 - 49.5 -5.5 30.25
50 50 - 49.5 0.5 0.25
38 38 - 49.5 11.5 132.25 S2 369.71
96 96 - 49.5 46.5 2162.25
42 42 - 49.5 -7.5 56.25
Step
47 47 - 49.5 -2.5 6.25
:5
40 40 - 49.5 -9.5 90.25 σ
39 39 - 49.5 -10.5 110.25 σ 19.23 ~
Total 2588 19
#2101CS521 (DM)  Unit 1 – Introduction to
Prof. Jayesh D. Vagadiya 50
Symmetric vs Skewed Data

Mode mean Mean Mode

Mean, Median, Mode

Median Median

Symmet Positively Negatively


ric Skewed Skewed

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 51
Quantiles
 Quantiles are statistical measures that divide a dataset into equal-sized
groups.
 Providing information about the distribution of the data.
 Quantiles are points taken at regular intervals of a data distribution,
dividing it into essentially equal- size consecutive sets.
 The 2-quantile is the data point dividing the lower and upper halves of the
data distribution.
 It corresponds to the median.
 The 4-quantiles are the three data points that split the data distribution
into four equal parts each part represents one-fourth of the data
distribution.
 They are more commonly referred to as quartiles.
 The 100-quantiles are more commonly referred to as percentiles.

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 52
Quantiles

25%

Q1 Q2 Q3
25th Median 75th
Percentile Percentile
 First Quartile (Q1) or 25th Percentile: Q1 is the value below which 25% of
the data falls. This means that 25% of the data points in the dataset are
less than or equal to Q1.
 Second Quartile (Q2) or Median or 50th Percentile: Q2 is the value that
separates the dataset into two equal halves.
 Third Quartile (Q3) or 75th Percentile: Q3 is the value below which 75% of
the data falls. This means that 75% of the data points in the dataset are
lessProf.than
Jayesh D.or equal to Q3. #2101CS521 (DM)  Unit 1 – Introduction to
Vagadiya 53
Quantiles
 The distance between the first and third quartiles is a called the
interquartile range (IQR) and is defined as IQR = Q3 − Q1.

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 54
Five-Number Summary
 The five-number summary of a distribution consists of the median (Q2),
the quartiles Q1 and Q3, and the smallest and largest individual
observations, written in the order of Minimum, Q1, Median, Q3, Maximum.
 Boxplots are a popular way of visualizing a distribution.

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 55
Boxplots
 Boxplots provide a means of depicting groups of numbers through their
quartiles.
 Quartiles means three points dividing a group into four equal parts.
 In boxplot, data will be divided in 4 part using the 3 points (25th percentile,
median, 75th percentile) Interquartile Range
(IQR)

Outliers Whiskers Whiskers Outliers

Minimum Maximum
(Q1 – 1.5 * IQR) Median (Q3 + 1.5 *
IQR)
Q1 Q2 Q3
(25th Percentile) (75th Percentile)
(50th Percentile)

-5 -4 -3 -2 -1 0 1 2 3 4 5
#2101CS521 (DM)  Unit 1 – Introduction to
Prof. Jayesh D. Vagadiya 56
Data Matrix vs Dissimilarity Matrix
 Data Matrix:
 A data matrix, also known as a feature matrix or attribute matrix, is a structured
representation of a dataset where rows represent observations or data points, and
columns represent attributes or variables.
 Each cell in the matrix contains the value of a specific attribute for a particular data
point.
 A data matrix is made up of two entities or “things,” namely rows (for objects) and
columns (for attributes). Therefore, the data matrix is often called a two-mode
matrix.

 Dissimilarity Matrix :
 A dissimilarity matrix, also known as a distance matrix or dissimilarity matrix, is a
square matrix that quantifies the dissimilarity or distance between pairs of data
points in a dataset.
 Each cell in the matrix represents the dissimilarity measure between two data points.
 The dissimilarity matrix contains one kind of entity (dissimilarities) and so is called a
one-mode matrix. #2101CS521 (DM)  Unit 1 – Introduction to
Prof. Jayesh D. Vagadiya 57
Data Matrix vs Dissimilarity Matrix

Studen Name Gender A B C


t_id
A 0.0 3.2 5.3
1 ABC Male
B 3.2 0.0 4.3
2 XYZ Female
C 5.3 4.3 0.0
3 PQR Male

Dissimilarity
Data Matrix
Matrix

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 58
Dissimilarity of Numeric Data
 The dissimilarity of numeric data, also known as the distance or similarity
measure, quantifies the difference between two or more sets of numeric
values.
 There are several popular dissimilarity measures for numeric data, each
with its own characteristics and suitability for different scenarios.
 Here are some commonly used measures:
 Euclidean Distance:
 For two vectors x and y, each with n numeric values, the Euclidean distance is
calculated as the square root of the sum of squared differences of corresponding
values
 It is used in applications like clustering, k-nearest neighbors, and regression analysis.

 Example: vectors: x = [2, 4, 6, 8, 10] y = [3, 5, 7, 9, 11]


 sqrt((2-3)2 + (4-5)2 + (6-7)2 + (8-9)2 + (10-11)2)
 sqrt(1 + 1 + 1 + 1 + 1)
 sqrt(5) #2101CS521 (DM)  Unit 1 – Introduction to
Prof. Jayesh D. Vagadiya 59
Dissimilarity of Numeric Data
 Manhattan Distance:
 Also known as the city block distance or L1 distance, it measures the sum of absolute
differences between corresponding values of two vectors.
 When the data is continuous and the magnitude of differences between values is less
important than their direction.

 Example: vectors: x = [2, 4, 6, 8, 10] y = [3, 5, 7, 9, 11]


 |2-3| + |4-5| + |6-7| + |8-9| + |10-11|
 1+1+1+1+1=5
 Minkowski Distance:
 It is a generalization of the Euclidean and Manhattan distances and is defined by the
parameter p. For p = 1, it is equivalent to the Manhattan distance, and for p = 2, it is
equivalent to the Euclidean distance.
 (|p)1/p
 Example: vectors: x = [2, 4, 6, 8, 10], y = [3, 5, 7, 9, 11], (p = 2)
 sqrt((|2-3|2 + |4-5|2 + |6-7|2 + |8-9|2 + |10-11|2))
 sqrt(1 + 1 + 1 + 1 + 1)
 sqrt(5) = 2.236 #2101CS521 (DM)  Unit 1 – Introduction to
Prof. Jayesh D. Vagadiya 60
Dissimilarity of Numeric Data
 Supremum Distance:
 It will measure the maximum absolute difference between corresponding elements of
two vectors.
 it calculates the largest absolute difference among the corresponding values.
 Example: vectors: x = [22, 1, 42, 10] y = [20, 0, 36, 8]
 max(|22-20|, |1-0|, |42-36|, |10-8|)
 max(2, 1, 6, 2)
 6

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 61
Exercise
 Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36,
8):
 (a) Compute the Euclidean distance between the two objects.
 (b) Compute the Manhattan distance between the two objects.
 (c) Compute the Minkowski distance between the two objects, using q = 3.
 (d) Compute the supremum distance between the two objects.

 So, the dissimilarity measures for the given objects are:


 (a) Euclidean Distance ≈ 6.708
 (b) Manhattan Distance = 11
 (c) Minkowski Distance (q = 3) ≈ 6.118
 (d) Supremum Distance = 6

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 62
IMP Questions
1. Explain KDD Process with diagram.
2. Explain types of Attributes with example.
3. What Kinds of Patterns Can Be Mined?
4. Are All Patterns Interesting?
5. Explain Data Mining Issues
6. Explain mean, median, mode, range and standard deviation with
example.
7. Explain five numbers summary with Boxplot diagram.
8. Explain Data Matrix vs Dissimilarity Matrix.
9. Explain methods to find dissimilarity of Numeric Data with example.

#2101CS521 (DM)  Unit 1 – Introduction to


Prof. Jayesh D. Vagadiya 63

You might also like