0% found this document useful (0 votes)
2 views

Chapter1 ML

The document provides an introduction to Machine Learning, explaining its significance in extracting insights from data across various fields. It discusses key concepts such as supervised and unsupervised learning, classification, regression, and the importance of understanding data attributes. Additionally, it outlines the steps involved in developing a machine learning application, emphasizing data collection, preparation, analysis, training, testing, and implementation.

Uploaded by

shashi vamshi pv
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Chapter1 ML

The document provides an introduction to Machine Learning, explaining its significance in extracting insights from data across various fields. It discusses key concepts such as supervised and unsupervised learning, classification, regression, and the importance of understanding data attributes. Additionally, it outlines the steps involved in developing a machine learning application, emphasizing data collection, preparation, analysis, training, testing, and implementation.

Uploaded by

shashi vamshi pv
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 101

Machine Learning

Introduction
Shailaja K.P
 Introduction
 With Machine Learning, we can gain insight from a dataset.
 We are going to ask the computer to make some sense from the data, this is
what we call learning
 Machine Learning is actively being used today in many places.
 What is Machine Learning?
 Most of the time, insight or knowledge we are trying to get from the data
won't be obvious from looking at the data
 For eg., in detecting spam mail, looking at a single word will not help, we
need to look at the set of words together, combined with the length of the
mail and other factors, to say whether the mail is spam or not.
 Machine Learning lies at the intersection of computer science, engineering
and statistics and often appears in other disciplines also
 It can also be applied to many fields from politics to geosciences.
 It is a tool that can be applied to many problems
 Any field that needs to interpret and act on data uses machine learning
techniques.
 Machine learning uses statistics.
 So why do we need statistics?
 In engineering we apply science to solve the problem.
 We solve the deterministic problems where our solution solves the
problem all the time.
 Or eg., if we are writing a software to control vending machine, the
solution works in all the environment, regardless of the money entered or
button pressed.
 There are many problems where the solution is not deterministic.
 That is, we don’t know enough about the problem or don’t
have sufficient computing power to properly model the problem.
 For these we need statistics.
 For eg., motivation of humans is a problem that is currently too difficult to
model.
 Sensors and the data deluge
 We have a tremendous amount of human-created data from the World Wide
Web, but recently more nonhuman sources of data have been coming online.
 The technology behind the sensors isn’t new, but connecting them to the web
is new.
 It’s estimated that shortly physical sensors will create 20 percent of nonvideo
internet traffic.
 The following is an example of an abundance of free data, a worthy cause,
and the need to sort through the data.
 In 1989, the Loma Prieta earthquake struck northern California, killing 63
people, injuring 3,757, and leaving thousands homeless.
 A similarly sized earthquake struck Haiti in 2010, killing more than 230,000
people.
 Shortly after the Loma Prieta earthquake, a study was published using low-
frequency magnetic field measurements claiming to foretell the earthquake.
 A number of subsequent studies showed that the original study was flawed
for various reasons.
 Suppose we want to redo this study and keep searching for ways to predict
earthquakes so we can avoid the horrific consequences and have a better
understanding of our planet.
 What would be the best way to go about this study? We could
buy magnetometers with our own money and buy pieces of land to place
them on.
 We could ask the government to help us out and give us money and land on
which to place these magnetometers.
 Who’s going to make sure there’s no tampering with the magnetometers, and
how can we get readings from them? There exists another low-cost solution.
 Mobile phones or smartphones today ship with three-axis
magnetometers.
 The smartphones also come with operating systems where you can
execute your own programs; with a few lines of code you can get readings
from the magnetometers hundreds of times a second.
 Also, the phone already has its own communication system set up; if you
can convince people to install and run your program, you could record a
large amount of magnetometer data with very little investment.
 In addition to the magnetometers, smartphones carry a large number of
other sensors including three-axis accelerometers, temperature sensors,
and GPS receivers, all of which you could use to support your primary
measurements.
 Key Terminology
 Consider an example of building a bird classification system.
 This sort of system often associated with machine learning is called an
expert systems.
 By creating a computer program to recognize birds, we’ve replaced an
ornithologist with a computer.
 The ornithologist is a bird expert, so we’ve created an expert system.
 In table below are some values for four parts of various birds that we decided
to measure.
 We chose to measure weight, wingspan, whether it has webbed feet, and the
color of its back.
 The four things we’ve measured are called features; these are also called
attributes.
 Each of the rows in table is an instance made up of features.
 The first two features in table are numeric and can take on decimal values.
 The third feature (webbed feet) is binary: it can only be 1 or 0.
 The fourth feature (back color) is an enumeration over the color palette.
 One task in machine learning is classification; this is illustrated using the
table to find the information about an Ivory-billed Woodpecker.
 We want to identify this bird out of a bunch of other birds.
 We could set up a bird feeder and then hire an ornithologist (bird expert) to
watch it and identify an Ivory-billed Woodpecker.
 This would be expensive, and the person could only be in one place at a time.
 We could also automate this process: set up many bird feeders with cameras
and computers attached to them to identify the birds that come in.
 We could put a scale on the bird feeder to get the bird’s weight and write
some computer vision code to extract the bird’s wingspan, feet type, and
back color.
 For the moment, assume we have all that information.
 How do we then decide if a bird at our feeder is an Ivory-billed Woodpecker
or something else? This task is called classification, and there are many
machine learning algorithms that are good at classification.
 We’ve decided on a machine learning algorithm to use for classification.
 What we need to do next is train the algorithm, or allow it to learn.
 To train the algorithm we feed it, quality data known as a training set.
 A training set is the set of training examples we’ll use to train our machine
learning algorithms.
 In table our training set has six training examples.
 Each training example has four features and one target variable; this is
depicted in figure below.
 The target variable is what we are trying to predict with our machine learning
algorithms.
 In classification the target variable takes on a nominal value, and in the task
of regression its value could be continuous.
 In a training set the target variable is known.
 The machine learns by finding some relationship between the features and
the target variable.
 In the classification problem the target variables are called classes, and
there is assumed to be a finite number of classes.
 To test machine learning algorithms what’s usually done is to have a training
set of data and a separate data set, called a test set.
 Initially the program is fed the training examples; this is when the machine
learning takes place.
 Next, the test set is fed to the program.
 The target variable for each example from the test set isn’t given to the
program, and the program decides which class each example should belong
to.
 The target variable or class that the training example belongs to is then
compared to the predicted value, and we can get a sense for how accurate
the algorithm is.
 In our bird classification example, assume we’ve tested the program and it
meets our desired level of accuracy.
 What the machine has learnt is called knowledge representation.
 Some algorithms have knowledge representation that’s more readable by
humans than others.
 The knowledge representation may be in the form of a set of rules; it may
be a probability distribution or an example from the training set.
 Key tasks of machine learning
 One of the key jobs of machine learning, that sets a framework that allows us
to easily turn a machine learning algorithm into a solid working application is
the task of classification.
 In classification, our job is to predict what class an instance of data should fall
into.
 Another task in machine learning is regression.
 Regression is the prediction of a numeric value.
 Classification and regression are examples of supervised learning.
 This set of problems is known as supervised because we’re telling the
algorithm what to predict.
 The opposite of supervised learning is a set of tasks known as unsupervised
learning.
 In unsupervised learning, there’s no label or target value given for the data.
 A task where we group similar items together is known as clustering.
 In unsupervised learning, we may also want to find statistical values that
describe the data.
 This is known as density estimation.
 Another task of unsupervised learning may be reducing the data from many
features to a small number so that we can properly visualize it in two or three
dimensions.
 How to choose the right algorithm
 With all the different algorithms given, how can you choose which one to use?
 First, you need to consider your goal. What are you trying to get out of this?
 What data do you have or can you collect? Those are the big questions.
 First consider the goal.
 If you’re trying to predict or forecast a target value, then you need to look into
supervised learning.
 If you’ve chosen supervised learning, what’s your target value? Is it a discrete
value like Yes/No, 1/2/3, A/B/C, or Red/Yellow/Black? If so, then you want to look
into classification.
 If the target value can take on a number of values, say any value from 0.00 to
100.00, or -999 to 999, or + to -, then you need to look into regression.
 If you’re not trying to predict a target value, then you need to look into
unsupervised learning.
 Are you trying to fit your data into some discrete groups? If so and that’s all
you need, you should look into clustering.
 Do you need to have some numerical estimate of how strong the fit is into
each group? If you answer yes, then you probably should look into a density
estimation algorithm.
 The second thing you need to consider is your data.
 You should spend some time getting to know your data, and the more you
know about it, the better you’ll be able to build a successful application.
 Things to know about your data are these: Are the features nominal or
continuous? Are there missing values in the features.
 If there are missing values, why are there missing values? Are there outliers
in the data?
 All of these features about your data can help you narrow the algorithm
selection process.
 With the algorithm narrowed, there’s no single answer to what the best
algorithm is or what will give you the best results.
 You’re going to have to try different algorithms and see how they perform.
 There are other machine learning techniques that you can use to improve
the performance of a machine learning algorithm.
 Steps in developing a machine learning
application
 Collect data. You could collect the samples by scraping a website and
extracting data, or you could get information from an RSS feed or an API. You
could have a device collect wind speed measurements and send them to you,
or blood glucose levels, or anything you can measure. The number of options
is endless. To save some time and effort, you could use publicly available
data.
 Prepare the input data. Once you have this data, you need to make sure
it’s in a useable format. The benefit of having this standard format is that you
can mix and match algorithms and data sources.
 Analyze the input data. This is looking at the data from the previous task.
This could be as simple as looking at the data you’ve parsed in a text editor
to make sure steps 1 and 2 are actually working and you don’t have a bunch
of empty values. You can also look at the data to see if you can recognize
any patterns or if there’s anything obvious, such as a few data points that
are vastly different from the rest of the set. Plotting data in one, two, or
three dimensions can also help. But most of the time you’ll have more than
three features, and you can’t easily plot the data across all features at one
time. You could, however, use some advanced methods distills multiple
dimensions down to two or three so you can visualize the data.
 Train the algorithm. This is where the machine learning takes place. This
step and the next step are where the “core” algorithms lie, depending on the
algorithm. You feed the algorithm good clean data from the first two steps
and extract knowledge or information
 Test the algorithm. This is where the information learned in the previous
step is put to use. When you’re evaluating an algorithm, you’ll test it to see
how well it does
 Use it. Here you make a real program to do some task, and once again you
see if all the previous steps worked as you expected.
 Getting to know your Data
 Knowledge about your data is useful for data preprocessing, the first major
task of the data mining process.
 You will want to know the following:
 What are the types of attributes or fields that make up your data?
 What kind of values does each attribute have?
 Which attributes are discrete, and which are continuous-valued?
 What do the data look like?
 How are the values distributed?
 Are there ways we can visualize the data to get a better sense of it all?
 Can we spot any outliers?
 Can we measure the similarity of some data objects with respect to others?
 Gaining such insight into the data will help with the subsequent analysis
 Data Objects and Attribute Types
 Data sets are made up of data objects.
 A data object represents an entity—in a sales database, the objects may be
customers, store items, and sales; in a medical database, the objects may be
patients; in a university database, the objects may be students, professors,
and courses.
 Data objects are typically described by attributes.
 Data objects can also be referred to as samples, examples, instances, data
points, or objects.
 If the data objects are stored in a database, they are data tuples.
 That is, the rows of a database correspond to the data objects, and the
columns correspond to the attributes.
 What Is an Attribute?
 An attribute is a data field, representing a characteristic or feature of a data
object.
 The nouns attribute, dimension, feature, and variable are often used
interchangeably in the literature.
 The term dimension is commonly used in data warehousing.
 Machine learning literature tends to use the term feature, while statisticians
prefer the term variable.
 Data mining and database professionals commonly use the term attribute.
 Attributes describing a customer object can include, for example, customer
ID, name, and address.
 Observed values for a given attribute are known as observations.
 A set of attributes used to describe a given object is called an attribute vector
 The type of an attribute is determined by the set of possible values—nominal,
binary, ordinal, or numeric—the attribute can have.

 Nominal Attributes
 Nominal data is a qualitative type of data used to classify and label variables.
 The values of a nominal attribute are symbols or names of things.
 Each value represents some kind of category, code, or state, and so nominal
attributes are also referred to as categorical.
 The values do not have any meaningful order.
 In computer science, the values are also known as enumerations.
 Example of Nominal attributes.
 Suppose that hair color and marital status are two attributes describing
person objects.
 In our application, possible values for hair color are black, brown, blond, red,
gray, and white.
 The attribute marital status can take on the values single, married, divorced,
and widowed.
 Both hair color and marital status are nominal attributes.
 Another example of a nominal attribute is occupation, with the values
teacher, dentist, programmer, farmer, and so on
 Binary Attributes
 A binary attribute is a nominal attribute with only two categories or states: 0 or
1, where 0 typically means that the attribute is absent, and 1 means that it is
present.
 Binary attributes are referred to as Boolean if the two states correspond to true
and false.
 Example of Binary Attributes, suppose the patient undergoes a medical test
that has two possible outcomes.
 The attribute medical test is binary, where a value of 1 means the result of the
test for the patient is positive, while 0 means the result is negative.
 A binary attribute is symmetric if both of its states are equally valuable and
carry the same weight; that is, there is no preference on which outcome should
be coded as 0 or 1.

 One such example could be the attribute gender having the states male and
female.
 A binary attribute is asymmetric if the outcomes of the states are not
equally important, such as the positive and negative outcomes of a medical
test.
 By convention, we code the most important outcome, which is usually the
rarest one, by 1 (e.g., diabetic) and the other by 0 (e.g., non-diabetic).
 Ordinal Attributes
 An ordinal attribute is an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude between
successive values is not known.
 Example of Ordinal attributes. Suppose that drink size corresponds to the size
of drinks available at a fast-food restaurant.
 This nominal attribute has three possible values: small, medium, and large.
 The values have a meaningful sequence (which corresponds to increasing
drink size); however, we cannot tell from the values how much bigger, say, a
medium is than a large.
 Other examples of ordinal attributes include grade (e.g., A+, A, A−, B+, and
so on) and professional rank.
 Professional ranks can be enumerated in a sequential order: for example,
assistant, associate, and full for professors.
 Numeric Attributes
 A numeric attribute is quantitative; that is, it is a measurable quantity,
represented in integer or real values.
 Numeric attributes can be interval-scaled or ratio-scaled.
 Interval-Scaled Attributes
 Interval-scaled attributes are measured on a scale of equal-size units.
 The values of interval-scaled attributes have order and can be positive, 0, or
negative.
 Thus, in addition to providing a ranking of values, such attributes allow us to
compare and quantify the difference between values.
 Example of Interval-scaled attributes.
 A temperature attribute is interval-scaled.
 The temperature in an AC room is 16 degree Celsius while temperature
outside the room is 32 degree Celsius.
 We can say that the temperature outside is 16 degree higher than inside the
room.
 Calendar dates are another example.
 For instance, the years 2002 and 2010 are eight years apart.
 Ratio-Scaled Attributes
 A ratio-scaled attribute is a numeric attribute with an inherent zero-point.
 That is, if a measurement is ratio-scaled, we can speak of a value as being a
multiple (or ratio) of another value.
 In addition, the values are ordered, and we can also compute the difference
between values, as well as the mean, median, and mode.
 examples of ratio-scaled attributes include age, money, weight.
 If you are 50 years and your son is 25 years, you can claim you are twice
their age.
 Discrete versus Continuous Attributes
 There are many ways to organize attribute types.
 Classification algorithms developed from the field of machine learning often
talk of attributes as being either discrete or continuous.
 Each type may be processed differently.
 A discrete attribute has a finite or countably infinite set of values, which may
or may not be represented as integers.
 The attributes hair color, smoker, medical test, and drink size each have a
finite number of values, and so are discrete.
 Note that discrete attributes may have numeric values, such as 0 and 1 for
binary attributes or, the values 0 to 110 for the attribute age.
 If an attribute is not discrete, it is continuous.
 The terms numeric attribute and continuous attribute are often used
interchangeably in the literature
 In practice, real values are represented using a finite number of digits.
 Continuous attributes are typically represented as floating-point variables.
 Data Visualization
 Data visualization aims to communicate data clearly and effectively through
graphical representation.
 Data visualization has been used extensively in many applications—for
example, at work for reporting, managing business operations, and tracking
progress of tasks.
 More popularly, we can take advantage of visualization techniques to
discover data relationships that are otherwise not easily observable by
looking at the raw data.
 Now a days, people also use data visualization to create fun and interesting
graphics.
 Several representative approaches considered are, pixel-oriented techniques,
geometric projection techniques, icon-based techniques, and hierarchical and
graph-based techniques.
 Pixel-Oriented Visualization Techniques
 A simple way to visualize the value of a dimension is to use a pixel where the
color of the pixel reflects the dimension’s value.
 For a data set of m dimensions, pixel-oriented techniques create m windows
on the screen, one for each dimension.
 The m dimension values of a record are mapped to m pixels at the
corresponding positions in the windows.
 The colors of the pixels reflect the corresponding values.
 Inside a window, the data values are arranged in some global order shared
by all windows.
 The global order may be obtained by sorting all data records in a way that’s
meaningful for the task at hand.
 Example of Pixel-oriented visualization.
 All-Electronics maintains a customer information table, which consists of four
dimensions: income, credit limit, transaction volume, and age.
 Can we analyze the correlation between income and the other attributes by
visualization?
 We can sort all customers in income-ascending order, and use this order to
lay out the customer data in the four visualization windows, as shown in
Figure below.
 Using pixel based visualization, we can easily observe the following: credit
limit increases as income increases; customers whose income is in the
middle range are more likely to purchase more from All-Electronics; there is
no clear correlation between income and age.
 Filling a window by laying out the data records in a linear way may not work
well for a wide window.
 Note that the windows do not have to be rectangular.
 For example, the circle segment technique uses windows in the shape of
segments of a circle, as illustrated in Figure below.

 This technique can ease the comparison of dimensions because the


dimension windows are located side by side and form a circle.
 Geometric Projection Visualization Techniques
 A drawback of pixel-oriented visualization techniques is that they cannot
help us much in understanding the distribution of data in a multidimensional
space.
 For example, they do not show whether there is a dense area in a
multidimensional subspace.
 Geometric projection techniques help users find interesting projections of
multidimensional data sets.
 The central challenge the geometric projection techniques try to address is
how to visualize a high-dimensional space on a 2-D display.
 A scatter plot displays 2-D data points using Cartesian coordinates.
 A third dimension can be added using different colors or shapes to represent
different data points.
 Figure below shows an example, where X and Y are two spatial attributes
and the third dimension is represented by different shapes.

 Through this visualization, we can see that points of types “+” and “×” tend
to be collocated(act of placing multiple entities within a single location).
 A 3-D scatter plot uses three axes in a Cartesian coordinate system.
 For data sets with more than four dimensions, scatter plots are usually
ineffective.
 The scatter-plot matrix technique is a useful extension to the scatter plot.
 For an n dimensional data set, a scatter-plot matrix is an n×n grid of 2-D
scatter plots that provides a visualization of each dimension with every other
dimension.
 Figure below shows an example, which visualizes the Iris data set.
 The data set consists of 450 samples from each of three species of Iris
flowers.
 There are five dimensions in the data set: length and width of sepal and
petal, and species.
 The scatter-plot matrix becomes less effective as the dimensionality
increases.
 Icon-Based Visualization Techniques
 Icon-based visualization techniques use small icons to represent
multidimensional data values.
 We look at two popular icon-based techniques: Chernoff faces and stick
figures.
 Chernoff faces were introduced in 1973 by statistician Herman Chernoff.
 They display multidimensional data of up to 18 variables (or dimensions) as a
cartoon human face (Figure below).
 Chernoff faces help reveal trends in the data.
 Components of the face, such as the eyes, ears, mouth, and nose, represent
values of the dimensions by their shape, size, placement, and orientation.
 For example, dimensions can be mapped to the following facial
characteristics: eye size, eye spacing, nose length, nose width, mouth
curvature, mouth width, mouth openness, pupil size, eyebrow slant, eye
eccentricity, and head eccentricity.
 Chernoff faces make use of the ability of the human mind to recognize small
differences in facial characteristics and to assimilate many facial
characteristics at once.
 Viewing large tables of data can be tedious.
 By condensing the data, Chernoff faces make the data easier for users to
digest.
 In this way, they facilitate visualization of regularities and irregularities
present in the data, although their power in relating multiple relationships is
limited.
 Another limitation is that specific data values are not shown.
 Therefore, this mapping should be carefully chosen.
 The stick figure visualization technique maps multidimensional data to
five-piece stick figures, where each figure has four limbs and a body.
 Two dimensions are mapped to the display (x and y) axes and the remaining
dimensions are mapped to the angle and/or length of the limbs.
 Figure below shows census data, where age and income are mapped to the
display axes, and the remaining dimensions (gender, education, and so on)
are mapped to stick figures.
 Hierarchical Visualization Techniques
 The visualization techniques discussed so far focus on visualizing multiple
dimensions simultaneously.
 However, for a large data set of high dimensionality, it would be difficult to
visualize all dimensions at the same time.
 Hierarchical visualization techniques partition all dimensions into subsets (i.e.,
subspaces).
 The subspaces are visualized in a hierarchical manner.
 “Worlds-within-Worlds,” also known as n-Vision, is a representative hierarchical
visualization method.
 Given more dimensions, more levels of worlds can be used, which is why the
method is called “worlds-with in worlds.”
 An example of hierarchical visualization methods, tree-maps display
hierarchical data as a set of nested rectangles.
 For example, Figure below shows a tree-map visualizing Google news stories.

 All news stories are organized into seven categories, each shown in a large
rectangle of a unique color.
 Within each category (i.e., each rectangle at the top level), the news stories
are further partitioned into smaller subcategories.
 Visualizing Complex Data and Relations
 In early days, visualization techniques were mainly for numeric data.
 Recently, more and more non-numeric data, such as text and social networks,
have become available.
 Visualizing and analyzing such data attracts a lot of interest.
 There are many new visualization techniques dedicated to these kinds of data.
 For example, many people on the Web tag various objects such as pictures, blog
entries, and product reviews.
 A tagcloud is a visualization of statistics of user-generated tags.
 Often, in a tag cloud, tags are listed alphabetically or in a user-preferred order.
 The importance of a tag is indicated by font size or color.
 Figure below shows a tag cloud for visualizing the popular tags used in a Web
site.

 Tag clouds are often used in two ways.


 First, in a tag cloud for a single item, we can use the size of a tag to represent
the number of times that the tag is applied to this item by different users.
 Second, when visualizing the tag statistics on multiple items, we can use the
size of a tag to represent the number of items that the tag has been applied
to, that is, the popularity of the tag.
 In addition to complex data, complex relations among data entries also raise
challenges for visualization.
 For example, Figure below uses a disease influence graph to visualize the
correlations between diseases.
 The nodes in the graph are diseases, and the size of each node is
proportional to the prevalence of the corresponding disease.
 Two nodes are linked by an edge if the corresponding diseases have a strong
correlation.

 In summary, visualization provides effective tools to explore data.


 There are many existing tools and methods.
 Moreover, visualization can be used in data mining in various aspects.
 In addition to visualizing data, visualization can be used to represent the
data mining process, the patterns obtained from a mining method, and user
interaction with the data.
 Visual data mining is an important research and development direction.
 Measuring Data Similarity and Dissimilarity
 In data mining applications, such as clustering, outlier analysis, and nearest-
neighbor classification, we need ways to assess how alike or unalike objects
are in comparison to one another.
 For example, a store may want to search for clusters of customer objects,
resulting in groups of customers with similar characteristics (e.g., similar
income, area of residence, and age).
 Such information can then be used for marketing.
 A cluster is a collection of data objects such that the objects within a cluster
are similar to one another and dissimilar to the objects in other clusters.
 Outlier analysis also employs clustering-based techniques to identify potential
outliers as objects that are highly dissimilar to others.
 Knowledge of object similarities can also be used in nearest-neighbor
classification schemes where a given object (e.g., a patient) is assigned
a class label (relating to, say, a diagnosis) based on its similarity toward
other objects in the model.
 similarity and dissimilarity measures, are referred to as measures of
proximity.
 Similarity and dissimilarity are related.
 A similarity measure for two objects, i and j, will typically return the
value 0 if the objects are unalike.
 The higher the similarity value, the greater the similarity between
objects. (Typically, a value of 1 indicates complete similarity, that is, the
objects are identical.)
 A dissimilarity measure works the opposite way.
 It returns a value of 0 if the objects are the same .
 The higher the dissimilarity value, the more dissimilar the two objects are.
 we present two data structures that are commonly used in the above types of
applications: the data matrix (used to store the data objects) and the
dissimilarity matrix (used to store dissimilarity values for pairs of objects).
 Data Matrix versus Dissimilarity Matrix
 Suppose that we have n objects (e.g., persons, items, or courses) described
by p attributes (also called measurements or features, such as age, height,
weight, or gender).
 The objects are x1 =(x11,x12,...,x1p), x2 =(x21,x22,...,x2p), and so on, where
xij is the value for object xi of the jth attribute.
 object xi is referred to as object i.
 The objects may be tuples in a relational database, and are also referred to as
data samples or feature vectors.
 Main memory-based clustering and nearest-neighbor algorithms typically
operate on either of the following two data structures:
 Data matrix(or object-by-attribute structure):
 This structure stores the n data objects in the form of a relational table, or n-
by-p matrix (n objects × p attributes).

[2.8]

 Each row corresponds to an object.


 Dissimilarity matrix(or object-by-object structure):
 This structure stores a collection of proximities that are available for all pairs of n
objects.
 It is often represented by an n-by-n table:

[2.9]

 where d(i, j) is the measured dissimilarity or “difference” between objects i and j.


 In general, d(i, j) is a non-negative number that is close to 0 when objects i and j
are highly similar or “near” each other, and becomes larger the more they differ.
 Note that d(i, i)=0; that is, the difference between an object and itself is 0.
 Furthermore, d(i, j)=d(j, i).
 Measures of similarity can often be expressed as a function of measures of
dissimilarity.
 For example, for nominal data,
sim(i, j)=1−d(i, j), [2.10]
 where sim(i, j) is the similarity between objects i and j.
 A data matrix is made up of two entities or “things,” namely rows (for objects)
and columns (for attributes).
 Therefore, the data matrix is often called a two-mode matrix.
 The dissimilarity matrix contains one kind of entity (dissimilarities) and so is
called a one-mode matrix.
 Many clustering and nearest-neighbor algorithms operate on a dissimilarity
matrix.
 Data in the form of a data matrix can be transformed into a dissimilarity matrix
before applying such algorithms.
 Proximity Measures for Nominal Attributes
 A nominal attribute can take on two or more states.
 For example, map color is a nominal attribute that may have, say, five states: red,
yellow, green, pink, and blue.
 Let the number of states of a nominal attribute be M.
 The states can be denoted by letters, symbols, or a set of integers, such as 1, 2,...,
M.
 “How is dissimilarity computed between objects described by nominal attributes?”
 The dissimilarity between two objects i and j can be computed based on the ratio
of mismatches:
[2.11]

 where m is the number of matches (i.e., the number of attributes for which i and j
are in the same state), and p is the total number of attributes describing the
objects.
 Example- Dissimilarity between nominal attributes.
 Suppose that we have the sample data of Table below, except that only the
object-identifier and the attribute test-1 are available, where test-1 is nominal.

 Let’s compute the dissimilarity matrix , that is,

 Since here we have one nominal attribute, test-1, we set p=1 in Eq. (2.11) so
that d(i, j) evaluates to 0 if objects i and j match, and 1 if the objects differ.
 Thus, we get

 From this, we see that all objects are dissimilar except objects 1 and 4 (i.e.,
d(4,1)=0).
 Alternatively, similarity can be computed as

[2.12]
Example-
Nationality Hair color
American Red
German Blonde
Kenyan Black
Japanese Brown
American white
Indian Black
Kenyan Black
Find the dissimilarity matrix
 Proximity Measures for Binary Attributes
 Let’s look at dissimilarity and similarity measures for objects described by
either symmetric or asymmetric binary attributes.
 A binary attribute has only one of two states: 0 and 1, where 0 means that
the attribute is absent, and 1 means that it is present.
 Given the attribute smoker describing a patient, for instance, 1 indicates that
the patient smokes, while 0 indicates that the patient does not.
 Treating binary attributes as if they are numeric can be misleading.
 Therefore, methods specific to binary data are necessary for computing
dissimilarity.
 “So, how can we compute the dissimilarity between two binary attributes?”
 One approach involves computing a dissimilarity matrix from the given binary
data.
 If all binary attributes are thought of as having the same weight, we have the
2×2 contingency table of Table below

 where q is the number of attributes that equal 1 for both objects i and j
 r is the number of attributes that equal 1 for object i but equal 0 for object j
 s is the number of attributes that equal 0 for object i but equal 1 for object j
 t is the number of attributes that equal 0 for both objects i and j.
 The total number of attributes is p, where p = q+r+s+t.
 for symmetric binary attributes, each state is equally valuable.
 Dissimilarity that is based on symmetric binary attributes is called
symmetric binary dissimilarity.
 If objects i and j are described by symmetric binary attributes, then the
dissimilarity between i and j is
 For asymmetric binary attributes, the two states are not equally important,
such as the positive (1) and negative (0) outcomes of a disease test.
 Given two asymmetric binary attributes, the agreement of two 1s (a positive
match) is then considered more significant than that of two 0s (a negative
match).
 Therefore, such binary attributes are often considered “monary” (having one
state).
 The dissimilarity based on these attributes is called asymmetric binary
dissimilarity, where the number of negative matches, t, is considered
unimportant and is thus ignored in the following computation:
 Complementarily, we can measure the difference between two binary
attributes based on the notion of similarity instead of dissimilarity.
 For example, the asymmetric binary similarity between the objects i and j can
be computed as

 The coefficient sim(i, j) of Eq. (2.15) is called the Jaccard coefficient and is
popularly referenced in the literature.
 Example- Dissimilarity between binary attributes.
 Suppose that a patient record table below contains the attributes name,
gender, fever, cough, test-1, test-2, test-3, and test-4, where name is an
object identifier, gender is a symmetric attribute, and the remaining
attributes are asymmetric binary.

 For asymmetric attribute values, let the values Y (yes) and P (positive) be set
to 1, and the value N (no or negative) be set to 0.
 Suppose that the distance between objects (patients) is computed based only on
the asymmetric attributes.
 According to Eq.(2.14), the distance between each pair of the three patients—
Jack, Mary, and Jim—is

 These measurements suggest that Jim and Mary are unlikely to have a similar
disease because they have the highest dissimilarity value among the three pairs.
 Of the three patients, Jack and Mary are the most likely to have a similar disease.
 Dissimilarity of Numeric Data: Minkowski Distance
 Measures that are commonly used for computing the dissimilarity of objects
described by numeric attributes include the Euclidean, Manhattan, and
Minkowski distances.
 In some cases, the data are normalized before applying distance calculations.
 This involves transforming the data to fall within a smaller or common range,
such as[−1,1] or[0.0,1.0].
 Consider a height attribute, for example, which could be measured in either
meters or inches.
 In general, expressing an attribute in smaller units will lead to a larger range
for that attribute, and thus tend to give such attributes greater effect or
“weight.”
 Normalizing the data attempts to give all attributes an equal weight.
 It may or may not be useful in a particular application.
 The most popular distance measure is Euclidean distance (i.e., straight line or
“as the crow flies”).
 Let i=(xi1, xi2,..., xip) and j=(xj1, xj2,..., xjp) be two objects described by p
numeric attributes.
 The Euclidean distance between objects i and j is defined as

 Another well-known measure is the Manhattan (or city block) distance,


named so because it is the distance in blocks between any two points in a
city (such as 2 blocks down and 3 blocks over for a total of 5 blocks).
 It is defined as
 Both the Euclidean and the Manhattan distance satisfy the following
mathematical properties:
 Non-negativity: d(i, j)≥0: Distance is a non-negative number.
 Identity of indiscernibles: d(i, i)=0: The distance of an object to itself is 0.
 Symmetry: d(i, j)=d(j, i): Distance is a symmetric function.
 Triangle inequality: d(i, j)≤d(i, k)+d(k, j): Going directly from object i to
object j in space is no more than making a detour over any other object k.
 A measure that satisfies these conditions is known as metric.
 Example- Euclidean distance and Manhattan distance.
 Let x1 = (1, 2) and x2 = (3, 5) represent two objects as shown in Figure below.

 The Euclidean distance between the two is √2 2+32 =3.61.


 The Manhattan distance between the two is 2+3=5.
 Minkowski distance is a generalization of the Euclidean and Manhattan
distances.
 It is defined as

 where h is a real number such that h≥1


 It represents the Manhattan distance when h=1 (i.e., L1 norm) and Euclidean
distance when h=2 (i.e., L2 norm).
 Proximity Measures for Ordinal Attributes
 The values of an ordinal attribute have a meaningful order or ranking about
them, yet the magnitude between successive values is unknown
 An example includes the sequence small, medium, large for a size attribute.
 Ordinal attributes may also be obtained from the discretization of numeric
attributes by splitting the value range into a finite number of categories.
 These categories are organized into ranks.
 That is, the range of a numeric attribute can be mapped to an ordinal
attribute f having Mf states.
 For example, the range of the interval-scaled attribute temperature (in
Celsius) can be organized into the following states: −30 to −10, −10 to 10,
10 to 30, representing the categories cold temperature, moderate
temperature, and warm temperature, respectively.
 Let M represent the number of possible states that an ordinal attribute can
have.
 These ordered states define the ranking 1,..., Mf .
 “How are ordinal attributes handled?”
 The treatment of ordinal attributes is quite similar to that of numeric
attributes when computing dissimilarity between objects.
 Suppose that f is an attribute from a set of ordinal attributes describing n
objects.
 The dissimilarity computation with respect to f involves the following steps:
1. The value of f for the ith object is xif , and f has Mf ordered states,
representing the ranking 1,..., Mf . Replace each xif by its corresponding
rank, rif ∈{1,..., Mf}.
2. Since each ordinal attribute can have a different number of states, it is
often necessary to map the range of each attribute onto [0.0, 1.0] so that
each attribute has equal weight. We perform such data normalization by
replacing the rank rif of the ith object in the fth attribute by

3. Dissimilarity can then be computed using any of the distance measures


described in the previous slides for numeric attributes, using z if to represent
the f value for the ith object.
 Example- Dissimilarity between ordinal attributes.
 Suppose that we have the sample data shown earlier in Table BELOW, except
that this time only the object-identifier and the continuous ordinal attribute,
test-2, are available.

 There are three states for test-2: fair, good, and excellent, that is, Mf =3.
 For step 1, if we replace each value for test-2 by its rank, the four objects are
assigned the ranks 3, 1, 2, and 3, respectively.
 Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5, and
rank 3 to 1.0.
 For step 3, we can use, say, the Euclidean distance (Eq. 2.16), which results
in the following dissimilarity matrix:

 Therefore, objects 1 and 2 are the most dissimilar, as are objects 2 and 4 (i.e.,
d(2,1)= 1.0 and d(4,2)=1.0).
 This makes intuitive sense since objects1 and 4 are both excellent.
 Object 2 is fair, which is at the opposite end of the range of values for test-2.
 Similarity values for ordinal attributes can be interpreted from dissimilarity as
sim(i,j)=1−d(i,j).
 Dissimilarity for Attributes of Mixed Types
 In many real databases, objects are described by a mixture of attribute types.
 In general, a database can contain all of these attribute types.
 “So, how can we compute the dissimilarity between objects of mixed attribute
types?”
 One approach is to group each type of attribute together, performing separate
data mining (e.g., clustering) analysis for each type.
 This is feasible if these analyses derive compatible results.
 However, in real applications, it is unlikely that a separate analysis per attribute
type will generate compatible results.
 A more preferable approach is to process all attribute types together, performing
a single analysis.
 One such technique combines the different attributes into a single dissimilarity
matrix, bringing all of the meaningful attributes onto a common scale of the
interval [0.0, 1.0].
 Suppose that the data set contains p attributes of mixed type.
 The dissimilarity d(i, j) between objects i and j is defined as
 The steps are identical to what we have already seen for each of the
individual attribute types.
 The only difference is for numeric attributes, where we normalize so that the
values map to the interval [0.0, 1.0].
 Thus, the dissimilarity between objects can be computed even when the
attributes describing the objects are of different types.
 Example- Dissimilarity between attributes of mixed type.
 Let’s compute a dissimilarity matrix for the objects in Table below

 Now we will consider all of the attributes, which are of different types.
 In Example for nominal attributes and ordinal attributes, we worked out the
dissimilarity matrices for each of the individual attributes.
 The procedures we followed for test-1 (which is nominal) and test-2 (which is
ordinal) are the same as outlined earlier for processing attributes of mixed
types.
 Therefore, we can use the dissimilarity matrices obtained for test-1 and test-
2 later when we compute Eq. (2.22).
 Now compute the dissimilarity matrices for the third attribute, test-3(which
is numeric). I.e. We must compute
 Following the case for numeric attributes, we let maxhxh =64 and minhxh
=22.
 The difference between the two is used in Eq. (2.22) to normalize the values
of the dissimilarity matrix.
 Find all the dij values for attribute test-3 (e.g. d12=(45-22)/(64-22)=0.55)
 The resulting dissimilarity matrix for test-3 is
 We can now use the dissimilarity matrices for the three attributes in our
computation of Eq.(2.22).
 dissimilarity matrices obtained for test-1 test2 and test-3 are
 The resulting dissimilarity matrix obtained for the

 objects 1 and 4 are the most similar, based on their values for test-1and
test-2.
 This is confirmed by the dissimilarity matrix, where d(4, 1) is the lowest
value for any pair of different objects.
 Similarly, the matrix indicates that objects 1 and 2 are the least similar.
 Cosine Similarity
 A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as a keyword) or phrase in the document.
 Thus, each document is an object represented by what is called a term-
frequency vector.
 For example, in Table below, we see that Document1 contains five instances of
the word team, while hockey occurs three times.

 The word coach is absent from the entire document, as indicated by a count
value of 0.
 Such data can be highly asymmetric.
 Term-frequency vectors are typically very long and sparse(i.e., they have
many 0 values).
 Applications using such structures include information retrieval, text
document clustering, biological taxonomy, and gene feature mapping.
 The traditional distance measures that we have studied in this chapter do
not work well for such sparse numeric data.
 For example, two term-frequency vectors may have many 0 values in
common, meaning that the corresponding documents do not share many
words, but this does not make them similar.
 We need a measure that will focus on the words that the two documents do
have in common, and the occurrence frequency of such words.
 In other words, we need a measure for numeric data that ignores zero-
matches.
 Cosine similarity is a measure of similarity that can be used to compare
documents or, say, give a ranking of documents with respect to a given
vector of query words.
 Let x and y be two vectors for comparison.
 Using the cosine measure as a similarity function, we have

 where ||x|| is the Euclidean norm of vector x=(x1, x2,..., xp), defined as .
 Conceptually, it is the length of the vector.
 Similarly, ||y|| is the Euclidean norm of vector y.
 The measure computes the cosine of the angle between vectors x and y.
 A cosine value of 0 means that the two vectors are at 90 degrees to each
other (orthogonal) and have no match.
 The closer the cosine value to 1, the smaller the angle and the greater the
match between vectors.
 Example- Cosine similarity between two term-frequency vectors.
 Suppose that x and y are the first two term-frequency vectors in Table 2.5.
 That is, x=(5,0,3,0,2,0,0,2,0,0) and y=(3,0,2,0,1,1,0,1,0,1).
 How similar are x and y?
 Using Eq. (2.23) to compute the cosine similarity between the two vectors,
we get:

 Therefore, if we were using the cosine similarity measure to compare these


documents, they would be considered quite similar.
 How to calculate the dissimilarity measures for the following nominal attributes?
 Calculate the dissimilarity measure for the ordinal attributes
 Suppose we have a table with five products, each assigned one of three
priorities: Urgent (assigned the ordinal value of 3), High
Priority (assigned the ordinal value of 2), and Low Priority (assigned the
ordinal value of 1). This table also includes their values in numeric
forms. The table is as follows:
Object Test I Test II Test III
Identifier (Nominal) (Ordinal) (Numeric)

1 Product A Low Priority 45

2 Product B Urgent 93

3 Product B High Priority 65

4 Product C High Priority 74

5 Product A Low Priority 23

You might also like