Chapter1 ML
Chapter1 ML
Introduction
Shailaja K.P
Introduction
With Machine Learning, we can gain insight from a dataset.
We are going to ask the computer to make some sense from the data, this is
what we call learning
Machine Learning is actively being used today in many places.
What is Machine Learning?
Most of the time, insight or knowledge we are trying to get from the data
won't be obvious from looking at the data
For eg., in detecting spam mail, looking at a single word will not help, we
need to look at the set of words together, combined with the length of the
mail and other factors, to say whether the mail is spam or not.
Machine Learning lies at the intersection of computer science, engineering
and statistics and often appears in other disciplines also
It can also be applied to many fields from politics to geosciences.
It is a tool that can be applied to many problems
Any field that needs to interpret and act on data uses machine learning
techniques.
Machine learning uses statistics.
So why do we need statistics?
In engineering we apply science to solve the problem.
We solve the deterministic problems where our solution solves the
problem all the time.
Or eg., if we are writing a software to control vending machine, the
solution works in all the environment, regardless of the money entered or
button pressed.
There are many problems where the solution is not deterministic.
That is, we don’t know enough about the problem or don’t
have sufficient computing power to properly model the problem.
For these we need statistics.
For eg., motivation of humans is a problem that is currently too difficult to
model.
Sensors and the data deluge
We have a tremendous amount of human-created data from the World Wide
Web, but recently more nonhuman sources of data have been coming online.
The technology behind the sensors isn’t new, but connecting them to the web
is new.
It’s estimated that shortly physical sensors will create 20 percent of nonvideo
internet traffic.
The following is an example of an abundance of free data, a worthy cause,
and the need to sort through the data.
In 1989, the Loma Prieta earthquake struck northern California, killing 63
people, injuring 3,757, and leaving thousands homeless.
A similarly sized earthquake struck Haiti in 2010, killing more than 230,000
people.
Shortly after the Loma Prieta earthquake, a study was published using low-
frequency magnetic field measurements claiming to foretell the earthquake.
A number of subsequent studies showed that the original study was flawed
for various reasons.
Suppose we want to redo this study and keep searching for ways to predict
earthquakes so we can avoid the horrific consequences and have a better
understanding of our planet.
What would be the best way to go about this study? We could
buy magnetometers with our own money and buy pieces of land to place
them on.
We could ask the government to help us out and give us money and land on
which to place these magnetometers.
Who’s going to make sure there’s no tampering with the magnetometers, and
how can we get readings from them? There exists another low-cost solution.
Mobile phones or smartphones today ship with three-axis
magnetometers.
The smartphones also come with operating systems where you can
execute your own programs; with a few lines of code you can get readings
from the magnetometers hundreds of times a second.
Also, the phone already has its own communication system set up; if you
can convince people to install and run your program, you could record a
large amount of magnetometer data with very little investment.
In addition to the magnetometers, smartphones carry a large number of
other sensors including three-axis accelerometers, temperature sensors,
and GPS receivers, all of which you could use to support your primary
measurements.
Key Terminology
Consider an example of building a bird classification system.
This sort of system often associated with machine learning is called an
expert systems.
By creating a computer program to recognize birds, we’ve replaced an
ornithologist with a computer.
The ornithologist is a bird expert, so we’ve created an expert system.
In table below are some values for four parts of various birds that we decided
to measure.
We chose to measure weight, wingspan, whether it has webbed feet, and the
color of its back.
The four things we’ve measured are called features; these are also called
attributes.
Each of the rows in table is an instance made up of features.
The first two features in table are numeric and can take on decimal values.
The third feature (webbed feet) is binary: it can only be 1 or 0.
The fourth feature (back color) is an enumeration over the color palette.
One task in machine learning is classification; this is illustrated using the
table to find the information about an Ivory-billed Woodpecker.
We want to identify this bird out of a bunch of other birds.
We could set up a bird feeder and then hire an ornithologist (bird expert) to
watch it and identify an Ivory-billed Woodpecker.
This would be expensive, and the person could only be in one place at a time.
We could also automate this process: set up many bird feeders with cameras
and computers attached to them to identify the birds that come in.
We could put a scale on the bird feeder to get the bird’s weight and write
some computer vision code to extract the bird’s wingspan, feet type, and
back color.
For the moment, assume we have all that information.
How do we then decide if a bird at our feeder is an Ivory-billed Woodpecker
or something else? This task is called classification, and there are many
machine learning algorithms that are good at classification.
We’ve decided on a machine learning algorithm to use for classification.
What we need to do next is train the algorithm, or allow it to learn.
To train the algorithm we feed it, quality data known as a training set.
A training set is the set of training examples we’ll use to train our machine
learning algorithms.
In table our training set has six training examples.
Each training example has four features and one target variable; this is
depicted in figure below.
The target variable is what we are trying to predict with our machine learning
algorithms.
In classification the target variable takes on a nominal value, and in the task
of regression its value could be continuous.
In a training set the target variable is known.
The machine learns by finding some relationship between the features and
the target variable.
In the classification problem the target variables are called classes, and
there is assumed to be a finite number of classes.
To test machine learning algorithms what’s usually done is to have a training
set of data and a separate data set, called a test set.
Initially the program is fed the training examples; this is when the machine
learning takes place.
Next, the test set is fed to the program.
The target variable for each example from the test set isn’t given to the
program, and the program decides which class each example should belong
to.
The target variable or class that the training example belongs to is then
compared to the predicted value, and we can get a sense for how accurate
the algorithm is.
In our bird classification example, assume we’ve tested the program and it
meets our desired level of accuracy.
What the machine has learnt is called knowledge representation.
Some algorithms have knowledge representation that’s more readable by
humans than others.
The knowledge representation may be in the form of a set of rules; it may
be a probability distribution or an example from the training set.
Key tasks of machine learning
One of the key jobs of machine learning, that sets a framework that allows us
to easily turn a machine learning algorithm into a solid working application is
the task of classification.
In classification, our job is to predict what class an instance of data should fall
into.
Another task in machine learning is regression.
Regression is the prediction of a numeric value.
Classification and regression are examples of supervised learning.
This set of problems is known as supervised because we’re telling the
algorithm what to predict.
The opposite of supervised learning is a set of tasks known as unsupervised
learning.
In unsupervised learning, there’s no label or target value given for the data.
A task where we group similar items together is known as clustering.
In unsupervised learning, we may also want to find statistical values that
describe the data.
This is known as density estimation.
Another task of unsupervised learning may be reducing the data from many
features to a small number so that we can properly visualize it in two or three
dimensions.
How to choose the right algorithm
With all the different algorithms given, how can you choose which one to use?
First, you need to consider your goal. What are you trying to get out of this?
What data do you have or can you collect? Those are the big questions.
First consider the goal.
If you’re trying to predict or forecast a target value, then you need to look into
supervised learning.
If you’ve chosen supervised learning, what’s your target value? Is it a discrete
value like Yes/No, 1/2/3, A/B/C, or Red/Yellow/Black? If so, then you want to look
into classification.
If the target value can take on a number of values, say any value from 0.00 to
100.00, or -999 to 999, or + to -, then you need to look into regression.
If you’re not trying to predict a target value, then you need to look into
unsupervised learning.
Are you trying to fit your data into some discrete groups? If so and that’s all
you need, you should look into clustering.
Do you need to have some numerical estimate of how strong the fit is into
each group? If you answer yes, then you probably should look into a density
estimation algorithm.
The second thing you need to consider is your data.
You should spend some time getting to know your data, and the more you
know about it, the better you’ll be able to build a successful application.
Things to know about your data are these: Are the features nominal or
continuous? Are there missing values in the features.
If there are missing values, why are there missing values? Are there outliers
in the data?
All of these features about your data can help you narrow the algorithm
selection process.
With the algorithm narrowed, there’s no single answer to what the best
algorithm is or what will give you the best results.
You’re going to have to try different algorithms and see how they perform.
There are other machine learning techniques that you can use to improve
the performance of a machine learning algorithm.
Steps in developing a machine learning
application
Collect data. You could collect the samples by scraping a website and
extracting data, or you could get information from an RSS feed or an API. You
could have a device collect wind speed measurements and send them to you,
or blood glucose levels, or anything you can measure. The number of options
is endless. To save some time and effort, you could use publicly available
data.
Prepare the input data. Once you have this data, you need to make sure
it’s in a useable format. The benefit of having this standard format is that you
can mix and match algorithms and data sources.
Analyze the input data. This is looking at the data from the previous task.
This could be as simple as looking at the data you’ve parsed in a text editor
to make sure steps 1 and 2 are actually working and you don’t have a bunch
of empty values. You can also look at the data to see if you can recognize
any patterns or if there’s anything obvious, such as a few data points that
are vastly different from the rest of the set. Plotting data in one, two, or
three dimensions can also help. But most of the time you’ll have more than
three features, and you can’t easily plot the data across all features at one
time. You could, however, use some advanced methods distills multiple
dimensions down to two or three so you can visualize the data.
Train the algorithm. This is where the machine learning takes place. This
step and the next step are where the “core” algorithms lie, depending on the
algorithm. You feed the algorithm good clean data from the first two steps
and extract knowledge or information
Test the algorithm. This is where the information learned in the previous
step is put to use. When you’re evaluating an algorithm, you’ll test it to see
how well it does
Use it. Here you make a real program to do some task, and once again you
see if all the previous steps worked as you expected.
Getting to know your Data
Knowledge about your data is useful for data preprocessing, the first major
task of the data mining process.
You will want to know the following:
What are the types of attributes or fields that make up your data?
What kind of values does each attribute have?
Which attributes are discrete, and which are continuous-valued?
What do the data look like?
How are the values distributed?
Are there ways we can visualize the data to get a better sense of it all?
Can we spot any outliers?
Can we measure the similarity of some data objects with respect to others?
Gaining such insight into the data will help with the subsequent analysis
Data Objects and Attribute Types
Data sets are made up of data objects.
A data object represents an entity—in a sales database, the objects may be
customers, store items, and sales; in a medical database, the objects may be
patients; in a university database, the objects may be students, professors,
and courses.
Data objects are typically described by attributes.
Data objects can also be referred to as samples, examples, instances, data
points, or objects.
If the data objects are stored in a database, they are data tuples.
That is, the rows of a database correspond to the data objects, and the
columns correspond to the attributes.
What Is an Attribute?
An attribute is a data field, representing a characteristic or feature of a data
object.
The nouns attribute, dimension, feature, and variable are often used
interchangeably in the literature.
The term dimension is commonly used in data warehousing.
Machine learning literature tends to use the term feature, while statisticians
prefer the term variable.
Data mining and database professionals commonly use the term attribute.
Attributes describing a customer object can include, for example, customer
ID, name, and address.
Observed values for a given attribute are known as observations.
A set of attributes used to describe a given object is called an attribute vector
The type of an attribute is determined by the set of possible values—nominal,
binary, ordinal, or numeric—the attribute can have.
Nominal Attributes
Nominal data is a qualitative type of data used to classify and label variables.
The values of a nominal attribute are symbols or names of things.
Each value represents some kind of category, code, or state, and so nominal
attributes are also referred to as categorical.
The values do not have any meaningful order.
In computer science, the values are also known as enumerations.
Example of Nominal attributes.
Suppose that hair color and marital status are two attributes describing
person objects.
In our application, possible values for hair color are black, brown, blond, red,
gray, and white.
The attribute marital status can take on the values single, married, divorced,
and widowed.
Both hair color and marital status are nominal attributes.
Another example of a nominal attribute is occupation, with the values
teacher, dentist, programmer, farmer, and so on
Binary Attributes
A binary attribute is a nominal attribute with only two categories or states: 0 or
1, where 0 typically means that the attribute is absent, and 1 means that it is
present.
Binary attributes are referred to as Boolean if the two states correspond to true
and false.
Example of Binary Attributes, suppose the patient undergoes a medical test
that has two possible outcomes.
The attribute medical test is binary, where a value of 1 means the result of the
test for the patient is positive, while 0 means the result is negative.
A binary attribute is symmetric if both of its states are equally valuable and
carry the same weight; that is, there is no preference on which outcome should
be coded as 0 or 1.
One such example could be the attribute gender having the states male and
female.
A binary attribute is asymmetric if the outcomes of the states are not
equally important, such as the positive and negative outcomes of a medical
test.
By convention, we code the most important outcome, which is usually the
rarest one, by 1 (e.g., diabetic) and the other by 0 (e.g., non-diabetic).
Ordinal Attributes
An ordinal attribute is an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude between
successive values is not known.
Example of Ordinal attributes. Suppose that drink size corresponds to the size
of drinks available at a fast-food restaurant.
This nominal attribute has three possible values: small, medium, and large.
The values have a meaningful sequence (which corresponds to increasing
drink size); however, we cannot tell from the values how much bigger, say, a
medium is than a large.
Other examples of ordinal attributes include grade (e.g., A+, A, A−, B+, and
so on) and professional rank.
Professional ranks can be enumerated in a sequential order: for example,
assistant, associate, and full for professors.
Numeric Attributes
A numeric attribute is quantitative; that is, it is a measurable quantity,
represented in integer or real values.
Numeric attributes can be interval-scaled or ratio-scaled.
Interval-Scaled Attributes
Interval-scaled attributes are measured on a scale of equal-size units.
The values of interval-scaled attributes have order and can be positive, 0, or
negative.
Thus, in addition to providing a ranking of values, such attributes allow us to
compare and quantify the difference between values.
Example of Interval-scaled attributes.
A temperature attribute is interval-scaled.
The temperature in an AC room is 16 degree Celsius while temperature
outside the room is 32 degree Celsius.
We can say that the temperature outside is 16 degree higher than inside the
room.
Calendar dates are another example.
For instance, the years 2002 and 2010 are eight years apart.
Ratio-Scaled Attributes
A ratio-scaled attribute is a numeric attribute with an inherent zero-point.
That is, if a measurement is ratio-scaled, we can speak of a value as being a
multiple (or ratio) of another value.
In addition, the values are ordered, and we can also compute the difference
between values, as well as the mean, median, and mode.
examples of ratio-scaled attributes include age, money, weight.
If you are 50 years and your son is 25 years, you can claim you are twice
their age.
Discrete versus Continuous Attributes
There are many ways to organize attribute types.
Classification algorithms developed from the field of machine learning often
talk of attributes as being either discrete or continuous.
Each type may be processed differently.
A discrete attribute has a finite or countably infinite set of values, which may
or may not be represented as integers.
The attributes hair color, smoker, medical test, and drink size each have a
finite number of values, and so are discrete.
Note that discrete attributes may have numeric values, such as 0 and 1 for
binary attributes or, the values 0 to 110 for the attribute age.
If an attribute is not discrete, it is continuous.
The terms numeric attribute and continuous attribute are often used
interchangeably in the literature
In practice, real values are represented using a finite number of digits.
Continuous attributes are typically represented as floating-point variables.
Data Visualization
Data visualization aims to communicate data clearly and effectively through
graphical representation.
Data visualization has been used extensively in many applications—for
example, at work for reporting, managing business operations, and tracking
progress of tasks.
More popularly, we can take advantage of visualization techniques to
discover data relationships that are otherwise not easily observable by
looking at the raw data.
Now a days, people also use data visualization to create fun and interesting
graphics.
Several representative approaches considered are, pixel-oriented techniques,
geometric projection techniques, icon-based techniques, and hierarchical and
graph-based techniques.
Pixel-Oriented Visualization Techniques
A simple way to visualize the value of a dimension is to use a pixel where the
color of the pixel reflects the dimension’s value.
For a data set of m dimensions, pixel-oriented techniques create m windows
on the screen, one for each dimension.
The m dimension values of a record are mapped to m pixels at the
corresponding positions in the windows.
The colors of the pixels reflect the corresponding values.
Inside a window, the data values are arranged in some global order shared
by all windows.
The global order may be obtained by sorting all data records in a way that’s
meaningful for the task at hand.
Example of Pixel-oriented visualization.
All-Electronics maintains a customer information table, which consists of four
dimensions: income, credit limit, transaction volume, and age.
Can we analyze the correlation between income and the other attributes by
visualization?
We can sort all customers in income-ascending order, and use this order to
lay out the customer data in the four visualization windows, as shown in
Figure below.
Using pixel based visualization, we can easily observe the following: credit
limit increases as income increases; customers whose income is in the
middle range are more likely to purchase more from All-Electronics; there is
no clear correlation between income and age.
Filling a window by laying out the data records in a linear way may not work
well for a wide window.
Note that the windows do not have to be rectangular.
For example, the circle segment technique uses windows in the shape of
segments of a circle, as illustrated in Figure below.
Through this visualization, we can see that points of types “+” and “×” tend
to be collocated(act of placing multiple entities within a single location).
A 3-D scatter plot uses three axes in a Cartesian coordinate system.
For data sets with more than four dimensions, scatter plots are usually
ineffective.
The scatter-plot matrix technique is a useful extension to the scatter plot.
For an n dimensional data set, a scatter-plot matrix is an n×n grid of 2-D
scatter plots that provides a visualization of each dimension with every other
dimension.
Figure below shows an example, which visualizes the Iris data set.
The data set consists of 450 samples from each of three species of Iris
flowers.
There are five dimensions in the data set: length and width of sepal and
petal, and species.
The scatter-plot matrix becomes less effective as the dimensionality
increases.
Icon-Based Visualization Techniques
Icon-based visualization techniques use small icons to represent
multidimensional data values.
We look at two popular icon-based techniques: Chernoff faces and stick
figures.
Chernoff faces were introduced in 1973 by statistician Herman Chernoff.
They display multidimensional data of up to 18 variables (or dimensions) as a
cartoon human face (Figure below).
Chernoff faces help reveal trends in the data.
Components of the face, such as the eyes, ears, mouth, and nose, represent
values of the dimensions by their shape, size, placement, and orientation.
For example, dimensions can be mapped to the following facial
characteristics: eye size, eye spacing, nose length, nose width, mouth
curvature, mouth width, mouth openness, pupil size, eyebrow slant, eye
eccentricity, and head eccentricity.
Chernoff faces make use of the ability of the human mind to recognize small
differences in facial characteristics and to assimilate many facial
characteristics at once.
Viewing large tables of data can be tedious.
By condensing the data, Chernoff faces make the data easier for users to
digest.
In this way, they facilitate visualization of regularities and irregularities
present in the data, although their power in relating multiple relationships is
limited.
Another limitation is that specific data values are not shown.
Therefore, this mapping should be carefully chosen.
The stick figure visualization technique maps multidimensional data to
five-piece stick figures, where each figure has four limbs and a body.
Two dimensions are mapped to the display (x and y) axes and the remaining
dimensions are mapped to the angle and/or length of the limbs.
Figure below shows census data, where age and income are mapped to the
display axes, and the remaining dimensions (gender, education, and so on)
are mapped to stick figures.
Hierarchical Visualization Techniques
The visualization techniques discussed so far focus on visualizing multiple
dimensions simultaneously.
However, for a large data set of high dimensionality, it would be difficult to
visualize all dimensions at the same time.
Hierarchical visualization techniques partition all dimensions into subsets (i.e.,
subspaces).
The subspaces are visualized in a hierarchical manner.
“Worlds-within-Worlds,” also known as n-Vision, is a representative hierarchical
visualization method.
Given more dimensions, more levels of worlds can be used, which is why the
method is called “worlds-with in worlds.”
An example of hierarchical visualization methods, tree-maps display
hierarchical data as a set of nested rectangles.
For example, Figure below shows a tree-map visualizing Google news stories.
All news stories are organized into seven categories, each shown in a large
rectangle of a unique color.
Within each category (i.e., each rectangle at the top level), the news stories
are further partitioned into smaller subcategories.
Visualizing Complex Data and Relations
In early days, visualization techniques were mainly for numeric data.
Recently, more and more non-numeric data, such as text and social networks,
have become available.
Visualizing and analyzing such data attracts a lot of interest.
There are many new visualization techniques dedicated to these kinds of data.
For example, many people on the Web tag various objects such as pictures, blog
entries, and product reviews.
A tagcloud is a visualization of statistics of user-generated tags.
Often, in a tag cloud, tags are listed alphabetically or in a user-preferred order.
The importance of a tag is indicated by font size or color.
Figure below shows a tag cloud for visualizing the popular tags used in a Web
site.
[2.8]
[2.9]
where m is the number of matches (i.e., the number of attributes for which i and j
are in the same state), and p is the total number of attributes describing the
objects.
Example- Dissimilarity between nominal attributes.
Suppose that we have the sample data of Table below, except that only the
object-identifier and the attribute test-1 are available, where test-1 is nominal.
Since here we have one nominal attribute, test-1, we set p=1 in Eq. (2.11) so
that d(i, j) evaluates to 0 if objects i and j match, and 1 if the objects differ.
Thus, we get
From this, we see that all objects are dissimilar except objects 1 and 4 (i.e.,
d(4,1)=0).
Alternatively, similarity can be computed as
[2.12]
Example-
Nationality Hair color
American Red
German Blonde
Kenyan Black
Japanese Brown
American white
Indian Black
Kenyan Black
Find the dissimilarity matrix
Proximity Measures for Binary Attributes
Let’s look at dissimilarity and similarity measures for objects described by
either symmetric or asymmetric binary attributes.
A binary attribute has only one of two states: 0 and 1, where 0 means that
the attribute is absent, and 1 means that it is present.
Given the attribute smoker describing a patient, for instance, 1 indicates that
the patient smokes, while 0 indicates that the patient does not.
Treating binary attributes as if they are numeric can be misleading.
Therefore, methods specific to binary data are necessary for computing
dissimilarity.
“So, how can we compute the dissimilarity between two binary attributes?”
One approach involves computing a dissimilarity matrix from the given binary
data.
If all binary attributes are thought of as having the same weight, we have the
2×2 contingency table of Table below
where q is the number of attributes that equal 1 for both objects i and j
r is the number of attributes that equal 1 for object i but equal 0 for object j
s is the number of attributes that equal 0 for object i but equal 1 for object j
t is the number of attributes that equal 0 for both objects i and j.
The total number of attributes is p, where p = q+r+s+t.
for symmetric binary attributes, each state is equally valuable.
Dissimilarity that is based on symmetric binary attributes is called
symmetric binary dissimilarity.
If objects i and j are described by symmetric binary attributes, then the
dissimilarity between i and j is
For asymmetric binary attributes, the two states are not equally important,
such as the positive (1) and negative (0) outcomes of a disease test.
Given two asymmetric binary attributes, the agreement of two 1s (a positive
match) is then considered more significant than that of two 0s (a negative
match).
Therefore, such binary attributes are often considered “monary” (having one
state).
The dissimilarity based on these attributes is called asymmetric binary
dissimilarity, where the number of negative matches, t, is considered
unimportant and is thus ignored in the following computation:
Complementarily, we can measure the difference between two binary
attributes based on the notion of similarity instead of dissimilarity.
For example, the asymmetric binary similarity between the objects i and j can
be computed as
The coefficient sim(i, j) of Eq. (2.15) is called the Jaccard coefficient and is
popularly referenced in the literature.
Example- Dissimilarity between binary attributes.
Suppose that a patient record table below contains the attributes name,
gender, fever, cough, test-1, test-2, test-3, and test-4, where name is an
object identifier, gender is a symmetric attribute, and the remaining
attributes are asymmetric binary.
For asymmetric attribute values, let the values Y (yes) and P (positive) be set
to 1, and the value N (no or negative) be set to 0.
Suppose that the distance between objects (patients) is computed based only on
the asymmetric attributes.
According to Eq.(2.14), the distance between each pair of the three patients—
Jack, Mary, and Jim—is
These measurements suggest that Jim and Mary are unlikely to have a similar
disease because they have the highest dissimilarity value among the three pairs.
Of the three patients, Jack and Mary are the most likely to have a similar disease.
Dissimilarity of Numeric Data: Minkowski Distance
Measures that are commonly used for computing the dissimilarity of objects
described by numeric attributes include the Euclidean, Manhattan, and
Minkowski distances.
In some cases, the data are normalized before applying distance calculations.
This involves transforming the data to fall within a smaller or common range,
such as[−1,1] or[0.0,1.0].
Consider a height attribute, for example, which could be measured in either
meters or inches.
In general, expressing an attribute in smaller units will lead to a larger range
for that attribute, and thus tend to give such attributes greater effect or
“weight.”
Normalizing the data attempts to give all attributes an equal weight.
It may or may not be useful in a particular application.
The most popular distance measure is Euclidean distance (i.e., straight line or
“as the crow flies”).
Let i=(xi1, xi2,..., xip) and j=(xj1, xj2,..., xjp) be two objects described by p
numeric attributes.
The Euclidean distance between objects i and j is defined as
There are three states for test-2: fair, good, and excellent, that is, Mf =3.
For step 1, if we replace each value for test-2 by its rank, the four objects are
assigned the ranks 3, 1, 2, and 3, respectively.
Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5, and
rank 3 to 1.0.
For step 3, we can use, say, the Euclidean distance (Eq. 2.16), which results
in the following dissimilarity matrix:
Therefore, objects 1 and 2 are the most dissimilar, as are objects 2 and 4 (i.e.,
d(2,1)= 1.0 and d(4,2)=1.0).
This makes intuitive sense since objects1 and 4 are both excellent.
Object 2 is fair, which is at the opposite end of the range of values for test-2.
Similarity values for ordinal attributes can be interpreted from dissimilarity as
sim(i,j)=1−d(i,j).
Dissimilarity for Attributes of Mixed Types
In many real databases, objects are described by a mixture of attribute types.
In general, a database can contain all of these attribute types.
“So, how can we compute the dissimilarity between objects of mixed attribute
types?”
One approach is to group each type of attribute together, performing separate
data mining (e.g., clustering) analysis for each type.
This is feasible if these analyses derive compatible results.
However, in real applications, it is unlikely that a separate analysis per attribute
type will generate compatible results.
A more preferable approach is to process all attribute types together, performing
a single analysis.
One such technique combines the different attributes into a single dissimilarity
matrix, bringing all of the meaningful attributes onto a common scale of the
interval [0.0, 1.0].
Suppose that the data set contains p attributes of mixed type.
The dissimilarity d(i, j) between objects i and j is defined as
The steps are identical to what we have already seen for each of the
individual attribute types.
The only difference is for numeric attributes, where we normalize so that the
values map to the interval [0.0, 1.0].
Thus, the dissimilarity between objects can be computed even when the
attributes describing the objects are of different types.
Example- Dissimilarity between attributes of mixed type.
Let’s compute a dissimilarity matrix for the objects in Table below
Now we will consider all of the attributes, which are of different types.
In Example for nominal attributes and ordinal attributes, we worked out the
dissimilarity matrices for each of the individual attributes.
The procedures we followed for test-1 (which is nominal) and test-2 (which is
ordinal) are the same as outlined earlier for processing attributes of mixed
types.
Therefore, we can use the dissimilarity matrices obtained for test-1 and test-
2 later when we compute Eq. (2.22).
Now compute the dissimilarity matrices for the third attribute, test-3(which
is numeric). I.e. We must compute
Following the case for numeric attributes, we let maxhxh =64 and minhxh
=22.
The difference between the two is used in Eq. (2.22) to normalize the values
of the dissimilarity matrix.
Find all the dij values for attribute test-3 (e.g. d12=(45-22)/(64-22)=0.55)
The resulting dissimilarity matrix for test-3 is
We can now use the dissimilarity matrices for the three attributes in our
computation of Eq.(2.22).
dissimilarity matrices obtained for test-1 test2 and test-3 are
The resulting dissimilarity matrix obtained for the
objects 1 and 4 are the most similar, based on their values for test-1and
test-2.
This is confirmed by the dissimilarity matrix, where d(4, 1) is the lowest
value for any pair of different objects.
Similarly, the matrix indicates that objects 1 and 2 are the least similar.
Cosine Similarity
A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as a keyword) or phrase in the document.
Thus, each document is an object represented by what is called a term-
frequency vector.
For example, in Table below, we see that Document1 contains five instances of
the word team, while hockey occurs three times.
The word coach is absent from the entire document, as indicated by a count
value of 0.
Such data can be highly asymmetric.
Term-frequency vectors are typically very long and sparse(i.e., they have
many 0 values).
Applications using such structures include information retrieval, text
document clustering, biological taxonomy, and gene feature mapping.
The traditional distance measures that we have studied in this chapter do
not work well for such sparse numeric data.
For example, two term-frequency vectors may have many 0 values in
common, meaning that the corresponding documents do not share many
words, but this does not make them similar.
We need a measure that will focus on the words that the two documents do
have in common, and the occurrence frequency of such words.
In other words, we need a measure for numeric data that ignores zero-
matches.
Cosine similarity is a measure of similarity that can be used to compare
documents or, say, give a ranking of documents with respect to a given
vector of query words.
Let x and y be two vectors for comparison.
Using the cosine measure as a similarity function, we have
where ||x|| is the Euclidean norm of vector x=(x1, x2,..., xp), defined as .
Conceptually, it is the length of the vector.
Similarly, ||y|| is the Euclidean norm of vector y.
The measure computes the cosine of the angle between vectors x and y.
A cosine value of 0 means that the two vectors are at 90 degrees to each
other (orthogonal) and have no match.
The closer the cosine value to 1, the smaller the angle and the greater the
match between vectors.
Example- Cosine similarity between two term-frequency vectors.
Suppose that x and y are the first two term-frequency vectors in Table 2.5.
That is, x=(5,0,3,0,2,0,0,2,0,0) and y=(3,0,2,0,1,1,0,1,0,1).
How similar are x and y?
Using Eq. (2.23) to compute the cosine similarity between the two vectors,
we get:
2 Product B Urgent 93