AIML mod 5
AIML mod 5
Cluster analysis has been widely used in many applications such as business intelligence, Pattern
recognition, image processing, bio-informatics, web technology, search engines, and text mining.
1. Business intelligence: Cluster analysis helps in target marketing, where marketers discover
groups and categorize them based on the purchasing patterns. The information retrieved can
be used market segmentation, product positioning (i.e., allocating products to specific areas),
new product development, grouping of shopping items, and selecting test markets.
2. Pattern cognition: Here, the clustering methods group similar patterns into clusters whose
members are more similar to each other. In other words, the similarity of members within
cluster is much higher when compared to the similarity of members outside of it.
3. Image processing: Extracting and understanding information from images is very important
in image processing. The images are initially segmented and the different objects of interest
in them are then identified. This involves division of an image into areas of similar attributes.
It is one of the important and most challenging tasks in image processing where clustering can
be applied. Image processing has applications in many areas such as analysis of remotely
ordinal. Nominal is in alphabetical form and not in integer form. Binary attribute is of two
types: symmetric binary and asymmetric binary. In symmetric data, both values are
equally important. For example, in gender, it is male and female. In asymmetric data, both
values are not equally important. For example, in result, it is pass and fail. The clustering
algorithm should also work for complex data types such as graphs, sequences, images, and
documents.
3. Discovery of clusters with arbitrary shape: Generally, clustering algorithms are to
determine spherical clusters. Due to the characteristics and diverse nature of the data used,
clusters may be of arbitrary shapes and can be nested within one another.
Traditional clustering algorithms, such as k-means and k-medoids, fail to detect non-
spherical shapes. Thus, it is important to have clustering algorithms that can detect clusters
of any arbitrary shape.
4. Avoiding domain knowledge to determine input parameters: Many algorithms require
domain knowledge like the desired number of clusters in the form of input. Thus, the
clustering results may become sensitive to the input parameters. Such parameters are often
hard to determine for high dimensionality data. Domain knowledge requirement affects
the quality of clustering and burdens the user.
For example, in k-means algorithm, the metric used to compare results for different values
of k is the mean distance between data points and their cluster centroid. Increasing the
number of clusters will always reduce the distance of data points to the extreme of
reaching zero when k is the same as the number of data points. Thus, this cannot be used.
Instead, to roughly determine k, the mean distance to the centroid is plotted as a function
of k and the "elbow point", where the rate of decrease sharply shifts. This is shown in Fig.
13.2.
5. Handling noisy data Real-world data, which is the input of clustering algorithms, are
mostly affected by noise. This results in poor-quality clusters. Noise is an unavoidable
problem, which affects the data collection and data preparation processes. Therefore, the
algorithms we use should be able to deal with noise. There are two types of noise:
i. Attribute noise includes implicit errors introduced by measurement tools.
They are induced by different types of sensors.
ii. Random errors introduced by batch processes or experts when the data is
gathered. This can be induced in document digitalization process.
6. Incremental clustering: The database used for clustering needs to be updated by adding
new data (incremental updates). Some clustering algorithms cannot incorporate
incremental updates but have to recompute a new clustering from scratch. The algorithms
which can accommodate new data without reconstructing the clusters are called
incremental clustering algorithms. It is more effective to use incremental clustering
algorithms.
7. Insensitivity to input order: Some clustering algorithms are sensitive to the order in
which data objects are entered. Such algorithms are not ideal as we have little idea about
the data objects presented. Clustering algorithms should be insensitive to the input order
of data objects.
8. Handling high-dimensional data: A dataset can contain numerous dimensions or
attributes. Generally, clustering algorithms are good at handling low-dimensional data
such as datasets involving only two or three dimensions. Clustering algorithms which can
handle high-dimensional space are more effective.
9. Handling constraints: Constrained clustering can be considered to contain a set of must-
link constraints, cannot-link constraints, or both. In a must-link constraint, two instances
in the must-link relation should be included in the same cluster. On the other hand, a
cannot-link constraint specifies that the two instances cannot be in the same cluster. These
sets of constraints act as guidelines to cluster the entire dataset. Some constrained
clustering algorithms cancel the clustering process if they cannot form clusters which
satisfy the specified constraints. Others try to minimize the amount of constraint violation
if it is impossible to find a clustering which satisfies the constraints. Constraints can be
used to select a clustering model to follow among different clustering methods. A
challenging task is to find data groups with good clustering behaviour that satisfy
constraints.
6. Interpretability and usability: Users require the clustering results to be interpretable, usable
usable, and include all the elements. Clustering is always tied with specific semantic
interpretations and applications.
13.2 Types of Clustering
Q. Explain the classification of clustering algorithms
a data-set. It does not require prespecifying the number of clusters to be generated. The result of
hierarchical clustering is a tree-based representation of the objects, which is also known as
dendrogram. Observations can be subdivided into groups by cutting the dendrogram at a desired
similarity level. We classify hierarchical methods on the basis of how the hierarchical decomposition
is formed. There are two approaches:
1. Agglomerative approach: This approach is also known as the bottom-up approach. In this
approach, we start with each object forming a separate group. It keeps on merging the objects
or groups that are close to one another. It keeps on doing so until all of the groups are merged
into one or until the termination condition holds.
2. Divisive approach: This approach is also known as the top-down approach. In this approach,
we start with all of the objects in the same cluster. In the continuous iteration, a cluster is split
up into smaller clusters. It is done until each object is in one cluster or the termination
condition holds. This method is rigid, because once a merging or splitting is done, it can never
be undone.
13.2.3 Density-Based Methods
Density-based clustering algorithm finds nonlinear shapes clusters based on the density. Density-
based Spatial clustering of applications with noise (DBSCAN) is the most widely used density-based
algorithm. It uses the concept of density reachability and density connectivity.
1. Density reachability: A point "p" is said to be density reachable from a point "q" if it is within
the distance from point "q" and "q" has sufficient number of points in its neighbors that are
within distance ‘€’
2. Density connectivity: Points "p" and "q" are said to be density-connected if there exists a
point which has sufficient number of points in its neighbors and both the points are within the
distance. This is called chaining process. So, if "q" is neighbor of "r", "r" is neighbor of s , s
is neighbor of "t", and t is neighbor of "p"; this implies that "q" is neighbor of "p".
13.2.4 13.2.4 Grid-Based Methods
The grid-based clustering approach differs from the conventional clustering algorithms in that it is
concerned not with data points but with the value space that surrounds the data points. In general, a
typical grid-based clustering algorithm consists of the following five basic steps:
1. Create the grid structure, i.e., partitioning the data space into a finite number of cells.
2. Calculating the cell density for each cell.
3. Sorting the cells according to their densities.
4. Identifying cluster centers.
3. Assign the data point to the cluster having a minimum distance from it and the cluster center.
4. Recalculate the new cluster center using Eq. (13.2).
By assigning the data points to the cluster center whose distance from it is minimum of all the cluster
centers, we get the following table.
C1 (2) C2 (16) C3( 38 )
m1 =2 m2 = 16 m3 = 38
{2, 3, 4, 6} {12, 14, 15, 16, 21, 23, 25} {30, 31, 35, 38}
New cluster centers
m2 = (12+14
m1= (2+3+4+6)/4 = 3.75
+15+16+21+23+25)/7 = 18 m3 = 33.50
m1= 3.75
m2 = 18
Similarly, using the new cluster centers we can calculate the distance from it and allocate clusters
based we stop up on minimum distance.
Iteration 2
Minimum Data
distance Point
Data Distance from Distance from Distance from
of all the belongs
Points C1(3.75) C2(18) C3(33.50)
cluster to the
centers cluster
2 (2 - 3.75 )2 = 3.06 (2 - 18)2 = 256 (2 - 33.50)2 = 1007.32 3.06 C1
4 (4 - 3.75)2 = 0.06 (4 - 18)2 = 196 (4 - 33.50)2 = 940.64 0.06 C1
6 (6 - 3.75)2 = 5.06 (6 - 18)2 = 144 (6 - 33.50)2 = 821 5.06 C1
31 (31 - 3.75)2 = 742.56 (31 - 18)2 = 169 (31 - 33.50)2 = 13.46 13.46 C3
38 (38 - 3.75)2 = 1173 (38 - 18)2 = 400 (38 - 33.50)2 = 11.08 11.08 C3
30 (30 - 2)2 = 689.06 (30 - 18)2 =144 (30 - 33.50)2 = 21.80 21.80 C3
It is found that there is no difference in the cluster formed & cluster centers from iteration 1 & 2, and
hence we stop the procedure.
Q. Apply k-means algorithm in given data for k = 2 (that is, 2 clusters). Use C1(80), and C2(250),
as the initial cluster centers. Data: 234,123, 456, 23, 34, 56, 78, 90, 150, 116, 117, 118, 199.
Iteration -1
Minimum Data Point
Distance from
Data Points Distance from C2(250) distance of all the belongs to the
C1(80)
cluster centers cluster
C1(80) C2(250)
m1 = 80 m2 = 250
{123, 23, 34, 56, 78, 90,116,117, 118} {150, 199, 234, 456}
New cluster centers
m1= 83.9 m2 = 259.75
Iteration -2
Minimum
Data Point
Distance from Distance from distance of all
Data Points belongs to the
C1(83.9) C2(259.75) the cluster
cluster
centers
C1(83.9) C2(259.75)
m1 = 83.9
m2 = 259.75
{123, 23, 34, 56, 78, 90,116,117, 118,
{199, 234, 456}
150}
New cluster centers
m1= 90.5 m2 = 296.3
Iteration -3
Minimum Data Point
Distance from
Data Points Distance from C2(296.3) distance of all the belongs to the
C1(90.5)
cluster centers cluster
56 C1
78 C1
90 C1
150 C1
116 C1
117 C1
118 C1
199 C2
C1(90.5) C2(296.3)
m1 = 83.9
m2 = 259.75
{123, 23, 34, 56, 78, 90,116,117, 118,
{ 199, 234, 456}
150}
New cluster centers
m1= 90.5 m2 = 296.3
Since there is no change in the cluster centers from iteration 2 to 3, so we stop the procedure.
Q. Apply k-means clustering for the datasets given in Table 13.1 for two clusters. Tabulate all
the assignments.
Sample No. X Y
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77
Solution:
Sample No. X Y Assignment
1 185 72 C1
2 170 56 C2
Similarly,
1. Distance from C1(185, 72) for (179, 68) = 7.21
Distance from C2(170, 56) for (179, 68) = 15
4 179 68 C1
5 182 72 C1
6 188 77 C1
After the second iteration, the assignment has not changed and hence the algorithm is stopped and the
Points are clustered.
Equation 13.2
The most common algorithm in k-medoid clustering is Partitioning Around Medoids (PAM)
PAM uses a greedy search which is faster than the exhaustive search and may not find the optimum
solution. It works as follows:
1. Initialize: select k of the n data points as the medoids.
2. Associate each data point to the closest medoid.
3. While the cost of the configuration decreases: For each medoid m and for each non-medoid
data point O':
Swap m and O', recompute the cost (sum of distances of points to their medoid).
If the total cost of the configuration increased in the previous step, undo the swap.
Solution:
Step 1: Two observations Cl = X2 = (3, 4) and C2 = X6 = (6, 4) are randomly selected as medoids (Cluster
centers)
Step 2: Manhattan distances are calculated to each center to associate each data object to its nearest
medoid.
Manhattan distance formula to calculate the distance matrices between medoid (cluster center) and non-
medoid (which are not cluster centers) points.
Distance = |X1-X2| + |Y1-Y2|
1. Distance between point ( 2, 6 ) and C1 medoid ( 3, 4 ) = | ( 2 - 3 ) | + | ( 6 – 4 ) | = 1 + 2 = 3
Distance between point ( 2, 6 ) and C2 medoid ( 6, 4 ) = | ( 2 - 6) | + | ( 6 - 4 ) | = 4 + 2 = 6
2. Distance between point ( 3, 4 ) and C1 medoid ( 3, 4 ) = | ( 3 - 3 ) | + | ( 4 – 4 ) | = 0 + 0 = 0
Distance between point ( 3, 4 ) and C2 medoid ( 6, 4 ) = | ( 3 - 6) | + | ( 4 - 4 ) | = 3 + 0 = 3
3. Distance between point ( 3, 8 ) and C1 medoid ( 3, 4 ) = | ( 3 - 3 ) | + | ( 8 – 4 ) | = 0 + 4 = 4
Distance between point ( 3, 8 ) and C2 medoid ( 6, 4 ) = | ( 3 - 6) | + | ( 8 - 4 ) | = 3 + 4 = 7
4. Distance between point ( 4, 2 ) and C1 medoid ( 3, 4 ) = | ( 4 - 3 ) | + | ( 2 – 4 ) | = 1 + 2 = 3
Distance between point ( 4, 2 ) and C2 medoid ( 6, 4 ) = | ( 4 - 6) | + | ( 2 - 4 ) | = 2 + 2 = 4
5. Distance between point ( 6, 2 ) and C1 medoid ( 3, 4 ) = | ( 6 - 3 ) | + | ( 2 – 4 ) | = 3 + 2 = 5
Distance between point ( 6, 2 ) and C2 medoid ( 6, 4 ) = | ( 6 - 6) | + | ( 2 - 4 ) | = 0 + 2 = 2
6. Distance between point ( 6, 4 ) and C1 medoid ( 3, 4 ) = | ( 6 - 3 ) | + | ( 4 – 4 ) | = 3 + 0 = 3
Distance between point ( 6, 4 ) and C2 medoid ( 6, 4 ) = | ( 6 - 6) | + | ( 4 - 4 ) | = 0 + 0 = 0
Data Object Distance To Distance Distance
closure to closure to
Sample Point C1 (3, 4) C2 =(6, 4)
C1 C2
X1 (2, 6) 3 6 3
X2 (3, 4) 0 3 0
X3 (3, 8) 4 7 4
X4 (4, 2) 3 4 3
X5 (6, 2) 5 2 2
X6 (6, 4) 3 0 0
cost 10 2
Total cost 12
Each point is assigned to the cluster of that medoid whose dissimilarity is less. Points 1, 2, 3 and 4 go to
cluster C1 and 5, 6 go to cluster C2.
So cost of swapping medoid from C2(6, 4) to O'(6,2) is 11. Since the cost is less than previous cost, this is
considered as a better cluster assignment. Here swapping is done as the cost is less.
Step 4: We select another non-medoid O'. Let us assume 0' = (4, 2). So now the medoids are Cl (3, 4) and
O'(4, 2). If cl and O' are new medoids, we calculate the total cost involved.
1. Distance between point ( 2, 6 ) and C1 medoid ( 3, 4 ) = | ( 2 - 3 ) | + | ( 6 – 4 ) | = 1 + 2 = 3
So cost of swapping medoid from C2(6, 2) to 0'(4,2) is 15. Since the cost is more, this cluster assignment
is not considered and the swapping is not done.
Thus, We try other non-medoids points to get minimum cost, The assignment with minimum cost
considered the best. For some applications, k-medoids show better results than k-means. The most time
consuming part of the k-medoids algorithm is the calculation of the distances between objects. The takes
matrix can be computed in advance to speed-up the process.
2. Find and merge the next two closest points, where a point is either an individual object or a cluster
of objects
3. If more than one cluster remains, return to step 2.
Average distance:
When an algorithm uses the minimum distance, dmin(Ci, Cj), to measure the distance between clusters, it is
called nearest-neighbor clustering algorithm. If the clustering process is terminated when the distance
between the nearest clusters exceeds a user-defined threshold, it is called single-linkage algorithm.
Agglomerative hierarchical clustering algorithm (with minimum distance measure) is called minimum
spanning tree algorithm since spanning tree of a graph is a tree that connects all vertices and a minimal
spanning tree is one with the least sum of edge weights.
An algorithm that uses the maximum distance, dmax (Ci, Cj), to measure the distance between clusters is
called farthest-neighbor clustering algorithm. If clustering is terminated when the maximum distance
exceeds a user-defined threshold, it is called complete-linkage algorithm.
The minimum and maximum measures tend to be sensitive to outliers or noisy data. The third method thus
suggests to take the average distance to rule out outlier problems. Another advantage is that it can handle
categorical data as well.
Algorithm: The agglomerative algorithm is carried out in three steps and the flowchart is shown in Fig.
13.4.
1. Convert object attributes to distance matrix.
2. Set each object as a cluster (thus, if we have N objects, we will have N clusters at the beginning).
3. Repeat until number of clusters is one.
• Merge two closest clusters.
• Update distance matrix.
Solution:
To compute distance matrix:
Euclidean distance:
Similarly:
Here the minimum value is 0.10, that is the distance between P3 & P6 and hence we combine P3 and P6.
To update the distance matrix,
min (d(P3, P6), P1)) = min (d(P3, P1), d(P6, P1)) = min (0.22, 0.24) = 0.22
min (d(P3, P6), P2)) = min (d(P3, P2), d(P6, P2)) = min (0.14, 0.24) = 0.14
min (d(P3, P6), P4)) = min (d(P3, P4), d(P6, P4)) = min (0.13, 0.22) = 0.13
min (d(P3, P6), P5)) = min (d(P3, P5), d(P6, P5)) = min (0.28, 0.39) = 0.28
Here the minimum value is 0.13 that is the distance between (P3, P6) & P4 and hence we combine (P3,
P6) & P4. To update the distance matrix,
min ((d(P3, P6), P4), P1) = min (d(P3, P6), P1), d(P4, P1)) = min (0.22, 0.37) = 0.22
min ((d(P3, P6), P4), P2) = min (d(P3, P6), P2), d(P4, P2)) = min (0.14, 0.19) = 0.14
min ((d(P3, P6), P4), P5) = min (d(P3, P6), P5), d(P4, P5)) = min (0.28, 0.23) = 0.23
The distance matrix after merging two closest members (P3, P6) & P4,
Here the minimum value is 0.14 that is the distance between P2 & P5 and hence we combine P2 & P5. To
update the distance matrix,
min (d(P2, P5), P1)) = min (d(P2, P1), d(P5, P1)) = min (0.23, 0.34) = 0.23
min (d(P2, P5), (P3, P6, P4)) = min (d(P2, (P3, P6, P4), d(P5, (P3, P6, P4))= min (0.14, 0.23) = 0.14
Here the minimum value is 0.14 that is the distance between (P3, P6, P4) & (P2, P5) and hence we combine
them. To update the distance matrix,
min (d(P2, P5, P3, P6, P4), P1)) = min (d((P2, P5), P1), d((P3,P6, P4), P1)) = min (0.23, 0.22) = 0.22
The distance matrix after merging two closest members (P2, P5, P3, P6, P4) & P1,
Solution:
To compute distance matrix:
Here the minimum value is 0.5 that is the distance between P4 & P6 and hence we combine them. To
update the distance matrix,
max (d(P4, P6), P1)) = max (d((P4, P1), d(P6,P1))) = max (3.6, 3.2) = 3.6
max (d(P4,P6), P2) = max (d(P4,P2), d(P6,P2)) = max (2.92, 2.5) = 2.92
max (d(P4,P6), P3) = max (d(P4,P3), d(P6,P3)) = max (2.24, 2.5) = 2.5
max (d(P4,P6), P5) = max (d(P4,P5), d(P6,P5)) = max (1.0, 1.12) = 1.12
Again, merging the two closest members of the two clusters and finding the minimum element in distance
matrix. We get the minimum value as 0.71 and hence we combine P1 and P2. To update the distance
matrix,
max (d(P1, P2), P3) = max (d(P1, P3), d(P2, P3)) = max (5.66, 4.95) = 5.66
max (d(P1,P2), (P4,P6)) = max (d(P1, P4, P6), d(P2, P4, P6)) = max (3.6, 2.92) = 3.6
max (d(P1,P2), P5) = max (d(P1, P5), d(P2, P5)) = max (4.24, 3.53) = 4.24
Again, merging the two closest members of the two clusters and finding the minimum element in distance
matrix. We get the minimum value as 1.12 and hence we combine P4, P6 and P5. To update the distance
matrix,
max (d(P4,P6,P5), (P1,P2)) = max (d(P4,P6,P1,P2), d(P5,P1,P2)) = max (3.6, 4.24) = 4.24
max (d(P4,P6,P5), P3) = max (d(P4,P6,P3), d(P5, P3)) = max (2.5, 1.41) = 2.5
Again, merging the two closest members of the two clusters and finding the minimum element in distance
matrix. We get the minimum value as 2.5 and hence combine P4, P6, P5 and P3. to update the distance
matrix,
min (d(P4,P6,P5,P3), (P1,P2)) = max (d(P4,P6,P5,P1,P2), d(P3,P1,P2)) = mac (3.6, 5.66) = 5.66
Solution:
The distance matrix is:
Merging the two closest members by finding the minimum element in distance matrix and forming the
clusters, we get
Here the minimum value is 0.5 that is the distance between P4 & P6 and hence we combine them. To
update the distance matrix,
average (d(P4,P6), P1) = average (d(P4,P1), d(P6,P1)) = average (3.6, 3.20) = 3.4
average (d(P4,P6), P2) = average (d(P4,P2), d(P6,P2)) = average (2.92, 2.5) = 2.71
average (d(P4,P6), P3) = average (d(P4,P3), d(P6,P3)) = average (2.24, 2.5) = 2.37
average (d(P4,P6), P5) = average (d(P4,P5), d(P6,P5)) = average (1.0, 1.12) = 1.06
Merging two closest members of the two clusters and finding the minimum elements in distance matrix.
We get the minimum value as 0.71 and hence we combine P1 and P2. To update the distance matrix:
average (d(P1,P2), P3) = average (d(P1,P3), d(P2,P3)) = average (5.66, 4.95) = 5.31
average (d(P1,P2), (P4,P6)) = average (d(P1,P4,P6), d(P2,P4,P6)) = average (3.2, 2.71) = 2.96
average (d(P1,P2), P5) = average (d(P1,P5), d(P2,P5)) = average (4.24, 3.53) = 3.89
Merging two closest members of the two clusters and finding the minimum elements in distance matrix.
We get the minimum value as 1.12 and hence we combine P4,P6 and P5. To update the distance matrix:
average (d(P4,P6,P5), (P1,P2)) = average (d(P4,P6,P1,P2), d(P5,P1,P2)) = average (2.96, 3.89) = 3.43
average (d(P4,P6,P5), P3) = average (d(P4,P6,P3), d(P5,P3)) = average (2.5, 1.41) = 1.96
Merging two closest members of the two clusters and finding the minimum elements in distance matrix.
We get the minimum value as 1.96 and hence we combine P4,P6,P5 and P3. To update the distance
matrix:
average (d(P4,P6,P5,P3), (P1,P2)) = average (d(P4,P6,P5,P1,P2), d(P3,P1P2)) = average (3.43, 5.66) =
4.55
INTRODUCTION
The most basic instance-based method is the K- Nearest Neighbor Learning. This
algorithm assumes all instances correspond to points in the n-dimensional space Rn.
The nearest neighbors of an instance are defined in terms of the standard Euclidean
distance.
Let an arbitrary instance x be described by the feature vector
((a1(x), a2(x), ………, an(x))
Where, ar(x) denotes the value of the rth attribute of instance x.
Below figure illustrates the operation of the k-Nearest Neighbor algorithm for the case where
the instances are points in a two-dimensional space and where the target function is Boolean
valued.
The positive and negative training examples are shown by “+” and “-” respectively. A
query point xq is shown as well.
The 1-Nearest Neighbor algorithm classifies xq as a positive example in this figure,
whereas the 5-Nearest Neighbor algorithm classifies it as a negative example.
Below figure shows the shape of this decision surface induced by 1- Nearest Neighbor over
the entire instance space. The decision surface is a combination of convex polyhedra
surrounding each of the training examples.
For every training example, the polyhedron indicates the set of query points whose
classification will be completely determined by that training example. Query points
outside the polyhedron are closer to some other training example. This kind of diagram
is often called the Voronoi diagram of the set of training example
Terminology
The phrase "locally weighted regression" is called local because the function is
approximated based only on data near the query point, weighted because the
contribution of each training example is weighted by its distance from the query point,
and regression because this is the term used widely in the statistical learning community
for the problem of approximating real-valued functions.
Given a new query instance xq, the general approach in locally weighted regression is
to construct an approximation ̂ that fits the training examples in the neighborhood
̂ q), which is
surrounding xq. This approximation is then used to calculate the value (x
output as the estimated target value for the query instance.
Consider locally weighted regression in which the target function f is approximated near
xq using a linear function of the form
Where, ai(x) denotes the value of the ith attribute of the instance x
Derived methods are used to choose weights that minimize the squared error summed
over the set D of training examples using gradient descent
Need to modify this procedure to derive a local approximation rather than a global one.
The simple way is to redefine the error criterion E to emphasize fitting the local training
examples. Three possible criteria are given below.
2. Minimize the squared error over the entire set D of training examples, while
weighting the error of each training example by some decreasing function K of its
distance from xq :
3. Combine 1 and 2:
The differences between this new rule and the rule given by Equation (3) are that the
contribution of instance x to the weight update is now multiplied by the distance penalty
K(d(xq, x)), and that the error is summed over only the k nearest training examples.
Where, each xu is an instance from X and where the kernel function Ku(d(xu, x)) is
defined so that it decreases as the distance d(xu, x) increases.
Here k is a user provided constant that specifies the number of kernel functions to be
included.
̂ is a global approximation to f (x), the contribution from each of the Ku(d(xu, x)) terms
is localized to a region nearby the point xu.
Choose each function Ku(d(xu, x)) to be a Gaussian function centred at the point xu with some
variance u2
The functional form of equ(1) can approximate any function with arbitrarily small error,
provided a sufficiently large number k of such Gaussian kernels and provided the width
2 of each kernel can be separately specified
The function given by equ(1) can be viewed as describing a two layer network where
the first layer of units computes the values of the various K u(d(xu, x)) and where the
second layer computes a linear combination of these first-layer unit values
Given a set of training examples of the target function, RBF networks are typically trained in
a two-stage process.
1. First, the number k of hidden units is determined and each hidden unit u is defined by
choosing the values of xu and u2 that define its kernel function Ku(d(xu, x))
2. Second, the weights w, are trained to maximize the fit of the network to the training
data, using the global error criterion given by
Because the kernel functions are held fixed during this second stage, the linear weight
values w, can be trained very efficiently
Several alternative methods have been proposed for choosing an appropriate number of hidden
units or, equivalently, kernel functions.
One approach is to allocate a Gaussian kernel function for each training example
(xi,f (xi)), centring this Gaussian at the point xi.
Each of these kernels may be assigned the same width 2. Given this approach, the RBF
network learns a global approximation to the target function in which each training
example (xi, f (xi)) can influence the value of f only in the neighbourhood of xi.
A second approach is to choose a set of kernel functions that is smaller than the number
of training examples. This approach can be much more efficient than the first approach,
especially when the number of training examples is large.
Summary
Radial basis function networks provide a global approximation to the target function,
represented by a linear combination of many local kernel functions.
The value for any given kernel function is non-negligible only when the input x falls
into the region defined by its particular centre and width. Thus, the network can be
viewed as a smooth linear combination of many local approximations to the target
function.
One key advantage to RBF networks is that they can be trained much more efficiently
than feedforward networks trained with BACKPROPAGATION.
Case-based reasoning (CBR) is a learning paradigm based on lazy learning methods and
they classify new query instances by analysing similar instances while ignoring
instances that are very different from the query.
In CBR represent instances are not represented as real-valued points, but instead, they
use a rich symbolic representation.
CBR has been applied to problems such as conceptual design of mechanical devices
based on a stored library of previous designs, reasoning about new legal cases based on
previous rulings, and solving planning and scheduling problems by reusing and
combining portions of previous solutions to similar problems
The CADET system employs case-based reasoning to assist in the conceptual design of
simple mechanical devices such as water faucets.
It uses a library containing approximately 75 previous designs and design fragments to
suggest conceptual designs to meet the specifications of new design problems.
Each instance stored in memory (e.g., a water pipe) is represented by describing both its
structure and its qualitative function.
New design problems are then presented by specifying the desired function and
requesting the corresponding structure.