0% found this document useful (0 votes)
6 views

Supervised and Unsupervised Pattern Recognition

Uploaded by

Mariana Diniz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Supervised and Unsupervised Pattern Recognition

Uploaded by

Mariana Diniz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Supervised and Unsupervised Pattern Recognition, and

their Performance
Luciano da Fontoura Costa

To cite this version:


Luciano da Fontoura Costa. Supervised and Unsupervised Pattern Recognition, and their Perfor-
mance. 2022. �hal-03681008�

HAL Id: hal-03681008


https://hal.science/hal-03681008
Preprint submitted on 30 May 2022

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
Supervised and Unsupervised Pattern Recognition, and their Performance
Luciano da Fontoura Costa
[email protected]

São Carlos Institute of Physics – DFCM/USP

25th May 2022

Abstract
Pattern recognition, be in supervised or not, has motivated growing attention because of its several important appli-
cations. One issue of particular importance concerns the validation of the quality, e.g. in terms of correct classifications
and stability, which is often estimated by performing cross-validation methods. A model-based approach is adopted,
in which the data categories are understood statistically in terms of respective random variables, associated to the
features, as well as the associated density probability functions. This allows both the supervised and unsupervised
pattern recognition cases to be addressed in a principled manner while the important issues of bias, undersampling, un-
derlearning and overfitting are all addressed and revisited. Several important and even surprising results are reported,
including the interpretation of overfitting as not being necessarily unwanted, the characterization of the phenomenon
of underlearning, in which several unstable working decision boundaries can be obtained, as being a consequence of
biased sampling and/or undersampling, as well as the approach to unsupervised learning as involving two related but
not necessarily identical issues, namely choosing how to interrelated the clusters and deciding whether a group could
be considered as a cluster. To complement this development, we briefly consider the application of the coincidence
similarity index to some of the covered problems, as well as present the possibility to use the important problem of
image segmentation as a laboratory for better understanding and developing pattern recognition concepts and methods.

1 Introduction methods, extending from multivariate statistics (e.g. [7,


8]) to neuronal networks (e.g. [9]). In addition, several
Pattern recognition means the action of, given a set of sequential and/or parallel processing stages are typically
entities represented by respective measurements (or fea- involved in the implementation of pattern recognition sys-
tures), to respectively assign existing (supervised recog- tems. Simplistically, the basic steps of a basic pattern
nition) or novel (unsupervised recognition) categories. recognition pipeline are shown in Figure 1. These in-
The already substantial importance of this area (e.g. clude: (a) acquisition of measurements (features) of the
[1, 2, 3, 4, 5, 6]) has increased steadily along the last entities to be recognized; (b) pre-processing, possibly in-
decades as a consequence of respective performance ad- volving normalization; and (c) the recognition proper; (d)
vancements combined with an expansion and intensifica- respective validation, often performed by cross-validation
tion of respective applications in the most diverse scien- methods.
tific and technological areas. In particular, several of the
activities traditionally performed by humans have been
progressively assisted or even substituted by artificial in-
telligence resources, which rely intensely on pattern recog-
nition.
By entity it is henceforth meant the objects, individu-
als, or any other type of patterns to be identified. The Figure 1: The pattern recognition pipeline. Feature extraction: A
collection of the properties (features) characterizing the set of measurements are taken from the entities to be recognized,
entities will be referred to as the respective dataset, with yielding the respective features. Pre-processing: the features are
pre-processed in order to curate their quality and also for normal-
its respective data elements corresponding to the entities
ization purposes. Pattern recognition: methods are applied in order
to be studied/classified. to assign categories to each entity. Validation: The validation of the
Despite its relatively simple conceptual characteriza- whole approach is then performed.
tion, pattern recognition involves several concepts and

1
Each of these three main stages are characterized by Aspect Main characteristics
substantial challenges. At the acquisition level, one prob- Densities specifying
lem of particular relevance concerns which measurements Classification Regions the categories within
are to be adopted for characterizing the entities. At the the feature space.
pre-processing stage, approaches have to be chosen that Uniform / The type of the
are able to improve the data quality (e.g. remove noise) non-uniform regions density.
as well as means to properly normalize the measurements. Used to represent
Major issues related to the third stage include the choice Sampling the category regions,
of recognition methods to be applied. Then, at the vali- can be sparse or biased.
dation stage, metrics and approaches have to be defined Some samples
and adopted in order to validate the recognition results Wrong samples can have wrong
according to the obtained results. categories.
While all the main issues involved in all the above dis- Deviations from the
cussed pattern recognition stages have received substan- Statistical
respective densities
tial attention from the respective literature, the problem fluctuations
caused by sampling.
of characterizing the performance of the obtained recogni- Determined by the number
tion framework and results, so as to validate the adopted Dimensionality
of required features.
approach, remains an important issue worth continuing Defined by the recognition
attention. Decision boundaries
methodology.
The validation of a pattern recognition approach de- How accurate the regions,
pends on whether it is supervised or unsupervised, and Confidence anchor samples, boundaries
both cases are considered in the present work. and features are.
In the former case, this has been typically performed
The set of features
by using cross-validation approaches (e.g. [5, 1]). In their Adopted features
adopted representing
most simple implementations, this type of validation in- (a kind of sampling)
the entities.
volves separating the available data with identified cate-
Supervised / The type of pattern
gories into a training and a testing sets. The supervised
unsupervised recognition method.
recognition approach is then optimized for the identifica-
Performed to quantify
tion of the training set and its performance is then quan-
Cross-validation the performance
tified from the results obtained by its application to the
of the recognition.
test set. There are several variations of this basic princi-
Categories can
ple aimed at improving the validation comprehensiveness Clustered or not
be clustered or not.
and/or accuracy, such as considering several training and
test sets, dividing the groups in non-equal proportions, Table 1: A glossary of the many important aspects influencing pat-
including additional sets and validation stages, etc. tern recognition.
The result from cross-validations on supervised pattern
recognition approaches indicate how many correct and in-
correct classifications were obtained respectively to each stages in Figure 1), allied to the fact that these aspects
of the involved categories. Ideally, there should be no tend to influence one another, provides a cogent indica-
incorrect classifications, with all the test data elements tion about the complexity of the validation problem. In
being correctly identified. When a pattern recognition this work, we develop a model-based approach to study-
approach passes through a strict validation, we have an ing the several performance limitations involved in super-
indication that it may work properly in identifying the vised and unsupervised pattern recognition while trying
categories of new data. to consider in an integrated manner all the aspects iden-
The failing of an approach in the respective cross- tified in Table 1. Special attention is placed on the sta-
validation indicates that there could be problems virtu- tistical modeling of the categories in respective feature
ally anywhere in the framework shown in Figure 1. Ta- spaces, which paves the way to identifying several impor-
ble 1 summarizes the main aspects that typically play an tant concepts and potential issues in pattern recognition,
important role in defining the performance of a pattern including the potentially dramatic effect of the increase
recognition approach. of the feature space dimensionality on the recognition re-
The relative large number of aspects of different na- sults, specially from the perspective of the curse of the
ture involved in the performance of pattern recognition dimensionality. The issue of undersampling therefore re-
(already hinted by the validation task encompassing all ceives special attention along this work, from which we

2
characterize the phenomenon of underlearning, namely 2 Categories, Statistical Modeling
the obtention of several provisionally working but un-
stable recognition configurations that do not withstand
and Sampling
systematic cross-validations. The phenomenon commonly
known as overfitting also receives special attention, and it Groups (ensembles) of entities can be modeled in terms
is argued that it does not intrinsically constitute a short- of their respective measurements, which are considered
coming, but actually an asset in a pattern recognition as random variables. In these cases, the respective joint
approach. probability density function (or field), or densities for
An example of the important interrelationship between brief, provides all available statistical information about
aspects involved in pattern recognition, we have that the the properties of those variables, and therefor about the
two most often sought properties, namely selectivity is entities as far as their features are concerned.
generally obtained at the expense of generalization. Usu- Densities can be understood as mappings from each
ally, a balance needs to be achieved regarding these two point (entity represented in terms of its feature vector)
requirements, which should take into account the nature in a given support (a region of the feature space) into re-
of the data and questions of interest. spective non-negative values. In addition, the sum of all
Supervised and unsupervised pattern recognition are densities in the support needs to be identical to one. In
both approached in the present work, with the former be- a given pattern recognition problem, it is also important
ing addressed first. The case of supervised recognition to identify from the outset the boundings of the respec-
is developed while considering the above outlined con- tively feature space, which we will henceforth refer to as
cepts and aspects, with emphasis on undersampling, un- the respective universe Ω. This set can be determined
derlearning, and overfitting, helping us to identify and from the minimum and maximum values of each involved
approach some the main reasons that can undermine su- feature. Observe that each feature defines an associated
pervised pattern recognition, with several important and axis in the respective feature space where the entities are
potentially surprising results. Unsupervised recognition to be represented.
is then treated with emphasis on two key issues: (a) the There are two main types of probability density func-
quantification of the separation between groups of data tions: (i) uniform; and (ii) non-uniform. The first case
elements; and (b) the criterion adopted for deciding on is characterized by constant values assigned for all points
the existence of clusters. in the whole support. Non-uniform densities have vary-
To complement our development, we also propose the ing values assigned to those points. An example of non-
consideration of image segmentation, an important and uniform density is the normal distribution. Figure 2(a)
challenging issue on itself, as a laboratory for better un- illustrates a uniform density defined on a disk on the R2
derstanding, developing, and evaluating pattern recogni- space.
tion approaches and systems. From the perspective of this article, uniform densities
Though focus is kept on presenting the several con- can be treated in simplified manner, as corresponding to
cepts and methods in a relatively accessible manner, en- all the points in their respective support. As such, uni-
hanced understanding of this work will be helped by form densities provide a particularly effective means for
some previous experience in pattern recognition and/or approaching several of the intricacies of pattern recogni-
related areas, particularly multivariate statistics (e.g. [7]), tion and its performance characterization. For instance,
stochastic geometry (e.g. [10]), and set/multiset theory in Figure 2(a), it is enough to represent the region asso-
(e.g. [11, 12, 13, 14, 15]). In order to emphasize the main ciated to the support of a uniform density instead of a
concepts and results along the development of the present three dimensional representation where the constant den-
work, several snippets have been respectively included. sity would also be shown.
It should also be kept in mind that the presented con-
cepts and methods are still subject to further complemen- It is not always the case that the normal density and
tation and validations, so that they should be treated as its support are available. Indeed, oftentimes we only
preliminary. In addition, the application of any pattern have samples, obtained from inaccessible density formu-
recognition approach to real-world data as approached lae, from a given density. This is illustrated in Figure 2
here should be understood mostly as means for providing in terms of three possible samplings of the density in (a):
insights on the data and groups interrelationships to be sparser (b); denser (c); and biased (d).
further investigated and validated, not providing a basis The amount and quality of samples is of critical im-
for absolute decisions regarding the separation or exis- portance in pattern recognition. Even if all samples are
tence of clusters. correct, in practice they will always be available in lim-
ited numbers, implying that the original density will never

3
Needless to say, biases can have critical impact on the
recognition results.
Sampling is also required for approximating non-
uniform densities, as illustrated in Figure 4, where a vary-
ing density on a disk support (a) is sampled by a limited
number of samples (b). Observe that the density of the
samples tends to reflect the respective original density at
each of the points in the support.

Figure 2: A uniform density, continuous on a disk support (a),


and possible respective discrete samplings characterized by being Figure 4: A non-uniform density on a disk support (a), and a pos-
relatively sparser (b), denser (c), and biased (d). sible respective sampling (b).

be perfectly sampled and represented. This loss of in-


formation impacts the characterization of the density in Another critical influence of sampling on pattern recog-
several manners, including unavoidable statistical fluctu- nition concerns the fact that the higher the dimensional-
ations, i.e. the fact that (typically) small scale spatial dif- ity of the feature space, the larger the number of points
ferences will be always found between among the sample that are required for a relatively dense and significant
distribution. As illustrated in Figure 3, these fluctuations representation of the densities. Actually, it is relatively
can lead to respective patterns. straightforward to infer that, in the case of M features,
the number of samples should increase with the respective
M −power. Therefore, the densities involved in pattern
recognition problems involving a large number of features
are often undersampled because the very large number
of samples may not be available, or cannot be computa-
tionally handled, which gives rise to the so-called curse of
dimensionality.
We conclude this section with our first snippet:

1 - The approximation of continuous regions by respec-


tive samples depends strongly on the dimensionality and
is crucial for pattern recognition. The higher the di-
mensionality, much more samples are needed. Biasing
and undersampling undermines the representation of the
Figure 3: A uniform random field of points. As a consequence of densities and can lead to recognition mistakes.
random fluctuations implied by the undersampling of the otherwise
completely uniform density, random fluctuations appear that can
eventually be taken for clusters.
3 Supervised Pattern Recognition
As implied in its own name, supervised pattern recogni-
Generally speaking, the larger the number of samples, tion refers to the assignment of categories to data ele-
the better. The situations in which the number of samples ments under supervision of several types, including sets
is not enough for proper representation of the original of pre-classified data elements, or prototypes of each cate-
continuous densities are henceforth called undersampling. gory (e.g. center of mass of the groups). In this section, we
It is also possible that the available samples are biased will approach this type of recognition as well as its perfor-
in several manners, such as that depicted in Figure 2(d). mance issues mainly from the perspective of the concepts

4
presented in Section 2. from the samples; and (b) use the samples directly for
Basically, supervised pattern recognition involves two the recognition.
stages: (i) training, and (ii) application to classification In both cases, when only a limited number of samples
of new categories. It is the former stage that makes this are available, there will always be the possibility of un-
type of recognition supervised. Let us illustrate this basic dersampling, which can have critical impacts on the clas-
principle in supervised recognition with the help of the sification.
example in Figure 5, which involves two uniform category However, before addressing undersampling in a more
regions. systematic way, it is interesting to discuss the frequently
considered problem of overfitting. Basically, this phe-
In this particular case, the optimal decision boundary nomenon consists of the obtained decision boundaries be-
resulted aligned to the own original boundaries, which ing ‘too’ closely adapted to the samples, as illustrated in
is characteristic of adjacent category regions. Provided Figure 9(a).
there are no errors in new data elements, perfect perfor-
mance will characterize subsequent classifications. Here, we have two adjacent category regions that have
Another, more frequent, supervised classification situ- been fully separated at the expense of the use of a rela-
ation involving non-adjacent regions is presented in Fig- tively intricate decision boundary. Observe that all points
ure 6. have been correctly classified in this case. Now, if a new
data element becomes available and is mapped into the
Some additional examples of relatively compact, well- previous space as shown in Figure 9(b). Given that this
separated category regions are illustrated in Figure 7. new element resulted within the blue region, a misclassi-
fication will be respective implied. However, the system
Of particular importance is the fact that compact, well- can be retrained so that the new boundary region shown
separated regions makes the classification much easier, in (c) is obtained, again ensuring correct classifications
while also reducing the chances of underlearning as im- throughout, but at the cost of an even more intricate de-
plied by undersampling. cision boundary, therefore enhancing the overfitting. In-
Figure 8 presents another supervised recognition exam- terestingly, it is possible to show that decision boundaries
ple involving non uniform regions in a one-dimensional can be found in any supervised recognition problem that
space. will yield full adherence to the involved categories, there-
fore implying no classification errors.
Provided the densities are fully known, Bayesian deci- Now, an important point concerns the fact that, pro-
sion theory indicates the means for identifying the optimal vided there are no errors in the supplied categories of
decision boundaries respectively to minimizing the num- all the samples, the phenomenon of overfitting is not in-
ber of misclassifications, whose probability corresponds trinsically unwanted, but actually necessary to properly
to the areas where one density overlaps the other. More represent the categories in the feature space. Actually, in
specifically, let M categories c = 1, 2, . . . , M be repre- case the configuration in Figure 9(c) corresponds to all
sented by respective conditional densities p(~x | c). Let possible samples constituting the respective category, the
the mass probability of each category be P (c). Then, the decision boundary in that figure actually corresponds to
Bayesian classification criterion consists of applying: an optimal solution. In conclusion, generally speaking,
overfitting does not constitute a shortcoming of the ap-
C(~x) = c | max {P (c) p(~x | c)} (1) proach, but actually one respective asset. In summary,
c=1,M
we may conclude that:
In the case of Figure 8, this criterion yields the optimal
border as corresponding to the intersection between the 2 - The overfitting respectively to the correct classi-
two densities, i.e. x = b. fication of the whole set of correctly labels samples is
When the region densities cannot be accurately deter- not necessarily unwanted, but actually welcomed, irre-
mined, other approaches need to be applied, and that is spectively of the level of intricacy or tight adherence
precisely where the recognition problems start because implemented by the respective decision boundaries.
the respective loss of information. There are two main
situations yielding inaccurate densities: we do not know Figure 9 also illustrates some possible results of apply-
them, have only approximations or hypothesis, or only ing cross-validation to the situation in (a). Figure 9(d)
respective samples are available, possibly in limited num- presents the same case, but after removal of some of its
bers. At least the two following two approaches can be points. In this specific case, the retraining under these
attempted in the latter case: (a) estimate the densities circumstances will yield a decision boundary similar to

5
Figure 5: The basic principle underlying supervised pattern recognition, respective to two adjacent uniform regions in a two-dimensional
feature space: (a) objects belong to two categories, defined by their respective regions delimitated by the blue and green contours, are to be
recognized; (b) samples of the two categories are taken and used to train a respective classifier, which yields the decision boundary shown
in orange; and (c) new data can now be classified depending on which region they fall into.

Figure 6: The basic principle underlying supervised pattern recognition, respective to two non-adjacent uniform regions in a two-dimensional
feature space: (a) objects belong to two categories, defined by their respective regions delimitated by the blue and green contours, are to be
recognized; (b) samples of the two categories are taken and used to train a respective classifier, which yields the decision boundary shown
in orange; and (c) new data can now be classified depending on which region they fall into. Remarkably, a wide range of possible boundary
decisions, instead of the single one obtained in the example in Fig. 5 are now possible. This does not represent neither underlearning nor
overfitting.

Figure 7: Additional examples of relatively compact, well-separated category regions. Many decision boundaries can be found in these cases
that ensures fully correct classifications.

the previous one, yielding no classification errors. Fig- situation in (a). The key aspect here is that the decision
ure 9(e) presents another example in which several points boundary in (b) are in fact not accurate, hence the mis-
were removed from (a), but now a new decision boundary classification errors implied. In other words, the failing of
is obtained. In case the removed points are now tested in this approach under cross-validation only indicates that
this new region, misclassifications will take place (f). It is the boundary obtained with fewer points is less precise.
important to identify what can be learnt from these cross- To any extent, adjacent category regions with intricate
validation experiments as applied to the specific overfitted interrelationships will tend to be identified as overfitted,

6
4 - Underlearning, characterized by many provision-
ally working decision boundaries that are not similar to
the correct one, happens when the sampling does not
properly represent the regions.

Now, let us return to the undersampling problem briefly


introduced above. We will start by performing an exper-
imental study of how uniformly random points randomly
separated into two groups relate one another, in terms
of Euclidean distances between their samples, as the di-
Figure 8: Entities belonging to two categories statistically modeled
mensionality of the feature space is increased. First, all
by the densities shown in this figure are to be recognized. In the case
the two categories have the same mass probability (i.e. the two types features will constrained within the interval [0, 1], as is the
of entities are equiprobable), Bayesian decision theory specifies the case of features pre-processed by minmax normalization.
optimal decision boundary in terms of the category corresponding Figure 11(a) presents the average ± standard deviations
to the maximum density values of the respective universe points. In
of the minimum distance between the 1000 randomly gen-
the case of this example, the optimal decision boundary is defined
by the value x = b. erated pairs of random categories.

Figure 11(b) presents the average ± standard devia-


tions of the Euclidean distances between two randomly
assigned groups of 20, 25, . . . , 50 points in feature spaces
but this is as it should be. of dimension D, with all features varying from −2 to 2,
We can infer from the above considerations that: as is typically the case with standardized features. Inter-
estingly, a much wider artificial separation can be respec-
3 - The highest the required performance in terms of tively observed.
accuracy and correct classifications, the higher the over- Of critical interest in the obtained results is the fact
fitting, implying more intricate decision boundaries es- that a non-null separation is observed between the two
pecially in the case of adjacent regions with jagged inter- randomly assigned groups of uniform samples, and that
relationship and also as a consequence of the sampling this separation tends to increase in the average with the
of the category regions or densities. The identification dimensionality of the feature space. At the same time,
of this phenomenon by cross-validation does not neces- relatively comparable standard deviations have been ob-
sarily imply a shortcoming, though it can provide useful served for most dimensions, except for the smallest cases.
information about the structure of the categories and Observe also that the average distances tend to decrease
samples. slowly with the increase of available samples, but they
fall short of the distances observed for 1 or 2 dimensions.
Ultimately, these results are a consequence of the curse
of dimensionality, which implies sparse representation, by
Now, let us address another problem, namely sample samples, of the category regions.
biasing, as illustrated in Figure 10, which shows two bi- This result plainly indicates that relatively well-
ased samplings of the situation shown in Figure 5(a). As separated groups can be obtained out of uniformly ran-
a consequence, the two samples became well-separated, dom points, especially for highly dimensional feature
allowing a wide range of exact possible decision bound- spaces and when relatively few samples are available. A
aries, a few of which are illustrated in orange in the figure. similar phenomenon can take place for samples obtained
Though these boundaries work for the given samples, it from generic respective category regions, even if they are
will soon fail when more samples are drawn from the re- actually not well separated in the original, continuous rep-
spective regions. resentation. Given that artificially well-separated groups
can appear in these cases, the phenomenon of underlearn-
The possibility to have several provisionally adequate ing directly analogous to our discussion regarding biased
decision boundaries that are prone to become unstable samples will occur. Actually, the undersampling that of-
with new samples (or under cross-validation) is hence- ten takes place in highly dimensional feature spaces can
forth called underlearning, in the sense that the recog- be considered as a kind of biased sampling.
nition system has not yet reached its proper training as Cross-validation provide a valuable means for
a consequence of biased sampling, which leads us to the identifying underlearning as a consequence of high
next snippet: dimensionality-related undersampling, because distinct

7
Figure 9: Illustration of the phenomenon of overfitting and its relationship to cross-validations.

dimensions, and it has to do with a phenomenon that


we shall call compact sampling, in order to refer to sam-
pling of relatively compact, well-separated original cat-
egory regions. Interestingly, under these circumstances,
the sampling, even if sparse, becomes restricted to the
original regions, therefore reflecting the distances between
the original regions. Provided those distances, e.g. in the
average, are larger than the distances between the groups
induced as a consequence of respective high dimensional-
ity, the effect of underlearning can become substantially
minimized. It then becomes possible, provided all the
Figure 10: Example of biased sampling of the regions in Figure 5(a) other involved aspects (e.g. classification method, quality
leading underlearning, in the sense that a wide range of decision of samples, etc.) are proper, to infer effective decision
boundaries can be obtained that implement fully correct classifica-
tion for this particular sample configuration, but which are unstable
boundaries even in the case of highly dimensional feature
and bound to misclassify new samples. The correct decision bound- spaces.
ary, as implied by the original category regions in Figure 5(a) is There are at least two possible means to identify com-
shown in cyan. pact sampling: (i) apply cross-validation; and (ii) to com-
pare the distances between the given sampled groups with
those obtained for similar configurations (i.e. number of
choices of the two randomly assigned groups will yield samples, normalization, dimensionality). In the latter
distinct decision boundaries, which leads us to the next case, a significant distinction between the distances im-
snippet: plied by the original data and those obtained by randomly
assigned groups will suggest that the supervised recogni-
5 - Cross-validation can reveal underlearning caused by tion can proceed with relatively little underlearning. An
sparse sampling (curse of dimensionality), especially in important conclusion to be drawn from the above reason-
the case when many features are adopted. ings therefore can be summarized as:

However, there is an important exception to the influ-


ence of undersampling on supervised recognition in high

8
between the nodes are proportional to the respectively
obtained average distances. Another reference network
is obtained by randomized groups with the same number
of elements and dimensionality. These two networks can
then be compared qualitative and/or quantitatively. In
case the two networks are similar, it is very likely that
undersampling may be taking place, which can be further
investigated by cross-validation. Given that the minimum
distance between the samples in each pair of clusters is
too strict and relatively unstable (the change of a single
sample can strongly impact on the result), it is also in-
teresting to consider the distances between the center of
mass of the real and random groups.
(a) Let us illustrate this method respectively to a dataset
containing 3 types of handwritten characters (‘’c, ‘e’, and
‘o’) [16], each being represented by 50 samples. Each
data element is characterized in terms of four respective
features, which are henceforth taken in their standard-
ized version. The average distances obtained from the
randomly assigned pairs of groups with the same dimen-
sionality and number of samples was hdr i = 0.580765942.
Figure 12 shows the principal component (e.g. [17]) pro-
jection of the handwritten characters dataset (a), as well
as the distances networks respectively to the real data (b)
and randomly assigned simulation (c).

The results of the same experiment as above, but now


performed respectively to the features normalized in the
(b) interval [0, 1] are presented in Figure 13, being charac-
terized by similar ratios between the real and random
Figure 11: The average ± standard deviations of the Euclidean dis- average distances.
tances between two groups of 20, 25, . . . , 50 points in feature spaces
of dimension D, with all coordinates varying from 0 to 1 (a) ad
from -2 to 2 (b), increases steadily, though in sublinear manner.
Also interesting is that the standard deviations do not tend to very 7 - If cross-validation does not hold, then: (a) the
significantly with the dimensionality and that similar shapes have
been obtained in (a) and (b). Results obtained from 1000 random
selectivity needs to be increased; (b) the sampling is not
experiments. enough to properly represent the reference regions; (c)
the data has low quality; (d) the recognition method is
unstable/unsuitable; or (e) the groups of samples cannot
be separated.
6 - In the case of highly dimensional feature spaces,
is possible to have underlearning minimized provided It is interesting to observe that, though the above dis-
the original regions are relatively compact and well- cussion focused on the effect of biased sampling and un-
separated. This can be verified by cross-validation or by dersampling in relatively high dimensions, the obtained
comparison between the original and random distances. distance values for the random configurations obtained
even for dimensions as smalls 2 or 3 indicate that artifacts
A relatively simple and quick approximated method for in data separation can take place even in these situations.
investigating underlearning in high dimensions is as fol-
lows. Given samples of M categories, obtain the average
± standard deviations of the distances between each pair 4 Clustering, or Unsupervised
of group. The overall interrelationship between the orig- Pattern Recognition
inal groups can then be roughly estimated by inspecting
the respective representation of the distance as a network Having discussed supervised pattern recognition from a
where each node corresponds to a category, and the links model-based perspective with special attention on perfor-

9
a cluster:

8 - Given samples in a feature space, a respective cluster


is a subset of these samples which are more similar one
another than to the remainder samples.

One of the first important aspects to be observed con-


cerning clustering is that it involves two related, but dis-
tinct, requirements: (a) how to quantify the separation
between the clusters; and (b) how to decide on the exis-
tence of one or more clusters. This critically important
aspect is not always observed, and can lead to respective
misunderstandings. So, we have the following snipped:

9 - Unsupervised classification requires a choice of how


to quantify the separation between the clusters, as well
a decision on the existence of clustering.

Pertaining the issue (a) above, there are several possi-


ble approaches that can be used for that finality, includ-
ing the minimal, maximum, distance between centers of
mass, among other possibilities, of metrics and indices
(e.g. [18, 19, 20, 21]) including but not being limited to:
Euclidean distance, cosine distance, Pearson correlation
Figure 12: The handwritten characters database presented in terms coefficient, Mahalanobis distance, Manhattan distance, as
of its PCA (a), and respective networks of average distance for the
real data (b) and respective randomly assigned simulation (c). One
well as similarity indices including the Jaccard, Interior-
of the real-data links resulted about twice the random counterpart, ity, and coincidence. Observe that we distinguish between
another is comparable, and the third is about half. Given that metrics and index in order to indicate that the former
the minimum distance between clusters is a too strict indication of
obeys the metric requirements, while the latter does not.
the separation between two groups, the original real data can be
considered as being relatively far from underlearning. For simplicity’s sake, the present work concentrates
on the Euclidean distance agglomeration (single-linkage),
though other agglomerative approaches including the
complete-linkage, average-linkage and Ward methods are
also illustrated, and the coincidence similarity is consid-
ered, for comparison purposes, in a subsequent section.
In the case of agglomerative clustering approaches, the
successive merging of the clusters gives rise to a respec-
tive dendrogram, which provides a comprehensive graph-
ical representation of the interrelationships between the
unfolding clustering respectively to the adopted metric
or similarity index. Figure 14 presents the dendrograms
obtained from the handwritten characters dataset by us-
Figure 13: The comparison between the real and random distances ing single-linkage), complete-linkage, average-linkage and
for the handwritten characters dataset considering the interval [0, 1]. Ward agglomerative clustering.
These results have proportions similar to those in Fig. 12.

Observe that the y−axes in each of the dendrograms in


Figure 14 corresponds to the respective adopted linkage
mance, we now turn our attention to the relatively more criterion and metrics/index. In the case of metrics, the
challenging task of unsupervised pattern recognition, or y−axis corresponds to a distance that increases from the
clustering for short, which is characterized by absence of bottom to the top. Interestingly, quite distinct cluster-
prototypes or even information about the number of ex- ing structures have been obtained by each method, which
pected clusters. motivates the question of which of them could be more
The following snippet provides an intuitive definition of relevant for this specific dataset.

10
Figure 14: Dendrograms obtained from the handwritten characters dataset by using single-linkage, complete-linkage, average-linkage and
Ward agglomerative clustering. Observe the completely distinct clusters interrelationships suggested by each of these distinct approaches.
What is the most adequate for the handwritten characters dataset?

The issue (b) above, namely deciding on the existence formed dendrograms with those in Figure 14. Though
of one or more cluster is directly related to the interre- they are completely distinct as far as the relative posi-
lationship, especially the separation, between the candi- tions of the vertical axes where the mergings occur, the
date groups, and can be approached in those terms. For merging sequence is completely identical in all average-
instance, it is possible (e.g. [22]) to consider the length linkage cases presented above. At the same time, the
of the branches leading to a branch, multiplied by the type of illustrated transformations constitute a particu-
number of samples in that branch as an indication of how larly useful resource for zooming in and out of the several
much that possible cluster stands out among the others. scales along the y−axis. For instance, in case we are es-
While the above mentioned type of approach is inter- pecially interested in studying the clusters relationship at
esting and often leads to suitable results, there is an im- the finer merging scale, we could resource to a transforma-
portant issue that is not so often realized or discussed, tion similar to that obtained by the sharp sigmoid, and so
and it has to do with the fact that the scaling of the on. Another relevant observation about the dendrograms
y−axis variable, henceforth referred to as y, has a some- obtained for the handwritten characters dataset consists
what arbitrary nature. For instance, in the case of the in the fact that none of them, original or transformed,
average-linkage method, instead of taking the respective provided a pronounced indication, as far as the relative
average Euclidean distances between the groups as y, it lengths and widths of their branches are concerned, about
would be also possible to consider any monotonic trans- the original separation between three main types of char-
formation of y, for instance by taking it to the 5-th power, acters in this specific dataset.
to the 0.2 power, or taking a sharp sigmoid, as depicted The above example illustrates the difficulty in using the
in Figure 15. lengths of the branches as a criterion for deciding on the
whether the involved clusters should be separated or not.
It is particularly interesting to compared these trans- Actually, there is an alternative approach that does not

11
Figure 15: Monotonic transformations of the average-linkage dendrogram obtained for the handwritten characters, but taking the by taking
the average distances y to the 5-th power (a), to the 0.2 power (b), or through a sharp sigmoid (c). Completely distinct dendrograms can
therefore be obtained, emphasizing respectively the large, medium, and small scales of the clustering structure of the dataset. Importantly,
the sequence of merging of each of these transformations is completely preserved, while only the y-axes is ‘elastically’ modified.

depend on the length of the branches. It consists of us- ples, so that the obtained clusters can be confronted with
ing other criteria for that purpose, in particular one of the original ones. As a summary of our brief discussion
the several approaches to quantifying the separation be- of unwanted effects on the performance of unsupervised
tween clusters, including those based on scatter matrices recognition methods and their respective validation, it can
(e.g. [4]) or even network modularity (e.g. [16]). be said that:
Now that we have considered some of the most basic
aspects of unsupervised clustering, we can attempt to ap- 10 - Overfitting, undersampling, and underlearning
proach the issue of their respective performance. There is phenomena equally affect the supervised or unsuper-
a particularly direct manner to do this, by using a set of vised case.
classified samples that are then treated as it they were not,
being therefore classified in an unsupervised manner. The
original categories can then be taken into account in order
to evaluate the recognition results in terms of misclassi-
5 Similarity-Based Pattern
fications, as well as all the aspects discussed respectively Recognition
to supervised pattern recognition. Actually the effects of
most of those aspects are precisely the same whatever we In this section we consider some possibilities of using the
are dealing with supervised or unsupervised learning. For real-valued coincidence similarity index [23, 21, 16, 24] as
instance, the presence of biased samples in relatively high the basis for comparing and interrelating groups of sam-
dimensions, and/or undersampling, are likely to induce ples in both supervised and unsupervised pattern recog-
underlearning, implying clusters to be found where there nition.
are none. Basically, the coincidence similarity corresponds to
One important difference respectively to unsupervised the product between the real-valued Jaccard and in-
recognition is that cross-validation, e.g. by k-folding, can- teriority indices, which are based on multiset theory
not generally be performed in the same way, as there is (e.g. [11, 12, 13, 14, 15]). It is primarily aimed at perform-
no training stage involved in that case. Those methods ing similarity comparisons between two patterns, e.g. as
need to be adapted, for instance by identifying clusters represented by respective feature vectors. In the present
respective to a portion of the reference samples, and then work we limit our attention to the parameterless version
comparing them with other the results obtained by the of the coincidence similarity index which, in this partic-
same unsupervised respectively to other sets of samples. ularly case, has results comprised in the interval [−1, 1].
However, it should be observed that this is not a complete The higher the coincidence similarity value, the most sim-
test, because the procedure may induced similarly biased ilar two patterns can be said to be. So far, the coincidence
results in all cases. More comprehensive validation ap- index has been successfully applied to several applica-
proaches need to consider previously labelled sets of sam- tions, including template matching [21] and translation
of datasets into respective complex networks [16].

12
First, we present in Figure 17 the average ± stan- was obtained for the real data, two of which are higher
dard deviation of the coincidence values between random (in absolute value) than the random reference, while the
groups for 20, 25, . . . , 50 points in feature spaces of dimen- remainder distance is similar. This suggests that, at least
sion D, with all features varying from 0 to 1. for this specific example, the coincidence similarity can
lead to less intense underlearning as a consequence of bi-
ased sampling or undersampling.

Figure 18: Networks of average similarity for the real data (a)
Figure 16: The average ± standard deviations of the coincidence and respective randomly assigned simulation (b) respectively to the
values between two groups of 20, 25, . . . , 50 points in feature spaces handwritten characters database. Comparatively to the Euclidean
of dimension D, with all features varying from 0 to 1. Comparatively distance based results shown in Fig. 12, two of the real pairwise
to the respective Euclidean distance counterpart in Fig. 11, it can distances resulted larger than the random counterparts, while the
be said that the obtained coincidence values increase more steeply other distance result very similar. This suggest a better underlearn-
along the smaller dimensions, become relatively stable for the larger ing resilience of the coincidence similarity representation of the data
dimensions. elements, at least for the case of the specific data in this example.
Observe that the magnitude of negative coincidence similarity quan-
tifies the dissimilarity between the respective groups.

A similar shape of the coincidences in terms of the di-


mensions can be verified for when the features comprised
To conclude this section, we present in Figure 19
in the inverval [−2, 2], shown in Figure 17.
the dendrogram obtained for the handwritten characters
dataset by the single-linkage method adapted to the co-
incidence similarity. More specifically, this index is used
instead of the Euclidean distance or the other options al-
ready discussed. As a consequence, the y−axis has to
be modified so as to have the dendrogram comparable to
those obtained by the other methods. In this work, this
has been done by taking the complement of the coinci-
dence values along the respective y−axis, i.e.:

ỹ = max {y} − y (2)

Interestingly, a well-balanced dendrogram has been ob-


tained in which not only details can be appreciated about
the pattern relationships at fine and medium compari-
Figure 17: The average ± standard deviations of the coincidence
values between two groups of 20, 25, . . . , 50 points in feature spaces
son scale, while the intrinsic subdivision into three cat-
of dimension D, with all features varying from −2 to 2. egories corresponding to the three types of handwritten
characters can be more effectively perceived in the relative
long branches leading to the three groups. Remarkably,
this coincidence-based single-linkage does not suffer from
Figure 18 depicts the networks of average similarity ob- the same level of chaining (successive incorporations of
tained respectively to the handwritten characters datased. samples into a same group) as its respective Euclidean
Interestingly, a more uniform distribution of coincidences distance-based counterpart.

13
not only for better understanding supervised and unsu-
pervised pattern recognition, but also for developing and
comparing respective concepts and methods. That is be-
lieved to be so as a consequence of the following aspects:
(i) the original data to be classified (pixels) can be im-
mediately inferred from the images; (ii) in the case of
supervised recognition, the choice of prototypes can be
easily performed, e.g. by clicking on specific points of the
image; (iii) images in general have great complexity and
intricacy, providing a comprehensive resource for testing
methods; (iv) the effects of the pattern recognition can
be immediately perceived in terms of the highlighted seg-
mented regions, especially the identification of possible
underlearning caused by biased sampled and undersam-
pling.
Figure 19: Dendrogram obtained for the handwritten characters
Figure 20 illustrates the above possibilities with respect
dataset through single-linkage of coincidence similarities between
the clusters. A well-balanced distribution of mergings is obtained to the supervised segmentation of a color image of a land-
at all scales while moderately emphasizing the intrinsic subdivision scape (a) including natural and human made objects and
into three respective types of handwritten characters. Remarkably, structures, while also incorporating varying levels of lu-
virtually none of the intense chaining characterizing the Euclidean
minosity, shadows, and diverse types of backgrounds and
distance-based counter part can be observed.
textures. More specifically, segmentation results obtained
by using Pearson correlation coefficient and coincidence
similarity are shown respectively in (b) and (c), respec-
It should be observed that the coincidence similarity tively to five prototype points marked with red crosses. In
index can also be adapted to the other clustering ap- the former case, the segmentation generalized too much in
proaches, substituting the Euclidean distance or corre- detriment of the selectivity, which implied several struc-
lations whenever necessary. tures and textures to be merged. The results obtained by
the coincidence similarity resulted substantially more ad-
herent to the structures from which the prototypes were
6 Image Segmentation as a Labo- taken with a moderate loss of generalization. In addition,
ratory given that a relatively high number of features was in-
volved, namely 25 RGB pixels sorted by intensity within
Image analysis and computer vision constitute important each color channel, the good adherence to the respective
branches of artificial intelligence (e.g. [25, 5, 4, 6]) as a objects can be taken as an indication that underlearning
consequence of their impressive potential for automating is not taking place.
and enhancing activities typically performed by humans,
including prospection, surveillance, quality control, as-
tronomy, to name but a few possibilities.
One of the first steps along the image analysis pipeline, 7 Concluding Remarks
the task of image segmentation (e.g. [25, 4]) is as critical
as it is challenging. Basically, given an image, to segment Pattern recognition has progressed all the way from early
it typically means identifying its portions of special rele- promising approaches toward becoming one of the central
vance for being possibly related to specific objects in the current research subjects. This has been motivated by the
image, or portions of these objects. This seemingly sim- many important applications to virtually every scientific
ple endeavor is complicated by several effects including and technological area and aspect. Yet, the question of
noise, shadows, reflections, occlusions, and transparency, evaluating the performance of pattern recognition while
among several other unwanted interferences. The impor- also identifying the main causes that can undermine it and
tance and challenge of image segmentation has been di- devising possibilities of improvements, remains an ever
rectly reflected to a so large number of related studies, important subject.
based on the most varied areas and concepts. It should be kept in mind that all results in the present
As a consequence of some special characteristics, we work are preliminary and still being complemented and
argue here that the problem of image segmentation can evaluated. In addition, the application of pattern recogni-
provide a particularly interesting and effective laboratory tion as approached here should be taken as a resource for

14
(a) (b) (c)
Figure 20: Original image (a) and respective segmentations (b) by using the Pearson correlation coefficient between the selected features of
the prototypes and those of all pixels in the image. The prototypes, are marked by red crosses, refer to the sandstone wall (2 samples) and
the further away bridge (3 samples). The obtained regions, delimitated by respective red contours, can be observed not to adhere selectively
to any of the types of structures in this image. In this case, the generalization prevailed strongly in detriment of the selectivity to the types
of structures. The results obtained by the coincidence similarity (c) are characterized by a precise adherence to the types of structures
in the image while maintaining an excellent generalization ability. These enhanced results are a direct consequence of important specific
properties of the coincidence similarity operation, including its high selectivity/sensitivity while being substantially robust to localized
feature perturbations [24].

gathering insights about the analyzer problem from the considered several possible aspects influencing the per-
point of view of the interrelationship between its compo- formance of supervised and unsupervised recognition, it
nents that can lead to insights and better understanding, would be interesting to approach the issue of features
not as an absolute or definitive result. Indeed, the ap- normalization to greater depth, as this aspect can also
plication and interpretation of pattern recognition should strongly influence the recognition results. Several possi-
closely take into account the nature of the data, the ques- bilities have also been established respectively to the phe-
tions to be worked, as well as the limitations of the fea- nomenon of underlearning, which has been argued to play
tures, classification methods as well as all other involved a critically important role especially not only in the case
aspects. of highly dimensional feature spaces, but even for mod-
This work addressed the issue of identifying the aspects erate dimensions. In particular, it would be interesting
that influence the performance of supervised and unsuper- to derive more complete tables of the artifact distances
vised pattern recognition from the perspective of statis- not only in terms of addition numbers of samples and
tical modeling of the original categories. Several related dimensions, but also respectively to other normalizing in-
factors were addressed, with special attention given to tervals. Regarding the identification of clustering as in-
the phenomena of biased sampling, undersampling, uner- volving two related tasks that can perhaps be performed
learning caused by the former, as well as overfitting. Sev- more effectively in separate, it would be interesting to
eral important effects were identified and discussed with evaluate in a more systematic and comparative fashion
the help of some real-world data examples. Snippets have how it would perform respectively to several types of syn-
also been included in order to emphasize 10 main points thetic and real-world datasets, including diverse types of
discussed and addressed here, provide a good concluding noise and interferences. Another promising research line
remarks summary. consists in considering further multiset-based similarity
The developed concepts and methods as reported in indices, and especially the coincidence approach, respec-
the present work pave the way to several future develop- tively to several other types of data and possible appli-
ments. While the range of possibilities is particularly am- cations to supervised and unsupervised pattern recogni-
ple, some of the potentially mostly promising prospects tion. Among many other related developments, it would
are briefly presented in the following. Even though we be interesting to consider the parametric version of the

15
coincidence index, which allows for enhanced versatility [12] P. M. Mahalakshmi and P. Thangavelu. Properties of
in its applications. multisets. International Journal of Innovative Tech-
nology and Exploring Engineering, 8:1–4, 2019.

Acknowledgments. [13] D. Singh, M. Ibrahim, T. Yohana, and J. N. Singh.


Luciano da F. Costa thanks CNPq (grant Complementation in multiset theory. International
no. 307085/2018-0) and FAPESP (grant 15/22308- Mathematical Forum, 38:1877–1884, 2011.
2).
[14] J. Hein. Discrete Mathematics. Jones & Bartlett
Pub., 2003.

[15] D. E. Knuth. The Art of Computing. Addison Wes-


Note: ley, 1998.
As all other preprints by the author, this work is pos-
sibly being considered by a scientific journal. It is copy- [16] L. da F. Costa. Coincidence complex net-
righted, and respective modification, commercial use, or works. https://iopscience.iop.org/article/
distribution of any of its parts are not allowed. 10.1088/2632-072X/ac54c3, 2022. J. Phys.: Com-
plexity, (3): 015012.

[17] F. Gewers, G. R. Ferreira, H. F. Arruda, F. N.


Silva, C. H. Comin, D. R. Amancio, and L. da F.
Costa. Principal component analysis: A natural
References approach to data exploration. Researchgate, 2019.
[1] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern https://www.researchgate.net/publication/
Classification. Wiley Interscience, 2000. 324454887_Principal_Component_Analysis_
A_Natural_Approach_to_Data_Exploration.
[2] K. Fukunaga. Statistical Pattern Recognition. Mor- accessed 1-Oct-2020.
gan Kaufmann, San Diego, 1990.
[18] S.-H. Cha. Comprehensive survey on dis-
[3] K. Koutrombas and S. Theodoridis. Pattern Recog- tance/similarity measures between probability den-
nition. Academic Press, 2008. sity functions. Intl. J. Math. Models and Meths. in
Appl. Sci., 1(4):300–307, 2007.
[4] L. da F. Costa. Shape Classification and Analysis:
Theory and Practice. CRC Press, Boca Raton, 2nd [19] C. E. Akbas, A. Bozkurt, M. T. Arslan, H. Aslanoglu,
edition, 2009. and A. E. Cetin. L1 norm based multiplication-
free cosine similiarity measures for big data analysis.
[5] B. K. P. Horn. Robot Vision. McGraw Hill, Cam- In IEEE Computational Intelligence for Multimedia
bridge, 1986. Understanding (IWCIM), France, Nov. 2014.

[6] E. R. Davies. Machine Vision. Morgan Kaufmann, [20] M. K. Vijaymeena and K. Kavitha. A survey on sim-
Amsterdam, 2005. ilarity measures in text mining. Machine Learning
and Applications, 3(1):19–28, 2016.
[7] R. A. Johnson and D.W. Wichern. Applied multi-
variate analysis. Prentice Hall, 2002. [21] L. da F. Costa. On similarity. https:
//www.sciencedirect.com/science/article/
[8] N. Mukhopadhyay. Probability and Statistical Infer- pii/S037843712200334X, 2022. Physica A: Statisti-
ence. CRC Press, New York, 2000. cal Mechanics and its Applications, 127456.

[9] S. Haykin. Neural Networks And Learning Machines. [22] Eric K. Tokuda, Cesar H. Comin, and Luciano
McGraw-Hill Education, 9th edition, 2013. da F. Costa. Revisiting agglomerative cluster-
ing. https://www.sciencedirect.com/science/
[10] D. Stoyan, W. S. Kendall, J. Mecke, and L. Ruschen- article/pii/S0378437121007068, 2022. Physica
dorf. Stochastic geometry and its applications. Wiley A, 585: 26433.
Chichester, 1995.
[23] L. da F. Costa. Further generalizations of the
[11] W. D. Blizard. Multiset theory. Notre Dame Journal Jaccard index. https://www.researchgate.
of Formal Logic, 30:36—66, 1989. net/publication/355381945_Further_

16
Generalizations_of_the_Jaccard_Index, 2021.
[Online; accessed 21-Aug-2021].

[24] L. da F. Costa. Multiset neurons. https:


//www.researchgate.net/publication/
356042155_Multiset_Neurons, 2021.

[25] R. C. Gonzalez and R. E. Woods and. Digital Image


Processing. Pearson, New York, 2018.

17

You might also like