This paper introduces a density-based clustering procedure for datasets with variables of mixed type. The proposed procedure, which is closely related to the concept of shared neighbourhoods, works particularly well in cases where the individual clusters differ greatly in terms of the average pairwise distance of the associated objects. Using a number of concrete examples, it is shown that the proposed clustering algorithm succeeds in allowing the identification of subgroups of objects with statistically significant distributional characteristics.
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERINGIJORCS
Clustering plays a vital role in the various areas of research like Data Mining, Image Retrieval, Bio-computing and many a lot. Distance measure plays an important role in clustering data points. Choosing the right distance measure for a given dataset is a biggest challenge. In this paper, we study various distance measures and their effect on different clustering. This paper surveys existing distance measures for clustering and present a comparison between them based on application domain, efficiency, benefits and drawbacks. This comparison helps the researchers to take quick decision about which distance measure to use for clustering. We conclude this work by identifying trends and challenges of research and development towards clustering.
The document discusses clustering documents using a multi-viewpoint similarity measure. It begins with an introduction to document clustering and common similarity measures like cosine similarity. It then proposes a new multi-viewpoint similarity measure that calculates similarity between documents based on multiple reference points, rather than just the origin. This allows a more accurate assessment of similarity. The document outlines an optimization algorithm used to cluster documents by maximizing the new similarity measure. It compares the new approach to existing document clustering methods and similarity measures.
Semi-Supervised Discriminant Analysis Based On Data Structureiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
This document summarizes a research paper that proposes a new semi-supervised dimensionality reduction algorithm called Semi-supervised Discriminant Analysis based on Data Structure (SGLS). SGLS aims to learn a low-dimensional embedding of high-dimensional data by integrating both global and local data structures, while also utilizing pairwise constraints to maximize discrimination. The algorithm formulates an optimization problem that incorporates a regularization term to preserve local geometry, while maximizing the distance between cannot-link pairs and minimizing the distance between must-link pairs in the embedding space. Experimental results on benchmark datasets demonstrate the effectiveness of SGLS compared to other semi-supervised dimensionality reduction methods.
Semi-supervised spectral clustering using shared nearest neighbor for data wi...IAESIJAI
In the absence of supervisory information in spectral clustering algorithms, it is difficult to construct suitable similarity graphs for data with complex shapes and varying densities. To address this issue, this paper proposes a semi-supervised spectral clustering algorithm based on shared nearest neighbor (SNN). The proposed algorithm combines the idea of semi-supervised clustering, adding SNN to the calculation of the distance matrix, and using pairwise constraint information to find the relationship between two data points, while providing a portion of supervised information. Comparative experiments were conducted on artificial data sets and University of California Irvine machine learning repository datasets. The experimental results show that the proposed algorithm achieves better clustering results compared to traditional K-means and spectral clustering algorithms.
This document discusses various clustering techniques used in data mining. It begins by defining clustering as an unsupervised learning technique that groups similar objects together. It then discusses advantages of clustering such as quality improvement and reuse opportunities. Several clustering methods are described such as K-means clustering, which aims to partition observations into k clusters where each observation belongs to the cluster with the nearest mean. The document concludes by discussing advantages of K-means clustering such as its linear time complexity and its use for spherical cluster shapes.
Combined cosine-linear regression model similarity with application to handwr...IJECEIAES
This document presents a combined cosine-linear regression model for calculating similarity between handwritten word images. It first provides an overview of various commonly used similarity and distance measures such as Euclidean, Manhattan, Minkowski, Cosine, Jaccard, and Chebyshev distances. It then compares the performance of these measures on a handwritten Arabic document dataset, finding that cosine distance performs best. However, cosine distance is affected by the size of the visual codebook used. The document proposes a floating threshold based on a linear regression model that considers both the codebook size and number of image features, in order to better measure similarity between word images. Experiments on a historical Arabic document collection demonstrate the effectiveness of this combined cosine-linear regression
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSijdkp
Subspace clustering discovers the clusters embedded in multiple, overlapping subspaces of high
dimensional data. Many significant subspace clustering algorithms exist, each having different
characteristics caused by the use of different techniques, assumptions, heuristics used etc. A comprehensive
classification scheme is essential which will consider all such characteristics to divide subspace clustering
approaches in various families. The algorithms belonging to same family will satisfy common
characteristics. Such a categorization will help future developers to better understand the quality criteria to
be used and similar algorithms to be used to compare results with their proposed clustering algorithms. In
this paper, we first proposed the concept of SCAF (Subspace Clustering Algorithms’ Family).
Characteristics of SCAF will be based on the classes such as cluster orientation, overlap of dimensions etc.
As an illustration, we further provided a comprehensive, systematic description and comparison of few
significant algorithms belonging to “Axis parallel, overlapping, density based” SCAF.
Improved probabilistic distance based locality preserving projections method ...IJECEIAES
In this paper, a dimensionality reduction is achieved in large datasets using the proposed distance based Non-integer Matrix Factorization (NMF) technique, which is intended to solve the data dimensionality problem. Here, NMF and distance measurement aim to resolve the non-orthogonality problem due to increased dataset dimensionality. It initially partitions the datasets, organizes them into a defined geometric structure and it avoids capturing the dataset structure through a distance based similarity measurement. The proposed method is designed to fit the dynamic datasets and it includes the intrinsic structure using data geometry. Therefore, the complexity of data is further avoided using an Improved Distance based Locality Preserving Projection. The proposed method is evaluated against existing methods in terms of accuracy, average accuracy, mutual information and average mutual information.
In order to solve the complex decision-making problems, there are many approaches and systems based on the fuzzy theory were proposed. In 1998, Smarandache introduced the concept of single-valued neutrosophic set as a complete development of fuzzy theory. In this paper, we research on the distance measure between single-valued neutrosophic sets based on the H-max measure of Ngan et al. [8]. The proposed measure is also a distance measure between picture fuzzy sets which was introduced by Cuong in 2013 [15]. Based on the proposed measure, an Adaptive Neuro Picture Fuzzy Inference System (ANPFIS) is built and applied to the decision making for the link states in interconnection networks. In experimental evaluation on the real datasets taken from the UPV (Universitat Politècnica de València) university, the performance of the proposed model is better than that of the related fuzzy methods.
Accurate time series classification using shapeletsIJDKP
Time series data are sequences of values measured over time. One of the most recent approaches to
classification of time series data is to find shapelets within a data set. Time series shapelets are time series
subsequences which represent a class. In order to compare two time series sequences, existing work uses
Euclidean distance measure. The problem with Euclidean distance is that it requires data to be
standardized if scales differ. In this paper, we perform classification of time series data using time series
shapelets and used Mahalanobis distance measure. The Mahalanobis distance is a descriptive statistic
that provides a relative measure of a data point's distance (residual) from a common point. The
Mahalanobis distance is used to identify and gauge similarity of an unknown sample set to a known one. It
differs from Euclidean distance in that it takes into account the correlations of the data set and is scaleinvariant.
We show that Mahalanobis distance results in more accuracy than Euclidean distance measure
STUDY ANALYSIS ON TRACKING MULTIPLE OBJECTS IN PRESENCE OF INTER OCCLUSION IN...aciijournal
The object tracking algorithm is used to tracking multiple objects in a video streams. This paper provides
Mutual tracking algorithm which improve the estimation inaccuracy and the robustness of clutter
environment when it uses Kalman Filter. using this algorithm to avoid the problem of id switch in
continuing occlusions. First the algorithms apply the collision avoidance model to separate the nearby
trajectories. Suppose occurring inter occlusion the aggregate model splits into several parts and use only
visible parts perform tracking. The algorithm reinitializes the particles when the tracker is fully occluded.
The experimental results using unmanned level crossing (LC) exhibit the feasibility of our proposal. In
addition, comparison with Kalman filter trackers has also been performed.
Study Analysis on Teeth Segmentation Using Level Set Methodaciijournal
This document presents a study analyzing a mutual tracking algorithm for tracking multiple objects in unmanned level crossings (LC) in the presence of inter-occlusion. The mutual tracking algorithm improves upon Kalman filtering approaches by monitoring the distance between target estimates to handle occlusion and identity switching. When occlusion is detected, it distinguishes visible from occluded parts of targets to accurately update tracking. Experimental results on unmanned LC videos show the mutual tracking algorithm more effectively handles occlusions compared to Kalman filtering.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
This document discusses multidimensional clustering methods for data mining and their industrial applications. It begins with an introduction to clustering, including definitions and goals. Popular clustering algorithms are described, such as K-means, fuzzy C-means, hierarchical clustering, and mixture of Gaussians. Distance measures and their importance in clustering are covered. The K-means and fuzzy C-means algorithms are explained in detail. Examples are provided to illustrate fuzzy C-means clustering. Finally, applications of clustering algorithms in fields such as marketing, biology, and earth sciences are mentioned.
Study Analysis on Tracking Multiple Objects in Presence of Inter Occlusion in...aciijournal
The object tracking algorithm is used to tracking multiple objects in a video streams. This paper provides
Mutual tracking algorithm which improve the estimation inaccuracy and the robustness of clutter
environment when it uses Kalman Filter. using this algorithm to avoid the problem of id switch in
continuing occlusions. First the algorithms apply the collision avoidance model to separate the nearby
trajectories. Suppose occurring inter occlusion the aggregate model splits into several parts and use only
visible parts perform tracking. The algorithm reinitializes the particles when the tracker is fully occluded.
The experimental results using unmanned level crossing (LC) exhibit the feasibility of our proposal. In
addition, comparison with Kalman filter trackers has also been performed.
Study Analysis on Tracking Multiple Objects in Presence of Inter Occlusion in...aciijournal
The object tracking algorithm is used to tracking multiple objects in a video streams. This paper provides
Mutual tracking algorithm which improve the estimation inaccuracy and the robustness of clutter
environment when it uses Kalman Filter. using this algorithm to avoid the problem of id switch in
continuing occlusions. First the algorithms apply the collision avoidance model to separate the nearby
trajectories. Suppose occurring inter occlusion the aggregate model splits into several parts and use only
visible parts perform tracking. The algorithm reinitializes the particles when the tracker is fully occluded.
The experimental results using unmanned level crossing (LC) exhibit the feasibility of our proposal. In
addition, comparison with Kalman filter trackers has also been performed
Hierarchical clustering is a method of partitioning a set of data into meaningful sub-classes or clusters. It involves two approaches - agglomerative, which successively links pairs of items or clusters, and divisive, which starts with the whole set as a cluster and divides it into smaller partitions. Agglomerative Nesting (AGNES) is an agglomerative technique that merges clusters with the least dissimilarity at each step, eventually combining all clusters. Divisive Analysis (DIANA) is the inverse, starting with all data in one cluster and splitting it until each data point is its own cluster. Both approaches can be visualized using dendrograms to show the hierarchical merging or splitting of clusters.
This document summarizes a research paper on using a k-means clustering method to detect brain tumors in MRI images. The paper introduces brain tumors and MRI imaging. It then describes using k-means clustering for tumor segmentation, which groups similar image patterns into clusters to identify the tumor region. The paper presents results of applying k-means to two MRI images, including statistical measures of segmentation accuracy, tumor area comparison, and timing. The k-means method achieved average rand index of 0.8358, low average errors, and tumor areas close to manual segmentation in under 3 seconds, demonstrating potential for accurate and efficient brain tumor detection.
This document proposes a new similarity measure for comparing spatial MDX queries in a spatial data warehouse to support spatial personalization approaches. The proposed similarity measure takes into account the topology, direction, and distance between the spatial objects referenced in the MDX queries. It defines the topological distance between spatial scenes referenced in queries based on a conceptual neighborhood graph. It also defines the directional distance between queries based on a graph of spatial directions and transformation costs. The similarity measure will be included in a recommendation approach the authors are developing to recommend relevant anticipated queries to users based on their previous queries.
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ijcseit
Multiple sequence alignment is increasingly important to bioinformatics, with several applications ranging from phylogenetic analyses to domain identification. There are several ways to perform multiple sequence alignment, an important way of which is the progressive alignment approach studied in this work.Progressive alignment involves three steps: find the distance between each pair of sequences; construct a
guide tree based on the distance matrix; finally based on the guide tree align sequences using the concept of aligned profiles. Our contribution is in comparing two main methods of guide tree construction in terms of both efficiency and accuracy of the overall alignment: UPGMA and Neighbor Join methods. Our experimental results indicate that the Neighbor Join method is both more efficient in terms of performance
and more accurate in terms of overall cost minimization.
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...ijcseit
Multiple sequence alignment is increasingly important to bioinformatics, with several applications rangingfrom phylogenetic analyses to domain identification. There are several ways to perform multiple sequencealignment, an important way of which is the progressive alignment approach studied in this work.Progressive alignment involves three steps: find the distance between each pair of sequences; construct a guide tree based on the distance matrix; finally based on the guide tree align sequences using the concept of aligned profiles. Our contribution is in comparing two main methods of guide tree construction in terms of both efficiency and accuracy of the overall alignment: UPGMA and Neighbor Join methods. Our experimental results indicate that the Neighbor Join method is both more efficient in terms of performance and more accurate in terms of overall cost minimization.
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...ijcseit
Multiple sequence alignment is increasingly important to bioinformatics, with several applications ranging
from phylogenetic analyses to domain identification. There are several ways to perform multiple sequence
alignment, an important way of which is the progressive alignment approach studied in this work.
Progressive alignment involves three steps: find the distance between each pair of sequences; construct a
guide tree based on the distance matrix; finally based on the guide tree align sequences using the concept
of aligned profiles. Our contribution is in comparing two main methods of guide tree construction in terms
of both efficiency and accuracy of the overall alignment: UPGMA and Neighbor Join methods. Our
experimental results indicate that the Neighbor Join method is both more efficient in terms of performance
and more accurate in terms of overall cost minimization.
Reverse Nearest Neighbors in Unsupervised Distance-Based Outlier Detection1crore projects
IEEE PROJECTS 2015
1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider.
It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training.
Dot Net
DOTNET Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
Java Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
ECE IEEE Projects 2015
1. Matlab project
2. Ns2 project
3. Embedded project
4. Robotics project
Eligibility
Final Year students of
1. BSc (C.S)
2. BCA/B.E(C.S)
3. B.Tech IT
4. BE (C.S)
5. MSc (C.S)
6. MSc (IT)
7. MCA
8. MS (IT)
9. ME(ALL)
10. BE(ECE)(EEE)(E&I)
TECHNOLOGY USED AND FOR TRAINING IN
1. DOT NET
2. C sharp
3. ASP
4. VB
5. SQL SERVER
6. JAVA
7. J2EE
8. STRINGS
9. ORACLE
10. VB dotNET
11. EMBEDDED
12. MAT LAB
13. LAB VIEW
14. Multi Sim
CONTACT US
1 CRORE PROJECTS
Door No: 214/215,2nd Floor,
No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai,
Tamin Nadu, INDIA - 600 026
Email id: [email protected]
website:1croreprojects.com
Phone : +91 97518 00789 / +91 72999 51536
Finding Relationships between the Our-NIR Cluster ResultsCSCJournals
The problem of evaluating node importance in clustering has been active research in present days and many methods have been developed. Most of the clustering algorithms deal with general similarity measures. However In real situation most of the cases data changes over time. But clustering this type of data not only decreases the quality of clusters but also disregards the expectation of users, when usually require recent clustering results. In this regard we proposed Our-NIR method that is better than Ming-Syan Chen proposed a method and it has proven with the help of results of node importance, which is related to calculate the node importance that is very useful in clustering of categorical data, still it has deficiency that is importance of data labeling and outlier detection. In this paper we modified Our-NIR method for evaluating of node importance by introducing the probability distribution which will be better than by comparing the results.
This document summarizes a research paper titled "A Novel Algorithm for Design Tree Classification with PCA". It discusses dimensionality reduction techniques like principal component analysis (PCA) that can improve the efficiency of classification algorithms on high-dimensional data. PCA transforms data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate, called the first principal component. The paper proposes applying PCA and linear transformation on an original dataset before using a decision tree classification algorithm, in order to get better classification results.
A Novel Algorithm for Design Tree Classification with PCAEditor Jacotech
This document summarizes a research paper titled "A Novel Algorithm for Design Tree Classification with PCA". It discusses dimensionality reduction techniques like principal component analysis (PCA) that can improve the efficiency of classification algorithms on high-dimensional data. PCA transforms data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate, called the first principal component. The paper proposes applying PCA and linear transformation on an original dataset before using a decision tree classification algorithm, in order to get better classification results.
This document summarizes a research paper that proposes a new method to accelerate the nearest neighbor search step of the k-means clustering algorithm. The k-means algorithm is computationally expensive due to calculating distances between data points and cluster centers. The proposed method uses geometric relationships between data points and centers to reject centers that are unlikely to be the nearest neighbor, without decreasing clustering accuracy. Experimental results showed the method significantly reduced the number of distance computations required.
BEHAVIOR STUDY OF ENTROPY IN A DIGITAL IMAGE THROUGH AN ITERATIVE ALGORITHM O...ijscmcj
Image segmentation is a critical step in computer vision tasks constituting an essential issue for pattern
recognition and visual interpretation. In this paper, we study the behavior of entropy in digital images
through an iterative algorithm of mean shift filtering. The order of a digital image in gray levels is defined.
The behavior of Shannon entropy is analyzed and then compared, taking into account the number of
iterations of our algorithm, with the maximum entropy that could be achieved under the same order. The
use of equivalence classes it induced, which allow us to interpret entropy as a hyper-surface in real m-
dimensional space. The difference of the maximum entropy of order n and the entropy of the image is used
to group the the iterations, in order to caractrizes the performance of the algorithm
This presentation provides a detailed overview of air filter testing equipment, including its types, working principles, and industrial applications. Learn about key performance indicators such as filtration efficiency, pressure drop, and particulate holding capacity. The slides highlight standard testing methods (e.g., ISO 16890, EN 1822, ASHRAE 52.2), equipment configurations (such as aerosol generators, particle counters, and test ducts), and the role of automation and data logging in modern systems. Ideal for engineers, quality assurance professionals, and researchers involved in HVAC, automotive, cleanroom, or industrial filtration systems.
More Related Content
Similar to Connectivity-Based Clustering for Mixed Discrete and Continuous Data (20)
Improved probabilistic distance based locality preserving projections method ...IJECEIAES
In this paper, a dimensionality reduction is achieved in large datasets using the proposed distance based Non-integer Matrix Factorization (NMF) technique, which is intended to solve the data dimensionality problem. Here, NMF and distance measurement aim to resolve the non-orthogonality problem due to increased dataset dimensionality. It initially partitions the datasets, organizes them into a defined geometric structure and it avoids capturing the dataset structure through a distance based similarity measurement. The proposed method is designed to fit the dynamic datasets and it includes the intrinsic structure using data geometry. Therefore, the complexity of data is further avoided using an Improved Distance based Locality Preserving Projection. The proposed method is evaluated against existing methods in terms of accuracy, average accuracy, mutual information and average mutual information.
In order to solve the complex decision-making problems, there are many approaches and systems based on the fuzzy theory were proposed. In 1998, Smarandache introduced the concept of single-valued neutrosophic set as a complete development of fuzzy theory. In this paper, we research on the distance measure between single-valued neutrosophic sets based on the H-max measure of Ngan et al. [8]. The proposed measure is also a distance measure between picture fuzzy sets which was introduced by Cuong in 2013 [15]. Based on the proposed measure, an Adaptive Neuro Picture Fuzzy Inference System (ANPFIS) is built and applied to the decision making for the link states in interconnection networks. In experimental evaluation on the real datasets taken from the UPV (Universitat Politècnica de València) university, the performance of the proposed model is better than that of the related fuzzy methods.
Accurate time series classification using shapeletsIJDKP
Time series data are sequences of values measured over time. One of the most recent approaches to
classification of time series data is to find shapelets within a data set. Time series shapelets are time series
subsequences which represent a class. In order to compare two time series sequences, existing work uses
Euclidean distance measure. The problem with Euclidean distance is that it requires data to be
standardized if scales differ. In this paper, we perform classification of time series data using time series
shapelets and used Mahalanobis distance measure. The Mahalanobis distance is a descriptive statistic
that provides a relative measure of a data point's distance (residual) from a common point. The
Mahalanobis distance is used to identify and gauge similarity of an unknown sample set to a known one. It
differs from Euclidean distance in that it takes into account the correlations of the data set and is scaleinvariant.
We show that Mahalanobis distance results in more accuracy than Euclidean distance measure
STUDY ANALYSIS ON TRACKING MULTIPLE OBJECTS IN PRESENCE OF INTER OCCLUSION IN...aciijournal
The object tracking algorithm is used to tracking multiple objects in a video streams. This paper provides
Mutual tracking algorithm which improve the estimation inaccuracy and the robustness of clutter
environment when it uses Kalman Filter. using this algorithm to avoid the problem of id switch in
continuing occlusions. First the algorithms apply the collision avoidance model to separate the nearby
trajectories. Suppose occurring inter occlusion the aggregate model splits into several parts and use only
visible parts perform tracking. The algorithm reinitializes the particles when the tracker is fully occluded.
The experimental results using unmanned level crossing (LC) exhibit the feasibility of our proposal. In
addition, comparison with Kalman filter trackers has also been performed.
Study Analysis on Teeth Segmentation Using Level Set Methodaciijournal
This document presents a study analyzing a mutual tracking algorithm for tracking multiple objects in unmanned level crossings (LC) in the presence of inter-occlusion. The mutual tracking algorithm improves upon Kalman filtering approaches by monitoring the distance between target estimates to handle occlusion and identity switching. When occlusion is detected, it distinguishes visible from occluded parts of targets to accurately update tracking. Experimental results on unmanned LC videos show the mutual tracking algorithm more effectively handles occlusions compared to Kalman filtering.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
This document discusses multidimensional clustering methods for data mining and their industrial applications. It begins with an introduction to clustering, including definitions and goals. Popular clustering algorithms are described, such as K-means, fuzzy C-means, hierarchical clustering, and mixture of Gaussians. Distance measures and their importance in clustering are covered. The K-means and fuzzy C-means algorithms are explained in detail. Examples are provided to illustrate fuzzy C-means clustering. Finally, applications of clustering algorithms in fields such as marketing, biology, and earth sciences are mentioned.
Study Analysis on Tracking Multiple Objects in Presence of Inter Occlusion in...aciijournal
The object tracking algorithm is used to tracking multiple objects in a video streams. This paper provides
Mutual tracking algorithm which improve the estimation inaccuracy and the robustness of clutter
environment when it uses Kalman Filter. using this algorithm to avoid the problem of id switch in
continuing occlusions. First the algorithms apply the collision avoidance model to separate the nearby
trajectories. Suppose occurring inter occlusion the aggregate model splits into several parts and use only
visible parts perform tracking. The algorithm reinitializes the particles when the tracker is fully occluded.
The experimental results using unmanned level crossing (LC) exhibit the feasibility of our proposal. In
addition, comparison with Kalman filter trackers has also been performed.
Study Analysis on Tracking Multiple Objects in Presence of Inter Occlusion in...aciijournal
The object tracking algorithm is used to tracking multiple objects in a video streams. This paper provides
Mutual tracking algorithm which improve the estimation inaccuracy and the robustness of clutter
environment when it uses Kalman Filter. using this algorithm to avoid the problem of id switch in
continuing occlusions. First the algorithms apply the collision avoidance model to separate the nearby
trajectories. Suppose occurring inter occlusion the aggregate model splits into several parts and use only
visible parts perform tracking. The algorithm reinitializes the particles when the tracker is fully occluded.
The experimental results using unmanned level crossing (LC) exhibit the feasibility of our proposal. In
addition, comparison with Kalman filter trackers has also been performed
Hierarchical clustering is a method of partitioning a set of data into meaningful sub-classes or clusters. It involves two approaches - agglomerative, which successively links pairs of items or clusters, and divisive, which starts with the whole set as a cluster and divides it into smaller partitions. Agglomerative Nesting (AGNES) is an agglomerative technique that merges clusters with the least dissimilarity at each step, eventually combining all clusters. Divisive Analysis (DIANA) is the inverse, starting with all data in one cluster and splitting it until each data point is its own cluster. Both approaches can be visualized using dendrograms to show the hierarchical merging or splitting of clusters.
This document summarizes a research paper on using a k-means clustering method to detect brain tumors in MRI images. The paper introduces brain tumors and MRI imaging. It then describes using k-means clustering for tumor segmentation, which groups similar image patterns into clusters to identify the tumor region. The paper presents results of applying k-means to two MRI images, including statistical measures of segmentation accuracy, tumor area comparison, and timing. The k-means method achieved average rand index of 0.8358, low average errors, and tumor areas close to manual segmentation in under 3 seconds, demonstrating potential for accurate and efficient brain tumor detection.
This document proposes a new similarity measure for comparing spatial MDX queries in a spatial data warehouse to support spatial personalization approaches. The proposed similarity measure takes into account the topology, direction, and distance between the spatial objects referenced in the MDX queries. It defines the topological distance between spatial scenes referenced in queries based on a conceptual neighborhood graph. It also defines the directional distance between queries based on a graph of spatial directions and transformation costs. The similarity measure will be included in a recommendation approach the authors are developing to recommend relevant anticipated queries to users based on their previous queries.
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ijcseit
Multiple sequence alignment is increasingly important to bioinformatics, with several applications ranging from phylogenetic analyses to domain identification. There are several ways to perform multiple sequence alignment, an important way of which is the progressive alignment approach studied in this work.Progressive alignment involves three steps: find the distance between each pair of sequences; construct a
guide tree based on the distance matrix; finally based on the guide tree align sequences using the concept of aligned profiles. Our contribution is in comparing two main methods of guide tree construction in terms of both efficiency and accuracy of the overall alignment: UPGMA and Neighbor Join methods. Our experimental results indicate that the Neighbor Join method is both more efficient in terms of performance
and more accurate in terms of overall cost minimization.
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...ijcseit
Multiple sequence alignment is increasingly important to bioinformatics, with several applications rangingfrom phylogenetic analyses to domain identification. There are several ways to perform multiple sequencealignment, an important way of which is the progressive alignment approach studied in this work.Progressive alignment involves three steps: find the distance between each pair of sequences; construct a guide tree based on the distance matrix; finally based on the guide tree align sequences using the concept of aligned profiles. Our contribution is in comparing two main methods of guide tree construction in terms of both efficiency and accuracy of the overall alignment: UPGMA and Neighbor Join methods. Our experimental results indicate that the Neighbor Join method is both more efficient in terms of performance and more accurate in terms of overall cost minimization.
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...ijcseit
Multiple sequence alignment is increasingly important to bioinformatics, with several applications ranging
from phylogenetic analyses to domain identification. There are several ways to perform multiple sequence
alignment, an important way of which is the progressive alignment approach studied in this work.
Progressive alignment involves three steps: find the distance between each pair of sequences; construct a
guide tree based on the distance matrix; finally based on the guide tree align sequences using the concept
of aligned profiles. Our contribution is in comparing two main methods of guide tree construction in terms
of both efficiency and accuracy of the overall alignment: UPGMA and Neighbor Join methods. Our
experimental results indicate that the Neighbor Join method is both more efficient in terms of performance
and more accurate in terms of overall cost minimization.
Reverse Nearest Neighbors in Unsupervised Distance-Based Outlier Detection1crore projects
IEEE PROJECTS 2015
1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider.
It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training.
Dot Net
DOTNET Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
Java Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
ECE IEEE Projects 2015
1. Matlab project
2. Ns2 project
3. Embedded project
4. Robotics project
Eligibility
Final Year students of
1. BSc (C.S)
2. BCA/B.E(C.S)
3. B.Tech IT
4. BE (C.S)
5. MSc (C.S)
6. MSc (IT)
7. MCA
8. MS (IT)
9. ME(ALL)
10. BE(ECE)(EEE)(E&I)
TECHNOLOGY USED AND FOR TRAINING IN
1. DOT NET
2. C sharp
3. ASP
4. VB
5. SQL SERVER
6. JAVA
7. J2EE
8. STRINGS
9. ORACLE
10. VB dotNET
11. EMBEDDED
12. MAT LAB
13. LAB VIEW
14. Multi Sim
CONTACT US
1 CRORE PROJECTS
Door No: 214/215,2nd Floor,
No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai,
Tamin Nadu, INDIA - 600 026
Email id: [email protected]
website:1croreprojects.com
Phone : +91 97518 00789 / +91 72999 51536
Finding Relationships between the Our-NIR Cluster ResultsCSCJournals
The problem of evaluating node importance in clustering has been active research in present days and many methods have been developed. Most of the clustering algorithms deal with general similarity measures. However In real situation most of the cases data changes over time. But clustering this type of data not only decreases the quality of clusters but also disregards the expectation of users, when usually require recent clustering results. In this regard we proposed Our-NIR method that is better than Ming-Syan Chen proposed a method and it has proven with the help of results of node importance, which is related to calculate the node importance that is very useful in clustering of categorical data, still it has deficiency that is importance of data labeling and outlier detection. In this paper we modified Our-NIR method for evaluating of node importance by introducing the probability distribution which will be better than by comparing the results.
This document summarizes a research paper titled "A Novel Algorithm for Design Tree Classification with PCA". It discusses dimensionality reduction techniques like principal component analysis (PCA) that can improve the efficiency of classification algorithms on high-dimensional data. PCA transforms data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate, called the first principal component. The paper proposes applying PCA and linear transformation on an original dataset before using a decision tree classification algorithm, in order to get better classification results.
A Novel Algorithm for Design Tree Classification with PCAEditor Jacotech
This document summarizes a research paper titled "A Novel Algorithm for Design Tree Classification with PCA". It discusses dimensionality reduction techniques like principal component analysis (PCA) that can improve the efficiency of classification algorithms on high-dimensional data. PCA transforms data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate, called the first principal component. The paper proposes applying PCA and linear transformation on an original dataset before using a decision tree classification algorithm, in order to get better classification results.
This document summarizes a research paper that proposes a new method to accelerate the nearest neighbor search step of the k-means clustering algorithm. The k-means algorithm is computationally expensive due to calculating distances between data points and cluster centers. The proposed method uses geometric relationships between data points and centers to reject centers that are unlikely to be the nearest neighbor, without decreasing clustering accuracy. Experimental results showed the method significantly reduced the number of distance computations required.
BEHAVIOR STUDY OF ENTROPY IN A DIGITAL IMAGE THROUGH AN ITERATIVE ALGORITHM O...ijscmcj
Image segmentation is a critical step in computer vision tasks constituting an essential issue for pattern
recognition and visual interpretation. In this paper, we study the behavior of entropy in digital images
through an iterative algorithm of mean shift filtering. The order of a digital image in gray levels is defined.
The behavior of Shannon entropy is analyzed and then compared, taking into account the number of
iterations of our algorithm, with the maximum entropy that could be achieved under the same order. The
use of equivalence classes it induced, which allow us to interpret entropy as a hyper-surface in real m-
dimensional space. The difference of the maximum entropy of order n and the entropy of the image is used
to group the the iterations, in order to caractrizes the performance of the algorithm
This presentation provides a detailed overview of air filter testing equipment, including its types, working principles, and industrial applications. Learn about key performance indicators such as filtration efficiency, pressure drop, and particulate holding capacity. The slides highlight standard testing methods (e.g., ISO 16890, EN 1822, ASHRAE 52.2), equipment configurations (such as aerosol generators, particle counters, and test ducts), and the role of automation and data logging in modern systems. Ideal for engineers, quality assurance professionals, and researchers involved in HVAC, automotive, cleanroom, or industrial filtration systems.
Filters for Electromagnetic Compatibility ApplicationsMathias Magdowski
In this lecture, I explain the fundamentals of electromagnetic compatibility (EMC), the basic coupling model and coupling paths via cables, electric fields, magnetic fields and wave fields. We also look at electric vehicles as an example of systems with many conducted EMC problems due to power electronic devices such as rectifiers and inverters with non-linear components such as diodes and fast switching components such as MOSFETs or IGBTs. After a brief review of circuit analysis fundamentals and an experimental investigation of the frequency-dependent impedance of resistors, capacitors and inductors, we look at a simple low-pass filter. The input impedance from both sides as well as the transfer function are measured.
ISO 4020-6.1 – Filter Cleanliness Test Rig: Precision Testing for Fuel Filter Integrity
Explore the design, functionality, and standards compliance of our advanced Filter Cleanliness Test Rig developed according to ISO 4020-6.1. This rig is engineered to evaluate fuel filter cleanliness levels with high accuracy and repeatability—critical for ensuring the performance and durability of fuel systems.
🔬 Inside This Presentation:
Overview of ISO 4020-6.1 testing protocols
Rig components and schematic layout
Test methodology and data acquisition
Applications in automotive and industrial filtration
Key benefits: accuracy, reliability, compliance
Perfect for R&D engineers, quality assurance teams, and lab technicians focused on filtration performance and standard compliance.
🛠️ Ensure Filter Cleanliness — Validate with Confidence.
UNIT-1-PPT-Introduction about Power System Operation and ControlSridhar191373
Power scenario in Indian grid – National and Regional load dispatching centers –requirements of good power system - necessity of voltage and frequency regulation – real power vs frequency and reactive power vs voltage control loops - system load variation, load curves and basic concepts of load dispatching - load forecasting - Basics of speed governing mechanisms and modeling - speed load characteristics - regulation of two generators in parallel.
Optimize Indoor Air Quality with Our Latest HVAC Air Filter Equipment Catalogue
Discover our complete range of high-performance HVAC air filtration solutions in this comprehensive catalogue. Designed for industrial, commercial, and residential applications, our equipment ensures superior air quality, energy efficiency, and compliance with international standards.
📘 What You'll Find Inside:
Detailed product specifications
High-efficiency particulate and gas phase filters
Custom filtration solutions
Application-specific recommendations
Maintenance and installation guidelines
Whether you're an HVAC engineer, facilities manager, or procurement specialist, this catalogue provides everything you need to select the right air filtration system for your needs.
🛠️ Cleaner Air Starts Here — Explore Our Finalized Catalogue Now!
Department of Environment (DOE) Mix Design with Fly Ash.MdManikurRahman
Concrete Mix Design with Fly Ash by DOE Method. The Department of Environmental (DOE) approach to fly ash-based concrete mix design is covered in this study.
The Department of Environment (DOE) method of mix design is a British method originally developed in the UK in the 1970s. It is widely used for concrete mix design, including mixes that incorporate supplementary cementitious materials (SCMs) such as fly ash.
When using fly ash in concrete, the DOE method can be adapted to account for its properties and effects on workability, strength, and durability. Here's a step-by-step overview of how the DOE method is applied with fly ash.
Video Games and Artificial-Realities.pptxHadiBadri1
🕹️ #GameDevs, #AIteams, #DesignStudios — I’d love for you to check it out.
This is where play meets precision. Let’s break the fourth wall of slides, together.
Peak ground acceleration (PGA) is a critical parameter in ground-motion investigations, in particular in earthquake-prone areas such as Iran. In the current study, a new method based on particle swarm optimization (PSO) is developed to obtain an efficient attenuation relationship for the vertical PGA component within the northern Iranian plateau. The main purpose of this study is to propose suitable attenuation relationships for calculating the PGA for the Alborz, Tabriz and Kopet Dag faults in the vertical direction. To this aim, the available catalogs of the study area are investigated, and finally about 240 earthquake records (with a moment magnitude of 4.1 to 6.4) are chosen to develop the model. Afterward, the PSO algorithm is used to estimate model parameters, i.e., unknown coefficients of the model (attenuation relationship). Different statistical criteria showed the acceptable performance of the proposed relationships in the estimation of vertical PGA components in comparison to the previously developed relationships for the northern plateau of Iran. Developed attenuation relationships in the current study are independent of shear wave velocity. This issue is the advantage of proposed relationships for utilizing in the situations where there are not sufficient shear wave velocity data.
Module4: Ventilation
Definition, necessity of ventilation, functional requirements, various system & selection criteria.
Air conditioning: Purpose, classification, principles, various systems
Thermal Insulation: General concept, Principles, Materials, Methods, Computation of Heat loss & heat gain in Buildings
Kevin Corke Spouse Revealed A Deep Dive Into His Private Life.pdfMedicoz Clinic
Kevin Corke, a respected American journalist known for his work with Fox News, has always kept his personal life away from the spotlight. Despite his public presence, details about his spouse remain mostly private. Fans have long speculated about his marital status, but Corke chooses to maintain a clear boundary between his professional and personal life. While he occasionally shares glimpses of his family on social media, he has not publicly disclosed his wife’s identity. This deep dive into his private life reveals a man who values discretion, keeping his loved ones shielded from media attention.
Connectivity-Based Clustering for Mixed Discrete and Continuous Data
1. International Journal on Cybernetics & Informatics (IJCI) Vol.13, No.3, June 2024
Bibhu Dash et al: SCDD -2024
pp. 17-30, 2024. IJCI – 2024 DOI:10.5121/ijci.2024.130303
CONNECTIVITY-BASED CLUSTERING FOR
MIXED DISCRETE AND CONTINUOUS DATA
Mahfuza Khatun1
and Sikandar Siddiqui2
1
Jahangirnagar University, Savar, 1342 Dhaka, Bangladesh
2
Deloitte Audit Analytics GmbH, Europa-Allee 91,
60486 Frankfurt, Germany
ABSTRACT
This paper introduces a density-based clustering procedure for datasets with variables of
mixed type. The proposed procedure, which is closely related to the concept of shared
neighbourhoods, works particularly well in cases where the individual clusters differ
greatly in terms of the average pairwise distance of the associated objects. Using a number
of concrete examples, it is shown that the proposed clustering algorithm succeeds in
allowing the identification of subgroups of objects with statistically significant
distributional characteristics.
KEYWORDS
Cluster analysis, mixed data, distance measures
1. INTRODUCTION
In the field of applied statistics, “clustering” and “cluster analysis” are collective terms for all
procedures by which individual objects can be aggregated into groups of mutually similar
entities.
One set of methods frequently used to perform this task, commonly referred to as centroid-based
approaches, model a cluster as a group of entities scattered around a common central point.
However, this may be counter-intuitive since many observers would also tend to group entities
together if they are scattered along a common, not necessarily linear path or surface, rather than a
common central point.
In response to this challenge, another class of clustering methods, often summarized under the
general term of “connectivity-“ or “density-based” approaches (Kriegel et al. [13]), has been
developed. In this context, clusters are defined as groups of entities located in regions of high
density and separated by near-empty areas within the sample space. Widely used examples of
such connectivity-based clustering procedures are DBSCAN (Ester et al., [5]) and OPTICS
(Ankerst et al., [2]); see also Oduntan [15].
The current paper addresses a key challenge that may occur in this context: It consists of the
possibility that the variables in the relevant datasets may be of mixed type, i.e. some of them may
be continuous, while others may either be ordered and discrete (like in the case of school grades)
or unordered and discrete (like in the case of country identifiers). This raises the question of how
the pairwise distances between the individual entities, which are key inputs for separating low-
and high-density regions, are to be measured. A possible solution to this problem is proposed in
2. International Journal on Cybernetics & Informatics (IJCI) Vol.13, No.3, June 2024
18
Section 2 of this paper. The proposed clustering procedure itself, which is closely related to the
concept of shared neighbourhoods presented in Houle et al. [10], is described in Section 3.
Section 4 summarizes the outcomes of a number of exemplary applications. Section 5 concludes.
2. MEASURING DISTANCES BETWEEN ENTITIES
2.1. Starting Point
As in Khatun and Siddiqui [12], we assume that there is a dataset consisting of N entities i = 1,
…, N, each of which is characterised by a tuple qi = { xi, vi, zi } of features.
xi is a realisation of a (K1×1) column vector X of numerical variables that either are
continuous or treated as continuous for practical reasons
vi is a realisation of a (K2×1) vector V of ordered discrete variables, beginning with 1
numbered consecutively in steps of 1, and
zi is a realisation of a (K3×1) vector Z of unordered, discrete variables.
2.2. Pairwise Distance with Respect to Continuous Variables
The distance between two entities i and j with respect to the values of X is measured by the
Manhattan Distance (see, e.g., Yang [20]) between the standardized xi and xj values as follows:
(1)
where r(Xs) denotes the range of the observed values of Xs, i.e. the difference between the sample
maximum and the sample minimum.
In this context, the Manhattan Distance is preferred to the more commonly used Euclidean
distance because when using the former, the contrast between the distances from different data
points shrinks less rapidly as the dimension K1 of X grows; see Aggarwal, Hinneburg, and Keim
[1]. In addition, the application of the distance measure (1) simplifies the consolidation of
distance measures for the different variable types involved, as will become obvious below.
2.3. Pairwise Distance with Respect to the Ordered Discrete Variables
The distance between i and j with respect to the values taken by the components of V can be
measured by
(2)
where ms is the number of possible realisations of the s-th ordered discrete variable.
This way of proceeding is justified as follows: If Vs is an ordered, discrete variable ranging from
to ms in steps of 1, then the actual value vi,s can be assumed to be dependent on the value taken
by a latent (=unobservable) variable vi,s
*
]0; 1] as follows:
vi,s = j if vi,s
*
](j-1)/ ms; j/ms ] (3)
3. International Journal on Cybernetics & Informatics (IJCI) Vol.13, No.3, June 2024
19
The distance between the mid-points of two neighbouring sub-intervals given in (3) equals 1/ms.
2.4. Pairwise Distance based on Unordered Discrete Characteristics
With regard to Z, the distance between two entities i and j can be quantified by their separateness
or lack of overlap (see, e.g., Stanfill and Waltz [16]), based on the Hamming Distance (Hamming
[9]):
(4)
In the above equation, I(.
) is the indicator function that equals 1 if the condition in brackets is
fulfilled and 0 if not. The scalar ms stands for the number of distinct possible realisations of the s-
th unordered discrete variable.
The distance prevailing between two specific realisations zi,s and zj,s of the s-th unordered discrete
variable Zs with respect to the values thus equals 0 if zi,s equals zj,s, and 1/ms if not. The distance
between pairs of observations with different values of Zs, as measured by (3), thus shrinks as the
number of possible realisations of the relevant variable increases. This normalisation rule is
motivated by the idea that the finer the classification scheme according to which individual
entities are grouped, the smaller the average number of entities per group, and the less certain we
can be that differences in group membership reflect actual disparities between the entities, rather
than merely random “noise”.
2.5. Overall Pairwise Distance Between Two Entities
The overall distance score between two entities i and j can then be calculated by summing up the
distance measures for the different types of variables given in (1), (2), and (4).
(5)
3. CLUSTERING PROCEDURE
As mentioned in the introduction, the clustering procedure proposed in this section is based on
the concept of shared neighbourhood, as presented in the seminal paper by Houle et al. [10]. This
approach is adopted here because, according to the authors, it is less affected by the “curse of
dimensionality” (Bellman [3]). What is meant by this term is that as the number of variables
under consideration grows, the size differences between the pairwise distances of the individual
data points decrease, which makes it increasingly impossible to form meaningful clusters based
on such distances.
In line with the above specifications, it is possible to calculate, for each entity i in the sample, the
distance between itself and each of the remaining (N-1) entities, and to sort the results of these
calculations in ascending order. Let i(1) ≤ i(2) ≤ … ≤ i(N-1) denote the sorted distances from i,
and let g > 0 be a user-specified integer number. Then, the adjacency set Si of entity i, is defined
as the sets of all entities j i for which the inequality d(i,j) ≤ i(g) holds:
Si := { j i | d(i,j) ≤ i(g)} (6)
4. International Journal on Cybernetics & Informatics (IJCI) Vol.13, No.3, June 2024
20
Two entities j and i are considered mutually interlinked if their adjacency sets Si and Sj have at
least one element in common:
Si ⚭ Sj Si Sj {} (7)
The proposed clustering procedure can then be summarized as follows:
Listing 1: Clustering Procedure
Step 1: Gather all entities in the sample in the subset of hitherto unassigned entities
Step 2: Initialize the cluster index c as 0.
Step 3: Increase the cluster index c by 1.
Step 4: Set the number of elements in cluster c, denoted by nc, to 0
Step 5: Set i to 1.
Step 6: If
entity number i has not yet been assigned to a cluster and
nc exceeds 0 and
entity number i and at least one element of cluster c are mutually
interlinked
then
assign entity number i to cluster c,
remove entity number i from the subset of hitherto unassigned entities, and
increase nc by 1
Step 7: If
entity number i has not yet been assigned to a cluster, and
nc equals 0,
then
assign entity number i to cluster c,
remove entity number i from the subset of hitherto unassigned entities, and
increase nc by 1
Step 8: Increase i by 1
Step 9 If i ≤ N, continue with Step 6
Step 10 If i > N, and if there is at least one object in the subset of hitherto unassigned
entities that is mutually interlinked with at least one element of cluster c, continue
with Step 5.
Step 11 If i > N and if there is not a single element in the subset of hitherto unassigned
entities is mutually interlinked with at least one element of cluster c, continue with
Step 3.
Step 12 If i > N and the set of hitherto unassigned entities is empty, terminate.
An implementation of this procedure in the matrix language GAUSS, together with the datasets
for the examples from the following section, is available on the WWW via
https://drive.google.com/drive/folders/1putjdHHMJjg2TafWenFl9lMBr3ShxC93?usp=drive_link
.
The above procedure will cause each of the entities in the sample to be unequivocally assigned to
a single cluster. However, particularly when the number g of neighbouring entities from which
the adjacency set of each entity i is derived is small (say, e.g. 1 or 2), some or even many objects
will be assigned to “degenerate” clusters comprising only a single object. On the other hand,
raising the value of g above a certain threshold (the level of which depends on the distributional
characteristics of the underlying dataset) will cause all objects to be gathered in a single,
maximally heterogeneous group. It thus becomes obvious that the choice of g implies a trade-off
between the potentially conflicting objectives of within-cluster homogeneity on one hand and
5. International Journal on Cybernetics & Informatics (IJCI) Vol.13, No.3, June 2024
21
inclusiveness (i.e. the assignment of as many entities as possible to valid, or “non-degenerate”,
clusters) on the other.
The proposed solution to this problem is to compare different possible outcomes of the clustering
procedure with a measure of separation accuracy that can be calculated as follows: Let min
denote a user-defined minimum cluster size, c(i) the index number of the cluster to which object i
has been assigned, and nc(i) the number of objects in that cluster. Then, the quantity
(8)
equals the distance between object i and its closest neighbour within its cluster, provided that
at least reaches the specified minimum size, whereas
(9)
equals the distance between object i and its closest neighbour outside its cluster whenever c(i)
does not fall short of min. Then, for any given value min, the particular value g*
of g that
maximizes the quantity
(10)
can be looked upon as the one that maximizes the separation accuracy associated with the chosen
value of min.
4. EXEMPLARY APPLICATIONS
4.1. Two-Dimensional Datasets with Continuous Variables Only
Although the main purpose of the proposed method is to allow the clustering of mixed data, some
of its key characteristics are probably best understood when applying it to a small set of
deliberately simple cases. The first important feature is that, unlike the centroid-based
procedures, the procedure from Sections 2 and 3 can identify clusters of arbitrary shape rather
than being limited to ones with round or oval profiles. In an exemplary manner, this is shown for
the “Aggregation” dataset examined by Gionis, Mannila and Tsaparas (2007). Figure 1 displays
the outcome of an application of the proposed procedure in this case:
6. International Journal on Cybernetics & Informatics (IJCI) Vol.13, No.3, June 2024
22
Figure 1
In the proposed procedure, the measure of neighbourliness that is used to decide whether two or
more data points belong together or are considered separate is not defined by a fixed distance
threshold; rather it is based on the presence or absence of at least one common neighbour. In a
setup where the individual clusters differ greatly in terms of the average pairwise distance of the
associated objects, this feature enables the procedure from Section 3 to identify such
accumulations of objects nevertheless, as is shown by its application to the “toy dataset” dealt
with in Jain and Law [11]:
Figure 2
However, the application of the proposed method to the R15” dataset studied in Veenman,
Reinders and Backer [18] also points to the flip side of the apparent advantages of our algorithm:
Because of its emphasis on common close neighbours, it tends to merge groups of objects
surrounding two or more different central points into a single group whenever there is some
degree of overlap between them. This can sometimes lead to counterintuitive results, as Figure 3
indicates: Here, our algorithm forms a single cluster out of eight centrally located point clouds
near the centre of the diagram, although most human observers would probably have perceived
them as separate groups.
The outcome for the dataset “Unbalance” examined by Rezaei and Fränti [14] (see Figure 4)
further underscores this point. Here, most human observers would probably have split Cluster 1
in four and Cluster 2 into three separate “sub-clusters”.
The above findings show that the proposed approach is far from a universal, objective solution to
clustering problems. It should rather be considered one out of several related approaches which,
in view of the variety of possible data and research objectives, can produce results with very
different degrees of plausibility from case to case.
7. International Journal on Cybernetics & Informatics (IJCI) Vol.13, No.3, June 2024
23
Figure 3
Figure 4
The four examples in this subsection relate to synthetically generated data points from a bivariate
distribution involving continuous variables only. The purpose of the two following subsections is
to demonstrate that the proposed algorithm can also produce plausible results in real-world
situations involving more variables, and variables of different types.
4.2. Country Grouping by Macroeconomic Indicators
The above procedure can be applied to form groups of countries based on similarity comparisons
with regard to their geographic location as well as a number of macroeconomic indicators. In the
particular case examined here, we use the seven World Bank [19] regions as geographical
assignment indicators and a set of four macroeconomic variables, which include (i) GDP per
capita, as well as (ii) government debt, (iii) the current account balance, and (iv) the government
budget balance, the last three of which are being expressed as a percentage of GDP. This choice
is motivated because the four variables just mentioned are among the most commonly used
indicators used to assess the resilience of countries to adverse economic shocks; see, e.g.,
Briguglio et al. [4] for a more comprehensive treatment of this issue. The common source for all
the data in use is Trading Economics [17], and the reference date is generally the year-end of
2022. If no data is available for this date, the most recent earlier key date has been chosen. After
removing sovereign states with missing data, we end up with a sample of 163 countries.
With the minimum cluster size min set to 2, application of the above procedure to the dataset
compiled accordingly yields an optimum number of close neighbours (g*
) of 2 and leads to two
large clusters, one more small cluster of only two countries (Cambodia and the Maldives), and a
total of six “outliers” (Afghanistan, Bhutan, Cyprus, Guinea, Indonesia, and Lebanon) that cannot
8. International Journal on Cybernetics & Informatics (IJCI) Vol.13, No.3, June 2024
24
be assigned to a cluster that reaches or exceeds the above size threshold. Table 4.1.1. enumerates
the countries assigned to Cluster 1 and 2 by region. Descriptive statistics of the four continuous
variables in use are given in Tables 4.1.2 to 4.1.5.
Table 1: Distribution of countries across Clusters 1 and 2
Cluster 1 Cluster 2
East Asia &
Pacific
Fiji, Japan Australia, Brunei, China, Laos, Malaysia,
Mongolia
Myanmar, New Zealand, Papua New
Guinea,
Philippines, Singapore, South Korea,
Thailand,
Vietnam
Europe & Central
Asia
Albania, Armenia, Belgium,
Bosnia and Herzegovina, France,
Georgia, Greece, Italy,
Macedonia, Moldova,
Montenegro, Portugal, Serbia,
Spain, Ukraine, United Kingdom,
Uzbekistan
Austria, Azerbaijan, Belarus, Bulgaria,
Czech Republic, Croatia, Denmark,
Estonia, Finland, Germany, Hungary,
Kazakhstan, Kosovo, Kyrgyzstan,
Iceland, Latvia, Lithuania, Luxembourg,
Netherlands, Norway, Poland, Romania,
Russia, Slovakia, Slovenia, Sweden,
Switzerland, Tajikistan, Turkey,
Turkmenistan
Latin America &
Caribbean
Argentina, Bahamas, Belize,
Bolivia, Brazil, Chile, Colombia,
Costa Rica, Dominican Republic,
Ecuador, El Salvador, Guatemala,
Haiti, Honduras, Jamaica,
Mexico, Nicaragua, Panama,
Paraguay, Peru, Suriname,
Trinidad and Tobago, Uruguay,
Cayman Islands
Guyana
Middle East &
North Africa
Algeria, Bahrain, Djibouti, Egypt,
Iran, Iraq, Jordan, Libya,
Palestine, Tunisia
Israel, Kuwait, Malta, Saudi Arabia,
United Arab Emirates, Oman, Qatar
North America Canada, United States
South Asia Bangladesh, Nepal, India,
Pakistan, Sri Lanka
Sub-Saharan
Africa
Angola, Benin, Botswana,
Burkina Faso, Burundi,
Cameroon, Cape Verde, Central
African Republic,
Chad, Comoros, Congo, Ethiopia,
Equatorial 1Guinea, Gabon,
Gambia, Ghana, Guinea, Guinea
Bissau, Ivory Coast, Kenya,
Lesotho, Liberia, Madagascar,
Malawi, Mauritania, Mauritius,
Mozambique, Namibia, Niger,
Nigeria, Republic of the Congo,
Senegal, Sierra Leone, Rwanda,
Seychelles, South Africa, Sudan,
Swaziland, Tanzania, Togo,
Uganda, Zambia, Zimbabwe
Central African Republic, Liberia,
Mauritania, Niger, Senegal, Seychelles
9. International Journal on Cybernetics & Informatics (IJCI) Vol.13, No.3, June 2024
25
Table 2: Descriptive Statistics by Cluster, GDP per capita
Cluster # Mean Std. dev. Min. Max # obs.
1 13829.40 13347.69 705.00 63670.00 95
2 36308.82 26802.29 838.00 115683.00 60
3 11560.00 10189.41 4355.00 18765.00 2
Sample 22067,75 22162,15 705.00 115683.00 163
Table 3: Descriptive Statistics by Cluster, Government debt as a percentage of GDP
Cluster # Mean Std. dev. Min. Max # obs.
1 71.38 40.39 14.60 264.00 95
2 46.80 24.67 1.90 160.00 60
3 47.35 14.92 36.8 57.9 2
Sample 62,32 37.89 1,9 264 163
Table 4: Descriptive Statistics by Cluster, Current account balance as a percentage of GDP
Cluster # Mean Std. dev. Min. Max # obs.
1 -3.15 6.56 -23.8 19.2 95
2 0.42 12.36 -26.8 30.5 60
3 -21.75 7.28 -26.9 -16.6 2
Sample -2.21 10.17 -33.80 30.50 163
Table 5: Government budget balance as a percentage of GDP
Cluster # Mean Std. dev. Min. Max # obs.
1 -4.28 5.07 -35.00 6.50 95
2 -2.41 4.35 -19.30 11.60 60
3 -10.80 4.95 -14.30 -7.30 2
Sample -3.62 4.88 -35.00 11.60 163
It turns out that the entities assigned to Cluster 1, on average, have a lower GDP per capita and
higher ratios of government debt and government budget deficits to GDP than those gathered in
Cluster 2. In both of these cases, a standard two-sample t-test leads to a rejection of the null
hypothesis of equal means on a confidence level exceeding 99%. Moreover, the average
government budget balance and the average current account balance, both expressed as a
percentage of GDP, are lower in Cluster 1 than in Cluster 2. In both of these cases, the absolute
values of the associated t-statistics exceed the 95% critical value for a two-sided test by far.
Clusters 1 and 2 also exhibit considerable differences between the correlation patterns prevailing
between the continuous variables involved, only the most striking of which are mentioned in the
following:
In Cluster 1, the sample correlation coefficient between the GDP per capita and the
government budget balance per unit of GDP exceeds 0.5 and is statistically significant on
a 95% level. In Cluster 2, the same coefficient is below 0.06 and statistically
insignificant.
10. International Journal on Cybernetics & Informatics (IJCI) Vol.13, No.3, June 2024
26
Figure 4:
In Cluster 1, the correlation coefficient between GDP per capita and the government’s
budget balance takes a negative value (-0.20), whereas in Cluster 2, the corresponding
coefficient has the opposite sign (0.297). A t-test derived a corresponding bivariate linear
regression in which the slope parameters were allowed to differ in Cluster 1 and Cluster 2
resulted in the null hypothesis of identical parameter values being rejected on a 99%
confidence level.
Figure 5:
The above results clearly indicate that, in this particular case, the proposed clustering procedure
succeeds in allowing the identification of subgroups of entities with statistically significant
distributional characteristics that might not have been detectable with other tools of exploratory
data analysis, such as histograms and scatterplots.
4.3. Clusters of Credit Card Applicants
Another dataset to which the procedure from sections 2 and 3 can be applied is the sample of
applicants for a specific type of credit card provided in the online complements of Greene’s [8]
econometrics textbook. The dataset consists of 1,319 observations on 12 variables. The variables
used in the current example are listed in Table 4.2.1.
11. International Journal on Cybernetics & Informatics (IJCI) Vol.13, No.3, June 2024
27
Table 6: Variables Used in the Credit Card Example
Name Type Description
card Discrete, unordered = 1 if application was accepted, = 0 otherwise
reports Treated as continuous Number of major derogatory reports
age Continuous Age of the applicants in years
income Continuous Annual income in USD 10 000
owner Discrete, unordered = 1 if applicants own their home, = 0 otherwise
selfemp Discrete, unordered = 1 if applicants are self-employed, = 0 otherwise
dependents Treated as continuous Number of dependents
In this case, the application of the proposed procedure with a minimum cluster size of 2 leads to
the formation of two large clusters without any outliers. Cluster 1, the larger of these two, number
1, comprises little more than two-thirds of the sample. Cluster-specific descriptive statistics for
those that either are continuous or treated as such are given in Tables 7 to 10 below.
Table 7: Descriptive Statistics by Cluster, Variable “reports”
Cluster # Mean Std. dev. Min. Max # obs.
1 0.6212 1.5917 0.0000 14.0000 887
2 0.1180 0.3942 0.0000 3.0000 432
Sample 0.4564 1.3453 0.0000 14.0000 1319
Table 8: Descriptive Statistics by Cluster, Variable “age”
Cluster # Mean Std. dev. Min. Max # obs.
1 35.4493 10.1895 0.1667 83.5000 887
2 28.6215 8.3509 0.5000 67.1667 432
Sample 33.2131 10.1428 0.1667 83.5000 1319
Table 9: Descriptive Statistics by Cluster, Variable “income”
Cluster # Mean Std. dev. Min. Max # obs.
1 3.6686 1.8618 0.2100 67.1667 887
2 2.7427 1.0347 1.3200 10.9999 432
Sample 3.3654 1.6939 0.2100 13.5000 1319
Table 10: Descriptive Statistics by Cluster, Variable “dependents”
Cluster # Mean Std. dev. Min. Max # obs.
1 1.3766 1.3372 0.0000 6.0000 887
2 0.2083 0.4066 0.0000 1.0000 432
Sample 0.9939 1.2477 0.0000 6.0000 1319
The outcome of a two-sample t-test indicates that, for each of the four variables named above, the
cluster-specific differences in the means are statistically significant at a 99% confidence level.
Below is information on cluster-specific frequency distributions for the three binary indicator
variables in use. Here, too, the absolute values of the t-statistics relating to the differences
between the two clusters exceed the critical values for the 99% confidence interval by far.
12. International Journal on Cybernetics & Informatics (IJCI) Vol.13, No.3, June 2024
28
Table 11:
Cluster # application accepted % self-employed % homeowners
1 66.63% 10.26% 65.50%
2 100.00% 0.00% 0.00%
Sample 77.56% 6.90% 44.04%
In most of the above cases, the cluster-specific pairwise correlation patterns between the four
numerical variables involved (“reports”, “age”, “income”, and “dependents”) do not show any
pronounced differences. An exception, however, is the relationship between the income variable
and the number of dependents, which is positive and statistically significant on a 95% confidence
level in the case of Cluster 1 but close to zero in Cluster 2. Hence, the supposition that the
proposed clustering method can be applied to identifying subgroups of entities with significantly
different statistical characteristics is also confirmed by the results obtained in this case.
4.4. Precautionary Remarks
Given the obvious suitability of the proposed approach in the context of the above examples, it
appears necessary to mention a number of problems that cannot be resolved by its application:
The performance of the algorithm presented and the characteristics of the outcomes
obtained are very sensitive to the choice of the minimum cluster size and the number of
neighbouring entities from which the adjacency set of each entity is derived. If the latter
is chosen too small, the proposed method may result in the formation of "degenerate"
clusters with only very few elements.
Many datasets to which this method can, in principle, be applied may contain one or
more “irrelevant” variables that do not contain any information on the basis of which
objects can be meaningfully divided into groups of interconnected elements.
Especially in data sets with many variables, strong dependency relationships between
individual attributes, or subsets thereof, can prevent the application of distance-based
grouping procedures of the kind described here.
The problem persists that notions like “distance” or “neighbourhood” become less
significant as the dimension of a dataset increases. The method proposed in this paper
may help mitigate this under favourable conditions, but it is by no means a complete
solution.
The performance of the proposed algorithm is very sensitive to the structure and
distribution of the data to which it is applied. Hence, it remains an open question to what
extent the proposed algorithm can be generalized across different datasets.
5. CONCLUSIONS
In this paper, a distance-based clustering procedure for data of mixed type has been proposed.
The feasibility of the proposed method and its ability to produce empirically plausible results
were demonstrated using some application examples of different nature and complexity.
However, the curse of dimensionality and the possible presence of irrelevant or strongly
interrelated (groups of) variables remain issues that can lead to great difficulties for such
applications. Hence, augmenting the proposed technique with feature selection and/or
13. International Journal on Cybernetics & Informatics (IJCI) Vol.13, No.3, June 2024
29
dimensionality reduction techniques that may help mitigate this problem is a promising area for
future research.
ACKNOWLEDGEMENTS
The authors are grateful for several helpful comments and suggestions by two anonymous
referees.
The project underlying this publication was funded by the German Federal Ministry of Economic
Affairs and Climate Action under project funding reference number 01MK21002G.
REFERENCES
[1] Aggarwal, C.C., Hinneburg, A., & Keim, D. A. (2001). On the Surprising Behavior of Distance
Metrics in High Dimensional Space. In J. van den Bussche, & V. Vianu (Eds.), Database Theory —
ICDT 2001. Berlin (Springer).
[2] Ankerst, M., Breunig, M. M., Kriegel, H.-P., & Sander, J. (1999). OPTICS: Ordering Points To
Identify the Clustering Structure. In: ACM SIGMOD Record, 28(2), 49-60.
https://doi.org/10.1145/304181.304187
[3] Bellman, R. E. (1961). Adaptive Control Processes: a Guided Tour. Princeton (University Press).
[4] Briguglio, L., Cordina, G., Farrugia, N. & Vella, S. (2008). Economic vulnerability and resilience
concepts and measurements. Helsinki (United Nations University).
[5] Ester, M., Kriegel, H.P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering
clusters in large spatial databases with noise. In E. Simoudis, J. Han & U.M. Fayyad (Eds.),
Proceedings of the Second International Conference on Knowledge Discovery and Data Mining
(KDD-96), Washington, D.C. (AAAI Press).
[6] Gionis, A., Mannila, H., & Tsaparas, P. (2007). Clustering aggregation. ACM Transactions on
Knowledge Discovery from Data, 1(1), 1-30. https://doi.org/10.1145/1217299.1217303
[7] Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 27,
623-637. https://doi.org/10.2307/2528823
[8] Greene, W.H. (2003). Econometric Analysis. Upper Saddle River, NJ (Prentice Hall).
[9] Hamming, R.W. (1950). Error detecting and error correcting codes. Bell System Technical Journal,
29 (2), 147–160. https://doi.org/10.1002/j.1538-7305.1950.tb00463.x
[10] Houle, M. E., Kriegel, H. P., Kroger, P., Schubert, E., and Zimek, A. (2010). Can shared-neighbor
distances defeat the curse of dimensionality? In: M. Gertz, & B. Ludäscher (Eds.), Scientific and
Statistical Database Management (SSDBM 2010). Lecture Notes in Computer Science, vol 6187.
Berlin and Heidelberg (Springer).
[11] Jain, A. and Law, M. (2005). Data Clustering: A User’s Dilemma. In S.K. Pal, S. Bandyopadhyay &
S. Biswas (Eds.), Pattern Recognition and Machine Intelligence. (Lecture Notes in Computer
Science, vol 3776), Berlin (Springer). https://doi.org/10.1007/11590316_1
[12] Khatun, M., & Siddiqui, S. (2023). Estimating Conditional Event Probabilities with Mixed
Regressors: a Weighted Nearest Neighbour Approach. Statistika 103(2), 226-234.
[13] Kriegel, H.-P., Kröger, P. Sander, J., & Zimek, A. (2011). Density-based Clustering. WIREs Data
Mining and Knowledge Discovery, 1 (3), 231–240. https://doi.org/10.1002/widm.30
[14] Rezaei, M., & Fränti, P. (2016). Set-matching measures for external cluster validity. IEEE Trans. on
Knowledge and Data Engineering 28 (8), 2173-2186. https://doi.org/10.1109/TKDE.2016.2551240
[15] Oduntan, O. I. (2020). Blending Multiple Algorithmic Granular Components: A Recipe for
Clustering. Department of Computer Science, The University of Manitoba, Winnipeg, Canada.
[16] Stanfill, C., & Waltz, D. (1986). Toward memory-based reasoning. Commun. ACM, 29(12),1213–
1228. https://doi.org/10.1145/7902.7906
[17] Trading Economics (2023): Indicators. Retrieved July 1st, 2023 from
https://tradingeconomics.com/indicators .
[18] Veenman, C.J., Reinders, M.J.T., & Backer, E. (2002). A maximum variance cluster algorithm.
IEEE Trans. Pattern Analysis and Machine Intelligence, 24(9), 1273-1280.
https://doi.org/10.1109/TPAMI.2002.1033218
14. International Journal on Cybernetics & Informatics (IJCI) Vol.13, No.3, June 2024
30
[19] World Bank (2023). World Bank Units. Retrieved July 1, 2023 from
https://www.worldbank.org/en/about/unit
[20] Yang, X.-S. (2019). Introduction to Algorithms for Data Mining and Machine Learning. Amsterdam
(Elsevier).