Clustering represents a fundamental procedure to provide users with meaningful insights from an original data set. The quality of the resulting clusters is largely dependent on the correct estimation of their number, K∗, which must be provided as an input parameter in many clustering algorithms. Only very few techniques provide an automatic detection of K∗ and are usually based on cluster validity indexes which are expensive with regard to computation time. Here, we present a new algorithm which allows one to obtain an accurate estimate of K∗, without partitioning data into the different clusters. This makes the algorithm particularly efficient in handling large-scale data sets from both the perspective of time and space complexity. The algorithm, indeed, highlights the block structure which is implicitly present in the similarity matrix, and associates K∗ to the number of blocks in the matrix. We test the algorithm on synthetic data sets with or without a hierarchical organization of elements. We explore a wide range of K∗ and show the effectiveness of the proposed algorithm to identify K∗, even more accurate than existing methods based on standard internal validity indexes, with a huge advantage in terms of computation time and memory storage. We also discuss the application of the novel algorithm to the de-clustering of instrumental earthquake catalogs, a procedure finalized to identify the level of background seismic activity useful for seismic hazard assessment.
Determining the number of clusters, before finding clusters, from the susceptibility of the similarity matrix
E. Lippiello
;S. Baccari;
2023
Abstract
Clustering represents a fundamental procedure to provide users with meaningful insights from an original data set. The quality of the resulting clusters is largely dependent on the correct estimation of their number, K∗, which must be provided as an input parameter in many clustering algorithms. Only very few techniques provide an automatic detection of K∗ and are usually based on cluster validity indexes which are expensive with regard to computation time. Here, we present a new algorithm which allows one to obtain an accurate estimate of K∗, without partitioning data into the different clusters. This makes the algorithm particularly efficient in handling large-scale data sets from both the perspective of time and space complexity. The algorithm, indeed, highlights the block structure which is implicitly present in the similarity matrix, and associates K∗ to the number of blocks in the matrix. We test the algorithm on synthetic data sets with or without a hierarchical organization of elements. We explore a wide range of K∗ and show the effectiveness of the proposed algorithm to identify K∗, even more accurate than existing methods based on standard internal validity indexes, with a huge advantage in terms of computation time and memory storage. We also discuss the application of the novel algorithm to the de-clustering of instrumental earthquake catalogs, a procedure finalized to identify the level of background seismic activity useful for seismic hazard assessment.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.