This paper is concerned with the co-clustering of distribution-valued data, that is, the simultaneous partitioning of rows and columns of an input data table, the elements of which are distributions (or histograms) representing aggregate data. The first proposed method extends the double k-means algorithm to distributional data. The Wasserstein distance, also known as Mallow’s distance, is used to compare distributions. To consider the different relevance of the variables characterizing the clusters, four variants of adaptive distributional double k-means are proposed. Accordingly, in the co-clustering procedure, an additional step is introduced to compute the relevance weights associated with the variables. In particular, each of the four algorithms provides i) a set of weights for the variables; ii) different sets of weights for the variables, one for each cluster (cluster-wise); iii) a double set of weights for the variables according to the decomposition of the Wasserstein distance into two components; iv) different double sets of weights for the variables and distance components, one for each cluster (cluster-wise). Applications using simulated and real data demonstrate the effectiveness of the proposed algorithms and the contribution of the relevance weights to the co-clustering procedure according to the structure of the data.
Co-clustering algorithms for distributional data with automated variable weighting
Antonio Balzanella;Antonio Irpino;Rosanna Verde
2021
Abstract
This paper is concerned with the co-clustering of distribution-valued data, that is, the simultaneous partitioning of rows and columns of an input data table, the elements of which are distributions (or histograms) representing aggregate data. The first proposed method extends the double k-means algorithm to distributional data. The Wasserstein distance, also known as Mallow’s distance, is used to compare distributions. To consider the different relevance of the variables characterizing the clusters, four variants of adaptive distributional double k-means are proposed. Accordingly, in the co-clustering procedure, an additional step is introduced to compute the relevance weights associated with the variables. In particular, each of the four algorithms provides i) a set of weights for the variables; ii) different sets of weights for the variables, one for each cluster (cluster-wise); iii) a double set of weights for the variables according to the decomposition of the Wasserstein distance into two components; iv) different double sets of weights for the variables and distance components, one for each cluster (cluster-wise). Applications using simulated and real data demonstrate the effectiveness of the proposed algorithms and the contribution of the relevance weights to the co-clustering procedure according to the structure of the data.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.