This paper introduces a strategy for clustering online multiple data streams. We assume that several sources are used for recording, over time, data about some physical phenomena. Each source provides repeated measurements at a very high frequency so that it is not possible to store the whole amount of data into some easy-to-access media, but data are available only in batches. Our aim is to discover a partition of the sources (e.g. sensors) into homogeneous clusters, analysing the incoming streams of data. The proposed strategy is based on processing the incoming data batches independently, through an initial summarization of the data batches by histograms and, then, by means of a local clustering performed on the histograms which provides a further data summarization. To keep track of the data proximities among the data streams over time, we use local clustering outputs for updating a proximity matrix. The final partitioning of the streams is obtained by a clustering based on such proximity matrix. Through an application on real and simulated data, we show the effectiveness of our strategy in finding homogeneous groups of sources of data streams.

Histogram-based clustering of multiple data streams

Balzanella A.
;
Verde R.
2019

Abstract

This paper introduces a strategy for clustering online multiple data streams. We assume that several sources are used for recording, over time, data about some physical phenomena. Each source provides repeated measurements at a very high frequency so that it is not possible to store the whole amount of data into some easy-to-access media, but data are available only in batches. Our aim is to discover a partition of the sources (e.g. sensors) into homogeneous clusters, analysing the incoming streams of data. The proposed strategy is based on processing the incoming data batches independently, through an initial summarization of the data batches by histograms and, then, by means of a local clustering performed on the histograms which provides a further data summarization. To keep track of the data proximities among the data streams over time, we use local clustering outputs for updating a proximity matrix. The final partitioning of the streams is obtained by a clustering based on such proximity matrix. Through an application on real and simulated data, we show the effectiveness of our strategy in finding homogeneous groups of sources of data streams.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11591/442180
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 5
  • ???jsp.display-item.citation.isi??? 3
social impact