Clustering of multivariate Big Data

Clustering is the division of a collection of data into groups, or clusters, such that points in the same cluster have a small distance from one another, while points in different clusters are at a large distance from one another.

When the data are not very high dimensional, but are too many to fit in memory, because they are part of a huge dataset, or because they arrive in streams and must be processed immediately or they are lost, specific algorithms are needed to analyse progressively the data, store in memory only a small number of summary statistics, and then discard the already processed data and free the memory.

Situations like this, in which clustering plays a fundamental role, recur in many applications, like for example customer segmentation in e-commerce web sites, image analysis of video frames for objects recognition, recognition of human movements from data provided by sensors placed on the body or on a smartphone, segmentation of a geographic region monitored by climatic sensors, etc.

The key element in smart algorithms to treat such type of big data is to find methods by which the summary statistics that are retained in memory can be updated when each new observation, or group of observations, is processed. A first and widely recognized method to cluster big data is the Bradley-Fayyad-Reina (BFR) algorithm [2], which is an extension of the classical K-means algorithm. Its main weakness resides in the assumption that the covariance matrix of each cluster is diagonal, which means that the components of the analyzed multivariate data should be uncorrelated. In this way at each step of the algorithm only the means and variances of each component of the cluster centres must be retained.

In [1] we describe an extension of the BFR algorithm to the case of clusters having “full” covariance matrix. Since with our method also the covariance terms of the clusters centres must be retained, there is an increase in the computational costs, but such increase can be easily controlled and is affordable if the processed data are not extremely high dimensional.

The method is based on the use of the Mahalanobis distance, and we solve the problem of the estimate of the covariance matrices using a Steinian shrinkage method [3]. Our algorithm has proved quite good performances when tested on synthetic data.

Authors

Giacomo Aletti, Alessandra Micheletti

ADAMSS Centre and Department of Environmental Science and Policy

Università degli Studi di Milano

 

References

[1] Aletti, G., Micheletti, A., A clustering algorithm for multivariate big data with correlated components. Proceedings of SIS2017 (2017).

[2] Bradley, P.S., Fayyad, U.,Reina, C.: Scaling clustering to large databases, in KDD-98 Proceedings, pp 1-7, American Association for Artificial Intelligence, (1998)

[3] Touloumis, A.: Nonparametric Stein-type shrinkage covariance matrix estimators in high-dimensional settings. Comput. Statist. Data Anal., 83:251-261, (2015).

%d bloggers like this: