prozac.txt (29.11 kB)

Dataset for: Some Remarks on the R2 for Clustering

Download (29.11 kB)
dataset
posted on 14.05.2018 by Nicola Loperfido, Thaddeus Tarpey
A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.

History

collectionID

4065347

Licence

Exports

Read the peer-reviewed publication

Logo branding

Licence

Exports