prozac.txt (29.11 kB)
Dataset for: Some Remarks on the R2 for Clustering
dataset
posted on 2018-05-14, 12:12 authored by Nicola Loperfido, Thaddeus TarpeyA common descriptive statistic in cluster analysis is the $R^2$ that measures the
overall proportion of variance explained by the cluster means. This note highlights properties of the
$R^2$ for clustering. In particular, we show that generally the $R^2$ can
be artificially inflated by linearly transforming the data by ``stretching'' and by projecting.
Also, the $R^2$ for clustering will often
be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering
for misspecified models.
Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in
high-dimensional settings.
A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how
the curves are estimated.