## Dataset for: Some Remarks on the R^{2} for Clustering

dataset

posted on 14.05.2018 by Nicola Loperfido, Thaddeus Tarpey#### dataset

Datasets usually provide raw data for analysis. This raw data often comes in spreadsheet form, but can be any collection of data, on which analysis can be performed.

A common descriptive statistic in cluster analysis is the $R^2$ that measures the
overall proportion of variance explained by the cluster means. This note highlights properties of the
$R^2$ for clustering. In particular, we show that generally the $R^2$ can
be artificially inflated by linearly transforming the data by ``stretching'' and by projecting.
Also, the $R^2$ for clustering will often
be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering
for misspecified models.
Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in
high-dimensional settings.
A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how
the curves are estimated.