Dataset for: On the analysis of two-phase designs in cluster--correlated settings

In public health research information that is readily available may be insufficient to address the primary question(s) of interest. One cost-efficient way forward, especially in resource-limited settings, is to conduct a two-phase study in which the population is initially stratified, at phase I, by the outcome and/or some categorical risk factor(s). At phase II detailed covariate data is ascertained on a sub-sample within each phase I strata. While analysis methods for two-phase designs are well established, they have focused exclusively on settings in which participants are assumed to be independent. As such, when participants are naturally clustered (e.g. patients within clinics) these methods may yield invalid inference. To address this we develop a novel analysis approach based on inverse-probability weighting (IPW) that permits researchers to specify some working covariance structure, appropriately accounts for the sampling design and ensures valid inference via a robust sandwich estimator. In addition, to enhance statistical efficiency, we propose a calibrated IPW estimator that makes use of information available at phase I but not used in the design. A comprehensive simulation study is conducted to evaluate small-sample operating characteristics, including the impact of using na\"{i}ve methods that ignore correlation due to clustering, as well as to investigate design considerations. Finally, the methods are illustrated using data from a one-time survey of the national anti-retroviral treatment program in Malawi.