Data from: Empirical extensions of the lasso penalty to reduce the false discovery rate in high-dimensional Cox regression models

Correct selection of prognostic biomarkers among multiple candidates is becoming increasingly challenging as the dimensionality of biological data becomes higher. Therefore, minimizing the false discovery rate (FDR) is of primary importance, while a low false negative rate (FNR) is a complementary measure. The lasso is a popular selection method in Cox regression, but its results depend heavily on the penalty parameter λ. Usually, λ is chosen using maximum cross-validated log-likelihood (max-cvl). However, this method has often a very high FDR. We review methods for a more conservative choice of λ. We propose an empirical extension of the cvl by adding a penalization term, which trades-off between the goodness-of-fit and the parsimony of the model, leading to the selection of fewer biomarkers and, as we show, to the reduction of the FDR without large increase in FNR. We conducted a simulation study considering null and moderately sparse alternative scenarios and compared our approach to the standard lasso and ten other competitors: AIC, AICC, BIC, eBIC, HQIC, RIC, one-standard-error rule, adaptive lasso, stability selection and percentile lasso. Our extension achieved the best compromise across all the scenarios between a reduction of the FDR and a limited raise of the FNR, followed by the AIC, the RIC and the adaptive lasso which performed well in some settings. We illustrate the methods using gene expression data of 523 breast cancer patients. In conclusion, we propose to apply our extension to the lasso whenever a stringent FDR with a limited FNR is targeted.