Article type
Year
Abstract
Background: It is widely acknowledged that newly developed diagnostic or prognostic prediction models should be externally validated to assess their performance. It is recommended to test the model in ‘different but related’ subjects, but criteria for ‘different but related’ are lacking.
Objectives: To propose a framework of methodological steps for analyzing and interpreting the results of validation studies of prediction models.
Methods: We identify whether the validation sample evaluates the model’s reproducibility or transportability by quantifying case mix differences with the development sample. We hereto use an adaptation of the Mahalanobis distance metric and compare the distribution of the linear predictors. We quantify the model’s performance with standard metrics for discrimination and calibration. Finally, we illustrate this approach with three validation datasets for a previously developed prediction model for Deep Venous Thrombosis.
Results: The first validation study had a similar case mix distribution (p = 0.752) and should therefore be interpreted as evaluating model reproducibility. Model performance was adequate (C = 0.78, calibration slope = 0.90), except for the model intercept (calibration-in-the-large=−0.5, p < 0.0001). In the other two validation studies, we found substantial case mix differences (p < 0.0001) and reduced model calibration (such as non-linear calibration slopes). These validation samples evaluated the model’s transportability and revealed the need for more extensive updating strategies.
Conclusions: The proposed framework enhances the interpretability of validation studies of prediction models. The steps are straightforward to implement and may enhance the transparency of prediction research.
Objectives: To propose a framework of methodological steps for analyzing and interpreting the results of validation studies of prediction models.
Methods: We identify whether the validation sample evaluates the model’s reproducibility or transportability by quantifying case mix differences with the development sample. We hereto use an adaptation of the Mahalanobis distance metric and compare the distribution of the linear predictors. We quantify the model’s performance with standard metrics for discrimination and calibration. Finally, we illustrate this approach with three validation datasets for a previously developed prediction model for Deep Venous Thrombosis.
Results: The first validation study had a similar case mix distribution (p = 0.752) and should therefore be interpreted as evaluating model reproducibility. Model performance was adequate (C = 0.78, calibration slope = 0.90), except for the model intercept (calibration-in-the-large=−0.5, p < 0.0001). In the other two validation studies, we found substantial case mix differences (p < 0.0001) and reduced model calibration (such as non-linear calibration slopes). These validation samples evaluated the model’s transportability and revealed the need for more extensive updating strategies.
Conclusions: The proposed framework enhances the interpretability of validation studies of prediction models. The steps are straightforward to implement and may enhance the transparency of prediction research.