Why do diagnostic tests differ in performance between different settings?

Article type
Authors
Willis B1, Hyde C2
1University of Birmingham, UK
2University of Exeter, UK
Abstract
Background: Whether the findings from a diagnostic test accuracy (DTA) study may be applied to another setting is important to evidence-based practice. First, the study should have internal validity: the results are a true representation of the test’s performance within the study setting. However, internal validity does not beget external validity, where the results may be generalised to other clinical settings. Further, we know from meta-analyses that widespread variation in DTA studies is commonplace.

Objectives: To identify the different sources of variation between DTA studies before considering the implications for diagnostic research.

Methodology: The literaturewas searched for methodological and primary studies that evaluated a test’s performance over different settings.

Results: DTA studies are affected by both artefactual and real variation. Predominantly, in artefactual variation, the design differs between studies and this affects the internal validity of results. However, even ‘internally valid’ studies evaluating the same test and target disorder may report different test accuracies due to there being real variation. This has three sources.The test’s execution may vary between studies due to poor reliability, cognitive errors by the operators and changes in prevalence. Similarly, cognitive biases and the disease prevalence may affect the test’s threshold. Patient spectrum, which reflects the mix of patients with and without disease, may also change between studies.

Conclusion: A number of conditions need to be met if the results of a DTA study are to be applicable in practice. Although it is unclear to what extent these conditions vary in practice, there is an obvious difficulty in ensuring they are similar to the reported study. This has implications for the concept of external validity. If test settings, test execution, and thresholds do vary in practice then designing a study to be applicable in multiple settings may be unattainable in large number of cases.