Reliability of a Scale for Measuring the Methodological Quality of Clinical Trials

Article type
Year
Authors
Moseley A, Maher C, Herbert R, Sherrington C
Abstract
Introduction: An internet-based searchable database of all randomised clinical trials relevant to physiotherapy called the Physiotherapy Evidence Database (or PEDro) is currently being developed. In addition to cataloguing bibliographic details, author abstracts and codes to facilitate searching, we are rating each trial for methodological quality using the PEDro scale. This scale is based on the Delphi list developed by Verhagen et al (1998), a 9-item list established by expert consensus (items: eligibility criteria specified; subjects randomly allocated to groups; concealed allocation; groups similar at baseline; blinding of subjects, therapists and assessors; intention to treat analysis; point measures and measures of variability reported). Two additional items not on the Delphi list (i.e., outcome measures obtained from more than 85% of subjects and reporting of results of between-group statistical comparisons) have been included in the 11-point PEDro scale.

Objectives:

Methods: To evaluate the test-retest reliability of the PEDro scale we had 10 judges independently rate 25 clinical trials randomly selected from the 1800 papers archived in the PEDro database. The reliability of individual items was evaluated by calculating Kappa's (K) and the observed agreement between ratings. The reliability of the total score was evaluated by calculating the Intraclass Correlation Coefficient (ICC) and the percent close agreement.

Results: Observed agreement for individual items ranged from 69% (for groups similar at baseline) to 96% (for blinding of therapists), with a mean of 86%. Meaningful K-values could not be calculated for random allocation to groups, blinding of therapists and intention to treat analysis because of extremes in the prevalence of scores. For the remaining 8 items, K-values ranged from 0.38 to 0.72, with a mean of 0.57. The lowest rating items were groups similar at baseline and outcome measures obtained from more than 85% of subjects, and the highest rating items were blinding of subjects and assessors. The total score ICC was 0.53 (95% CI: 0.38-0.70), which is comparable to the reliability reported by Jadad et al (1996). Raters were within 2 points of each other for the total score 91% of the time.

Discussion: In summary, the inter-rater reliability of the PEDro scale was acceptable. To increase the accuracy of quality ratings on the PEDro database each trial will be independently rated by two reviewers, and a third rater will arbitrate where there is disagreement.

References: Verhagen et al (1998). Journal of Clinical Epidemiology 51(12): 1235-41. Jadad et al (1996). Controlled Clinical Trials 17: 1-12.