Pitfalls in the use of kappa when interpreting agreement between multiple raters in reliability studies

Abstract

Objective

To compare different reliability coefficients (exact agreement, and variations of the kappa (generalised, Cohen's and Prevalence Adjusted and Biased Adjusted (PABAK))) for four physiotherapists conducting visual assessments of scapulae.

Design

Inter-therapist reliability study.

Setting

Research laboratory.

Participants

30 individuals with no history of neck or shoulder pain were recruited with no obvious significant postural abnormalities.

Main outcome measures

Ratings of scapular posture were recorded in multiple biomechanical planes under four test conditions (at rest, and while under three isometric conditions) by four physiotherapists.

Results

The magnitude of discrepancy between the two therapist pairs was 0.04 to 0.76 for Cohen's kappa, and 0.00 to 0.86 for PABAK. In comparison, the generalised kappa provided a score between the two paired kappa coefficients. The difference between mean generalised kappa coefficients and mean Cohen's kappa (0.02) and between mean generalised kappa and PABAK (0.02) were negligible, but the magnitude of difference between the generalised kappa and paired kappa within each plane and condition was substantial; 0.02 to 0.57 for Cohen's kappa and 0.02 to 0.63 for PABAK, respectively.

Conclusions

Calculating coefficients for therapist pairs alone may result in inconsistent findings. In contrast, the generalised kappa provided a coefficient close to the mean of the paired kappa coefficients. These findings support an assertion that generalised kappa may lead to a better representation of reliability between three or more raters and that reliability studies only calculating agreement between two raters should be interpreted with caution. However, generalised kappa may mask more extreme cases of agreement (or disagreement) that paired comparisons may reveal.

Citation

Pitfalls in the use of kappa when interpreting agreement between multiple raters in reliability studies.Physiotherapy-March 2014 (Vol. 100, Issue 1, Pages 27-35, DOI: 10.1016/j.physio.2013.08.002)