Why kappa, and not just percent agreement
When two reviewers screen the same records, you could simply report the percentage they classified the same way. The problem is that two reviewers excluding most records will agree a lot by chance alone, so a high raw agreement can hide poor reliability. Cohen's kappa corrects for that chance agreement, which is why journals and methodologists expect it as the measure of inter-rater reliability rather than a bare percentage.
How to read your result
The calculator reports kappa alongside the observed and chance agreement so you can see how much of the raw agreement was genuine. The interpretation band uses the widely cited Landis and Koch benchmarks:
- Below 0.00: poor, worse than chance
- 0.00 to 0.20: slight
- 0.21 to 0.40: fair
- 0.41 to 0.60: moderate
- 0.61 to 0.80: substantial
- 0.81 to 1.00: almost perfect
These bands are conventions, not hard rules. A low kappa is a prompt to revisit your eligibility criteria and re-pilot, since the usual cause is an ambiguous criterion that the two reviewers interpreted differently. Our guide to running a reliable two-reviewer screening process covers how to calibrate before you screen at scale.
Where agreement fits in the review
Reviewer agreement is a checkpoint inside the screening stage, the same stage whose totals you report in your PRISMA 2020 flow diagram. A calibrated, well-documented screening process is what makes the records screened and records excluded counts in that diagram defensible. Reporting the kappa from your pilot is also one of the methods details the PRISMA reporting checklist expects under the selection process item.