Example: marketing

Understanding Interobserver Agreement: The Kappa Statistic

360 May 2005 Family MedicineIn reading medical literature on diagnosis and inter-pretation of diagnostic tests, our attention is generallyfocused on items such as sensitivity, specificity, pre-dictive values, and likelihood ratios. These items ad-dress the validity of the test. But if the people who ac-tually interpret the test cannot agree on the interpreta-tion, the test results will be of little us suppose that you are preparing to give a lec-ture on community-acquired pneumonia. As you pre-pare for the lecture, you read an article titled, Diag-nosing Pneumonia by History and Physical Examina-tion, published in the Journal of the American Medi-cal Association in You come across a table inthe article that shows agreement on physical examina-tion findings of the chest. You see that there was 79% agreement on the presence of wheezing with a kappaof and 85% agreement on the presence of tactilefremitus with a Kappa of How do you interpretthese levels of agreement taking into account the kappastatistic?

360 May 2005 Family Medicine In reading medical literature on diagnosis and inter-pretation of diagnostic tests, our attention is generally focused on items such as sensitivity, specificity, pre-

Tags:

  Statistics, Agreement, Kappa, The kappa statistic

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Understanding Interobserver Agreement: The Kappa Statistic

1 360 May 2005 Family MedicineIn reading medical literature on diagnosis and inter-pretation of diagnostic tests, our attention is generallyfocused on items such as sensitivity, specificity, pre-dictive values, and likelihood ratios. These items ad-dress the validity of the test. But if the people who ac-tually interpret the test cannot agree on the interpreta-tion, the test results will be of little us suppose that you are preparing to give a lec-ture on community-acquired pneumonia. As you pre-pare for the lecture, you read an article titled, Diag-nosing Pneumonia by History and Physical Examina-tion, published in the Journal of the American Medi-cal Association in You come across a table inthe article that shows agreement on physical examina-tion findings of the chest. You see that there was 79% agreement on the presence of wheezing with a kappaof and 85% agreement on the presence of tactilefremitus with a Kappa of How do you interpretthese levels of agreement taking into account the kappastatistic?

2 Accuracy Versus PrecisionWhen assessing the ability of a test (radiograph,physical finding, etc) to be helpful to clinicians, it isimportant that its interpretation is not a product of guess-work. This concept is often referred to as precision(though some incorrectly use the term accuracy). Re-call the analogy of a target and how close we get to thebull s-eye (Figure 1). If we actually hit the bull s-eye(representing agreement with the gold standard), weare accurate. If all our shots land together, we have goodprecision (good reliability). If all our shots land togetherand we hit the bull s-eye, we are accurate as well is possible, however, to hit the bull s-eye purelyby chance. Referring to Figure 1, only the center blackdot in target A is accurate, and there is little precision(poor reliability about where the shots land). In B, thereis precision but not accuracy. C demonstrates neitheraccuracy nor precision. In D, the black dots are bothaccurate and precise.

3 The lack of precision in A and Ccould be due to chance, in which case, the bull s-eyeshot in A was just lucky. In B and D, the groupingsare unlikely due to , as it pertains to agreement between ob-servers ( Interobserver agreement ), is often reported asa Kappa Kappa is intended to give the readera quantitative measure of the magnitude of agreementbetween observers. It applies not only to tests such asradiographs but also to items like physical exam find-ings, eg, presence of wheezes on lung examination asnoted earlier. Comparing the presence of wheezes onlung examination to the presence of an infiltrate on achest radiograph assesses the validity of the exam find-ing to diagnose pneumonia. Assessing whether the ex-aminers agree on the presence or absence of wheezes(regardless of validity) assesses precision (reliability).Research SeriesUnderstanding Interobserver Agreement: The Kappa Statistic Anthony J. Viera, MD; Joanne M. Garrett, PhDFrom the Robert Wood Johnson Clinical Scholars Program, University ofNorth such as physical exam findings, radiographic interpretations, or other diagnostic tests often rely onsome degree of subjective interpretation by observers.

4 Studies that measure the agreement between two ormore observers should include a Statistic that takes into account the fact that observers will sometimesagree or disagree simply by chance. The Kappa Statistic (or Kappa coefficient) is the most commonly usedstatistic for this purpose. A Kappa of 1 indicates perfect agreement , whereas a Kappa of 0 indicates agree-ment equivalent to chance. A limitation of Kappa is that it is affected by the prevalence of the finding underobservation. Methods to overcome this limitation have been described.(Fam Med 2005;37(5):360-3.)361 Vol. 37, No. 5 The Kappa StatisticInterobserver variation can be measured in any situ-ation in which two or more independent observers areevaluating the same thing. For example, let us imaginea study in which two family medicine residents areevaluating the usefulness of a series of 100 noon lec-tures. Resident 1 and Resident 2 agree that the lecturesare useful 15% of the time and not useful 70% of thetime (Table 1).

5 If the two residents randomly assigntheir ratings, however, they would sometimes agree justby chance. Kappa gives us a numerical rating of thedegree to which this calculation is based on the difference betweenhow much agreement is actually present ( observed agreement ) compared to how much agreement wouldbe expected to be present by chance alone ( expected agreement ). The data layout is shown in Table 1. Theobserved agreement is simply the percentage of all lec-tures for which the two residents evaluations agree,which is the sum of a + d divided by the total n in Table1. In our example, this is 15+70/100 or may also want to know how different the ob-served agreement ( ) is from the expected agree-ment ( ). Kappa is a measure of this difference, stan-dardized to lie on a -1 to 1 scale, where 1 is perfectagreement, 0 is exactly what would be expected bychance, and negative values indicate agreement lessthan chance, ie, potential systematic disagreement be-tween the observers.

6 In this example, the Kappa is (For calculations, see Table 1.)Interpretation of KappaWhat does a specific Kappa value mean? We can usethe value of from the example above. Not every-one would agree about whether constitutes good agreement . However, a commonly cited scale is repre-sented in Table It turns out that, using this scale, akappa of is in the moderate agreement rangebetween our two observers. Remember that perfectagreement would equate to a Kappa of 1, and chanceagreement would equate to 0. Table 2 may help you visualize the interpretation of Kappa . So, residents inthis hypothetical study seem to be in moderate agree-ment that noon lectures are not that interpreting Kappa , it is also important to keepin mind that the estimated Kappa itself could be due tochance. To report a P value of a Kappa requires calcula-Figure 1 Accuracy and PrecisionTable 1 Interobserver VariationUsefulness of Noon Lectures Resident 1 Lectures Helpful?

7 YesNoTotal Resident 2 Yes15520 LecturesNo107080 Helpful?Total2575100 Data Layout Observer 1 ResultYesNoTotal Observer 2 Yesabm1 ResultNocdm0 Totaln1n0n(a) and (d) represent the number of times the two observers agree while (b)and (c) represent the number of times the two observers disagree. If thereare no disagreements, (b) and (c) would be zero, and the observed agreement (po) is 1, or 100%. If there are no agreements, (a) and (d) would be zero,and the observed agreement (po) is :Expected agreementpe = [(n1/n) * (m1/n)] + [(no/n) * (mo/n)]In this example, the expected agreement is:pe = [(20/100) * (25/100)] + [(75/100) * (80/100)] = + = , K= (po pe) = = (1 pe) 1 Series362 May 2005 Family Medicinetion of the variance of Kappa and deriving a z Statistic ,which are beyond the scope of this article.

8 A confidenceinterval for Kappa , which may be even more informa-tive, can also be calculated. Fortunately, computer pro-grams are able to calculate Kappa as well as the P valueor confidence interval of Kappa at the stroke of a fewkeys. Remember, though, the P value in this case testswhether the estimated Kappa is not due to chance. Itdoes not test the strength of agreement . Also, P valuesand confidence intervals are sensitive to sample size,and with a large enough sample size, any Kappa above0 will become statistically KappaSometimes, we are more interested in the agreementacross major categories in which there is meaningfuldifference. For example, let s suppose we had five cat-egories of helpfulness of noon lectures: very help-ful, somewhat helpful, neutral, somewhat awaste, and complete waste. In this case, we maynot care whether one resident categorizes as very help-ful while another categorizes as somewhat helpful, but we might care if one resident categorizes as veryhelpful while another categorizes as complete waste.

9 Using a clinical example, we may not care whether oneradiologist categorizes a mammogram finding as nor-mal and another categorizes it as benign, but we docare if one categorizes it as normal and the other weighted Kappa , which assigns less weight toagreement as categories are further apart, would be re-ported in such In our previous example, adisagreement of normal versus benign would still becredited with partial agreement , but a disagreement ofnormal versus cancer would be counted as no agree-ment. The determination of weights for a weightedkappa is a subjective issue on which even experts mightdisagree in a particular ParadoxReturning to our original example on chest findingsin pneumonia, the agreement on the presence of tactilefremitus was high (85%), but the Kappa of wouldseem to indicate that this agreement is really very reason for the discrepancy between the unadjustedlevel of agreement and Kappa is that tactile fremitus issuch a rare finding, illustrating that Kappa may not bereliable for rare observations.

10 Kappa is affected byprevalence of the finding under consideration much likepredictive values are affected by the prevalence of thedisease under For rare findings, verylow values of Kappa may not necessarily reflect lowrates of overall for a moment to our hypothetical study ofthe usefulness of noon lectures, let us imagine that theprevalence of a truly helpful noon lecture is very low,but the residents know it when they experience it. Like-wise, they know (and will say) that most others are nothelpful. The data layout might look like Table 3. Theobserved agreement is high at 85%. However, the Kappa (calculation shown in Table 3) is low at .04, suggestingonly poor to slight agreement when accounting forchance. One method to account for this paradox, putsimply, is to distinguish between agreement on the twolevels of the finding (eg, agreement on positive ratingscompared to agreement on negative ratings). Feinsteinand Cicchetti have published detailed papers on thisparadox and methods to resolve ,6 For now, under-standing of Kappa and recognizing this important limi-tation will allow the reader to better analyze articlesreporting Interobserver 2 Interpretation of Kappa Poor Slight Fair Moderate Substantial Almost < 0 Less than chance perfect agreementTable 3 Usefulness of Noon Lectures, With LowPrevalence of Helpful Lectures Resident 1 Lectures Helpful?


Related search queries