Example: quiz answers

Visualizing Categorical Data: Data, Stories, and Pictures

Visualizing Categorical data : data , stories , andPicturesMichael FriendlyYork data frequency data , and discrete data are most of-ten presented in tables, and analyses using loglinear models andlogistic regression are most often presented in terms of parame-ter estimates. Over the past decade, I and others have developednovel visualization methods for Categorical data , designed to pro-vide exploratory and confirmatory graphic displays analogous tothose used readily and easily for quantitative data . These graphicalmethods are described inVisualizing Categorical data . The bookalso provides a large collection of macros designed to make thesemethods readily and easily used. This paper provides an overviewof these graphical methods and macros, as told through data , theirstories, and associated graphical : Categorical data , graphics, mosaic displays, mo-saic matrices, correspondence analysis, loglinear models, IntroductionOver the last decade a modest revolution has been brewing in theanalysis of Categorical data , as graphical methods and techniquesof data vi

Visualizing Categorical Data: Data, Stories, and Pictures Michael Friendly York University, friendly@yorku.ca Abstract Categorical data—frequency data, and discrete data—are most of-

Tags:

  Data, Stories, Categorical, Visualizing, Categorical data, Visualizing categorical data

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Visualizing Categorical Data: Data, Stories, and Pictures

1 Visualizing Categorical data : data , stories , andPicturesMichael FriendlyYork data frequency data , and discrete data are most of-ten presented in tables, and analyses using loglinear models andlogistic regression are most often presented in terms of parame-ter estimates. Over the past decade, I and others have developednovel visualization methods for Categorical data , designed to pro-vide exploratory and confirmatory graphic displays analogous tothose used readily and easily for quantitative data . These graphicalmethods are described inVisualizing Categorical data . The bookalso provides a large collection of macros designed to make thesemethods readily and easily used. This paper provides an overviewof these graphical methods and macros, as told through data , theirstories, and associated graphical.

2 Categorical data , graphics, mosaic displays, mo-saic matrices, correspondence analysis, loglinear models, IntroductionOver the last decade a modest revolution has been brewing in theanalysis of Categorical data , as graphical methods and techniquesof data visualization, so commonly used for quantitative data , havebegun to be developed for frequency data and discrete SUGI 17 (Friendly, 1992a) I described some initial steps inthe development of new graphical methods for Categorical Data, with the goals of (a) providing visualization techniques for data ex-ploration and model fitting comparable in scope to those used forquantitative data , and (b) implementing these methods in readilyavailable software. These goals have now been largely methods are described and illustrated in a new book,Visualiz-ing Categorical data (VCD), now in production.

3 The book includesnearly 40 general macros and programs (see Appendix A), coveringmost aspects of Categorical data paper provides an overview of some of these graphicalmethods and macros, using examples from the book, as told throughdata, their stories , and associated graphical displays. (Most of thegraphs are in color; see the CD version of the Proceedings.)2 Disputed authorship: The FederalistPapersIn 1787 88, Alexander Hamilton, John Jay, and James Madisonwrote a series of newspaper essays to persuade the voters of NewYork State to ratify the constitution. The essays were titledThe Federalist Papersand all were signed with a pseudonym. Ofthe 77 papers published, the author(s) of 65 are known, butbothHamilton and Madison later claimed sole authorship of the remain-ing 12.

4 Mosteller and Wallace (1984) investigated the use of statis-tical methods to identify authors of disputed works based on the fre-quency distributions of certain key function words, and concludedthat Madison had indeed authored the 12 disputed 1 shows the distribution of the occurrence of one of these marker words, the wordmayin 262 blocks of text (each about 200words long) from issues of theFederalist Papersand other essaysknown to be written by James important part of the analysis by Mosteller and Wallace wasto establish the theoretical form of these frequency distributions,so that the known works could be compared in terms of estimatedparamters, rather than through the entire distributions. A simple ar-gument for the occurrence of rare events leads to a suggestion thatthe distribution of such words might be Poisson; however, numer-ical fitting led to the conclusion that the Negative Binomial gavebetter concentrate here on visualization methods to determine thetheoretical form of a discrete 1: Number of occurrences (k) and number of blocks of text(nk)ofthewordmayin Federalist Papers and essays written byJames Madisonk0 1 Hanging rootogramsFrequency020406080100120140160 Number of Occurrences0123456 Figure 1: Histogram for Madison data , with Poisson fitDiscrete frequency distributions are often graphed as histograms,with a theoretical fitted distribution superimposed.

5 Figure 1, for ex-ample, shows the data in Table 1 together with the fitted frequenciesunder a Poisson model. It is hard to compare the observed and fittedfrequencies visually, because (a) we must assess deviations againsta curvilinear relation, and (b) the largest frequencies dominate hanging rootogram (Tukey, 1977) solves these problems by(a) shifting the histogram bars to coincide with the fitted curve, sothat deviations may be judged by deviations from a horizontal line,and (b) plotting on a square-root scale, so that smaller frequenciesare emphasized. Figure 2 shows more clearly that the observed fre-quencies differ systematically from those predicted under a Poissonmodel. InVCD, several macros are presented for fitting a varietySqrt(frequency)-2024681012 Number of Occurrences0123456 Figure 2: Suspended rootogram for Madison dataof discrete distributions.

6 TheGOODFIT macro carries out goodness-of-fit tests; theROOTGRAM macro provides a variety of displays in-cluding those of Figure 1 and 2. For example, Figure 2 is producedas%goodfit( data =madison,var=coun t,freq=blocks,dist=poisson,out=fit);%roo tgram( data =fit,var=count,obs=blocks); Ord plotsA simple plot suggested by Ord (1967) may be used to diagnosethe form of a discrete distribution. Ord showed that, for each ofthe Poisson, Binomial, Negative Binomial, and Logarithmic Seriesdistributions, a plot ofkpk=pk 1againstkis linear, and these dis-tributions were distinguished by the signs of the slope and 3 shows the Ord plot for the Madison data , which diag-noses the distribution as a Negative Binomial, based on the positiveslope of the thicker line (found by weighted least squares).

7 Thisplot is produced using theORDPLOT macro, used as%ordplot( data =madison,count=Count,freq =blocks); Robust distribution plotsOne disadvantage of the Ord plot is lack of resistance, since a singlediscrepant frequency,nk, affects the points for bothkandk+ distribution plots, following methods described by Hoaglinand Tukey (1985), are provided by 4 shows the Negative Binomial distribution plot, pro-duced using theDISTPLOT macro, as follows:slope = : Negative binomialparm: p = Ratio, (k n(k) / n(k-1))0123456k (Occurrences of may )0123456 Figure 3: Ord plot for Madison dataslope(b) = : a/log(p) = : 1-e(b) = metameter-10-9-8-7-6-5-4-3-2-10 Number of Occurrences0123456 Figure 4: Robust distribution plot for Madison data for the negativebinomial%distplot( data =madison, coun t=c ount ,fr eq=b loc ks,d ist =neg bin );This plot has the property that the circled points are linear inkwhenthe data follow the assumed distribution, as in the Ord plot.

8 How-ever, the ordinate count metameter depends only onnk,andtheconfidence bars are calculated to take into account the variability ofindividual counts,nk, in the observed Gender bias in admission to Berkelely?Bickel et al. (1975) analyzed data on admissions to graduate depat-ments at U. C. Berkeley in 1973. Aggregate data for the six largestdepartments are shown in Table 2, classified by admission and gen-der. The issue was whether these data showed evidence of genderbias in 2: Admissions to Berkeley graduate programsAdmittedRejectedTot Fourfold displaysTable 2 is an example of a2 2table. For such data , theoddsratio, =n11n22=n12n21, is a natural measure of the strength ofassociation between the two displaydepicts these frequencies by quarter circles,whose radius is proportional topnij, so the area is proportional tothe cell count (Fienberg, 1975, Friendly, 1994a,c).

9 The cell fre-quencies are usually scaled to equate the marginal totals, and sothat the ratio of diagonally opposite segments depicts the odds ra-tio. Confidence rings for the observed allow a visual test of thehypothesisH0: =1corresponding to no association. They havethe property that the rings for adjacent quadrants overlapiffthe ob-served counts are consistent with the null 5 shows the aggregate data from Table 2. The sampleodds ratio, Odds (AdmitjMale) / (AdmitjFemale) is indicatingthat males were almost twice as likely to be admitted. The confi-dence rings in the figure do not overlap, showing that this associa-tion is highly significant. Does this constitute evidence for genderbias in admission?Sex: MaleAdmit?

10 : YesSex: FemaleAdmit?: No119814935571278 Figure 5: Fourfold display for Berkeley admissions data , marginsequatedThe admissions data shown in Figure 5 came from the six largestat Berkeley. To determine the source of the apparent sex bias infavor of males, we make a new plot, Figure 6, stratified by , Figure 6 shows that, for five of the six departments,the odds of admission is approximately the same for both men andwomen applicants. Department A appears to differs from the others,with women approximately (=(313=19)=(512=89)) times resolution of this contradiction can be found in the large dif-ferences in admission rates among departments. Men and womenapply to different departments differentially, and in these datawomen happen to apply in larger numbers to departments that havea low acceptance rate.