Comparison of Distance Measures in Cluster …

Journal of data Science3(2005), 85-100 Comparison of Distance Measures in ClusterAnalysis with dichotomous DataHolmes FinchBall State UniversityAbstract: The current study examines the performance of Cluster analysiswith dichotomous data using Distance Measures based on response patternsimilarity. In many contexts, such as educational and psychological testing, Cluster analysis is a useful means for exploring datasets and identifying un-derlying groups among individuals. However, standard approaches to clusteranalysis assume that the variables used to group observations are continu-ous in nature. This paper focuses on four methods for calculating distancebetween individuals using dichotomous data , and the subsequent introduc-tion of these distances to a clustering algorithm such as Ward s.

The fourmethods in question, are potentially useful for practitioners because they arerelatively easy to carry out using standard statistical software such as SASand SPSS, and have been shown to have potential for correctly grouping ob-servations based on dichotomous data . Results of both a simulation studyand application to a set of binary survey responses show that three of thefour Measures behave similarly, and canyield correct Cluster recovery ratesof between 60% and 90%. Furthermore, these methods were found to workbetter, in nearly all cases, than using the raw data with Ward s words: Cluster analysis , dichotomous data , Distance IntroductionCluster analysis (CA) is an analytic technique used to classify observationsinto a finite and, ideally, small number of groups based upon two or more vari-ables.

In some cases there are hypotheses regarding the number and make upof such groups, but more often there is little or no prior information concerningwhich individuals will be grouped together, making CA an exploratory are a number of clustering algorithms available, all having as their primarypurpose the measurement of mathematical Distance between individual observa-tions, and groups of observations. Distance in this context can be thought of inthe Euclidean sense, or some other, comparable conceptualization (Johnson andWichern, 1992).86 Holmes FinchOne of the primary assumptions underlying these standard methods for cal-culating Distance is that the variables used to classify individuals into groups arecontinuous in nature (Anderberg, 1973).

However, some research situations, suchas those involving testing data , may involve other types of variables, includingordinal or nominal. For example, in some situations, researchers are interestedin grouping sets of test examinees based on their dichotomously scored responses(correct or incorrect) to individual test items, rather than on the total score forthe exam, especially for identifying cases of answer copying (Wollack, 2002). Theclustering of observations based on dichotomous variables can be readily extendedbeyond the realm of psychological testing to any situation in which the presenceor absence of several traits are coded and researchers want to group the obser-vations based on these binary variables.

These could include economic analyseswhere individual firms are classified in terms of the presence or absence of vari-ous management practices, or situations where binary coding is used to describeindustrial processes. In such situations, the standard Euclidean Measures of dis-tance are inappropriate for assessing the dissimilarity between two observationsbecause the variables of interest are not continuous, and thus some alternativemeasure of separation must be used (Dillon and Goldstein, 1984). It is the goal ofthis paper to investigate four Measures of Distance designed for clustering usingdichotomous data , and to compare their performance in correctly classifying in-dividuals using simulated test data .

A fifth approach, using the raw data ratherthan these Distance Measures , will also be included. The paper begins with adescription of the four Distance Measures , followed by a discussion of the studydesign and the Monte Carlo simulation. Next, is the presentation and discussionof the results followed by a description of the implications for practitioners usingdichotomous variables for clustering, and finally, weaknesses of the Distance Measures for dichotomous VariablesThere are several techniques for conducting CA with binary data , all of whichinvolve calculating distances between observations based upon the observed vari-ables and then applying one of the standard CA algorithms to these popular group of these Measures designed for binary data is known collectivelyas matching coefficients (Dillon and Goldstein, 1984).

There are several typesof matching coefficients, all of which take as their main goal the measurementof response set similarity between any two observations. The logic underlyingthese techniques is that two individuals should be viewed as similar to the degreethat they share a common pattern of attributes among the binary variables (Sni-jders, Dormaar, van Schurr, Dijkman-Caes and Driessen, 1990). In other words,observations with more similar patterns of responses on the variables of interestare seen as closer to one another than are those with more disparate responseCluster analysis with dichotomous Data87patterns. An advantage of these Measures is that they are easy to effect usingavailable statistical software such as SAS or order to discuss how these methods work, it is helpful to refer to an this case, Table 1 below will be used to demonstrate how each of the fourmeasures are calculated.

The rows represent the presence or absence (1,0) of aset ofKtraits for a single observationi, and the columns represent the presenceor absence of the same set ofKtraits for a second observation,j,wherei = 1: 2 2responsetableSubject 2 Subject 1101ab0cdCellaincludes the count of the number of theKvariables for which thetwo subjects both have the attribute present. In a testing context, having theattribute present would mean correctly answering the item. In turn, cellbrepre-sents the number of variables for which the first subject has the attribute presentand the second subject does not, and cellcincludes the number of variables forwhich the second subject has the attribute present and the first subject doesnot.

Finally, celldincludes the count of the number of theKvariables for whichneither subject has the attribute present. The indices described below differ inthe ways that they manipulate these cell counts. While there are a number ofdistance metrics available for dichotomous variables (Hands and Everitt, 1987),this paper will examine the 4 most widely discussed in the literature (Anderberg,1973; Lorr, 1983; Dillon and Goldstein, 1984; Snijders, Dormaar, van Schurr,Dijkman-Caes and Driessen, 1990). It is recognized that other approaches areavailable, however in the interest of focusing this research on using methods thathave been cited previously in the literature as being useful, and that are availableto practitioners, only these four will be included in the current study.

The firstof these Measures of Distance is the Russell/Rao Index (Rao, 1948). It can beexpressed in terms of the cells of Table 1 as:1a+b+c+d( )This index is simply the proportion of cases in which both observations hadthe trait of interest. In contrast is the Jaccard coefficient, introduced by Sneath(1957), which has a similar structure but excludes cases from the denominatorwhere neither subject has the trait of interest (celld).aa+b+c( )88 Holmes FinchA third variation on this theme, called the matching coefficient (Sokal andMichener, 1958), includes both matched cellsaandd: the number of cases whereboth subjects have both attributes present, and the number of cases where neithersubject has the attributes +da+b+c+d( )The final index to be examined here is Dice s coefficient (Dice, 1945).

Comparison of Distance Measures in Cluster …

Tags:

Information

Transcription of Comparison of Distance Measures in Cluster …

Related search queries

Comparison of Distance Measures in Cluster …

Tags:

Information

Documents from same domain

Multilevel Logistic Regression Analysis Applied to …

Related documents

FTP Analysis via SMF Records, FTP Server Exits and …

The Waters Acquity Ultra-Performance Liquid …

ISSN 2230 ULTRA PERFORMANCE LIQUID …

The power of lubrication - SKF.com

Valgrind: A Framework for Heavyweight Dynamic …

A technician’s guide to The next generation of hub …

Related search queries