Example: barber

Understanding of Internal Clustering Validation Measures

2010 IEEE International Conference on Data Mining Understanding of Internal Clustering Validation Measures Yanchi Liu1,2 , Zhongmou Li2 , Hui Xiong2 , Xuedong Gao1 , Junjie Wu3. 1. School of Economics and Management, University of Science and Technology Beijing, China 2. MSIS Department, Rutgers Business School, Rutgers University, USA. 3. School of Economics and Management, Beihang University, China Abstract Clustering Validation has long been recognized without any additional information. In practice, external as one of the vital issues essential to the success of clus- information such as class labels is often not available in tering applications. In general, Clustering Validation can be many application scenarios.

Understanding of Internal Clustering Validation Measures Yanchi Liu1, 2, Zhongmou Li , Hui Xiong , Xuedong Gao1, Junjie Wu3 1School of Economics and Management, University of Science and Technology Beijing, China liuyanchi@manage.ustb.edu.cn, gaoxuedong@manage.ustb.edu.cn 2MSIS Department, Rutgers Business School, Rutgers University, USA mosesli@pegasus.rutgers.edu, …

Tags:

  Validation

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Understanding of Internal Clustering Validation Measures

1 2010 IEEE International Conference on Data Mining Understanding of Internal Clustering Validation Measures Yanchi Liu1,2 , Zhongmou Li2 , Hui Xiong2 , Xuedong Gao1 , Junjie Wu3. 1. School of Economics and Management, University of Science and Technology Beijing, China 2. MSIS Department, Rutgers Business School, Rutgers University, USA. 3. School of Economics and Management, Beihang University, China Abstract Clustering Validation has long been recognized without any additional information. In practice, external as one of the vital issues essential to the success of clus- information such as class labels is often not available in tering applications. In general, Clustering Validation can be many application scenarios.

2 Therefore, in the situation that categorized into two classes, external Clustering Validation and Internal Clustering Validation . In this paper, we focus on there is no external information available, Internal Validation Internal Clustering Validation and present a detailed study of Measures are the only option for cluster Validation . 11 widely used Internal Clustering Validation Measures for crisp In literature, a number of Internal Clustering Validation Clustering . From ve conventional aspects of Clustering , we Measures for crisp Clustering have been proposed, such as investigate their Validation properties. Experiment results show , , , and . However, current existing that is the only Internal Validation measure which performs well in all ve aspects, while other Measures have Measures can be affected by various data characteristics.

3 For certain limitations in different application scenarios. example, noise in data can have a signi cant impact on the performance of an Internal Validation measure, if minimum I. I NTRODUCTION or maximum pairwise distances are used in the measure. The performance of existing Measures in different situations Clustering , one of the most important unsupervised learn- remains unknown. Therefore, we present a detailed study ing problems, is the task of dividing a set of objects into of 11 widely used Internal Validation Measures , as shown clusters such that objects within the same cluster are similar in Table I. We investigate their Validation properties in ve while objects in different clusters are distinct.

4 Clustering different aspects: monotonicity, noise, density, subclusters is widely used in many elds, such as image analysis and skewed distributions. For each aspect, we generate and bioinformatics. As an unsupervised learning task, it is synthetic data for experiments. These synthetic data well necessary to nd a way to validate the goodness of partitions represent the properties. Finally, the experiment results show after Clustering . Otherwise, it would be dif cult to make use that is the only Internal Validation measure which of different Clustering results. performs well in all ve aspects, while other Measures have Clustering Validation , which evaluates the goodness of certain limitations in different application scenarios, mainly Clustering results [1], has long been recognized as one of the in aspects of noise and subclusters.

5 Vital issues essential to the success of Clustering applications [2]. External Clustering Validation and Internal Clustering val- II. I NTERNAL C LUSTERING Validation M EASURES. idation are the two main categories of Clustering Validation . The main difference is whether or not external information In this section, we introduce some basic concepts of is used for Clustering Validation . An example of external Internal Validation Measures , as well as a suite of 11 widely Validation measure is entropy, which evaluates the purity used Internal Validation indices. of clusters based on the given class labels [3]. As the goal of Clustering is to make objects within the Unlike external Validation Measures , which use external same cluster similar and objects in different clusters distinct, information not present in the data, Internal Validation mea- Internal Validation Measures are often based on the following sures only rely on information in the data.

6 The Internal two criteria [4] [5]. Measures evaluate the goodness of a Clustering structure I. Compactness. It Measures how closely related the without respect to external information [4]. Since external objects in a cluster are. A group of Measures evaluate cluster Validation Measures know the true cluster number in compactness based on variance. Lower variance indicates advance, they are mainly used for choosing an optimal better compactness. Also, there are numerous Measures Clustering algorithm on a speci c data set. On the other hand, estimate the cluster compactness based on distance, such Internal Validation Measures can be used to choose the best as maximum or average pairwise distance, and maximum or Clustering algorithm as well as the optimal cluster number average center-based distance.

7 1550-4786/10 $ 2010 IEEE 911. DOI Table I. I NTERNAL C LUSTERING Validation M EASURES. Measure Notation De nition Optimal value 1. 1 Root-mean-square std dev { 2 /[ ( 1)]} 2 Elbow . 2 R-squared ( 2 )/ 2. 2. Elbow 2.. 3 Modi ed Hubert statistic ( 1) ( , ) , ( , ) Elbow . 2 ( , )/( 1). 4 Calinski-Harabasz index Max 2 ( , )/( ).. ( , ). 5 index ( 1 ( , ) max , ( , )) Max . min , ( , ). 6 Dunn's indices min {min ( max {max . )} Max , ( , )}. 1. 1. ( ) ( ). 7 Silhouette index {. max[ ( ), ( )] } Max . ( ) = 1 1 1. , = ( , ), ( ) = min , = [ ( , )]. 1.. 1. 1.. 8 Davies-Bouldin index max , = {[ ( , ) + ( , )]/ ( , )} Min . 9 Xie-Beni index [ 2 ( , )]/[ , = 2 ( , )] Min . 10 SD validity index ( ) ( ) + ( ) Min , ( , ) 1.

8 ( ) = 1 ( ) / ( ) , ( ) = , ( , ) ( ( , )). 11 S Dbw validity index ( ) + ( ) Min . 1. ( , ). ( ) = ( 1) [ , = .. { ( , ), ( , )}. ].. : data set; : number of objects in ; : center of ; : attributes number of ; : number of clusters; : the i th cluster; : number of objects in ;. 1. : center of ; ( ): variance vector of ; ( , ): distance between x and y; = ( ) 2. II. Separation. It Measures how distinct or well-separated [7]. The Modi ed Hubert statistic ( ) [8] evaluates the a cluster is from other clusters. For example, the pairwise difference between clusters by counting the disagreements distances between cluster centers or the pairwise minimum of pairs of data objects in two partitions. distances between objects in different clusters are widely The Calinski-Harabasz index ( ) [9] evaluates the used as Measures of separation.

9 Also, Measures based on cluster validity based on the average between- and within- density are used in some indices. cluster sum of squares. Index ( ) [1] Measures sep- The general procedure to determine the best partition and aration based on the maximum distance between cluster optimal cluster number of a set of objects by using Internal centers, and Measures compactness based on the sum of Validation Measures is as follows. distances between objects and their cluster center. Dunn's Step 1: Initialize a list of Clustering algorithms which will index ( ) [10] uses the minimum pairwise distance between be applied to the data set. objects in different clusters as the inter-cluster separation Step 2: For each Clustering algorithm, use different com- and the maximum diameter among all clusters as the intra- binations of parameters to get different Clustering results.

10 Cluster compactness. These three indices take a form of Step 3: Compute the corresponding Internal Validation = ( )/( ), where and index of each partition obtained in step 2. are weights. The optimal cluster number is determined by Step 4: Choose the best partition and the optimal cluster maximizing the value of these indices. number according to the criteria. Table I shows a suite of 11 widely used Internal Validation The Silhouette index ( ) [11] validates the Clustering Measures . To the best of our knowledge, these Measures performance based on the pairwise difference of between- represent a good coverage of the Validation Measures avail- and within-cluster distances. In addition, the optimal cluster able in different elds, such as data mining, information number is determined by maximizing the value of this index.


Related search queries