Transcription of A Density-Based Algorithm for Discovering …
1 From: KDD-96 Proceedings. Copyright 1996, AAAI ( ). All rights reserved. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databaseswith Noise Martin Ester, Hans-Peter Kriegel, Jiirg Sander, Xiaowei Xu Institute for ComputerScience, University of Munich Oettingenstr. 67, D-80538 Miinchen, Germany {ester I kriegel I sander I xwxu} Abstract are often not knownin advancewhendealing with large Clusteringalgorithmsare attractive for the task of class iden- databases. tification in spatial , the applicationto (2) Discoveryof clusters with arbitrary shape, because the large spatial databasesrises the followingrequirements for shapeof clusters in spatial databases maybe spherical, clustering algorithms: minimalrequirements of domain drawn-out,linear, elongatedetc.
2 Knowledgeto determinethe input parameters,discoveryof clusters witharbitrary shapeandgoodefficiencyonlarge da- (3) Goodefficiency on large databases, on databases of tabases. Thewell-known clustering algorithmsoffer no solu- significantly morethan just a fewthousandobjects. tion to the combination of these this paper, The well-knownclustering algorithms offer no solution to wepresent the newclustering algorithmDBSCAN relying on the combination of these requirements. In this paper, we a density -basednotion of clusters whichis designedto dis- present the new clustering Algorithm DBSCAN.
3 It requires coverclusters of arbitrary shape. DBSCAN requires only one only one input parameterand supports the user in determin- input parameterand supportsthe user in determiningan ap- ing an appropriatevalue for it. It discoversclusters of arbi- propriatevalue for it. Weperformed an experimental evalua- tion of the effectiveness and efficiency of DBSCAN using trary shape. Finally, DBSCAN is efficient evenfor large spa- synthetic data and real data of the SEQUOIA 2000bench- tial databases. The rest of the paper is organizedas follows. of our experimentsdemonstratethat (1) Wediscuss clustering algorithms in section 2 evaluating DBSCAN is significantly moreeffective in discoveringclus- them according to the above requirements.
4 In section 3, we ters of arbitrary shapethan the well-known algorithmCLAR- present our notion of clusters whichis based on the concept ANS,and that (2) DBSCAN outperforms CLARANS by of density in the database. Section 4 introduces the algo- factor of morethan 100in termsof efficiency. rithm DBSCAN which discovers such clusters in a spatial Keywords: Clustering Algorithms,Arbitrary Shapeof Clus- database. In section 5, we performedan experimentalevalu- ters, Efficiencyon LargeSpatial Databases,HandlingNlj4- ation of the effectiveness and efficiency of DBSCAN using 275oise.
5 Synthetic data and data of the SEQUOIA 2000 benchmark. Section 6 concludes with a summaryand somedirections for 1. Introduction future research. Numerousapplications require the managementof spatial data, data related to space. Spatial DatabaseSystems 2. Clustering Algorithms (SDBS)(Gueting 1994) are database systems for the man- agementof spatial data. Increasingly large amountsof data There are two basic types of clustering algorithms (Kaufman are obtained fromsatellite images, X-ray crystallography or & Rousseeuw1990): partitioning and hierarchical algo- other automatic equipment.
6 Therefore, automated know- rithms. Partitioningalgorithmsconstruct a partition of a da- ledge discovery becomesmore and more important in spatial tabase D ofn objects into a set ofk clusters, k is an input pa- databases. rameter for these algorithms, somedomainknowledgeis Several tasks of knowledgediscovery in databases (KDD) required whichunfortunately is not available for manyap- have been defined in the literature (Matheus, Chan&Pi- plications. The partitioning algorithmtypically starts with atetsky-Shapiro 1993). The task considered in this paper is an initial partition of Dand then uses an iterative control class identification, the groupingof the objects of a data- strategy to optimize an objective function.
7 Eachcluster is base into meaningfulsubclasses. In an earth observation da- representedby the gravity center of the cluster (k-meansal- tabase, , we might want to discover classes of houses gorithms)or by one of the objects of the cluster located near along someriver. its center (k-medoidalgorithms). Consequently,partitioning Clustering algorithms are attractive for the task of class algorithmsuse a two-step procedure. First, determinek rep- identification. However, the application to large spatial data- resentatives minimizingthe objective function.
8 Second,as- bases rises the following requirementsfor clustering algo- sign each object to the cluster with its representative "clos- rithms: est" to the consideredobject. Thesecondstep implies that a (1) Minimal requirements of domain knowledge to deter- partition is equivalent to a voronoidiagramand each cluster minethe input parameters, because appropriate values is containedin oneof the voronoicells. Thus,the shapeof all 226 KDD-96. clusters found by a partitioning algorithmis convexwhichis moderatevalues for n, but it is prohibitive for applicationson veryrestrictive.
9 Large databases. Ng & Han (1994) explore partitioning algorithms for Jain (1988) explores a density based approach to identify KDDin spatial databases. An Algorithm called CLARANS clusters in k-dimensionalpoint sets. The data set is parti- (Clustering Large Applications based on RANdomized tioned into a numberof nonoverlappingcells and histograms Search) is introduced which is an improvedk-medoidmeth- are constructed. Cells with relatively high frequencycounts od. Compared to former k-medoid algorithms, CLARANS of points are the potential cluster centers and the boundaries is moreeffective and moreefficient.
10 Anexperimentalevalu- betweenclusters fall in the "valleys" of the histogram. This ation indicates that CLARANS runs efficiently on databases methodhas the capability of identifying clusters of any of thousands of objects. Ng&Han(1994) also discuss meth- shape. However,the space and run-time requirements for ods to determinethe "natural" numberknat of clusters in a storing and searching multidimensional histograms can be database. They propose to run CLARANS once for each k if the space and run-time requirements are from2 to n. For each of the discovered clusterings the sil- optimized, the performance of such an approach crucially houette coefficient (Kaufman&Rousseeuw1990) is calcu- dependson the size of the cells.