### Transcription of A Density-Based Algorithm for Discovering …

1 From: KDD-96 Proceedings. Copyright 1996, AAAI ( ). All rights reserved. A **Density-Based** **Algorithm** for **Discovering** Clusters in Large Spatial Databaseswith Noise Martin Ester, Hans-Peter Kriegel, Jiirg Sander, Xiaowei Xu Institute for ComputerScience, University of Munich Oettingenstr. 67, D-80538 Miinchen, Germany {ester I kriegel I sander I xwxu} Abstract are often not knownin advancewhendealing with large Clusteringalgorithmsare attractive for the task of class iden- databases. tification in spatial , the applicationto (2) Discoveryof clusters with arbitrary shape, because the large spatial databasesrises the followingrequirements for shapeof clusters in spatial databases maybe spherical, clustering algorithms: minimalrequirements of domain drawn-out,linear, elongatedetc. knowledgeto determinethe input parameters,discoveryof clusters witharbitrary shapeandgoodefficiencyonlarge da- (3) Goodefficiency on large databases, on databases of tabases.

2 Thewell-known clustering algorithmsoffer no solu- significantly morethan just a fewthousandobjects. tion to the combination of these this paper, The well-knownclustering algorithms offer no solution to wepresent the newclustering algorithmDBSCAN relying on the combination of these requirements. In this paper, we a **density** -basednotion of clusters whichis designedto dis- present the new clustering **Algorithm** DBSCAN. It requires coverclusters of arbitrary shape. DBSCAN requires only one only one input parameterand supports the user in determin- input parameterand supportsthe user in determiningan ap- ing an appropriatevalue for it. It discoversclusters of arbi- propriatevalue for it. Weperformed an experimental evalua- tion of the effectiveness and efficiency of DBSCAN using trary shape. Finally, DBSCAN is efficient evenfor large spa- synthetic data and real data of the SEQUOIA 2000bench- tial databases.

3 The rest of the paper is organizedas follows. of our experimentsdemonstratethat (1) Wediscuss clustering algorithms in section 2 evaluating DBSCAN is significantly moreeffective in discoveringclus- them according to the above requirements. In section 3, we ters of arbitrary shapethan the well-known algorithmCLAR- present our notion of clusters whichis **based** on the concept ANS,and that (2) DBSCAN outperforms CLARANS by of **density** in the database. Section 4 introduces the algo- factor of morethan 100in termsof efficiency. rithm DBSCAN which discovers such clusters in a spatial Keywords: Clustering Algorithms,Arbitrary Shapeof Clus- database. In section 5, we performedan experimentalevalu- ters, Efficiencyon LargeSpatial Databases,HandlingNlj4- ation of the effectiveness and efficiency of DBSCAN using 275oise. synthetic data and data of the SEQUOIA 2000 benchmark. Section 6 concludes with a summaryand somedirections for 1.

4 Introduction future research. Numerousapplications require the managementof spatial data, data related to space. Spatial DatabaseSystems 2. Clustering Algorithms (SDBS)(Gueting 1994) are database systems for the man- agementof spatial data. Increasingly large amountsof data There are two basic types of clustering algorithms (Kaufman are obtained fromsatellite images, X-ray crystallography or & Rousseeuw1990): partitioning and hierarchical algo- other automatic equipment. Therefore, automated know- rithms. Partitioningalgorithmsconstruct a partition of a da- ledge discovery becomesmore and more important in spatial tabase D ofn objects into a set ofk clusters, k is an input pa- databases. rameter for these algorithms, somedomainknowledgeis Several tasks of knowledgediscovery in databases (KDD) required whichunfortunately is not available for manyap- have been defined in the literature (Matheus, Chan&Pi- plications.)

5 The partitioning algorithmtypically starts with atetsky-Shapiro 1993). The task considered in this paper is an initial partition of Dand then uses an iterative control class identification, the groupingof the objects of a data- strategy to optimize an objective function. Eachcluster is base into meaningfulsubclasses. In an earth observation da- representedby the gravity center of the **cluster** (k-meansal- tabase, , we might want to discover classes of houses gorithms)or by one of the objects of the **cluster** located near along someriver. its center (k-medoidalgorithms). Consequently,partitioning Clustering algorithms are attractive for the task of class algorithmsuse a two-step procedure. First, determinek rep- identification. However, the application to large spatial data- resentatives minimizingthe objective function. Second,as- bases rises the following requirementsfor clustering algo- sign each object to the **cluster** with its representative "clos- rithms: est" to the consideredobject.

6 Thesecondstep implies that a (1) Minimal requirements of domain knowledge to deter- partition is equivalent to a voronoidiagramand each **cluster** minethe input parameters, because appropriate values is containedin oneof the voronoicells. Thus,the shapeof all 226 KDD-96. clusters found by a partitioning algorithmis convexwhichis moderatevalues for n, but it is prohibitive for applicationson veryrestrictive. large databases. Ng & Han (1994) explore partitioning algorithms for Jain (1988) explores a **density** **based** approach to identify KDDin spatial databases. An **Algorithm** called CLARANS clusters in k-dimensionalpoint sets. The data set is parti- (Clustering Large Applications **based** on RANdomized tioned into a numberof nonoverlappingcells and histograms Search) is introduced which is an improvedk-medoidmeth- are constructed. Cells with relatively high frequencycounts od.

7 Compared to former k-medoid algorithms, CLARANS of points are the potential **cluster** centers and the boundaries is moreeffective and moreefficient. Anexperimentalevalu- betweenclusters fall in the "valleys" of the histogram. This ation indicates that CLARANS runs efficiently on databases methodhas the capability of identifying clusters of any of thousands of objects. Ng&Han(1994) also discuss meth- shape. However,the space and run-time requirements for ods to determinethe "natural" numberknat of clusters in a storing and searching multidimensional histograms can be database. They propose to run CLARANS once for each k if the space and run-time requirements are from2 to n. For each of the discovered clusterings the sil- optimized, the performance of such an approach crucially houette coefficient (Kaufman&Rousseeuw1990) is calcu- dependson the size of the cells. lated, and finally, the clustering with the maximum silhou- ette coefficient is chosen as the "natural" clustering.

8 3. A **density** **based** Notion of Clusters Unfortunately, the run time of this approachis prohibitive for large n, because it implies O(n) calls of CLARANS. Whenlooking at the sample sets of points depicted in CLARANS assumesthat all objects to be clustered can re- figure 1, we can easily and unambiguously detect clusters of side in main memoryat the same time which does not hold points and noise points not belongingto any of those clus- for large databases. Furthermore, the run time of CLARANS ters. is prohibitive on large databases. Therefore, Ester, Kriegel &Xu(1995) present several focusing techniques which ad- ..:-'.".'" :~..-,..:,, . dress both of these problemsby focusing the clustering pro- - ,;~.:.:~ .- cess on the relevant parts of the , the focus is '-l" .,.. ~'. small enough to be memoryresident and second, the run ., ,..-:.,..,~.,.:..~,,:'.;t""" :'::: time of CLARANS on the objects of the focus is significant.

9 :::i-\"."'.,.:~: o,.. lo~, ..,, o .:::;::.". ~, ly less than its run time on the wholedatabase. Hierarchical algorithms create a hierarchical decomposi- database 1 database 2 database 3. tion of D. The hierarchical decompositionis represented by figure 1: Sampledatabases a dendrogram,a tree that iteratively splits D into smaller subsets until each subset consists of only one object. In such The main reason why we recognize the clusters is that a hierarchy, each node of the tree represents a **cluster** of D. within each **cluster** wehave a typical **density** of points which The dendrogramcan either be created from the leaves up to is considerably higher than outside of the **cluster** . Further- the root (agglomerativeapproach)or from the root downto more, the **density** within the areas of noise is lowerthan the the leaves (divisive approach)by mergingor dividing clus- **density** in any of the clusters.

10 Ters at eachstep. In contrast to partitioning algorithms,hier- In the following, wetry to formalizethis intuitive notion archical algorithmsdo not needk as an input. However,a ter- of "clusters" and "noise" in a database D of points of some mination condition has to be defined indicating whenthe k-dimensionalspace S. Note that both, our notion of clusters mergeor division process should be terminated. One exam- and our **Algorithm** DBSCAN, apply as well to 2D or 3D Eu- ple of a termination condition in the agglomerativeapproach clidean space as to some high dimensional feature space. is the critical distance Dmin betweenall the clusters of Q. Thekey idea is that for each point of a **cluster** the neighbor- So far, the mainproblemwith hierarchical clustering al- hood of a given radius has to contain at least a minimum gorithmshas been the difficulty of deriving appropriate pa- numberof points, the **density** in the neighborhoodhas to nrameters for the termination condition, a value of Dmi exceed somethreshold.