A Density-Based Algorithm for Discovering …

From: KDD-96 Proceedings. Copyright 1996, AAAI ( ). All rights reserved. A Density-Based Algorithm for Discovering Clusters in Large spatial Databaseswith Noise Martin Ester, Hans-Peter Kriegel, Jiirg Sander, Xiaowei Xu Institute for ComputerScience, University of Munich Oettingenstr. 67, D-80538 Miinchen, Germany {ester I kriegel I sander I xwxu} Abstract are often not knownin advancewhendealing with large Clusteringalgorithmsare attractive for the task of class iden- databases. tification in spatial , the applicationto (2) Discoveryof clusters with arbitrary shape, because the large spatial databasesrises the followingrequirements for shapeof clusters in spatial databases maybe spherical, clustering algorithms: minimalrequirements of domain drawn-out,linear, elongatedetc.

Knowledgeto determinethe input parameters,discoveryof clusters witharbitrary shapeandgoodefficiencyonlarge da- (3) Goodefficiency on large databases, on databases of tabases. Thewell-known clustering algorithmsoffer no solu- significantly morethan just a fewthousandobjects. tion to the combination of these this paper, The well-knownclustering algorithms offer no solution to wepresent the newclustering algorithmDBSCAN relying on the combination of these requirements. In this paper, we a density -basednotion of clusters whichis designedto dis- present the new clustering Algorithm DBSCAN. It requires coverclusters of arbitrary shape. DBSCAN requires only one only one input parameterand supports the user in determin- input parameterand supportsthe user in determiningan ap- ing an appropriatevalue for it.

It discoversclusters of arbi- propriatevalue for it. Weperformed an experimental evalua- tion of the effectiveness and efficiency of DBSCAN using trary shape. Finally, DBSCAN is efficient evenfor large spa- synthetic data and real data of the SEQUOIA 2000bench- tial databases. The rest of the paper is organizedas follows. of our experimentsdemonstratethat (1) Wediscuss clustering algorithms in section 2 evaluating DBSCAN is significantly moreeffective in discoveringclus- them according to the above requirements. In section 3, we ters of arbitrary shapethan the well-known algorithmCLAR- present our notion of clusters whichis based on the concept ANS,and that (2) DBSCAN outperforms CLARANS by of density in the database.

Section 4 introduces the algo- factor of morethan 100in termsof efficiency. rithm DBSCAN which discovers such clusters in a spatial Keywords: Clustering Algorithms,Arbitrary Shapeof Clus- database. In section 5, we performedan experimentalevalu- ters, Efficiencyon LargeSpatial Databases,HandlingNlj4- ation of the effectiveness and efficiency of DBSCAN using 275oise. synthetic data and data of the SEQUOIA 2000 benchmark. Section 6 concludes with a summaryand somedirections for 1. Introduction future research. Numerousapplications require the managementof spatial data, data related to space. spatial DatabaseSystems 2. Clustering Algorithms (SDBS)(Gueting 1994) are database systems for the managementof spatial data.

Increasingly large amountsof data There are two basic types of clustering algorithms (Kaufman are obtained fromsatellite images, X-ray crystallography or & Rousseeuw1990): partitioning and hierarchical algo- other automatic equipment. Therefore, automated know- rithms. Partitioningalgorithmsconstruct a partition of a da- ledge discovery becomesmore and more important in spatial tabase D ofn objects into a set ofk clusters, k is an input pa- databases. rameter for these algorithms, somedomainknowledgeis Several tasks of knowledgediscovery in databases (KDD) required whichunfortunately is not available for manyap- have been defined in the literature (Matheus, Chan&Pi- plications.)

The partitioning algorithmtypically starts with atetsky-Shapiro 1993). The task considered in this paper is an initial partition of Dand then uses an iterative control class identification, the groupingof the objects of a data- strategy to optimize an objective function. Eachcluster is base into meaningfulsubclasses. In an earth observation da- representedby the gravity center of the cluster (k-meansal- tabase, , we might want to discover classes of houses gorithms)or by one of the objects of the cluster located near along someriver. its center (k-medoidalgorithms). Consequently,partitioning Clustering algorithms are attractive for the task of class algorithmsuse a two-step procedure.

First, determinek rep- identification. However, the application to large spatial data- resentatives minimizingthe objective function. Second,as- bases rises the following requirementsfor clustering algo- sign each object to the cluster with its representative "clos- rithms: est" to the consideredobject. Thesecondstep implies that a (1) Minimal requirements of domain knowledge to deter- partition is equivalent to a voronoidiagramand each cluster minethe input parameters, because appropriate values is containedin oneof the voronoicells. Thus,the shapeof all 226 KDD-96. clusters found by a partitioning algorithmis convexwhichis moderatevalues for n, but it is prohibitive for applicationson veryrestrictive.

Large databases. Ng & Han (1994) explore partitioning algorithms for Jain (1988) explores a density based approach to identify KDDin spatial databases. An Algorithm called CLARANS clusters in k-dimensionalpoint sets. The data set is parti- (Clustering Large Applications based on RANdomized tioned into a numberof nonoverlappingcells and histograms Search) is introduced which is an improvedk-medoidmeth- are constructed. Cells with relatively high frequencycounts od. Compared to former k-medoid algorithms, CLARANS of points are the potential cluster centers and the boundaries is moreeffective and moreefficient. Anexperimentalevalu- betweenclusters fall in the "valleys" of the histogram.

This ation indicates that CLARANS runs efficiently on databases methodhas the capability of identifying clusters of any of thousands of objects. Ng&Han(1994) also discuss meth- shape. However,the space and run-time requirements for ods to determinethe "natural" numberknat of clusters in a storing and searching multidimensional histograms can be database. They propose to run CLARANS once for each k if the space and run-time requirements are from2 to n. For each of the discovered clusterings the sil- optimized, the performance of such an approach crucially houette coefficient (Kaufman&Rousseeuw1990) is calcu- dependson the size of the cells. lated, and finally, the clustering with the maximum silhou- ette coefficient is chosen as the "natural" clustering.

3. A density based Notion of Clusters Unfortunately, the run time of this approachis prohibitive for large n, because it implies O(n) calls of CLARANS. Whenlooking at the sample sets of points depicted in CLARANS assumesthat all objects to be clustered can re- figure 1, we can easily and unambiguously detect clusters of side in main memoryat the same time which does not hold points and noise points not belongingto any of those clus- for large databases. Furthermore, the run time of CLARANS ters. is prohibitive on large databases. Therefore, Ester, Kriegel &Xu(1995) present several focusing techniques which ad- ..:-'.".'" :~..-,..:,, . dress both of these problemsby focusing the clustering pro- - ,;~.

A Density-Based Algorithm for Discovering …

Tags:

Information

Transcription of A Density-Based Algorithm for Discovering …

Related search queries

A Density-Based Algorithm for Discovering …

Tags:

Information

Documents from same domain

Related documents

Related search queries