A Density-Based Algorithm for Discovering …

From: KDD-96 Proceedings. Copyright 1996, AAAI ( ). All rights reserved. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databaseswith Noise Martin Ester, Hans-Peter Kriegel, Jiirg Sander, Xiaowei Xu Institute for ComputerScience, University of Munich Oettingenstr. 67, D-80538 Miinchen, Germany {ester I kriegel I sander I xwxu} Abstract are often not knownin advancewhendealing with large Clusteringalgorithmsare attractive for the task of class iden- databases. tification in spatial , the applicationto (2) Discoveryof clusters with arbitrary shape, because the large spatial databasesrises the followingrequirements for shapeof clusters in spatial databases maybe spherical, clustering algorithms: minimalrequirements of domain drawn-out,linear, elongatedetc.

Knowledgeto determinethe input parameters,discoveryof clusters witharbitrary shapeandgoodefficiencyonlarge da- (3) Goodefficiency on large databases, on databases of tabases. Thewell-known clustering algorithmsoffer no solu- significantly morethan just a fewthousandobjects. tion to the combination of these this paper, The well-knownclustering algorithms offer no solution to wepresent the newclustering algorithmDBSCAN relying on the combination of these requirements. In this paper, we a density -basednotion of clusters whichis designedto dis- present the new clustering Algorithm DBSCAN.

It requires coverclusters of arbitrary shape. DBSCAN requires only one only one input parameterand supports the user in determin- input parameterand supportsthe user in determiningan ap- ing an appropriatevalue for it. It discoversclusters of arbi- propriatevalue for it. Weperformed an experimental evalua- tion of the effectiveness and efficiency of DBSCAN using trary shape. Finally, DBSCAN is efficient evenfor large spa- synthetic data and real data of the SEQUOIA 2000bench- tial databases. The rest of the paper is organizedas follows. of our experimentsdemonstratethat (1) Wediscuss clustering algorithms in section 2 evaluating DBSCAN is significantly moreeffective in discoveringclus- them according to the above requirements.

In section 3, we ters of arbitrary shapethan the well-known algorithmCLAR- present our notion of clusters whichis based on the concept ANS,and that (2) DBSCAN outperforms CLARANS by of density in the database. Section 4 introduces the algo- factor of morethan 100in termsof efficiency. rithm DBSCAN which discovers such clusters in a spatial Keywords: Clustering Algorithms,Arbitrary Shapeof Clus- database. In section 5, we performedan experimentalevalu- ters, Efficiencyon LargeSpatial Databases,HandlingNlj4- ation of the effectiveness and efficiency of DBSCAN using 275oise.

Synthetic data and data of the SEQUOIA 2000 benchmark. Section 6 concludes with a summaryand somedirections for 1. Introduction future research. Numerousapplications require the managementof spatial data, data related to space. Spatial DatabaseSystems 2. Clustering Algorithms (SDBS)(Gueting 1994) are database systems for the managementof spatial data. Increasingly large amountsof data There are two basic types of clustering algorithms (Kaufman are obtained fromsatellite images, X-ray crystallography or & Rousseeuw1990): partitioning and hierarchical algo- other automatic equipment.

Therefore, automated know- rithms. Partitioningalgorithmsconstruct a partition of a da- ledge discovery becomesmore and more important in spatial tabase D ofn objects into a set ofk clusters, k is an input pa- databases. rameter for these algorithms, somedomainknowledgeis Several tasks of knowledgediscovery in databases (KDD) required whichunfortunately is not available for manyap- have been defined in the literature (Matheus, Chan&Pi- plications. The partitioning algorithmtypically starts with atetsky-Shapiro 1993). The task considered in this paper is an initial partition of Dand then uses an iterative control class identification, the groupingof the objects of a data- strategy to optimize an objective function.

Eachcluster is base into meaningfulsubclasses. In an earth observation da- representedby the gravity center of the cluster (k-meansal- tabase, , we might want to discover classes of houses gorithms)or by one of the objects of the cluster located near along someriver. its center (k-medoidalgorithms). Consequently,partitioning Clustering algorithms are attractive for the task of class algorithmsuse a two-step procedure. First, determinek rep- identification. However, the application to large spatial data- resentatives minimizingthe objective function.

Second,as- bases rises the following requirementsfor clustering algo- sign each object to the cluster with its representative "clos- rithms: est" to the consideredobject. Thesecondstep implies that a (1) Minimal requirements of domain knowledge to deter- partition is equivalent to a voronoidiagramand each cluster minethe input parameters, because appropriate values is containedin oneof the voronoicells. Thus,the shapeof all 226 KDD-96. clusters found by a partitioning algorithmis convexwhichis moderatevalues for n, but it is prohibitive for applicationson veryrestrictive.

Large databases. Ng & Han (1994) explore partitioning algorithms for Jain (1988) explores a density based approach to identify KDDin spatial databases. An Algorithm called CLARANS clusters in k-dimensionalpoint sets. The data set is parti- (Clustering Large Applications based on RANdomized tioned into a numberof nonoverlappingcells and histograms Search) is introduced which is an improvedk-medoidmeth- are constructed. Cells with relatively high frequencycounts od. Compared to former k-medoid algorithms, CLARANS of points are the potential cluster centers and the boundaries is moreeffective and moreefficient.

Anexperimentalevalu- betweenclusters fall in the "valleys" of the histogram. This ation indicates that CLARANS runs efficiently on databases methodhas the capability of identifying clusters of any of thousands of objects. Ng&Han(1994) also discuss meth- shape. However,the space and run-time requirements for ods to determinethe "natural" numberknat of clusters in a storing and searching multidimensional histograms can be database. They propose to run CLARANS once for each k if the space and run-time requirements are from2 to n. For each of the discovered clusterings the sil- optimized, the performance of such an approach crucially houette coefficient (Kaufman&Rousseeuw1990) is calcu- dependson the size of the cells.

A Density-Based Algorithm for Discovering …

Tags:

Information

Advertisement

Transcription of A Density-Based Algorithm for Discovering …

Related search queries

A Density-Based Algorithm for Discovering …

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries