A Density-Based Algorithm for Discovering Clusters in ...

A Density-Based Algorithm for Discovering Clustersin Large Spatial Databases with NoiseMartin Ester, Hans-Peter Kriegel, Jiirg Sander, Xiaowei XuInstitute for Computer Science, university of MunichOettingenstr. 67, D-80538 Miinchen, Germany{ester I kriegel I sander I xwxu } algorithms are attractive for the task of class iden-tification in spatial databases. However, the application tolarge spatial databases rises the following requirements forclustering algorithms: minimal requirements of domainknowledge to determine the input parameters, discovery ofclusters with arbitrary shape and good efficiency on large da-tabases. The well-known clustering algorithms offer no solu-tion to the combination of these requirements. In this paper,we present the new clustering Algorithm DBSCAN relying ona Density-Based notion of Clusters which is designed to dis-cover Clusters of arbitrary shape.

DBSCAN requires only oneinput parameter and supports the user in determining an ap-propriate value for it. We performed an experimental evalua-tion of the effectiveness and efficiency of DBSCAN usingsynthetic data and real data of the SEQUOIA 2000 bench-mark. The results of our experiments demonstrate that (1)DBSCAN is significantly more effective in Discovering clus-ters of arbitrary shape than the well-known Algorithm CLAR-ANS, and that (2) DBSCAN outperforms CLARANS by factor of more than 100 in terms of : Clustering Algorithms, Arbitrary Shape of Clus-ters, Efficiency on Large Spatial Databases, Handling IntroductionNumerous applications require the management of spatialdata, data related to space. Spatial Database Systems(SDBS) (Gueting 1994) are database systems for the man-agement of spatial data.

Increasingly large amounts of dataare obtained from satellite images, X-ray crystallography orother automatic equipment. Therefore, automated know-ledge discovery becomes more and more important in tasks of knowledge discovery in databases (KDD)have been defined in the literature (Matheus, Chan& Pi-atetsky-Shapiro 1993). The task considered in this paper isclass identification, the grouping of the objects of a data-base into meaningful subclasses. In an earth observation da-tabase, , we might want to discover classes of housesalong some algorithms are attractive for the task of classidentification. However, the application to large spatial data-bases rises the following requirements for clustering algo-rithms:(1)Minimal requirements of domain knowledge to deter-mine the input parameters, because appropriate valuesare often not known in advance when dealing with largedatabases.

(2)Discovery of Clusters with arbitrary shape, because theshape of Clusters in spatial databases may be spherical,drawn-out, linear, elongated etc.(3)Good efficiency on large databases, on databases ofsignificantly more than just a few thousand well-known clustering algorithms offer no solution tothe combination of these requirements. In this paper, wepresent the new clustering Algorithm DBSCAN. It requiresonly one input parameter and supports the user in determin-ing an appropriate value for it. It discovers Clusters of arbi-trary shape. Finally, DBSCAN is efficient even for large spa-tial databases. The rest of the paper is organized as discuss clustering algorithms in section 2 evaluatingthem according to the above requirements. In section 3, wepresent our notion of Clusters which is based on the conceptof density in the database.

Section 4 introduces the algo-rithm DBSCAN which discovers such Clusters in a spatialdatabase. In section 5, we performed an experimental evalu-ation of the effectiveness and efficiency of DBSCAN usingsynthetic data and data of the SEQUOIA 2000 6 concludes with a summary and some directions forfuture Clustering AlgorithmsThere are two basic types of clustering algorithms (Kaufman& Rousseeuw 1990): partitioning and hierarchical algo-rithms. Partitioning algorithms construct a partition of a da-tabase D ofn objects into a set ofk Clusters , k is an input pa-rameter for these algorithms, some domain knowledge isrequired which unfortunately is not available for many ap-plications. The partitioning Algorithm typically starts withan initial partition of D and then uses an iterative controlstrategy to optimize an objective function.

Each cluster isrepresented by the gravity center of the cluster (k-means al-gorithms) or by one of the objects of the cluster located nearits center (k-medoid algorithms). Consequently, partitioningalgorithms use a two-step procedure. First, determine k rep-resentatives minimizing the objective function. Second, as-sign each object to the cluster with its representative "clos-est" to the considered object. The second step implies that apartition is equivalent to a voronoi diagram and each clusteris contained in one of the voronoi cells. Thus, the shape of all226 KDD-96 From: KDD-96 Proceedings. Copyright 1996, AAAI ( ). All rights reserved. Clusters found by a partitioning Algorithm is convex which isvery & Han (1994) explore partitioning algorithms forKDD in spatial databases.

An Algorithm called CLARANS(Clustering Large Applications based on RANdomizedSearch) is introduced which is an improved k-medoid meth-od. Compared to former k-medoid algorithms, CLARANSis more effective and more efficient. An experimental evalu-ation indicates that CLARANS runs efficiently on databasesof thousands of objects. Ng &Han (1994) also discuss meth-ods to determine the "natural" number knat of Clusters in adatabase. They propose to run CLARANS once for each kfrom 2 to n. For each of the discovered clusterings the sil-houette coefficient (Kaufman & Rousseeuw 1990) is calcu-lated, and finally, the clustering with the maximum silhou-ette coefficient is chosen as the "natural" , the run time of this approach is prohibitivefor large n, because it implies O(n) calls of assumes that all objects to be clustered can re-side in main memory at the same time which does not holdfor large databases.

Furthermore, the run time of CLARANSis prohibitive on large databases. Therefore, Ester, Kriegel&Xu (1995) present several focusing techniques which ad-dress both of these problems by focusing the clustering pro-cess on the relevant parts of the database. First, the focus issmall enough to be memory resident and second, the runtime of CLARANS on the objects of the focus is significant-ly less than its run time on the whole algorithms create a hierarchical decomposi-tion of D. The hierarchical decomposition is represented bya dendrogram, a tree that iteratively splits D into smallersubsets until each subset consists of only one object. In sucha hierarchy, each node of the tree represents a cluster of dendrogram can either be created from the leaves up tothe root (agglomerative approach) or from the root down tothe leaves (divisive approach) by merging or dividing clus-ters at each step.

In contrast to partitioning algorithms, hier-archical algorithms do not need k as an input. However, a ter-mination condition has to be defined indicating when themerge or division process should be terminated. One exam-ple of a termination condition in the agglomerative approachis the critical distance Dmin between all the Clusters of far, the main problem with hierarchical clustering al-gorithms has been the difficulty of deriving appropriate pa-rameters for the termination condition, a value of Dminwhich is small enough to separate all "natural" Clusters and,at the same time large enough such that no cluster is split intotwo parts. Recently, in the area of signal processing the hier-archical Algorithm Ejcluster has been presented (Garcfa,Fdez-Valdivia, Cortijo & Molina 1994) automatically deriv-ing a termination condition.

Its key idea is that two points be-long to the same cluster if you can walk from the first pointto the second one by a "sufficiently small" step. Ejclusterfollows the divisive approach. It does not require any inputof domain knowledge. Furthermore, experiments show thatit is very effective in Discovering non-convex Clusters . How-ever, the computational cost of Ejcluster is O(n2) due to thedistance calculation for each pair of points. This is accept-able for applications such as character recognition withmoderate values for n, but it is prohibitive for applications onlarge (1988) explores a density based approach to identifyclusters in k-dimensional point sets. The data set is parti-tioned into a number of nonoverlapping cells and histogramsare constructed. Cells with relatively high frequency countsof points are the potential cluster centers and the boundariesbetween Clusters fall in the "valleys" of the histogram.

A Density-Based Algorithm for Discovering Clusters in ...

Tags:

Information

Advertisement

Transcription of A Density-Based Algorithm for Discovering Clusters in ...

Related search queries

A Density-Based Algorithm for Discovering Clusters in ...

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries