k-Shape: Efﬁcient and Accurate Clustering of Time Series

K-Shape: Efficient and Accurate Clustering of Time SeriesJohn PaparrizosColumbia GravanoColumbia proliferation and ubiquity of temporal data across manydisciplines has generated substantial interest in the analysisand mining of time Series . Clustering is one of the most pop-ular data mining methods, not only due to its exploratorypower, but also as a preprocessing step or subroutine forother techniques. In this paper, we presentk-Shape, a novelalgorithm for time- Series relies on a scal-able iterative refinement procedure, which creates homoge-neous and well-separated clusters. As its distance measure,k-Shape uses a normalized version of the cross-correlationmeasure in order to consider the shapes of time Series whilecomparing them.

Based on the properties of that distancemeasure, we develop a method to compute cluster centroids,which are used in every iteration to update the assignmentof time Series to clusters. To demonstrate the robustness ofk-Shape, we perform an extensive experimental evaluationof our approach against partitional, hierarchical, and spec-tral Clustering methods, with combinations of the most com-petitive distance outperforms all scalableapproaches in terms of accuracy. Furthermore,k-Shape alsooutperforms all non-scalable (and hence impractical) com-binations, with one exception that achieves similar accuracyresults. However, unlikek-Shape, this combination requirestuning of its distance measure and is two orders of mag-nitude slower thank-Shape.

Overall,k-Shape emerges asa domain-independent, highly Accurate , and highly efficientclustering approach for time Series with broad INTRODUCTIONT emporal, or sequential, data mining deals with problemswhere data are naturally organized in sequences [34]. Werefer to such data sequences as time- Series sequences if theycontain explicit information about timing ( , stock, au-dio, speech, and video) or if an ordering on values can beinferred ( , streams and handwriting). Large volumes oftime- Series sequences appear in almost every discipline, in-cluding astronomy, biology, meteorology, medicine, finance,robotics, engineering, and others [1, 6, 25, 27, 36, 52, 70, 76].

The ubiquity of time Series has generated a substantial inter-Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from 15,May 31 June 4, 2015, Melbourne, Victoria, 2015 ACM 978-1-4503-2758-9/15/05.

$ AECG (local)Types of sequence alignmentClass BClass A30405060708090100 Class BLinear drift (global)Class AClass BFigure 1: ECG sequence examples and types of alignmentsfor the two classes of the ECGFiveDays dataset [1].est in querying [2, 19, 45, 46, 48, 62, 74, 79], indexing [9, 13,41, 42, 44, 77], classification [37, 56, 68, 88], Clustering [43,54, 64, 87, 89], and modeling [4, 38, 86] of such all techniques applied to time- Series data, cluster-ing is the most widely used as it does not rely on costlyhuman supervision or time-consuming annotation of Clustering , we can identify and summarize interestingpatterns and correlations in the underlying data [33].

In thelast few decades, Clustering of time- Series sequences has re-ceived significant attention [5, 16, 25, 47, 60, 64, 66, 87, 89],not only as a powerful stand-alone exploratory method, butalso as a preprocessing step or subroutine for other time- Series analysis techniques, including Clustering ,critically depend on the choice of distance measure. A keyissue when comparing two time- Series sequences is how tohandle the variety of distortions, as we will discuss, that arecharacteristic of the sequences. To illustrate this point, con-sider the well-known ECGFiveDays dataset [1], with ECGsequences recorded for the same patient on two differentdays.

While the sequences seem similar overall, they exhibitpatterns that belong in one of the two distinct classes (seeFigure 1): Class A is characterized by a sharp rise, a drop,and another gradual increase while Class B is characterizedby a gradual increase, a drop, and another gradual , ashape-basedclustering method should generate apartition similar to the classes shown in Figure 1, where se-quences exhibiting similar patterns are placed into the samecluster based on theirshapesimilarity, regardless of differ-ences in amplitude and phase. As the notion of shape cannotbe precisely defined, dozens of distance measures have beenproposed [11, 12, 14, 19, 20, 22, 55, 75, 78, 81] to offer invari-ances to multiple inherent distortions in the data.

However,it has been shown that distance measures offering invari-ances to amplitude and phase perform exceptionally well[19, 81] and, hence, such distance measures are used forshape-based Clustering [53, 59, 64, 87].Due to these difficulties and the different needs for invari-ances from one domain to another, more attention has beengiven to the creation of new distance measures rather thanto the creation of new Clustering algorithms. It is generallybelieved that the choice of distance measure is more im-portant than the Clustering algorithm itself [7]. As a conse-quence, time- Series Clustering relies mostly on classic cluster-ing methods, either by replacing the default distance mea-sure with one that is more appropriate for time Series , orby transforming time Series into flat data so that existingclustering algorithms can be directly used [83].

However, thechoice of Clustering method can affect: (i) accuracy, as ev-ery method expresses homogeneity and separation of clus-ters differently; and (ii) efficiency, as the computational costdiffers from one method to another. For example, spectralclustering [21] or certain variants of hierarchical Clustering [40] are more appropriate to identify density-based clusters( , areas of higher density than the remainder of the data)than partitional methods such ask-means [50] ork-medoids[40]. On the other hand,k-means is more efficient than hi-erarchical, spectral, ork-medoids , state-of-the-art approaches for shape-basedclustering, which use partitional methods with distance mea-sures that are scale- and shift-invariant, suffer from twomain drawbacks: (i) these approaches cannot scale to large-volumes of data as they depend on computationally expen-sive methods or distance measures [53, 59, 64, 87]; (ii) theseapproaches have been developed for particular domains [87]or their effectiveness has only been shown for a limitednumber of datasets [53, 59].

Moreover, the most success-ful shape-based Clustering methods handle phase invariancethrough a local, non-linear alignment of the sequence coor-dinates, even though a global alignment is often example, for the ECG dataset in Figure 1, an efficientlinear drift can reveal the underlying differences in patternsof sequences of two classes, whereas an expensive non-linearalignment might match every corresponding increase or dropof each sequence , making it difficult to distinguish the twoclasses (see Figure 1). Importantly, to the best of our knowl-edge, these approaches have never been extensively evalu-ated against each other, against other partitional methods,or against different approaches such as hierarchical or spec-tral methods.

k-Shape: Efﬁcient and Accurate Clustering of Time Series

Tags:

Information

Transcription of k-Shape: Efﬁcient and Accurate Clustering of Time Series

Related search queries

k-Shape: Efﬁcient and Accurate Clustering of Time Series

Tags:

Information

Documents from same domain

Related documents

Related search queries