Fast Algorithms for Mining Association Rules

Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan S&ant*. IBM Almaden Research Center 650 Harry Road, San Jose, CA 95120. Abstract tires and auto accessoriesalso get automotive services We consider the problem of discovering Association Rules done. Finding all such Rules is valuable for cross- between items in a large database of sales transactions . marketing and attached mailing applications. Other We present two new Algorithms for solving thii problem applications include catalog design, add-on sales, that are fundamentally different from the known algo- store layout, and customer segmentation based on rithms.

Empirical evaluation shows that these Algorithms buying patterns. The databases involved in these outperform the known Algorithms by factors ranging from applications are very large. It is imperative, therefore, three for small problems to more than an order of mag- to have fast Algorithms for this task. nitude for large problems. We also show how the best features of the two proposedalgorithms can be combined The following is a formal statement of the problem into a hybrid algorithm, called AprioriHybrid. Scale-up [4]: Let Z = {ir,iz, .. , im} be a set of literals, experiments show that AprioriHybrid scales linearly with called items.

Let 2) be a set of transactions , where the number of transactions . AprioriHybrid also has ex- each transaction T is a set of items such that T c cellent scale-up properties with respect to the transaction Z. Associated with each transaction is a unique size and the number of items in the database. identifier, called its TID. We say that a transaction T contains X, a set of some items in Z, if X c T. 1 Introduction An Association rzle is an implication of the form Progress in bar-code technology has made it possi- X q Y, where X C Z, Y c 2, and X rl Y = 0. ble for retail organizations to collect and store mas- The rule X a Y holds in the transaction set D with sive amounts of sales data, referred to as the basket confidence c if c% of transactions in D that contain data.

A record in such data typically consists of the X also contain Y. The rule X _ Y has support s transaction date and the items bought in the trans- in the transaction set V if s% of transactions in V. action. Successful organizations view such databases contain X U Y. Our Rules are somewhat more general as important pieces of the marketing infrastructure. than in [4] in that we allow a consequent to have more They are interested in instituting information-driven than one item. marketing processes, managed by database technol- Given a set of transactions D, the problem of min- ogy, that enable marketers to develop and implement ing Association Rules is to generate all Association Rules customized marketing programs and strategies [S].

That have support and confidence greater than the The problem of Mining Association Rules over basket user-specified minimum support (called minsup) and data was introduced in [4]. An example of such a minimum confidence (called minconf) respectively. rule might be that 98% of customers that purchase Our discussion is neutral with respect to the repre- sentation of V. For example, V could be a data file, *Visiting from the Department of Computer Science, Uni- versity of Wisconsin, Madison. a relational table, or the result of a relational expres- permission to copy without fee all or part of this material sion.

Is granted provided that the copies are not made OT distributed An algorithm for finding all Association Rules , for direct commercial advantage, the VLDB copyright notice and the title of the publication and itr date appear, and notice henceforth referred to as the AIS algorithm, was pre- is given that copying is by permission of the Very Large Data sented in [4]. Another algorithm for this task, called Base Endowment. To copq otherwise, or to republish, nquins the SETM algorithm, has been proposed in [13]. In a fee and/or special permisrion from the Endowment. this paper, we present two new Algorithms , Apriori Proceedings of the 20th VLDB Conference and AprioriTid, that differ fundamentally from these Santiago, Chile, 1994 Algorithms .

We present experimental results showing 487. that the proposed Algorithms always outperform the for an itemset is the number of transactions that earlier Algorithms . The performance gap is shown to contain the itemset. Itemsets with minimum sup- increase with problem size, and ranges from a fac- port are called large itemsets, and all others small tor of three for small problems to more than an or- itemsets. In Section 2, we give new Algorithms , der of magnitude for large problems. We then dis- Apriori and AprioriTid, for solving this problem. cuss how the best features of Apriori and Apriori- Tid can be combined into a hybrid algorithm, called 2.

Use the large itemsets to generate the desired Rules . Here is a straightforward algorithm for this task. AprioriHybrid. Experiments show that the Apriori- For every large itemset 1, find all non-empty subsets Hybrid has excellent scale-up properties, opening up of 1. For every such subset a, output a rule of the feasibility of Mining Association Rules over very the form a =+ (1 - a) if the ratio of support(l). large databases. to support(a) is at least minconf We need to The problem of finding Association Rules falls within consider all subsets of 1 to generate Rules with the purview of database Mining [3] [12], also called multiple consequents.

Due to lack of space, we do knowledge discovery in databases [21]. Related, but not discuss this subproblem further, but refer the not directly applicable, work includes the induction reader to [5] for a fast algorithm. of classification Rules [S] [ll] [22], discovery of causal Rules [19], learning of logical definitions [18], fitting In Section 3, we show the relative performance of functions to data [15], and clustering [9] [lo]. The of the proposed Apriori and AprioriTid Algorithms closest work in the machine learning literature is the against the AIS [4] and SETM [13] Algorithms . KID3 algorithm presented in [20].

If used for finding To make the paper self-contained, we include an all Association Rules , this algorithm will make as many overview of the AIS and SETM Algorithms in this passes over the data as the number of combinations section. We also describe how the Apriori and of items in the antecedent, which is exponentially AprioriTid Algorithms can be combined into a hybrid large. Related work in the database literature is algorithm, AprioriHybrid, and demonstrate the scale- the work on inferring functional dependencies from up properties of this algorithm. We conclude by data [16]. Functional dependencies are Rules requiring pointing out some related open problems in Section 4.

Fast Algorithms for Mining Association Rules

Tags:

Information

Transcription of Fast Algorithms for Mining Association Rules

Related search queries

Fast Algorithms for Mining Association Rules

Tags:

Information

Documents from same domain

Related documents

Related search queries