Transcription of Data Mining Association Analysis: Basic Concepts and ...
1 Data Mining Association Analysis: Basic Concepts and AlgorithmsLecture Notes for Chapter 6 Introduction to Data MiningbyTan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Association Rule MiningOGiven a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transactionMarket-Basket transactionsTID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example of Association Rules{Diaper} {Beer},{Milk, Bread} {Eggs,Coke},{Beer, Bread} {Milk},Implication means co-occurrence, not causality! Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 Definition: Frequent ItemsetOItemset A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset An itemset that contains k itemsOSupport count ( ) Frequency of occurrence of an itemset ({Milk, Bread,Diaper}) = 2 OSupport Fraction of transactions that contain an itemset s({Milk, Bread, Diaper}) = 2/5 OFrequent Itemset An itemset whose support is greater than or equal to a minsupthresholdTID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Definition: Association RuleExample.
2 Beer}Diaper,Milk{ |T|)BeerDiaper,,Milk(=== )Diaper,Milk()BeerDiaper,Milk,(=== cOAssociation Rule An implication expression of the form X Y, where X and Y are itemsets Example:{Milk, Diaper} {Beer}ORule Evaluation Metrics Support (s) Fraction of transactions that contain both X and Y Confidence (c) Measures how often items in Y appear in transactions thatcontain XTID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Association Rule Mining TaskOGiven a set of transactions T, the goal of Association rule Mining is to find all rules having support minsupthreshold confidence minconfthresholdOBrute-force approach: List all possible Association rules Compute the support and confidence for each rule Prune rules that fail the minsupand minconfthresholds Computationally prohibitive!
3 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Mining Association RulesExample of Rules:{Milk,Diaper} {Beer} (s= , c= ){Milk,Beer} {Diaper} (s= , c= ){Diaper,Beer} {Milk} (s= , c= ){Beer} {Milk,Diaper} (s= , c= ) {Diaper} {Milk,Beer} (s= , c= ) {Milk} {Diaper,Beer} (s= , c= )TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Observations: All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support butcan have different confidence Thus, we may decouple the support and confidence requirements Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Mining Association RulesOTwo-step approach.
4 Itemset Generation Generate all itemsets whose support Generation Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemsetOFrequent itemset generation is still computationally expensive Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8 Frequent Itemset GenerationnullABACADAEBCBDBECDCEDEABCDEA BCABDABEACDACEADEBCDBCEBDECDEABCDABCEABD EACDEBCDEABCDEG iven d items, there are 2dpossible candidate itemsets Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9 Frequent Itemset GenerationOBrute-force approach: Each itemset in the lattice is a candidatefrequent itemset Count the support of each candidate by scanning the database Match each transaction against every candidate Complexity ~ O(NMw) => Expensive since M = 2d!
5 !!TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke NTransactionsList ofCandidatesMw Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 Computational ComplexityOGiven d unique items: Total number of itemsets = 2d Total number of possible Association rules: 1231111+ = =+ = = dddkkdjjkdkdRIf d=6, R = 602 rules Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 Frequent Itemset Generation StrategiesOReduce the number of candidates(M) Complete search: M=2d Use pruning techniques to reduce MOReduce the number of transactions (N) Reduce size of N as the size of itemset increases Used by DHP and vertical-based Mining algorithmsOReduce the number of comparisons(NM) Use efficient data structures to store the candidates or transactions No need to match every candidate against every transaction Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12 Reducing Number of CandidatesOApriori principle: If an itemset is frequent, then all of its subsets must also be frequentOApriori principle holds due to the following property of the support measure: Support of an itemset never exceeds the support of its subsets This is known as the anti-monotoneproperty of support)()()(.
6 ,YsXsYXYX Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 Found to be InfrequentnullABACADAEBCBDBECDCEDEABCDEA BCABDABEACDACEADEBCDBCEBDECDEABCDABCEABD EACDEBCDEABCDEI llustrating Apriori PrinciplenullABACADAEBCBDBECDCEDEABCDEAB CABDABEACDACEADEBCDBCEBDECDEABCDABCEABDE ACDEBCDEABCDEP runed supersets Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14 Illustrating Apriori PrincipleItemCountBread4 Coke2 Milk4 Beer3 Diaper4 Eggs1 ItemsetCount{Bread,Milk}3{Bread,Beer}2{B read,Diaper}3{Milk,Beer}2{Milk,Diaper}3{ Beer,Diaper}3 Ite m se t Count {B re a d ,M ilk ,D ia p e r} 3 Items (1-itemsets)Pairs (2-itemsets)(No need to generatecandidates involving Cokeor Eggs)Triplets (3-itemsets)Minimum Support = 3If every subset is considered, 6C1+ 6C2+ 6C3= 41 With support-based pruning,6 + 6 + 1 = 13 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 Apriori AlgorithmOMethod: Let k=1 Generate frequent itemsets of length 1 Repeat until no new frequent itemsets are identified Generate length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 Reducing Number of ComparisonsOCandidate counting.
7 Scan the database of transactions to determine the support of each candidate itemset To reduce the number of comparisons, store the candidates in a hash structure Instead of matching each transaction against every candidate, match it against candidates contained in the hashed bucketsTID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke NTransactionsHash StructurekBuckets Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 Generate Hash Tree2 3 45 6 71 4 51 3 61 2 44 5 71 2 54 5 81 5 93 4 53 5 63 5 76 8 93 6 73 6 81,4,72,5,83,6,9 Hash functionSuppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}You need: Hash function Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18 Association Rule Discovery: Hash tree15 914513 63 453 6 73 6 83 5 63 5 76 8 92 3 45 6 712 445 712 545 81,4,72,5,83,6,9 Hash FunctionCandidate Hash TreeHash on 1, 4 or 7 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 Association Rule Discovery.
8 Hash tree1 591 4 51 3 63 4 53 6 73 6 83 563 576 8923 456 71 244 571 254 581,4,72,5,83,6,9 Hash FunctionCandidate Hash TreeHash on 2, 5 or 8 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20 Association Rule Discovery: Hash tree1 5 91 4 51 3 634 536736835 635 768 92 3 45 6 71 2 44 5 71 2 54 5 81,4,72,5,83,6,9 Hash FunctionCandidate Hash TreeHash on 3, 6 or 9 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 Subset Operation1 2 3 5 6 Transaction, t2 3 5 613 5 625 61 33 5 61 261 55 62 362 55 631 2 31 2 51 2 61 3 51 3 61 5 62 3 52 3 62 5 63 5 6 Subsets of 3 itemsLevel 1 Level 2 Level 363 5 Given a transaction t, what are the possible subsets of size 3? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22 Subset Operation Using Hash Tree1 5 91 4 51 3 63 4 53 6 73 6 83 5 63 5 76 8 92 3 45 6 71 2 44 5 71 2 54 5 81 2 3 5 61 +2 3 5 63 5 62 +5 63 +1,4,72,5,83,6,9 Hash Functiontransaction Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23 Subset Operation Using Hash Tree1 5 91 4 51 3 63 4 53 6 73 6 83 5 63 5 76 8 92 3 45 6 71 2 44 5 71 2 54 5 81,4,72,5,83,6,9 Hash Function1 2 3 5 63 5 61 2 +5 61 3 +61 5 +3 5 62 +5 63 +1 +2 3 5 6transaction Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24 Subset Operation Using Hash Tree1 5 91 4 51 3 63 4 53 6 73 6 83 5 63 5 76 8 92 3 45 6 71 2 44 5 71 2 54 5 81,4,72,5,83,6.
9 9 Hash Function1 2 3 5 63 5 61 2 +5 61 3 +61 5 +3 5 62 +5 63 +1 +2 3 5 6transactionMatch transaction against 11 out of 15 candidates Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 Factors Affecting ComplexityOChoice of minimum support threshold lowering support threshold results in more frequent itemsets this may increase number of candidates and max length of frequent itemsetsODimensionality (number of items) of the data set more space is needed to store support count of each item if number of frequent items also increases, both computation and I/O costs may also increaseOSize of database since Apriori makes multiple passes, run time of algorithm may increase with number of transactionsOAverage transaction width transaction width increases with denser data sets This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width)
10 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26 Compact Representation of Frequent ItemsetsOSome itemsets are redundant because they have identical support as their supersetsONumber of frequent itemsetsONeed a compact representationTID A1A2A3A4A5A6A7A8A9A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C101111111111100000000000000000000211111 1111100000000000000000000311111111110000 0000000000000000411111111110000000000000 000000051111111111000000000000000000006 0000000000111111111100000000007 0000000000111111111100000000008 0000000000111111111100000000009 0000000000111111111100000000001000000000 0011111111110000000000110000000000000000 0000111111111112000000000000000000001111 1111111300000000000000000000111111111114 0000000000000000000011111111111500000000 0000000000001111111111 = =101103kk Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27 Maximal Frequent ItemsetnullABACADAEBCBDBECDCEDEABCDEABCA BDABEACDACEADEBCDBCEBDECDEABCDABCEABDEAC DEBCDEABCDEB orderInfrequent ItemsetsMaximal ItemsetsAn itemset is maximal frequent if none of its immediate supersets is frequent Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28 Closed ItemsetOAn itemset is closed if none of its immediate supersets has