Transcription of An Apriori-based Algorithm for Mining - 大阪大学
1 , ciencyhasbeencon \Graphstructure" eldofchemistry,CASEandMultiCASE systemshavebeenoftenusedtodiscovercharac teristicsubstructuresofchemicalcom-pound s[8],[9].Thoughthesesystemscane ciently ndthesubstructures, [14].Thoughtheproposedalgorithmisverye cienttominefre-quentschemasfrommassiveda ta, ,thepropositionalclassi cationtechniques, , ,theregressiontreetechniques, ,M5,andtheinductivelogicprogramming(ILP) techniqueshavebeenap-pliedinthecarcinoge nesispredictionsofchemicalcompounds[10], [7].However,theseapproachescandiscoveron lylimitedtypesofcharacteristicsubstructu res,becausethegraphstructuresmustbepre-c haracterizedbysomespeci ,atechniquetominethefrequentsubstructure scharacterizingthecarcinogenesisofchemic alcompoundshasbeenproposedwithoutrequiri nganyconversionofsubstructurestospeci cfeaturesbyDehaspeetal.[3].Theyused??Cur rentlybeeinginTokyoResearchInstitute,IBM ,1623-14 Shimotsuruma,Yam-atoshi,Kanagawa,242-850 2, , [11].Sincethee ciencyachievedbythisapproachismuchbetter thantheformerILPapproaches, ,thefullsearchspacewasstillsolargethatth esearchhadtobelimitedwithinthe6thlevelwh erethesubstructuresarerepresentedwith6pr edicatesatmaximum,andtheyreportedthatsig ni (GBI)isanapproachtoseekthefrequentpatter nsbyiterativelychunkingthevertexpairstha tfre-quentlyappear[12].
2 SUBDUE isanotherapproachtoseekthecharacteristic graphpatternstoe cientlycompresstheoriginalgraphintermsof MDLprin-ciple[2]. ,theymaymisssomesigni cantpatterns, ,eachworkminessomecharac- )toproposeanovelapproachnamedas\ apriori - basedGraphMining",AGMforshort,tominethef requentsubstructuresandtheassociationrul esfromthegeneralclassofgraphstructuredda tainamoree cientmannerthantheprecedingwork,and2)toa ssesstheperformanceoftheapproachforthear ti ciallysimulateddataandalsoforthecarcinog enesisdataofOxfordUniversityandNationalT oxicologicalProgram(NTP)[13].2 PrincipleofMiningGraphSubstructuresTheme thodsstudiedinthemathematicalgraphisomor phismproblemarenotdirectlyapplicabletoou rcase,becausethemethodsareonlytocheckift hetwogivengraphsareisomorphic[4].Weintro ducethemathematicalgraphrepresen-tationo f\adjacencymatrix"andtocombineitwithane cientlevelwisesearchofthefrequentcanonic almatrixcode[5].Thelevelwisesearchisbase dontheextensionoftheApriorialgorithmofth ebasketanalysis[1]. nition1(GraphhavingLabels)Givenasetofver ticesV(G)=fv1;v2;:::;vkg,asetofedgesconn ectingsomevertexpairsinV(G);E(G)=feh=(vi ;vj)jvi;vj2V(G)g,asetofvertexlabelsL(V(G ))=flb(vi)j8vi2V(G)gandasetofedgelabelsL (E(G))=flb(eh)j8eh2E(G)g,thenagraphGisre presentedasG=(V(G);E(G);L(V(G));L(E(G))) , ,2000,Lyon,France(toappear)3 ThisgraphGisrepresentedbyanadjacencymatr ixXwhichisaverywellknownrepresentationin mathematicalgraphtheory[4].
3 ThistransformationfromGtoXdoesnotrequire muchcomputationale nition2(AdjacencyMatrix)GivenagraphG=(V( G);E(G);L(V(G));L(E(G))),theadjacencymat rixXhasthefollowing(i;j)-element,xij,xij =(num(lb);eh=(vi;vj)2E(G)andlb=lb(eh)0;( vi;vj)2=E(G);wherenum(lb) ,anumbernum(lb)isassignedtothei-thlow(i- thcolumn)ofthematrixwherevi2V(G)andlb=lb (vi).De nition3(SizeofaGraph)The\size"ofagraphGi sthenumberofverticesinV(G), ,kinDe nition4(GraphTransactionandGraphData)Agr aphG=(V(G);E(G);L(V(G));L(E(G)))isatrans action,andgraphdataGDisasetofthetransact ions,whereGD=fG1;G2;:::; nitioniseither`0'or`1',whereaseachelemen tinDe ,andenablesane (i-thcolumn).Toreducethevariantsoftherep -resentationsandincreasethee ciencyofthecodematchingdescribedlater, ,andthegraphasG(Xk).De nition5(Vertex-sortedAdjacencyMatrix)The adjacencymatrixXkofthegraphG(Xk)isvertex -sortedifnum(lb(vi)) num(lb(vi+1))fori=1;2;:::;k 1:Inthestandardbasketanalysis,itemswithi nanitemsetarekeptinlexico-graphicorder[1 ].Thisenablesane , , nition6(CodeofAdjacencyMatrix)Incaseofan undirectedgraph,thecodecode(Xk)ofavertex -sortedadjacencymatrixXk;Xk=0 BBBBB@x1;1x1;2x1;3 x1;kx2;1x2;2x2;3 x2;kx3;1x3;2x3;3 x3; ;1xk;2xk;3 xk;k1 CCCCCA; , nedascode(Xk)=x1;1x1;2x2;2x1;3x2;3x3;3x1 ;4 xk 1;kxk;k; ,itisde nedascode(Xk)=x1;1x1;2x2;1x2;2x1;3x3;1x2 ;3x3;2 xk 1;kxk;k 1xk;k;wherethedigitsareobtainedsimilarly totheundirectedcase,butthediagonallysymm etricelementxjiisaddedaftereachxijwheni6 = nition7(InducedSubraph)GivenagraphG=(V(G );E(G);L(V(G));L(E(G))),aninducedsubgrap hofG,Gs=(V(Gs);E(Gs);L(V(Gs));L(E(Gs))), (Gs) V(G);E(Gs) E(G);8u;v2V(Gs);(u;v)2E(Gs),(u.))
4 V)2E(G):WhenGsisaninducedsubgraphofG,iti sdenotedasGs nitionsof\support"and\con dence" nition8(SupportandCon dence)GivenagraphGs,thesupportofGsisde nedassup(Gs)=numberofgraphtransactionsGw hereGs G2 GDtotalnumberofgraphtransactionsG2GD:Giv entwoinducedsubgraphsGbandGh,thecon denceoftheassociationruleGb)Ghisde nedasconf(Gb)Gh)=numberofgraphsGwhereGb[ Gh G2 GDnumberofgraphsGwhereGb G2GD:Ifthevalueofsup(Gs)ismorethanathres holdvalueminsup,Gsiscalledasa\frequentin ducedsubgraph".SimilarlytotheApriorialgo rithm, (Xk)andG(Yk) (Xk)andG(Yk) , ,2000,Lyon,France(toappear)5elementsofth ematricesexceptfortheelementsofthek-thro wandthek-thcolumn,thentheyarejoinedtogen erateZk+ Xk 1x1xT2xkk ;Yk= Xk 1y1yT2ykk ;Zk+1=0@Xk 1x1y1xT2xkkzk;k+1yT2zk+1;kykk1A=0B@Xky1z k;k+1yT2zk+1;kykk1CA;(1)whereXk 1istheadjacencymatrixrepresentingthegrap hwhosesizeisk 1,xiandyi(i=1;2)are(k 1) \ rstmatrix"andYkthe\secondmatrix".Thefoll owingrelationsholdamongthevertex-sorteda djacencymatricesXk;YkandZk+ (vi;vi2V(G(Xk))=lb(vi;vi2V(G(Yk))=lb(vi; vi2V(G(Zk+1)));lb(vi;vi2V(G(Xk)) lb(vi+1;vi+12V(G(Xk));lb(vk;vk2V(G(Xk))= lb(vk;vk2V(G(Zk+1));(2)lb(vk;vk2V(G(Yk)) =lb(vk+1;vk+12V(G(Zk+1));lb(vk;vk2V(G(Xk )) lb(vk;vk2V(G(Yk)):Here,i=1; ;k ;k+1andzk+1; (lb)correspondingtoeachedgelabellbor0cor respondingtothecasethatnoedgeexistsbetwe envkandvk+ ,zk;k+1andzk+1; +1sforallpossiblevaluepairsofzk;k+1andzk +1; (Xk)andG(Yk)arethesame,exchangingXkandYk ( ,takingYkasthe rstmatrixandXkasthesecondmatrix), ,thetwoadjacencymatricesarejoinedonlywhe nEq.)))))))]
5 (3)issatis \normalform".code(the rstmatrix) code(thesecondmatrix)(3)Inthestandardbas ketanalysis,the(k+1)-itemsetbecomesacand idatefrequentitemsetonlywhenallthek-sub- itemsetsarecon ,thegraphGofsizek+1isacandidateoffrequen tinducedsubgraphsonlywhenalladjacencymat ricesgeneratedbyremovingfromthegraphGthe i-thvertexvi(1 i k+1)andallitsconnectedlinksarecon (smaller)k-levels,iftheadjacencymatrixof thegraphgeneratedbyremovingthei-thvertex viisnon-normalform, , ,anadjacencymatrixofthesize1 1issetforeachvertexvi2G(Xk).Then,thepair ofthematri-cesfortheverticesvi;vj2G(Xk)s atisfyingtheconstraintsofEq.(2)and(3)are joinedbytheoperationofEq.(1).Atthistime, thevaluesoftheelementsfor(vi;vj)and(vj;v i)intheoriginalXkaresubstitutedtothenon- diagonalelementsz1;2andz2;1respectivelyt oreconstructthestructureofG(Xk).Sub-sequ ently,thepairoftheobtained2 2matricesarefurtherjoinedaccordingtothec onstraintsofEq.(1),(2)and(3).Thevaluesof theelementsz2;3andz3; ectsthestructureofG(Xk),andisconstructed byfollowingtheconstraints, \normalization".
6 Intheintermediatelevels,thenormalformsof allinducedsubgraphsofG(Xk) (Tk) [6].CanonicalFromAfterallcandidateinduce dsubgraphsarederived, , , @ ciently, ,canonicalformisde nedfornormalformsofadjacencymatricesrepr esentinganidenticalinducedsubgraph,andan e nition9(CanonicalForm)GivenasetNF(G)ofal lnormalformsofadjacencymatricesrepresent inganidenticalgraphG,itscanonicalformXci sde nedasXhavingtheminimumcodenumberinNF(G), ,Xc=argminX2NF(G)code(X):Weassumethatall thetransformationmatricesSk 1bethematrixobtainedbyremovingthem-thver texvm(1 m k) , ,2000,Lyon,France(toappear)7fromG(Xk).Xm k 1istransformedtooneofitsnormalforms,X0mk 1,bytheafore-mentionednormalization,andt husitstransformationmatrixTmk ,letSk 1ofX0mk 1beSmk 1,thenthetransformedcanonicalformisrepre sentedby(Tmk 1 Smk 1)TXmk 1 Tmk 1 Smk ;TmktotransformXktoXckareobtainedfromSmk 1;Tmk [6].sij=8>> <>>:smij0 i k 1and0 j k 1;1i=kandj=k;0otherwise;tij=8>>>>> <>>>>>:tmiji<mandj6=k;tmi 1;ji>mandj6=k;1i=mandj=k;0otherwise;Xck=argmi nm=1; ;kcode((TmkSmk)TXk(TmkSmk));wheresij;smi j;tijandtmijaretheelementsofmatrixSmk;Sm k 1;TmkandTmk , , , , , ,theassociationrulesamongthemwhosecon dencevaluesaremorethanagivencon- , , ,oneforthedirectedgraphandtheotherforthe undirectedgraph, ,2,3and4showtheresultsofcomputationtimef ordi erentnumberoftransactions,numberofvertex labels,minimumsupportthresholdandaverage transactionsizeforbothdirectedandundirec tedgraphs, , , ,theproposedalgorithmdoesnotshowintracta blecomputationalcomplexityexceptthecases forgraphsoflargesizeinthedatabase.
7 1 XPEHU RI IUHTXHQW JUDSK EDUV w w 1 XPEHU RI JUDSKV LQ GDWDEDVHF igure1 Figure2 | Figure3 Figure4 [13].Thetaskisto (NTP) , , ,2000,Lyon,France(toappear) :level(numberofverticesincludedinfrequen tsubgraph)NOC:numberofcandidates,NOFS:nu mberoffrequentgraphsthe300compoundsweres electedfortheanalysis, , ,H,O,Cl,F,Sandsomecations,andthetypesofb ondsaresingle,dou-ble, cstructuresataspeci ,anisolatedvertexlabeledbythecarcinogene sisclassofthecompound, ,\classvertex", (NOC)andthatofthediscoveredfrequentinduc edsubgraphs(NOF)foreachlevelofthesearch, , , ,andwasalmost8daysfor10%,whileitwasonlya bout40minutesfor20%.
8 Thesizeofthelargestfrequentinducedsubgra phdiscoveredinthecaseof10% ,thecon dencedeviation ofanassociationruleGb)Ghisgivenasfollows . =(conf(Gb)Gh) (Gb)Gh) , ,frpisthefractionofpositivecompoundsinth edata, , ,andfrnisthatofnegativecompounds, , (=100% ). th,asetofassociationruleseachhaving morethanthe thisde ned, ,therulesetderivedforthe10%thresholdcont ainssomeruleshavingsigni cantcon ,theexhaustivesearchforlowsupportthresho ldisconsideredtobeverye ectivetominevaluablerules. | | | | | | | |1 | | | | | |Figure5:Relationof FRQI 6 ! 1 HJDWLYHVXS FRQI &"" ! 3 RVLWLYHVXS FRQI ;+" ! 3 RVLWLYHF igure6 rstruleisverysimple, ,thesymbolXofavertexand? , [3].Thisfactshowsthepracticale ,anovelapproachhasbeendevelopedthatcane , ,2000,Lyon,France(toappear) , , , , { , , ,JournalofArti cialIntelligenceRe-search, , ,L.,Toivonen, , (KDD-98), { , ,Univer-sityofAlberta,Edomonton,Alberta, ,A.,Washio, , :ProceedingsoftheSecondInterna-tionalCon ference,DS'99, { , (inJapanese), , ,R.}}}
9 ,Muggleton,S.,Srinivasan, , ; , , , , , , ,QSAR, , { ,S.,Pfahringer, , (KDD-97), { , , , , , ,T.,Horiuchi,T.,Motoda, , c-AsiaConferenceofKnowledgeDiscoveryandD ataMining(PAKDD2000), { ,A.,King, ,Muggleton, , cialIntelligence(IJCAI-97), { , , (KDD-97), {274.}}}}}