Transcription of An Apriori-based Algorithm for Mining - 大阪大学
1 , ciencyhasbeencon \Graphstructure" eldofchemistry,CASEandMultiCASE systemshavebeenoftenusedtodiscovercharac teristicsubstructuresofchemicalcom-pound s[8],[9].Thoughthesesystemscane ciently ndthesubstructures, [14].Thoughtheproposedalgorithmisverye cienttominefre-quentschemasfrommassiveda ta, ,thepropositionalclassi cationtechniques, , ,theregressiontreetechniques, ,M5,andtheinductivelogicprogramming(ILP) techniqueshavebeenap-pliedinthecarcinoge nesispredictionsofchemicalcompounds[10], [7].However,theseapproachescandiscoveron lylimitedtypesofcharacteristicsubstructu res,becausethegraphstructuresmustbepre-c haracterizedbysomespeci ,atechniquetominethefrequentsubstructure scharacterizingthecarcinogenesisofchemic alcompoundshasbeenproposedwithoutrequiri nganyconversionofsubstructurestospeci cfeaturesbyDehaspeetal.
2 [3].Theyused??CurrentlybeeinginTokyoRese archInstitute,IBM,1623-14 Shimotsuruma,Yam-atoshi,Kanagawa,242-850 2, , [11].Sincethee ciencyachievedbythisapproachismuchbetter thantheformerILPapproaches, ,thefullsearchspacewasstillsolargethatth esearchhadtobelimitedwithinthe6thlevelwh erethesubstructuresarerepresentedwith6pr edicatesatmaximum,andtheyreportedthatsig ni (GBI)isanapproachtoseekthefrequentpatter nsbyiterativelychunkingthevertexpairstha tfre-quentlyappear[12].SUBDUE isanotherapproachtoseekthecharacteristic graphpatternstoe cientlycompresstheoriginalgraphintermsof MDLprin-ciple[2].
3 ,theymaymisssomesigni cantpatterns, ,eachworkminessomecharac- )toproposeanovelapproachnamedas\ apriori - basedGraphMining",AGMforshort,tominethef requentsubstructuresandtheassociationrul esfromthegeneralclassofgraphstructuredda tainamoree cientmannerthantheprecedingwork,and2)toa ssesstheperformanceoftheapproachforthear ti ciallysimulateddataandalsoforthecarcinog enesisdataofOxfordUniversityandNationalT oxicologicalProgram(NTP)[13].2 PrincipleofMiningGraphSubstructuresTheme thodsstudiedinthemathematicalgraphisomor phismproblemarenotdirectlyapplicabletoou rcase,becausethemethodsareonlytocheckift hetwogivengraphsareisomorphic[4].
4 Weintroducethemathematicalgraphrepresen- tationof\adjacencymatrix"andtocombineitw ithane cientlevelwisesearchofthefrequentcanonic almatrixcode[5].Thelevelwisesearchisbase dontheextensionoftheApriorialgorithmofth ebasketanalysis[1]. nition1(GraphhavingLabels)Givenasetofver ticesV(G)=fv1;v2;:::;vkg,asetofedgesconn ectingsomevertexpairsinV(G);E(G)=feh=(vi ;vj)jvi;vj2V(G)g,asetofvertexlabelsL(V(G ))=flb(vi)j8vi2V(G)gandasetofedgelabelsL (E(G))=flb(eh)j8eh2E(G)g,thenagraphGisre presentedasG=(V(G);E(G);L(V(G));L(E(G))) , ,2000,Lyon,France(toappear)3 ThisgraphGisrepresentedbyanadjacencymatr ixXwhichisaverywellknownrepresentationin mathematicalgraphtheory[4].
5 ThistransformationfromGtoXdoesnotrequire muchcomputationale nition2(AdjacencyMatrix)GivenagraphG=(V( G);E(G);L(V(G));L(E(G))),theadjacencymat rixXhasthefollowing(i;j)-element,xij,xij =(num(lb);eh=(vi;vj)2E(G)andlb=lb(eh)0;( vi;vj)2=E(G);wherenum(lb) ,anumbernum(lb)isassignedtothei-thlow(i- thcolumn)ofthematrixwherevi2V(G)andlb=lb (vi).De nition3(SizeofaGraph)The\size"ofagraphGi sthenumberofverticesinV(G), ,kinDe nition4(GraphTransactionandGraphData)Agr aphG=(V(G);E(G);L(V(G));L(E(G)))isatrans action,andgraphdataGDisasetofthetransact ions,whereGD=fG1;G2;:::; nitioniseither`0'or`1',whereaseachelemen tinDe ,andenablesane (i-thcolumn).)
6 Toreducethevariantsoftherep-resentations andincreasethee ciencyofthecodematchingdescribedlater, ,andthegraphasG(Xk).De nition5(Vertex-sortedAdjacencyMatrix)The adjacencymatrixXkofthegraphG(Xk)isvertex -sortedifnum(lb(vi)) num(lb(vi+1))fori=1;2;:::;k 1:Inthestandardbasketanalysis,itemswithi nanitemsetarekeptinlexico-graphicorder[1 ].Thisenablesane , , nition6(CodeofAdjacencyMatrix)Incaseofan undirectedgraph,thecodecode(Xk)ofavertex -sortedadjacencymatrixXk;Xk=0 BBBBB@x1;1x1;2x1;3 x1;kx2;1x2;2x2;3 x2;kx3;1x3;2x3;3 x3; ;1xk;2xk;3 xk;k1 CCCCCA; , nedascode(Xk)=x1;1x1;2x2;2x1;3x2;3x3;3x1 ;4 xk 1;kxk;k; ,itisde nedascode(Xk)=x1;1x1;2x2;1x2;2x1;3x3;1x2 ;3x3;2 xk 1;kxk;k 1xk;k;wherethedigitsareobtainedsimilarly totheundirectedcase,butthediagonallysymm etricelementxjiisaddedaftereachxijwheni6 = nition7(InducedSubraph)GivenagraphG=(V(G );E(G).)
7 L(V(G));L(E(G))),aninducedsubgraphofG,Gs =(V(Gs);E(Gs);L(V(Gs));L(E(Gs))), (Gs) V(G);E(Gs) E(G);8u;v2V(Gs);(u;v)2E(Gs),(u;v)2E(G):W henGsisaninducedsubgraphofG,itisdenoteda sGs nitionsof\support"and\con dence" nition8(SupportandCon dence)GivenagraphGs,thesupportofGsisde nedassup(Gs)=numberofgraphtransactionsGw hereGs G2 GDtotalnumberofgraphtransactionsG2GD:Giv entwoinducedsubgraphsGbandGh,thecon denceoftheassociationruleGb)Ghisde nedasconf(Gb)Gh)=numberofgraphsGwhereGb[ Gh G2 GDnumberofgraphsGwhereGb G2GD:Ifthevalueofsup(Gs)ismorethanathres holdvalueminsup,Gsiscalledasa\frequentin ducedsubgraph".]
8 SimilarlytotheApriorialgorithm, (Xk)andG(Yk) (Xk)andG(Yk) , ,2000,Lyon,France(toappear)5elementsofth ematricesexceptfortheelementsofthek-thro wandthek-thcolumn,thentheyarejoinedtogen erateZk+ Xk 1x1xT2xkk ;Yk= Xk 1y1yT2ykk ;Zk+1=0@Xk 1x1y1xT2xkkzk;k+1yT2zk+1;kykk1A=0B@Xky1z k;k+1yT2zk+1;kykk1CA;(1)whereXk 1istheadjacencymatrixrepresentingthegrap hwhosesizeisk 1,xiandyi(i=1;2)are(k 1) \ rstmatrix"andYkthe\secondmatrix".Thefoll owingrelationsholdamongthevertex-sorteda djacencymatricesXk;YkandZk+ (vi;vi2V(G(Xk))=lb(vi;vi2V(G(Yk))=lb(vi; vi2V(G(Zk+1)));lb(vi;vi2V(G(Xk)) lb(vi+1;vi+12V(G(Xk));lb(vk;vk2V(G(Xk))= lb(vk;vk2V(G(Zk+1));(2)lb(vk;vk2V(G(Yk)) =lb(vk+1;vk+12V(G(Zk+1));lb(vk;vk2V(G(Xk )) lb(vk;vk2V(G(Yk)):Here,i=1; ;k ;k+1andzk+1; (lb)correspondingtoeachedgelabellbor0cor respondingtothecasethatnoedgeexistsbetwe envkandvk+ ,zk;k+1andzk+1; +1sforallpossiblevaluepairsofzk;k+1andzk +1.))))))))))
9 (Xk)andG(Yk)arethesame,exchangingXkandYk ( ,takingYkasthe rstmatrixandXkasthesecondmatrix), ,thetwoadjacencymatricesarejoinedonlywhe nEq.(3)issatis \normalform".code(the rstmatrix) code(thesecondmatrix)(3)Inthestandardbas ketanalysis,the(k+1)-itemsetbecomesacand idatefrequentitemsetonlywhenallthek-sub- itemsetsarecon ,thegraphGofsizek+1isacandidateoffrequen tinducedsubgraphsonlywhenalladjacencymat ricesgeneratedbyremovingfromthegraphGthe i-thvertexvi(1 i k+1)andallitsconnectedlinksarecon (smaller)k-levels,iftheadjacencymatrixof thegraphgeneratedbyremovingthei-thvertex viisnon-normalform, , ,anadjacencymatrixofthesize1 1issetforeachvertexvi2G(Xk).
10 Then,thepairofthematri-cesforthevertices vi;vj2G(Xk)satisfyingtheconstraintsofEq. (2)and(3)arejoinedbytheoperationofEq.(1) .Atthistime,thevaluesoftheelementsfor(vi ;vj)and(vj;vi)intheoriginalXkaresubstitu tedtothenon-diagonalelementsz1;2andz2;1r espectivelytoreconstructthestructureofG( Xk).Sub-sequently,thepairoftheobtained2 2matricesarefurtherjoinedaccordingtothec onstraintsofEq.(1),(2)and(3).Thevaluesof theelementsz2;3andz3; ectsthestructureofG(Xk),andisconstructed byfollowingtheconstraints, \normalization".Intheintermediatelevels, thenormalformsofallinducedsubgraphsofG(X k) (Tk) [6].