Transcription of Inferring user traits via unsupervised methods
1 Characterizing the ethereum address spaceInferring user traits via unsupervised methodsJames Payette1, Samuel Schwager2, Joseph Murphy31 Department of Computer Science, of MCS, of Physics, AcquisitionData and Feature SetModels and AnalysisResults and DiscussionOngoing InvestigationsReferencesSuccessful,effic ientdataacquisitionwasamajormilestonefor ourproject ,werecursivelyscrapeddatafromthepublical lyavailableblockchain,eventuallyaggregat ingadatasetof250,000uniqueaddresses. QueriedtheetherscanAPIforanaddress ethereumbalanceandalloftheirtransactions ( ).Wetriedtoselectfeaturesthat,whenaggreg ated, :TotalEther,numberoftransactions,transac tionspermonth,averageEthertransaction, ,yetanonymousledgers,or blockchains , ,knownonlybytheiraddresses,wouldhaveenor moussecurityimplications[1].Weexaminethe blockchainofEthereumwiththeobjectiveofcl usteringaddressesintodistinct behaviorgroups example transaction on the ethereum blockchain [2]The Ethereumaddress spaceThemainobjectiveofourquantitativean alysiswastouseclusteringevaluationmetric sandPrincipalComponentAnalysis(PCA)todet ermineaninformedestimatefortheoptimalnum berofclusterswithwhichtoexamineasbehavio rgroups.
2 [1]Monaco,JohnV."Identifyingbitcoinusers bytransactionbehavior."SPIED efense+ ,2015.[2]Wood,Gavin." ethereum :Asecuredec entralisedgeneralisedtransactionledger." EthereumProjectYe l l owPaper151(2014).[3]Kodinariya,TruptiM., "ReviewondeterminingnumberofClusterinK-M eansClustering." (2013):90-95.[4]Tibshirani,Robert,Guenth erWalther,andTrevorHastie." PCA finds that only 33%of the variance is explained by the first two components K-means clustering used over other methods for its scalability, versatility Use unsupervised metric CalinskiHarabaz Score as measure of cluster definition Elbow of Calinski Harabaz plot gives insight on optimal number of clusters [3] Further investigate optimal number of clusters via Silhouette ScoresAcknowledgementsDeterminingtheopti malnumberofK-meansclustersisnotalwaysawe ll-definedproblem[3],[4].Employingvariou sevaluationtechniques, "JournaloftheRoyalStatisticalSociety:Ser iesB(StatisticalMethodology) (2001):411-423.[5]Meiklejohn,Sarah,etal.
3 "Afistfulofbitcoins:characterizingpaymen tsamongmenwithnonames." , :Silhouettescoresrangefrom0to1(-1=miscla ssification).Scorescloserto1indicateacon fidentclustermapping( ,farfromneighbors).Left:Silhouettescores ofclusterswithsize>100,averagescore(dott edredline). , ,wewillqualitativelyanalyzetheclustersba sedontheirlocationsinfeaturespacetochara cterizetheirtraits[5].Thequalitativeanal ysiswillbeincludedinourfinalreport(inpre paration).Longtermapplicationsofthiswork includeexploringgenerativemodelstolearns pecificbehaviorgroupcharacteristicsinord erto impersonate is the sum of squared distances of samples to their closest cluster center. Elbow similar to CH , elbows ,consideringtheSilhouetteanalysis, ,asthereislikelyabiaseddistributionofuse rs.