Transcription of Detecting Outliers - Brendan Gregg's Homepage
1 Rg/blo gs / Brendan /2013/07/01/ Detecting - o utliers / Detecting OutliersIn co mput er perf o rmance, we re es pecially co ncerned abo ut latencyoutliers: very s lo w dat abas e queries , applicat io n reques t s , dis k I/O,et c. T he t erm o ut lier is s ubject ive: t here is no rigid mat hemat icaldef init io n. Fro m [Grubbs 69]:An outlying observation, or outlier, is one that appears todeviate markedly from other members of the sample in which liers are co mmo nly det ect ed by co mparing t he maximum value in adat a s et t o a cus t o m t hres ho ld, s uch as 50 o r 100 ms f o r dis k his requires t he met ric t o be well unders t o o d bef o rehand, as isus ually t he cas e f o r applicat io n lat ency and o t her key met rics .Ho wever, we are als o o f t en f aced wit h a large number o f unf amiliarmet rics , where we do n t kno w t he t hres ho lds in here are a number o f pro po s ed t es t s f o r o ut liers which do n t relyo n t hres ho lds.
2 If s uch a t es t wo rks , o ut liers can be det ect ed f ro m any perf o rmance met ll explain o ut liers us ing a vis ualiz at io n, and pro po s e a s imple t es t f o r t heir det ect io n. I ll t hen us e it o ns ynt het ic and t hen real wo rld dis t ribut io ns . T he res ult s are s urpris his is dis k I/O lat ency f ro m a pro duct io n clo ud s erver as a f requency t rail, s ho wing 10,000 I/O lat encymeas urement s f ro m t he blo ck device int erf ace level: Out liers can be s een as dis t ant po int s o n t he right .ProblemNo w co ns ider t he f o llo wing 25 s ynt het ic rando m dis t ribut io ns , which are s ho wn as f illed f requency t rails . T hes ehave been co lo red dif f erent s hades o f yello w t o help dif f erent iat e o verlaps.
3 T he purpo s e is t o co mpare t hedis t ance f ro m t he bulk o f t he dat a t o t he o ut liers , which lo o k like grains o f s and. Many o f t hes e appear t o have o ut liers : values t hat deviat e markedly f ro m o t her members o f t he s ample. Whicho nes ?Six Sigma TestT his ident if ies t he pres ence o f o ut liers bas ed o n t heir dis t ance f ro m t he bulk o f t he dat a, and s ho uld berelat ively eas y t o unders t and and implement . Firs t , calculat e t he max s igma:max = (max(x) ) / T his is ho w f ar t he max is abo ve t he mean, , in unit s o f s t andard deviat io n, (s igma).T he s ix s igma t es t is t hen: Outliers = (max >= 6)If any meas urement exceeds s ix s t andard deviat io ns , we can s ay t hat t he s ample co nt ains ing t he earlier dis k I/O dat a s et : Click t he image t o s ee 6 and t he mean, s t andard deviat io n, and 99t h percent ile f o r co mparis o SigmaHere are t he earlier dis t ribut io ns wit h t heir max s igma values o n t he right :Yo u can us e t his t o unders t and ho w max s igma s cales , and what 6 will and wo n t ident if y.
4 T here is als o avers io n wit h 100 dis t ribut io ns, and no n- co lo red whit e and black vers io ns .Here is ano t her s et, which has dif f erent dis t ribut io n t ypes and numbers o f mo des .T he s ix s igma t es t appears t o wo rk well f o r t hes e s ynt het ic dis t ribut io ns . If yo u wis h t o us e a dif f erent s igmavalue, yo u can us e t hes e plo t s t o help guide yo ur cho I/O Latency OutliersNo w f o r real dat a. T he f o llo wing are 35 dis k I/O lat ency dis t ribut io ns , each wit h 50,000 I/O, s o rt ed o n maxs igma, and wit h t he x- axis s caled f o r each f requency t rail:One charact eris t ic t hat may s t and o ut is t hat many o f t hes e dis t ribut io ns aren t no rmal: t hey are co mbinat io nso f bimo dal and lo g- no rmal.
5 T his is expect ed: t he lo wer lat ency mo de is f o r dis k cache hit s , and t he higherlat ency mo de is f o r dis k cache mis s es , which als o has queueing creat ing a t ail. T he pres ence o f t wo mo desand a t ail increas es t he s t andard deviat io n, and t hus , lo wers max s o f t hes e dis t ribut io ns s t ill have o ut liers acco rding t o t he s ix s igma t es t . And t his is jus t t he t o p 35: s ee t hef ull 200 dis k I/O dis t ribut io ns (whit e, black), f ro m 200 rando m pro duct io n s ervers .100% of these servers have latency outliersI ve t ackled many dis k I/O lat ency o ut lier is s ues in t he pas t , but haven t had a go o d s ens e f o r ho w co mmo no ut liers really are. Fo r my dat acent er, dis ks , enviro nment , and during a 50,000 I/O s pan, t his vis ualiz at io ns ho ws t hat lat ency o ut liers are very co mmo n Latency OutliersHere are 35 MySQL co mmand lat ency dis t ribut io ns , f ro m abo ut 5,000 meas urement s each:T his is f ro m 100 rando m MySQL pro duct io n s ervers, where 96% have 6 o ut liers.
6 Latency OutliersHere are 35 no HT T P s erver res po ns e t ime dis t ribut io ns , f ro m abo ut 5,000 meas urement s each:T his is f ro m 100 rando m no pro duct io n s ervers, where 98% have 6 o ut liers .The Implications of OutliersT he pres ence o f o ut liers in a dat as et has s o me impo rt ant implicat io ns :1. T here may be much great er values o ut liers t han t he average and s t andard deviat io n s ugges t . Us e away t o examine t hem, s uch as a vis ualiz at io n o r lis t ing t hem beyo nd a t hres ho ld (eg, 6 ). At t he veryleas t , examine t he maximum Yo u can t t rus t t he average o r s t andard deviat io n t o ref lect t he bulk o f t he dat a, as t hey may be s light lyinf luenced by o ut liers.
7 Fo r t he bulk o f t he dat a, yo u can t ry us ing ro bus t s t at is t ics s uch as t he medianand t he median abs o lut e deviat io n (MAD).In a recent s evere cas e, t he mean applicat io n res po ns e t ime was o ver 3 ms . Ho wever, explaining t his valuealo ne was f ut ile. Upo n s t udying t he dis t ribut io n, I s aw t hat mo s t s reques t s were aro und 1 ms , as was t hemedian but t here were o ut liers t aking up t o 30 s eco nds !While o ut liers can be a perf o rmance pro blem, t hey aren t neces s arily s o . Here are t he s ame 200 dis k I/Odis t ribut io ns , numbered and s o rt ed bas ed o n t heir max lat ency in millis eco nds (whit e, black). Only 80% o f t hes ehave lat ency o ut liers bas ed o n a 50 ms t hres ho ld.
8 Fo r s o me dis t ribut io ns , 1 ms exceeds 6 , as t he bulk o f t heI/O were much f as t StepsAf t er ident if ying t he pres ence o f o ut liers , yo u can examine t hem vis ually us ing us ing a his t o gram, f requencyt rail, s cat t er plo t , o r heat map. Fo r all o f t hes e, a labeled axis can be included t o s ho w t he value range,indicat ing t he maximum value heir values can als o be s t udied individually, by o nly lis t ing t ho s e beyo nd 6 in t he s ample. Ext ra inf o rmat io ncan t hen be co llect ed, which wo uld have been t o o much det ail f o r t he ent ire dat a s et .What Causes Outliers ?Out liers , depending o n t heir t ype, may have many caus es . To give yo u an idea f o r lat ency o ut liers :Net wo rk o r ho s t packet dro ps , and T CP t imeo ut - bas ed ret rans mit s.
9 DNS t imeo ut s .Paging o r s ck co nt ent io io n s o f t ware s calabilit y is s ues .Erro rs and ret ries .CPU caps , and s cheduler lat io n by higher prio rit y wo rk (kernel/int errupt s ).So me guy s ho ut ing at yo ur dis d lo ve t o analyz e and s ho w what t he previo us ly s ho wn o ut liers were caus ed by, but I ll have t o s ave t hat f o rlat er po s t s (t his is lo ng eno ugh).Implementing Sigma TestsT he lat ency meas urement s us ed here were t raced us ing DTrace, and t hen po s t - pro ces s ed us ing here are a number o f ways t o implement t his in real- t ime. By us e o f cumulat ive s t at is t ics , t he mean ands t andard deviat io n can be kno wn f o r t he ent ire po pulat io n s ince co llect io n began.
10 T he max can t hen beco mpared t o t hes e cumulat ive met rics when each event (I/O o r reques t ) co mplet es , and t hen t he max s igma canbe calculat ed and maint ained in a co unt r example: t he dis k I/O s t at is t ics repo rt ed by io s t at (1), which ins t rument t he blo ck device layer, aremaint ained in t he kernel as a gro up o f s t at is t ics which are t he t o t als s ince bo o t . Fo r t he Linux kernel, t hes e aret he eleven /pro c/dis ks t at s as do cument ed in Do cument at io n/io s t at s .t xt , and maint ained in t he kernel as s t ructdis k_s t at s . A member t o s uppo rt calculat ing t he s t andard deviat io n can be added, which has t he cumulat ives quare o f t he dif f erence t o t he mean, as well as max s igma and maximum members.