Example: biology

Linear Regression via Maximization of the Likelihood

ElementsofMachineLearningPrincetonUniver sityInleastsquaresregression, ,weintroducedtheideaofafunction ( y,y)thatisbiggerwhenourmachinelearningmo delproducesanestimate ,weusedasquaredloss: ( y,y)=( y y)2.(1)Inthisnotewe , {yn}Nn=1whereyn Randweassumethattheyareallindependentlya ndidenticallydistributedaccordingtoaGaus siandistributionwithunknownmean andvariance | , 2 N(yn| , 2).(2)Theprobabilitydensityfunctionassoc iatewiththisconditionaldistributionisthe familiarunivariateGaussian:Pr(yn| , 2)=1 2 exp 12 2(yn )2 .(3) ,however,andsowewritetheconditionaldistr ibutionforallofthemasaproduct:Pr({yn}Nn= 1| , 2)=N n=11 2 exp 12 2(yn )2 .(4)Thisfunction,whichweareherethinkingo fasbeingparameterizedby ,weareasking what wouldassignthehighestprobabilitytothedat awe veseen? Thisinductivecriterionofselectingmodelpa rametersbasedontheirabilitytoprobabilist icallyexplainthedataiswhatwerefertoasmax imumlikelihoodestimation(MLE).

wonderful properties that are out of scope for this course. At the end of the day, however, we can think of this as being a different (negative) loss function: ... The covariance matrix Σ must be square, symmetric, and positive definite. When Σ is diagonal,

Tags:

  Properties, Matrix, Covariance, Covariance matrix

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Linear Regression via Maximization of the Likelihood

1 ElementsofMachineLearningPrincetonUniver sityInleastsquaresregression, ,weintroducedtheideaofafunction ( y,y)thatisbiggerwhenourmachinelearningmo delproducesanestimate ,weusedasquaredloss: ( y,y)=( y y)2.(1)Inthisnotewe , {yn}Nn=1whereyn Randweassumethattheyareallindependentlya ndidenticallydistributedaccordingtoaGaus siandistributionwithunknownmean andvariance | , 2 N(yn| , 2).(2)Theprobabilitydensityfunctionassoc iatewiththisconditionaldistributionisthe familiarunivariateGaussian:Pr(yn| , 2)=1 2 exp 12 2(yn )2 .(3) ,however,andsowewritetheconditionaldistr ibutionforallofthemasaproduct:Pr({yn}Nn= 1| , 2)=N n=11 2 exp 12 2(yn )2 .(4)Thisfunction,whichweareherethinkingo fasbeingparameterizedby ,weareasking what wouldassignthehighestprobabilitytothedat awe veseen? Thisinductivecriterionofselectingmodelpa rametersbasedontheirabilitytoprobabilist icallyexplainthedataiswhatwerefertoasmax imumlikelihoodestimation(MLE).

2 ,however,wecanthinkofthisasbeingadiffere nt(negative)lossfunction: = MLE=argmax Pr({yn}Nn=1| , 2)=argmax N n=11 2 exp 12 2(yn )2 .(5)Inpractice,thisisn ,weactuallyprefertomaximizetheloglikelih oodbecauseitturnsallofourproductsintosum s,whichareeasiertomanipulateanddifferent iate, ,whenwetaketheproductofmanythingsthatmay belessthan1,thefloatingpointnumbersonour computermaybecomeveryclosetozeroandthema ximizationmaynotbenumericallystable; ,our(negative)lossfunctionbecomesL( )=logPr({xn}Nn=1| , 2)=N n=1log 1 2 exp 12 2(yn )2 (6)= Nlog N2log2 12 2N n=1(yn )2.(7)Figure1showsthelikelihoodfunctionL ( ) ; byfollowingthesamekindofprocedurethatweu sedforleastsquaresregression:differentia te,settozero,andsolvefor :dd L( )=1 2N n=1(yn )=0(8)1 2N n=1yn N 2 =0(9) =1NN n=1yn.(10)2 4 10 7 Figure1:Theblackdotsareten(N=10)datafrom aGaussiandistributionwith 2=1and = . ; ,themaximumlikelihoodestimateinthismodel (regardlessof 2) ,weassumeourdataaretuplesoftheform{xn,yn }Nn=1,wherexn RDandyn ,however, ,we renowgoingtosaythatourdataarisefromaproc esslikey=xTw+ (11)where ,wecanthinkaboutdifferentnoisemodelsfor.

3 Generalizingtheprevioussection,averynatu ralideaistosaythatthisnoiseisfromazero-m eanGaussiandistributionwithvariance 2, , | 2 N( |0, 2).(12)AddingaconstanttoaGaussianjusthas theeffectofshiftingitsmean,sotheresultin gconditionalprobabilitydistributionforou rgenerativeprobabilisticprocessisPr(yn|x n,w, 2)=1 2 exp 12 2(yn xTnw)2 .(13) ,exceptnowratherthanconditioningon ,we :Pr({yn}Nn=1|{xn}Nn=1,w, 2)=N n=11 2 exp 12 2(yn xTnw)2 .(14)Again,thisisafunctionofw, , ,butwithanewtwist:we (z| , )=| | 1/2(2 ) D/2exp 12(z )T 1(z ) .(15)Thecovariancematrix mustbesquare,symmetric, isdiagonal, :Pr(y|X,w, 2)=N(y|Xw, 2I)=(2 2 ) N/2exp 12 2(Xw y)T(Xw y) .(16)Wecannowthinkabouthowwe ,itishelpfultotakethenaturallogfirst:log Pr(y|X,w, 2)= N2log(2 2 ) 12 2(Xw y)T(Xw y).(17)Theadditivetermdoesn :wMLE=argmaxw 12 2(Xw y)T(Xw y) .(18)4 The12 2doesnotchangethesolutiontothisprobleman dofcoursecouldchangethesignandmakethisma ximizationintoaminimization:wMLE=argminw (Xw y)T(Xw y).

4 (19)Thisisexactlythesameoptimizationprob lemthatwesolvedfortheleast-squareslinear Regression !Whileitseemslikethelossfuncti onviewandthemaximumlikelihoodviewarediff erent,thisrevealsthattheyareoftenthesame underthehood:leastsquarescanbeinterprete dasassumingGaussiannoise,andparticularch oicesoflikelihoodcanbeinterpreteddirectl yas(usuallyexponentiated) 2 Onethingthatisdifferentaboutmaximumlikel ihood,however, , , ,afterfindingwMLEifwehaveaqueryinputxpre dforwhichwedon tknowthey,wecouldcomputeaguessviaypred=x TpredwMLE,orwecouldactuallyconstructawho ledistribution:Pr(ypred|xpred,wMLE, 2)=N(ypred|xTpredwMLE, 2).(20)Thissoundsgreat,but ,itwon tbeanygood ,maximumlikelihoodestimationtellsushowto dothatonealso,andwecanstartoutbyassuming thatwe : MLE=argmax N2log(2 2 ) 12 2(XwMLE y)T(XwMLE y) .(21)Solvingthismaximizationproblemisaga injustaquestionofdifferentiatingandsetti ngtozero: 2 N2log 2 12 2(XwMLE y)T(XwMLE y) =0(22) N2 2+12 4(XwMLE y)T(XwMLE y)=0(23) N+1 2(XwMLE y)T(XwMLE y)=0(24) 2=1N(XwMLE y)T(XwMLE y).

5 (25) 17 September2018 Initialversion6


Related search queries