Transcription of Introduction to reinforcement learning
1 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojectsIntroduction to reinforcement learningPantelis P. AnalytisMarch 12, 20181 / 27 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojects1 Introduction2classical and operant conditioning3 Modeling human learning4 Ideas for semester projects2 / 27 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojectsWhat s reinforcement learning ?3 / 27 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojectsWhat s reinforcement learning ?4 / 27 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojectsclassical conditioningConditioned stimulus ( a sound) , unconditionedstimulus ( the taste of food), unconditioned response(unlearned behavior such as salivation).
2 5 / 27 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojectsBehaviorism in psychologyPsychology was under the grip of behaviorism from the20s to the on expressed behavior rather than on / 27 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojectsThe Rescola-Wanger model Vn+1X= X ( Vtot)Vn+1X=VnX+ Vn+1X VXis the change in the strength, on a single trial, of theassociation between the CS labelled X and the US is the salience of X (bounded by 0 and 1) is the rate parameter for the US (bounded by 0 and 1),sometimes called its association value is the maximum conditioning possible for the USVXis the current associative strength of XVtotis the total associative strength of all stimuli present,that is, X plus any others7 / 27 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojectsThe Rescola-Wanger model: predictionsThe model captures acquisition and extinction ofassociations through a process of surprise.
3 First model toincorporate several , the model captures interactions between cue may block the association of another with theUS. Extinction might not occur if an inhibitor is time the model converges to optimal least : Blocking, overshadowing and weakening / 27 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojectsThe first learning experimentsThorndike studied the time that animals took to escapefrom his illustrious / 27 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojectsThorndike s law of effectThorndike s law of effect:Of several responses made to thesame situation, those which are accompanied or closelyfollowed by satisfaction to the animal will, other things beingequal, be more firmly connected with the situation, so that,when it recurs, they will be more likely to recur.
4 Those whichare accompanied or closely followed by discomfort to theanimal will, other things being equal, have their connectionswith that situation weakened, so that, when it recurs, they willbe less likely to occur. The greater the satisfaction ordiscomfort, the greater the strengthening or weakening of thebond(Thorndike, 1911, ).10 / 27 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojectsLearned helplessnessThe organisms learn that it is impossible to escape, andeven when the hindrance is removed they do not attemptto / 27 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojectsThe first learning experimentsOperant conditioning can be described as a process thatattempts to modify behavior through the use of positiveand negative reinforcement .
5 Through operantconditioning, an individual makes an association between aparticular behavior and a consequence (Skinner, 1938).12 / 27 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojectsTolman s cognitive maps3 groups of rats, running in a maze for 17 group got a reward, the second got no reward, thethird got a reward on the 11th / 27 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojectsImplicit learningThe group that was rewarded only on the 11th dayimproved rapidly and surpassed in terms of performancethe group that was rewarded from the / 27 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojectsThere are two strategies to solve RL problems.
6 Organismcan memorize rewards or construct a contingency map andplan ahead / 27 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojectsThe Iowa gambling task (Bachara et al. 1997)Participants are presented 4 decks on the computer andthey are told that each deck will reward them or trials in total, unbeknownst to the participants. Theparticipants started with $ 2000 and are asked tomaximize their / 27 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojectsThe Iowa gambling taskParticipants are presented 4 decks on the computer andthey are told that each deck will reward them or s A and B bring higher bring higher immediaterewards, but have negative expected value, while C and Dhave lower immediate rewards but positive expected / 27 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojectsModeling human learning : expectationThe delta rule is a popular model-free learning rule:Ej(t) =Ej(t 1) + j(t) [Rj(t) Ej(t 1)],where j(t) is an indicator variable, being 1 if alternativejwas chosen on trialt, and 0 otherwise.
7 We opted for asimple fixed learning rate, / 27 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojectsModeling human learning : expectationThe decay rule is another popular model-free learning rule,according to which expected values of the unchosenalternatives decay towards 0 ( Erev and Roth, 1998):Ej(t) = Ej(t 1) + j(t)Rj(t),with decay parameter 0 / 27 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojectsModeling human learning : choice rules -greedy ruleP(C(t) =j) ={(1 )/KmaxifEj(t)>Ek(t), k6=j /(K Kmax)otherwisewhereKis the number of arms andKmaxis the number ofarms with the same maximum (C(t) =j) =exp( Ej(t)) Kk=1exp( Ek(t))20 / 27 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojectsThe Iowa gambling task: behavioral resultsParticipants are presented 4 decks on the computer andthey are told that each deck will reward them or s A and B bring higher bring higher immediaterewards, but have negative expected value, while C and Dhave lower immediate rewards but positive expected / 27 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojectsThe Iowa gambling task.}
8 Simulating modelsThe models were fitted on human data using maximumlikelihood / 27 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojectsPrediction competitions23 / 27 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojectsReplicating well known findings24 / 27 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojectsStudying widely used websitesCan you develop a model of likes and comments onInstagram or Twitter?How does attention interact with liking in websites likeFacebook?25 / 27 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojectsUsing big data from KDD competitionsKDD regularly organizes competitions.
9 Data from pastevents are available / 27 IntroductiontoreinforcementlearningPante lis andoperantconditioningModelinghumanlearn ingIdeas forsemesterprojectsDataset repositories27 / 27