Example: dental hygienist

Attention and Transformers Lecture 11

Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 20211 Lecture 11: Attention and TransformersFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 20212 Administrative: Midterm-Midterm was this Tuesday-We will be grading this week and you should have grades by next Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 20213 Administrative: Assignment 3-A3 is due Friday May 25th, 11:59pm Lots of applications of ConvNets Also contains an extra credit notebook, which is worth an additional 5% of the A3 grade. Extra credit will not be used when curving the class Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 20214 Last Time: Recurrent Neural NetworksFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 20215 Last Time: Variable length computation graph with shared x2x1 WhTy3y2y1L1L2L3 LTLFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 2021 Let's jump to Lecture 10 - slide 436 Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 20

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 z 0,0 z 0,1 z 0,2 z 1,0 z 1,1 z 1,2 z 2,0 z 2,1 z 2,2 Attention idea: New context vector at every time step. Each context vector will attend to different image regions gif source Attention Saccades in humans

Tags:

  Visual, Transformers, Attention, Attention and transformers, Visual attention

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Attention and Transformers Lecture 11

1 Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 20211 Lecture 11: Attention and TransformersFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 20212 Administrative: Midterm-Midterm was this Tuesday-We will be grading this week and you should have grades by next Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 20213 Administrative: Assignment 3-A3 is due Friday May 25th, 11:59pm Lots of applications of ConvNets Also contains an extra credit notebook, which is worth an additional 5% of the A3 grade. Extra credit will not be used when curving the class Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 20214 Last Time: Recurrent Neural NetworksFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 20215 Last Time: Variable length computation graph with shared x2x1 WhTy3y2y1L1L2L3 LTLFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 2021 Let's jump to Lecture 10 - slide 436 Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 20217 Today's Agenda.

2 - Attention with RNNs-In Computer Vision-In NLP-General Attention Layer-Self- Attention -Positional encoding-Masked Attention -Multi-head Attention -TransformersFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 20218 Today's Agenda:- Attention with RNNs-In Computer Vision-In NLP-General Attention Layer-Self- Attention -Positional encoding-Masked Attention -Multi-head Attention -TransformersFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 2021 Extract spatial features from a pretrained CNNI mage Captioning using spatial features9 CNNF eatures: H x W x DXu et al, Show, Attend and Tell: Neural Image Caption Generation with visual Attention , ICML 2015z0,0z0,1z0,2z1,0z1,1z1,2z2,0z2,1z2,2 Input: Image IOutput: Sequence y = y1, y2.

3 , yTFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 2021 Extract spatial features from a pretrained CNNI mage Captioning using spatial features10 CNNF eatures: H x W x Dh0Xu et al, Show, Attend and Tell: Neural Image Caption Generation with visual Attention , ICML 2015z0,0z0,1z0,2z1,0z1,1z1,2z2,0z2,1z2,2 MLPE ncoder: h0 = fW(z)where z is spatial CNN featuresfW(.) is an MLPI nput: Image IOutput: Sequence y = y1, y2,.., yTFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 2021 Extract spatial features from a pretrained CNNI mage Captioning using spatial features11 CNNF eatures: H x W x Dh0[START]Xu et al, Show, Attend and Tell: Neural Image Caption Generation with visual Attention , ICML 2015z0,0z0,1z0,2z1,0z1,1z1,2z2,0z2,1z2,2 y0h1[START]y1personMLPE ncoder: h0 = fW(z)where z is spatial CNN featuresfW(.)

4 Is an MLPI nput: Image IOutput: Sequence y = y1, y2,.., yTDecoder: yt = gV(yt-1, ht-1, c)where context vector c is often c = h0cFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 2021 Extract spatial features from a pretrained CNNI mage Captioning using spatial features12 CNNF eatures: H x W x Dh0Xu et al, Show, Attend and Tell: Neural Image Caption Generation with visual Attention , ICML 2015z0,0z0,1z0,2z1,0z1,1z1,2z2,0z2,1z2,2 MLPE ncoder: h0 = fW(z)where z is spatial CNN featuresfW(.) is an MLPI nput: Image IOutput: Sequence y = y1, y2,.., yTDecoder: yt = gV(yt-1, ht-1, c)where context vector c is often c = h0h0[START]y0h1[START]y1h2y2y1personpers onwearingcFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 2021 Extract spatial features from a pretrained CNNI mage Captioning using spatial features13 CNNF eatures: H x W x Dh0Xu et al, Show, Attend and Tell: Neural Image Caption Generation with visual Attention , ICML 2015z0,0z0,1z0,2z1,0z1,1z1,2z2,0z2,1z2,2 MLPE ncoder: h0 = fW(z)where z is spatial CNN featuresfW(.)

5 Is an MLPI nput: Image IOutput: Sequence y = y1, y2,.., yTDecoder: yt = gV(yt-1, ht-1, c)where context vector c is often c = h0h0[START]y0h1[START]y1h2y2y1h3y3y2pers onwearingpersonwearinghatcFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 2021 Extract spatial features from a pretrained CNNI mage Captioning using spatial features14 CNNF eatures: H x W x Dh0[START]Xu et al, Show, Attend and Tell: Neural Image Caption Generation with visual Attention , ICML 2015z0,0z0,1z0,2z1,0z1,1z1,2z2,0z2,1z2,2 y0h1[START]y1h2y2y1h3y3y2personwearingha th4y4y3personwearinghat[END]MLPcEncoder: h0 = fW(z)where z is spatial CNN featuresfW(.) is an MLPI nput: Image IOutput: Sequence y = y1, y2.

6 , yTDecoder: yt = gV(yt-1, ht-1, c)where context vector c is often c = h0 Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 2021 Extract spatial features from a pretrained CNNI mage Captioning using spatial features15 CNNF eatures: H x W x Dh0[START]Xu et al, Show, Attend and Tell: Neural Image Caption Generation with visual Attention , ICML 2015z0,0z0,1z0,2z1,0z1,1z1,2z2,0z2,1z2,2 y0h1[START]y1h2y2y1h3y3y2personwearingha th4y4y3personwearinghat[END]MLPP roblem: Input is "bottlenecked" through c-Model needs to encode everything it wants to say within cThis is a problem if we want to generate really long descriptions? 100s of words longcFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 2021 Extract spatial features from a pretrained CNNI mage Captioning with RNNs & Attention16 CNNF eatures: H x W x Dh0Xu et al, Show, Attend and Tell: Neural Image Caption Generation with visual Attention , ICML 2015z0,0z0,1z0,2z1,0z1,1z1,2z2,0z2,1z2,2 Attention idea: New context vector at every time context vector will attend to different image regionsgif sourceAttention Saccades in humansFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 2021 Extract spatial features from a pretrained CNNI mage Captioning with RNNs & Attention17 CNNF eatures.

7 H x W x Dh0Xu et al, Show, Attend and Tell: Neural Image Caption Generation with visual Attention , ICML 2015z0,0z0,1z0,2z1,0z1,1z1,2z2,0z2,1z2,2 e1,0,0e1,0,1e1,0,2e1,1,0e1,1,1e1,1,2e1,2 ,0e1,2,1e1,2,2 Alignment scores: H x W Compute alignments scores (scalars):fatt(.) is an MLPFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 2021 Extract spatial features from a pretrained CNNI mage Captioning with RNNs & Attention18 CNNF eatures: H x W x Dh0Xu et al, Show, Attend and Tell: Neural Image Caption Generation with visual Attention , ICML 2015z0,0z0,1z0,2z1,0z1,1z1,2z2,0z2,1z2,2 e1,0,0e1,0,1e1,0,2e1,1,0e1,1,1e1,1,2e1,2 ,0e1,2,1e1,2,2a1,0,0a1,0,1a1,0,2a1,1,0a1 ,1,1a1,1,2a1,2,0a1,2,1a1,2,2 Alignment scores: H x W Attention : H x W Normalize to get Attention weights:0 < at, i, j < 1, Attention values sum to 1 Compute alignments scores (scalars):fatt(.)

8 Is an MLPFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 2021 Extract spatial features from a pretrained CNNI mage Captioning with RNNs & Attention19 CNNF eatures: H x W x Dh0c1Xu et al, Show, Attend and Tell: Neural Image Caption Generation with visual Attention , ICML 2015z0,0z0,1z0,2z1,0z1,1z1,2z2,0z2,1z2,2 e1,0,0e1,0,1e1,0,2e1,1,0e1,1,1e1,1,2e1,2 ,0e1,2,1e1,2,2a1,0,0a1,0,1a1,0,2a1,1,0a1 ,1,1a1,1,2a1,2,0a1,2,1a1,2,2 Alignment scores: H x W Attention : H x W XCompute alignments scores (scalars):fatt(.) is an MLPC ompute context vector:Normalize to get Attention weights:0 < at, i, j < 1, Attention values sum to 1 Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 2021 Extract spatial features from a pretrained CNNEach timestep of decoder uses a different context vector that looks at different parts of the input imageImage Captioning with RNNs & Attention20 CNNF eatures: H x W x Dh0c1y0h1[START]y1Xu et al, Show, Attend and Tell: Neural Image Caption Generation with visual Attention , ICML 2015z0,0personz0,1z0,2z1,0z1,1z1,2z2,0z2 ,1z2,2 Decoder.

9 Yt = gV(yt-1, ht-1, ct)New context vector at every time stepFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 2021 Extract spatial features from a pretrained CNNI mage Captioning with RNNs & Attention21 CNNF eatures: H x W x Dh0c1y0h1[START]y1Xu et al, Show, Attend and Tell: Neural Image Caption Generation with visual Attention , ICML 2015z0,0personz0,1z0,2z1,0z1,1z1,2z2,0z2 ,1z2,2 Decoder: yt = gV(yt-1, ht-1, ct)New context vector at every time stepe1,0,0e1,0,1e1,0,2e1,1,0e1,1,1e1,1,2 e1,2,0e1,2,1e1,2,2a1,0,0a1,0,1a1,0,2a1,1 ,0a1,1,1a1,1,2a1,2,0a1,2,1a1,2,2 Alignment scores: H x W Attention : H x W c2 XFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 2021 Extract spatial features from a pretrained CNNI mage Captioning with RNNs & Attention22 CNNF eatures: H x W x Dh0c1y0h1[START]y1h2y2c2y1Xu et al, Show, Attend and Tell: Neural Image Caption Generation with visual Attention , ICML 2015z0,0personpersonwearingz0,1z0,2z1,0z 1,1z1,2z2,0z2,1z2,2 Each timestep of decoder uses a different context vector that looks at different parts of the input imageDecoder.

10 Yt = gV(yt-1, ht-1, ct)New context vector at every time stepFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 2021 Extract spatial features from a pretrained CNNI mage Captioning with RNNs & Attention23 CNNF eatures: H x W x Dh0c1y0h1[START]y1h2y2c2y1Xu et al, Show, Attend and Tell: Neural Image Caption Generation with visual Attention , ICML 2015z0,0h3y3c3y2personwearingpersonweari nghatz0,1z0,2z1,0z1,1z1,2z2,0z2,1z2,2 Each timestep of decoder uses a different context vector that looks at different parts of the input imageDecoder: yt = gV(yt-1, ht-1, ct)New context vector at every time stepFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 -May 06, 2021 Extract spatial features from a pretrained CNNI mage Captioning with RNNs & Attention24 CNNF eatures: H x W x Dh0c1y0h1[START]y1h2y2c2y1Xu et al, Show, Attend and Tell: Neural Image Caption Generation with visual Attention , ICML 2015z0,0h3y3c3y2personwearinghath4y4c4y3 personwearinghat[END]z0,1z0,2z1,0z1,1z1, 2z2,0z2,1z2,2 Each timestep of decoder uses a different context vector that looks at different parts of the input imageDecoder: y


Related search queries