Example: tourism industry
arXiv:2108.00154v2 [cs.CV] 8 Oct 2021

arXiv:2108.00154v2 [cs.CV] 8 Oct 2021

Back to document page

visual inputs. The reasons are two-fold: (1) Input embeddings of each layer are equal-scale, so no cross-scale feature can be extracted; (2) to lower the computa-tional cost, some vision transformers merge adjacent embeddings inside the self-attention module, thus sacrificing small-scale (fine-grained) features of the em-

  Visual, Attention

Download arXiv:2108.00154v2 [cs.CV] 8 Oct 2021

15
Please wait..

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Related search queries