Self-supervised Learning - 國立臺灣大學

Self-supervised LearningHung-yi Lee (Embeddingsfrom Language Models)BERT(Bidirectional Encoder Representations from Transformers)ERNIE (Enhanced Representation through Knowledge Integration)Big Bird: Transformers for Longer SequencesSource of image : Hoover340M parametersBERTGPT-2T5 GPT-3 ELMoSource: of image : (94M)BERT (340M)GPT-2 (1542M)The models become larger and larger ..Megatron (8B)GPT-2T5 (11B)TuringNLG (17B)The models become larger and larger ..GPT-3 is 10times larger than Turing (340M) GPT-3 (175B)BERTGPT-3 Transformer( )Outline BERT seriesGPT seriesSelf- supervised LearningSupervised labelModel ModelSelf- supervised Masking InputBERT MASKR andom(special token) Transformer EncoderLinear (all characters)==orRandomly masking some tokens ..softmaxMasking InputBERT MASKR andom(special token) Transformer EncoderLinear==orRandomly masking some tokens.

Softmax Ground truth minimize cross entropy Next Sentence Prediction BERT[SEP]Yes/No[CLS]LinearRobustly optimized BERT approach (RoBERTa)w1w2 Sentence 1w3w4w5 Sentence 2 This approach is not SOP: Sentence order predictionUsed in ALBERT Masked token prediction Nextsentence prediction BERTSelf- supervised LearningModel for Task 1 Downstream Tasks Model for Task 2 Model for Task 3 The tasks we care We have a little bit labeled Corpus of Linguistic Acceptability (CoLA) Stanford Sentiment Treebank (SST-2) Microsoft Research Paraphrase Corpus (MRPC) QuoraQuestion Pairs (QQP) Semantic Textual Similarity Benchmark (STS-B) Multi-Genre Natural Language Inference (MNLI) Question-answering NLI (QNLI) Recognizing Textual Entailment (RTE) WinogradNLI (WNLI) General Language Understanding Evaluation (GLUE) also has Chinese version ( )BERT and its Family GLUE scoresSource of image : to use BERT Case 1 BERT[CLS]w1w2w3 LinearclassInput: sequence output: classsentenceExample.

Sentiment analysisRandom initializationInitby pre-trainThis is the model to be is goodpositiveBetter than randomPre-train Random Initialization Source of image : (fine-tune)(scratch)19 How to use BERT Case 2 BERT[CLS]w1w2w3 LinearclassInput: sequenceoutput: same as inputsentenceLinearclassLinearclassI saw a sawNVDETNE xample: POS tagging How to use BERT Case 3 Input: two sequencesOutput:a classpremise: A person on a horse jumps over a broken down airplanehypothesis: A person is at a : Natural Language Inferencee (NLI)Linearw1w2 How to use BERT Case 3 BERT[CLS][SEP]ClassSentence 1 Sentence 2w3w4w5 Input: two sequencesOutput:a classHow to use BERT Case 4 Extraction-based Question Answering (QA) = 1, 2, , = 1, 2, , QAModeloutput: two integers ( , ) = , , Document:Query:Answer: 177779 =17, =17 =77, =79q1q2 How to use BERT Case 4 BERT[CLS][SEP]questiondocumentd1d2d3inne r = 2 Random Initialized q1q2 How to use BERT Case 4 BERT[CLS][SEP]questiondocumentd1d2d3inne r answer is d2d3.

S = 2e = 3 Random Initialized That is all about BERT!Training BERT is challenging! GLUE scoresThis work is done by Our ALBERT-baseGoogle s ALBERT- s BERT-baseTraining data has more than 3 billionsof words. 3000times of Harry Potter series 8 days with TPU v3 BERT Embryology ( )When does BERT know POS tagging, syntactic parsing, semantics?The answer is counterintuitive! a seq2seq modelw1w2w3w5w6w7w4 Cross Attentionw8 DecoderEncoderw1w2w3w4 Reconstruct the inputCorrupted MASS / BARTBARTA B [SEP] C D EA B [SEP] C D EA B [SEP] C EC D E [SEP] A B D E A B [SEP] CA B [SEP] EMASS(Delete D )Text Infilling(permutation)(rotation) Comparison Transfer Text-to-Text Transformer (T5) Colossal Clean Crawled Corpus (C4)Why does BERT work?BERT Represent the meaningof embeddingThe tokens with similar meaning have similar embedding.

Contextis does BERT work?BERT BERT compute cosine similarity Why does BERT work?John Rupert FirthYou shall know a word by the company it keepsBERTw1w2w3w4w2word embedding Contextualized word embedding Why does BERT work? Applying BERT to protein, DNA, music classificationThis work is done by CCAGCTGCATCACAGGAGGCCAGCGAGCAGGTCTGTTCCA AGGGCCTTCGAGCCAGTCTGEI AGACCCGCCGGGAGGCGGAGGACCTGCAGGGTGAGCCCCA CCGCCCCTCCGTGCCCCCGCIE AACGTGGCCTCCTTGTGCCCTTCCCCACAGTGCCCTCTTC CAGGACAAACTTGGAGAAGTIE CCACTCAGCCAGGCCCTTCTTCTCCTCCAGGTCCCCCACG GCCCTTCAGGATGAAAGCTGIE CCTGATCTGGGTCTCCCCTCCCACCCTCAGGGAGCCAGGC TCGGCATTTCTGGCAGCAAGIE AGCCCTCAACCCTTCTGTCTCACCCTCCAGCCTAAAGCTC CTTGACAACTGGGACAGCGTIE CCACTCAGCCAGGCCCTTCTTCTCCTCCAGGTCCCCCACG GCCCTTCAGGATGAAAGCTGN CTGTGTTCACCACATCAAGCGCCGGGACATCGTGCTCAAG TGGGAGCTGGGGGAGGGCGCN GTGTTACCGAGGGCATTTCTAACAGTCTTCTTACTACGGC CTCCGCCGACCGCGCGCTCGN TCTGAGCTCTGCATTTGTCTATTCTCCAGCTGACCCTGGT TCTCTCTCTTAGCTACCTGC classDNA sequence AweTyouCheGsheThis work is done by [CLS]

LinearclassDNAsequence Random initializationInitby pre-trainpre-train on EnglishWhy does BERT work?AGAC wewesheheWhy does BERT work? Applying BERT to protein, DNA, music classificationThis work is done by Learn More ..BERT (Part 1)BERT (Part 2) BERTM ulti-BERT Training a BERT model by many different Reading ComprehensionTraining on the sentences of 104 languages Multi-BERTDoc1 Query1 Ans1 Doc2 Query2 Ans2 Doc3 Query3 Ans3 Doc4 Query4 Ans4 Doc5 Query5 Ans5 Doc1 Query1?Doc3 Query3?Doc2 Query2?Train on EnglishQA training examplesTest on ChineseQA testZero-shot Reading Comprehension English: SQuAD, Chinese: DRCDF1 score of Human performance is + work is done by Alignment?Multi-BERT highestmountain swimjumprabbitfish : Reciprocal Rank (MRR): Higher MRR, better alignment Google s Multi-BERTOur Multi-BERT200k sentences for each langHow about 1000k?

The training is also challenging ..Two days ..(the whole trainingtook one week) : Reciprocal Rank (MRR): Higher MRR, better alignment Google s Multi-BERTOur Multi-BERT200k sentences for each langOur Multi-BERT1000k sentencesThe amount of training data is critical for alignment. swimjumprabbitfishMulti-BERT highestmountainReconstruction highestmountainWeird???If the embedding is language independent ..How to correctly reconstruct?There must be language Where is Language?Average of ChineseAverage of EnglishThis work is done by thereisacat++++ swimjumprabbitfishIf this is true ..Average of ChineseAverage of EnglishThis work is done by swimjumprabbitfish xUnsupervised token-level translation BERT seriesGPT seriesPredict Next Token<BOS> 1 2 3 4 Model???? LinearTransformsoftmaxCross entropywt+1from wt Training data: Predict Next TokenThey can do generation.

To use GPT? DescriptionA few example Few-shot Learning One-shot Learning Zero-shot Learning (no gradient descent) In-context LearningAverage of 42tasksTo learn more .. TextData CentricPredictionPosition, 2015 Jigsaw, 2017 Rotation, 2018 Cutout, 2015 RNNLM, 1997word2v, 2013audio2v, 2019 BERT, 2018 Mock, 2020 TERA, 2020 APC, 2019 NLPS peechCVContrastiveInfoNCE, 2017 CPC, 2019 MoCo, 2019 SimCLR, 2020 MoCov2, 2020 BYOL, 2020 SimSiam, 2020 image - -BYOLB ootstrap your own latent: A new approach to Self-supervised versionBERT Speech GLUE -SUPERB Speech processing Universal PERformanceBenchmark Will be available soon Downstream:Benchmark with 10+ tasks The models need to know how to process content, speaker, emotion, and even semantics. Toolkit:A flexible and modularized framework for Self-supervised speech models. (a joke)Predict Next TokenThey can do generation.

I forced a bot to watch over 1,000 hours of XXX ! !!!

Self-supervised Learning - 國立臺灣大學

Tags:

Information

Advertisement

Transcription of Self-supervised Learning - 國立臺灣大學

Related search queries

Self-supervised Learning - 國立臺灣大學

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries