arXiv:2006.11477v3 [cs.CL] 22 Oct 2020

Wav2vec : A Framework for Self-SupervisedLearning of speech RepresentationsAlexei BaevskiHenry ZhouAbdelrahman MohamedMichael AIAbstractWe show for the first time that learning powerful representations from speechaudio alone followed by fine-tuning on transcribed speech can outperform the bestsemi-supervised methods while being conceptually simpler. wav2vec masksthe speech input in the latent space and solves a contrastive task defined over aquantization of the latent representations which are jointly learned. Experimentsusing all labeled data of Librispeech achieve WER on the clean/othertest sets. When lowering the amount of labeled data to one hour, wav2vec the previous state of the art on the 100 hour subset while using 100times less labeled data. Using just ten minutes of labeled data and pre-trainingon 53k hours of unlabeled data still achieves WER. This demonstrates thefeasibility of speech recognition with limited amounts of labeled IntroductionNeural networks benefit from large quantities of labeled training data.

However, in many settingslabeled data is much harder to come by than unlabeled data: current speech recognition systemsrequire thousands of hours of transcribed speech to reach acceptable performance which is notavailable for the vast majority of the nearly 7,000 languages spoken worldwide [31]. Learning purelyfrom labeled examples does not resemble language acquisition in humans: infants learn language bylistening to adults around them - a process that requires learning good representations of machine learning, self-supervised learning has emerged as a paradigm to learn general datarepresentations from unlabeled examples and to fine-tune the model on labeled data. This has beenparticularly successful for natural language processing [43,45,9] and is an active research area forcomputer vision [20, 2, 36, 19, 6].In this paper, we present a framework for self-supervised learning of representations from raw audiodata. Our approach encodes speech audio via a multi-layer convolutional neural network and thenmasks spans of the resulting latent speech representations [26,56], similar to masked languagemodeling [9].

The latent representations are fed to a Transformer network to build contextualized rep-resentations and the model is trained via a contrastive task where the true latent is to be distinguishedfrom distractors [54, 49, 48, 28] ( 2).As part of training, we learn discrete speech units [53,32,7,18] via a gumbel softmax [24,5]to represent the latent representations in the contrastive task (Figure 1) which we find to be moreeffective than non-quantized targets. After pre-training on unlabeled speech , the model is fine-tuned1 Code and models are available Under [ ] 22 Oct 2020X<latexit sha1_base64="eFc/CcufjQXkIIgegQKZW dirbK0=">AAAB8nicbVBNS8 NAFHypX7V+VT16 WSyCp5 KIoMeCF49 VbC2koWy2m3bpJht2X4QS+jO8eFDEq7/Gm//GTZu Dtg4sDDPvsfMmTKUw6 LrfTmVtfWNzq7pd29nd2z+oHx51jco04x2mpNK9k BouRcI7 KFDyXqo5jUPJH8 PJTeE/PnFthEoecJryIKajRESCUbSS348pjhmVeW 82qDfcpjsHWSVeSRpQoj2of/WHimUxT5 BJaozvuSkGOdUomOSzWj8zPKVsQkfctzShMTdBPo 88I2dWGZJIafsSJHP190 ZOY2 OmcWgni4hm2 SvE/zw/w+g6yEWSZsgTtvgoyiRBRYr7yVBozlBOL aFMC5uVsDHVlKFtqWZL8 JZPXiXdi6bnNr27y0brvqyjCidwCufgwRW04 Bba0 AEGCp7hFd4cdF6cd+djMVpxyp1j+APn8weYEZF+</latexit> <latexit sha1_base64="eFc/CcufjQXkIIgegQKZW dirbK0=">AAAB8nicbVBNS8 NAFHypX7V+VT16 WSyCp5 KIoMeCF49 VbC2koWy2m3bpJht2X4QS+jO8eFDEq7/Gm//GTZu Dtg4sDDPvsfMmTKUw6 LrfTmVtfWNzq7pd29nd2z+oHx51jco04x2mpNK9k BouRcI7 KFDyXqo5jUPJH8 PJTeE/PnFthEoecJryIKajRESCUbSS348pjhmVeW 82qDfcpjsHWSVeSRpQoj2of/WHimUxT5 BJaozvuSkGOdUomOSzWj8zPKVsQkfctzShMTdBPo 88I2dWGZJIafsSJHP190 ZOY2 OmcWgni4hm2

SvE/zw/w+g6yEWSZsgTtvgoyiRBRYr7yVBozlBOL aFMC5uVsDHVlKFtqWZL8 JZPXiXdi6bnNr27y0brvqyjCidwCufgwRW04 Bba0 AEGCp7hFd4cdF6cd+djMVpxyp1j+APn8weYEZF+</latexit> <latexit sha1_base64="eFc/CcufjQXkIIgegQKZW dirbK0=">AAAB8nicbVBNS8 NAFHypX7V+VT16 WSyCp5 KIoMeCF49 VbC2koWy2m3bpJht2X4QS+jO8eFDEq7/Gm//GTZu Dtg4sDDPvsfMmTKUw6 LrfTmVtfWNzq7pd29nd2z+oHx51jco04x2mpNK9k BouRcI7 KFDyXqo5jUPJH8 PJTeE/PnFthEoecJryIKajRESCUbSS348pjhmVeW 82qDfcpjsHWSVeSRpQoj2of/WHimUxT5 BJaozvuSkGOdUomOSzWj8zPKVsQkfctzShMTdBPo 88I2dWGZJIafsSJHP190 ZOY2 OmcWgni4hm2 SvE/zw/w+g6yEWSZsgTtvgoyiRBRYr7yVBozlBOL aFMC5uVsDHVlKFtqWZL8 JZPXiXdi6bnNr27y0brvqyjCidwCufgwRW04 Bba0 AEGCp7hFd4cdF6cd+djMVpxyp1j+APn8weYEZF+</latexit> <latexit sha1_base64="eFc/CcufjQXkIIgegQKZW dirbK0=">AAAB8nicbVBNS8 NAFHypX7V+VT16 WSyCp5 KIoMeCF49 VbC2koWy2m3bpJht2X4QS+jO8eFDEq7/Gm//GTZu Dtg4sDDPvsfMmTKUw6 LrfTmVtfWNzq7pd29nd2z+oHx51jco04x2mpNK9k BouRcI7 KFDyXqo5jUPJH8 PJTeE/PnFthEoecJryIKajRESCUbSS348pjhmVeW 82qDfcpjsHWSVeSRpQoj2of/WHimUxT5 BJaozvuSkGOdUomOSzWj8zPKVsQkfctzShMTdBPo 88I2dWGZJIafsSJHP190 ZOY2 OmcWgni4hm2 SvE/zw/w+g6yEWSZsgTtvgoyiRBRYr7yVBozlBOL aFMC5uVsDHVlKFtqWZL8 JZPXiXdi6bnNr27y0brvqyjCidwCufgwRW04 Bba0 AEGCp7hFd4cdF6cd+djMVpxyp1j+APn8weYEZF+</latexit>Z<latexit sha1_base64="AbUB20 ZkMBJJrNGS9 CvvOZFQGF8=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxIQd0V 3 LisYh84 HUomzbShmWRIMkIZ+hluXCji1q9x59+YaWehrQcC h3 PuJeeeMOFMG9f9dkpr6xubW+Xtys7u3v5B9fCoo2 WqCG0 TyaXqhVhTzgRtG2Y47 SWK4jjktBtObnK/+0 SVZlI8mGlCgxiPBIsYwcZKfj/GZkwwzx5ng2rNrb tzoFXiFaQGBVqD6ld/KEkaU2 EIx1r7npuYIMPKMMLprNJPNU0wmeAR9S0 VOKY6yOaRZ+jMKkMUSWWfMGiu/t7 IcKz1NA7tZB5RL3u5+J/npya6 CjImktRQQRYfRSlHRqL8fjRkihLDp5 ZgopjNisgYK0yMbaliS/CWT14lnYu616hf3zVqzf uijjKcwCmcgweX0 IRbaEEbCEh4hld4c4zz4rw7H4vRklPsHMMfOJ8/n vuRjA==</latexit>.

C<latexit sha1_base64="MdYkdScTEPCFZ+zkzFYHzx6vsfU=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4 KjNSUHeFblxWsQ+YDiWTZtrQTDIkd4Qy9 DPcuFDErV/jzr8x085 CWw8 EDufcS849 YSK4 Adf9dkobm1vbO+Xdyt7+weFR9fika1 SqKetQJZTuh8 QwwSXrAAfB+olmJA4F64 XTVu73npg2 XMlHmCUsiMlY8ohTAlbyBzGBCSUia82H1 ZpbdxfA68 QrSA0 VaA+rX4 ORomnMJFBBjPE9N4 EgIxo4 FWxeGaSGJYROyZj5lkoSMxNki8hzfGGVEY6 Utk8 CXqi/NzISGzOLQzuZRzSrXi7+5/kpRDdBxmWSApN 0+VGUCgwK5/fjEdeMgphZQqjmNiumE6 IJBdtSxZbgrZ68 TrpXda9Rv71v1 JoPRR1ldIbO0 SXy0 DVqojvURh1 EkULP6BW9 OeC8OO/Ox3K05BQ7p+gPnM8ffAiRdQ==</latexit>Q<latexit sha1_base64="edUm7D+Sm7O5bC2L0byZugOTXbo=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4 KjNSUHcFNy5bsQ+YDiWTZtrQTDIkd4Qy9 DPcuFDErV/jzr8x085 CWw8 EDufcS849 YSK4 Adf9dkobm1vbO+Xdyt7+weFR9fika1 SqKetQJZTuh8 QwwSXrAAfB+olmJA4F64 XTu9zvPTFtuJKPMEtYEJOx5 BGnBKzkD2 ICE0pE1p4 PqzW37i6A14lXkBoq0 BpWvwYjRdOYSaCCGON7bgJBRjRwKti8 MkgNSwidkjHzLZUkZibIFpHn+MIqIxwpbZ8 EvFB/b2 QkNmYWh3 Yyj2hWvVz8z/NTiG6 CjMskBSbp8qMoFRgUzu/HI64 ZBTGzhFDNbVZMJ0 QTCralii3 BWz15nXSv6l6jfttu1 JoPRR1ldIbO0 SXy0 DVqonvUQh1 EkULP6BW9 OeC8OO/Ox3K05BQ7p+gPnM8fkU6 Rgw==</latexit>MaskedCNNqqqqqL<latexit sha1_base64="Uo4sFAN+3pvZ3 Umv6/JoBHX7l0Q=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxIQd0V 3 LhwUcU+YDqUTJppQzPJkGSEMvQz3 LhQxK1f486/MdPOQlsPBA7n3 EvOPWHCmTau++2U1tY3 NrfK25Wd3b39g+rhUUfLVBHaJpJL1 QuxppwJ2jbMcNpLFMVxyGk3nNzkfveJKs2keDTTh AYxHgkWMYKNlfx+jM2 YYJ7dzQbVmlt350 CrxCtIDQq0 BtWv/lCSNKbCEI619j03 MUGGlWGE01mln2qaYDLBI+pbKnBMdZDNI8/QmVWG KJLKPmHQXP29keFY62kc2sk8ol72cvE/z09 NdBVkTCSpoYIsPopSjoxE+f1oyBQlhk8twUQxmxW RMVaYGNtSxZbgLZ+8 SjoXda9Rv75v1 JoPRR1lOIFTOAcPLqEJt9 CCNhCQ8 Ayv8 OYY58V5dz4 WoyWn2 DmGP3A+fwCJtZF+</latexit>`Contrastive lossContext representationsraw waveformQuantizedrepresentationsLatent speechrepresentationsTransformerFigure 1.

Illustration of our framework which jointly learns contextualized speech representationsand an inventory of discretized speech labeled data with a Connectionist Temporal Classification (CTC) loss [14,4] to be used fordownstream speech recognition tasks ( 3)Previous work learned a quantization of the data followed by a contextualized representations with aself-attention model [5,4], whereas our approach solves both problems end-to-end. Masking partsof the input with Transformer networks for speech has been explored [4,26], but prior work relieseither on a two-step pipeline or their model is trained by reconstructing the filter bank input related work includes learning representations from auto-encoding the input data [52,11] ordirectly predicting future timesteps [8].Our results show that jointly learning discrete speech units with contextualized representationsachieves substantially better results than fixed units learned in a prior step [4]. We also demonstratethe feasibility of ultra-low resource speech recognition: when using only 10 minutes of labeled data,our approach achieves word error rate (WER) on the clean/other test sets of set a new state of the art on TIMIT phoneme recognition as well as the 100 hour clean subsetof Librispeech.

Moreover, when we lower the amount of labeled data to just one hour, we stilloutperform the previous state of the art self-training method of [42] while using 100 times lesslabeled data and the same amount of unlabeled data. When we use all 960 hours of labeled data fromLibrispeech, then our model achieves WER ( 4, 5).2 ModelOur model is composed of a multi-layer convolutional feature encoderf:X 7 Zwhich takes asinput raw audioXand outputs latent speech representationsz1,..,zTforTtime-steps. They arethen fed to a Transformerg:Z 7 Cto build representationsc1,..,cTcapturing information fromthe entire sequence [9,5,4]. The output of the feature encoder is discretized toqtwith a quantizationmoduleZ 7 Qto represent the targets (Figure 1) in the self-supervised objective ( ). Comparedto vq-wav2vec [5], our model builds context representations over continuous speech representationsand self-attention captures dependencies over the entire sequence of latent representations encoder consists of several blocks containing a temporal convolution fol-lowed by layer normalization [1] and a GELU activation function [21].

The raw waveform input tothe encoder is normalized to zero mean and unit variance. The total stride of the encoder determinesthe number of time-stepsTwhich are input to the Transformer ( ).Contextualized representations with output of the feature encoder is fed toa context network which follows the Transformer architecture [55,9,33]. Instead of fixed positionalembeddings which encode absolute positional information, we use a convolutional layer similarto [37,4,57] which acts as relative positional embedding. We add the output of the convolutionfollowed by a GELU to the inputs and then apply layer self-supervised training we discretize the output of the feature encoderzto a finite set of speech representations via product quantization [25]. This choice led to good2results in prior work which learned discrete units in a first step followed by learning contextualizedrepresentations [5]. Product quantization amounts to choosing quantized representations frommultiple codebooks and concatenating them.

GivenGcodebooks, or groups, withVentriese RV d/G, we choose one entry from each codebook and concatenate the resulting vectorse1,..,eGand apply a linear transformationRd7 Rfto obtainq Gumbel softmax enables choosing discrete codebook entries in a fully differentiable way [16,24,35]. We use the straight-through estimator [26] and setupGhard Gumbel softmax operations [24].The feature encoder outputzis mapped tol RG Vlogits and the probabilities for choosing thev-th codebook entry for groupgarepg,v=exp(lg,v+nv)/ Vk=1exp(lg,k+nk)/ ,(1)where is a non-negative temperature,n= log( log(u))anduare uniform samples fromU(0,1).During the forward pass, codewordiis chosen byi=argmaxjpg,jand in the backward pass, thetrue gradient of the Gumbel softmax outputs is TrainingTo pre-train the model we mask a certain proportion of time steps in the latent feature encoder space( ), similar to masked language modeling in BERT [9]. The training objective requires identifyingthe correct quantized latent audio representation in a set of distractors for each masked time step( ) and the final model is fine-tuned on the labeled data ( ).

MaskingWe mask a proportion of the feature encoder outputs, or time steps before feeding them to the contextnetwork and replace them with a trained feature vector shared between all masked time steps; wedo not mask inputs to the quantization module. To mask the latent speech representations output bythe encoder, we randomly sample without replacement a certain proportionpof all time steps to bestarting indices and then mask the subsequentMconsecutive time steps from every sampled index;spans may ObjectiveDuring pre-training, we learn representations of speech audio by solving a contrastive taskLmwhichrequires to identify the true quantized latent speech representation for a masked time step within a setof distractors. This is augmented by a codebook diversity lossLdto encourage the model to use thecodebook entries equally + Ld(2)where is a tuned context network outputctcentered over masked time stept, the modelneeds to identify the true quantized latent speech representationqtin a set ofK+ 1quantizedcandidate representations q Qtwhich includesqtandKdistractors [23,54].

arXiv:2006.11477v3 [cs.CL] 22 Oct 2020

Tags:

Information

Transcription of arXiv:2006.11477v3 [cs.CL] 22 Oct 2020

Related search queries

arXiv:2006.11477v3 [cs.CL] 22 Oct 2020

Tags:

Information

Documents from same domain

Related documents

Related search queries