Example: quiz answers

Jukebox: A Generative Model for Music - OpenAI

jukebox : A Generative Model for MusicPrafulla Dhariwal* 1 Heewoo Jun* 1 Christine Payne* 1 Jong Wook Kim1 Alec Radford1 Ilya Sutskever1 AbstractWe introduce jukebox , a Model that generatesmusic with singing in the raw audio domain. Wetackle the long context of raw audio using a multi-scale VQ-VAE to compress it to discrete codes,and modeling those using autoregressive Trans-formers. We show that the combined Model atscale can generate high-fidelity and diverse songswith coherence up to multiple minutes. We cancondition on artist and genre to steer the musicaland vocal style, and on unaligned lyrics to makethe singing more controllable. We are releasingthousands of non cherry-picked samples, alongwith Model weights and IntroductionMusic is an integral part of human culture, existing from theearliest periods of human civilization and evolving into awide diversity of forms.

frequencies perceptible to humans. As an example, a four-minute-long audio segment will have an input length of ˘10 million, where each position can have 16 bits of information. In comparison, a high-resolution RGB image with 1024 1024 pixels has an input length of ˘3 million, and each position has 24 bits of information. This makes learning

Tags:

  Frequencies, Jukebox

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of Jukebox: A Generative Model for Music - OpenAI

1 jukebox : A Generative Model for MusicPrafulla Dhariwal* 1 Heewoo Jun* 1 Christine Payne* 1 Jong Wook Kim1 Alec Radford1 Ilya Sutskever1 AbstractWe introduce jukebox , a Model that generatesmusic with singing in the raw audio domain. Wetackle the long context of raw audio using a multi-scale VQ-VAE to compress it to discrete codes,and modeling those using autoregressive Trans-formers. We show that the combined Model atscale can generate high-fidelity and diverse songswith coherence up to multiple minutes. We cancondition on artist and genre to steer the musicaland vocal style, and on unaligned lyrics to makethe singing more controllable. We are releasingthousands of non cherry-picked samples, alongwith Model weights and IntroductionMusic is an integral part of human culture, existing from theearliest periods of human civilization and evolving into awide diversity of forms.

2 It evokes a unique human spirit inits creation, and the question of whether computers can evercapture this creative process has fascinated computer scien-tists for decades. We have had algorithms generating pianosheet Music (Hiller Jr & Isaacson, 1957; Moorer, 1972;Hadjeres et al., 2017; Huang et al., 2017), digital vocodersgenerating a singer s voice (Bonada & Serra, 2007; Sainoet al., 2006; Blaauw & Bonada, 2017) and also synthesizersproducing timbres for various musical instruments (Engelet al., 2017; 2019). Each captures a specific aspect of musicgeneration: melody, composition, timbre, and the humanvoice singing. However, a single system to do it all field of Generative models has made tremendousprogress in the last few years.

3 One of the aims of gen-erative modeling is to capture the salient aspects of the dataand to generate new instances indistinguishable from thetrue data The hypothesis is that by learning to produce thedata we can learn the best features of the data1. We aresurrounded by highly complex distributions in the visual,audio, and text domain, and in recent years we have devel-*Equal contribution1 OpenAI , San Francisco. Correspondenceto: advances in text generation (Radford et al.), speechgeneration (Xie et al., 2017) and image generation (Brocket al., 2019; Razavi et al., 2019). The rate of progress inthis field has been rapid, where only a few years ago wehad algorithms producing blurry faces (Kingma & Welling,2014; Goodfellow et al.)

4 , 2014) but now we now can gener-ate high-resolution faces indistinguishable from real ones(Zhang et al., 2019b). Generative models have been applied to the Music genera-tion task too. Earlier models generated Music symbolicallyin the form of a pianoroll, which specifies the timing, pitch,velocity, and instrument of each note to be played. (Yanget al., 2017; Dong et al., 2018; Huang et al., 2019a; Payne,2019; Roberts et al., 2018; Wu et al., 2019). The symbolicapproach makes the modeling problem easier by workingon the problem in the lower-dimensional space. However, itconstrains the Music that can be generated to being a specificsequence of notes and a fixed set of instruments to renderwith. In parallel, researchers have been pursuing the non-symbolic approach, where they try to produce Music directlyas a piece of audio.

5 This makes the problem more challeng-ing, as the space of raw audio is extremely high dimensionalwith a high amount of information content to Model . Therehas been some success, with models producing piano pieceseither in the raw audio domain (Oord et al., 2016; Mehriet al., 2017; Yamamoto et al., 2020) or in the spectrogramdomain (Vasquez & Lewis, 2019). The key bottleneck isthat modeling the raw audio directly introduces extremelylong-range dependencies, making it computationally chal-lenging to learn the high-level semantics of Music . A way toreduce the difficulty is to learn a lower-dimensional encod-ing of the audio with the goal of losing the less importantinformation but retaining most of the musical approach has demonstrated some success in generat-ing short instrumental pieces restricted to a set of a fewinstruments (Oord et al.)

6 , 2017; Dieleman et al., 2018).In this work, we show that we can use state-of-the-art deepgenerative models to produce a single system capable of gen-erating diverse high-fidelity Music in the raw audio domain,with long-range coherence spanning multiple minutes. Ourapproach uses a hierarchical VQ-VAE architecture (Razavi1 Richard Feynmann famously said, What I cannot create, Ido not understand jukebox : A Generative Model for Musicet al., 2019) to compress audio into a discrete space, witha loss function designed to retain the maximum amount ofmusical information, while doing so at increasing levelsof compression. We use an autoregressive Sparse Trans-former (Child et al., 2019; Vaswani et al., 2017) trained withmaximum-likelihood estimation over this compressed space,and also train autoregressive upsamplers to recreate the lostinformation at each level of show that our models can produce songs from highlydiverse genres of Music like rock, hip-hop, and jazz.

7 Theycan capture melody, rhythm, long-range composition, andtimbres for a wide variety of instruments, as well as thestyles and voices of singers to be produced with the mu-sic. We can also generate novel completions of existingsongs. Our approach allows the option to influence thegeneration process: by swapping the top prior with a con-ditional prior, we can condition on lyrics to tell the singerwhat to sing, or on midi to control the composition. Werelease our Model weights and training and sampling codeat BackgroundWe consider Music in the raw audio domain represented asa continuous waveformx [ 1,1]T, where the numberof samplesTis the product of the audio duration and thesampling rate typically ranging from 16 kHz to 48 kHz.

8 Formusic, CD quality audio, kHz samples stored in 16bit precision, is typically enough to capture the range offrequencies perceptible to humans. As an example, a four-minute-long audio segment will have an input length of 10million, where each position can have 16 bits of comparison, a high-resolution RGB image with1024 1024pixels has an input length of 3million, and eachposition has 24 bits of information. This makes learninga Generative Model for Music extremely computationallydemanding with increasingly longer durations; we have tocapture a wide range of musical structures from timbre toglobal coherence while simultaneously modeling a largeamount of VQ-VAETo make this task feasible, we use the VQ-VAE (Oord et al.)

9 ,2017; Dieleman et al., 2018; Razavi et al., 2019) to compressraw audio to a lower-dimensional space. A one-dimensionalVQ-VAE learns to encode an input sequencex= xt Tt=1using a sequence of discrete tokensz= zs [K] Ss=1,whereKdenotes the vocabulary size and we call the ratioT/Sthe hop length. It consists of an encoderE(x)whichencodesxinto a sequence of latent vectorsh= hs Ss=1,a bottleneck that quantizeshs7 ezsby mapping eachhsto its nearest vectorezsfrom a codebookC={ek}Kk=1,and a decoderD(e)that decodes the embedding vectorsback to the input space. It is thus an auto-encoder with adiscretization bottleneck. The VQ-VAE is trained using thefollowing objective:L=Lrecons+Lcodebook+ Lcommit(1)Lrecons=1T t xt D(ezt) 22(2)Lcodebook=1S s sg [hs] ezs 22(3)Lcommit=1S s hs sg [ezs] 22(4)wheresgdenotes the stop-gradient operation, which passeszero gradient during backpropagation.

10 The reconstructionlossLreconspenalizes for the distance between the inputxand the reconstructed output x=D(ez), andLcodebookpe-nalizes the codebook for the distance between the encodingshand their nearest neighborsezfrom the codebook. Tostabilize the encoder, we also addLcommitto prevent theencodings from fluctuating too much, where the weight controls the amount of contribution of this loss. To speed uptraining, the codebook lossLcodebookinstead uses EMA up-dates over the codebook variables. Razavi et al. (2019)extends this to a hierarchical Model where they train a sin-gle encoder and decoder but break up the latent sequencehinto a multi-level representation[h(1), ,h(L)]with de-creasing sequence lengths, each learning its own codebookC(l).


Related search queries