AbstractUsing deeplearning models to generate different art forms has become a popular area of research lately,fuelled by the introduction of Generative adversarial networks. In most casesneural network performance is negatively affected by lack of sufficienttraining data, which is especially true for music due to the inherentcomplexities such as representation, noise, and dealing with temporal relations.This study answers two main questions, namely: can we use raw audio frequenciesto efficiently and completely represent music of high resolution genres?Secondly, if so, are Generative adversarial networks with recurrent unitspowerful enough to learn such representations and generate audio that is bothrealistic and enjoyable? This studyproposes the use of raw audio frequencies from mp3 files as opposed to otherscarce representations of music such as MIDI and ABC. To combat the problem oflack of training data, no copyright electronic dance music on YouTube sitessuch as NCS will beused. An accompanying long short-term memory based generative networkarchitecture for music generation is proposed.
Keywords – GAN, Music, LSTMIntroductionThe use of computing formusic generation and other art forms has been around for a few decades now, butthis notion has come back to popularity with the rise of deep learning and itsperceived capability to exhibit creativity. The development of convolutionalneural networks (CNN) has led to some of the best results in generating visualart. Recurrent neural network (RNN) variants, on the other hand, have producedcurrent state of the art results in learning sequential representations such asnatural language, audio and video.
In learning the generative distribution fora genre of music, it is necessary to be able to find a suitable representationof the data that is easy to learn and at the same time preserves all theimportant features for learning. Many such representations exist, such as MIDI,ABC, Piano roll, raw audio and raw audio frequencies, each with different advantagesand disadvantages.Until recently MIDI andABC formats have been the obvious choice to researchers due to their simplicity,although they suffer from shortcomings such as the lack of sufficient audiotraining data sets in this format, the added workload of converting raw audioto this representation, and that not all songs can be represented in MIDI andABC form.
This has limited music generation to a few simple instrument-based genressuch as Jazz. Google DeepMind presented in their WaveNet paper 1 that it is possible to generatesymbolic music from raw audio using a CNN, this idea inspired the longshort-term memory (LSTM) generative adversarial network (GAN) proposed in thisresearch for music generation from raw audio frequencies. This has the addedadvantage of being able to train networks on many different genres of music notlimited to MIDI representable audio, hence opening up a source of training dataof multiple genres previously unavailable for this task.
Unlike WaveNet that usesa CNN architecture to produce each melody one time step after anotherconditional to the output of the previous step, this research proposes to implementan LSTM based architecture in both the generator and discriminator of thegenerative adversarial network. LSTM is better suited for sequential data thanCNNs, and does away with the need to manually condition each generated note onthe previous note, as this property is inherent in the nature of an LSTM unit. Kalingeri and Grandhe 2 used raw audiofrequencies on a stacked LSTM and CNN network for music generation, but did notuse a generative adversarial network. LSTM cell are an improved form of RNNcells with the ability to learn relations in data that are multiple time stepsapart in a sequence.GANs have proven to bereally good at generating realistic data from a set of latent variables 6, 14.
These networks aretrained by training two networks (the generator and discriminator) in a zero-sumgame fashion until a Nash equilibrium isreached where the generator generates music so realistic the discriminatorcannot tell it apart from samples pulled from the training distribution 5. For the purpose of thiswork, these networks have the advantage of doing away with the expensive use ofregularization terms and tend to yield more creative results at the task ofdata generation due to the use of latent variable in the input space. A lot of architecturesexist for music generation today and majority of these implement some variationand a CNN, RNN and/or GANs, however most of these are trained on open sourceMIDI files.Problem StatementThe main goal of thiswork is to demonstrate that raw audio frequencies can be used to train networkssuch as a generative neural network with LSTM cells, and generate realistic andenjoyable music.
The simplicity of using MIDI files for training does notcompensate for the drawback in lack of varied training data. Using of raw audiowill hopefully widen the genre spectre of the generated music, and will makethe technique more accessible to the general public.This work makes use ofcurrent best practices in training generative adversarial networks andrecurrent neural networks but will focus on the proposed model rather thanperform a comparison across all existing models. Training will not be performedon MIDI files but will instead only compare results of this work to currentbaseline and state of the art results on MIDI datasets based on generated musicquality metrics such as Polyphonyand Tone span.
The fundamental building blocks in music will bepresented for context to presenting the case for use of raw audio frequencies.The background of the architecturesused in this study will also be explored starting from a single perceptron 3, fully connectednetworks, activation functions, convolutions, recurrent and LSTM cells and the concept of adversarialtraining.Literature SurveyA number of venerableinventions in the deep learning domain are key ingredients to majority of thework done in neural music generation. LSTM 4is well suited and successful in learning sequential data such as audio, andhas the capability to recall notes generated a number of time steps back bysolving the vanishing gradient dilemma that other RNNs suffered from. GANs 5 are especially usefulfor generating realistic data with little danger of overfitting, and have beenfound to produce rather creative art 6,GoGAN 7 for image completion, SeqGAN 8 for both text and music (using MIDIrepresentation) generation. Majority of the music generating networks to dateare trained on either jazz or classical music, which are much easier to learnas they contain notes from just a few instruments.In the music generationspace alone, WaveNet 1introduced by the Google DeepMind team is a completely probabilistic networkthat uses the same architecture as PixelCNN 9to generate raw audio files from a dataset of mp3 files of many tagged genres.
Their network was developed mainly for the task of text to speech, and usesmultiple layers of dilated causal convolutional networks instead of a moresuiting sequence model such as RNN. They do this to avoid the long trainingtime required for RNNs. However, they do not train the generator in anadversarial manner, and do not report quantitative results of their work onmusic generation. MidiNet 10expands on this work using MIDI files instead, and train CNNs in an adversarialsetting. However, the work does not fully depict the power of adversarialgeneration of music, as the training data used in the form of MIDI files lackeddepth and breadth in number of audio samples per second and genre.
Both thesenetworks generate notes sequentially by conditioning them on the distributionof previously generated notes. In 11, Olof introduces continuous-recurrent GANs for thesame task. four LSTM cells were stacked in the generator adversary, and MIDIfiles were used for training with four features: tone length, intensity, timelapse from previous note and the wave frequency.
The MIDI files were allclassical music, as in the case of WaveNet, and was a total of 3697 songs. Thismay be enough for training networks on low resolution sound with only a fewinstruments, but, as stated in 1, when multiple genres were used, there was either too much noise in thedata to learn anything meaningful, or the different genre subsets of the datawere not enough to train the networks. Although C-RNN-GAN tends to repeat thesame note at exactly the same time step in generation, this work 11 proved GANs are a viable way to learngenerative distributions of sequences and generate music that sounds pleasingand varied from the training data.Olof 11, Kalingeriand Grandhe 2 come closest to the aim of this studyin that this study implements a customized LSTM-GAN as in 11, and train on raw audio as in 2 with the hope of proving deep learningarchitectures can learn more complex music notes than those in jazz andclassical music. Given most audio music available online is in a 16bit format,which translates to possible output notes at each timestep, an n-law reduction technique will be implemented as done by the DeepMindteam in 1, to reduce the size of the training dataper track per second without compromising quality. Other notably interestingapproaches include 12,13 that use auto encoders and beep beliefnets with RNN cells for this task.MethodologyAll networks andpre-processing will be implemented in python as it is rich with open sourcelibraries and support.
For assembling and training the networks a combinationof keras and tensorflow will be used. These have multiple optimizers, layertypes and activation functions to pick from but also allow some level ofcustomization to the specific needs of the training data. For the training data, music files will bescrapped from no copyright YouTube channels and a database of electronic dancemusic will be created.
Other existing mp3 datasets will also be used. The webscrapper will be written in Python as well due to ease of implementation andavailability of tools on GitHub. A Fourier transform will be implemented nextto convert audio samples per time step into a vector of raw audio frequencies.Following the u-law reduction in 1,experiments will be run with different dimensionally reduced input spaces fromthe initial possible values per timestep for a 16KHz audio file. This is to try and get the balance between fast training andthe quality of the training data.
The baseline model is aGAN with a single LSTM cell in the generator and discriminator. This networkwill be trained using RMS prop in the LSTM cells with ReLU activations in thegenerator net and LeakyReLU in the Discriminator except for the Tanh activatedoutput layer. For this study more LSTMs will be added in the generator anddiscriminator to be able to generate multiple notes per time step.
This willalso help learn the sequential timing of the notes so as to generate harmoniousmusic. The input to the generator network G(z),isa vector of latent variables z, and the output issequence of audio frequencies. The discriminator D(x) net learns totell the real data distribution X and generated data = G(z) apart. The standard objective function to minimize whentraining GANS is given by: 5 Intheir later work 15, Goodfellow and the OpenAI team show that minimizing thisobjective function that is computed directly from the output of thediscriminator can lead to overfitting the generator. In 15 they propose anobjective function that drives the generator, G(z),to generate data thatmatches the statistics of the real data.
They choose to use the expected valueof the learned features of an intermediate layer of the discriminator network D(x).This study implementsthe following improved objective function: 15Wheref(x) denotes activations in the intermediate layer of thediscriminator network Training deep neuralnetwork on medium-sized datasets can take days on a powerful CPU, therefore thenetwork has to be trained on a GPU if one is available, otherwise amazon web services (AWS) will be used forthis project.Project PlanThe phases of thisproject include data collection and pre-processing, building and testing thenetworks, reading at all stages to ensure the goals of the experiment are beingachieved. Given the nature of the dataand the size of the network, training is likely to consume a lot of time. Belowis a detailed schedule for the stages and milestones of the project