Machine Learning

Using Convolutional Neural Networks to Generate Harmony

Explores the data representation and temporal resolution on subjective quality and the negative log-likelihood of generated notes.

Luke Griswold

Dec 21, 2019

7 min read

Using Neural Networks (NN) to generate music has been discussed several times on Towards Data Science, but usually under the context of sequence generation using Recurrent Neural Networks (RNNs) and some Convolutional Neural Networks (CNNs). This article is about an approach using convolution to generating harmonies with four instruments capable of generating one pitch at any given time.

👁 Transposition of BWV 364 using Music21, Image by MuseScore

Transposition of BWV 364 using Music21, Image by MuseScore

The major advantage of using Convolutional Neural Networks is that, like image recognition, music is invariant under time (themes / motifs) and pitch (transposition). Thus, convolution is an ideal operation to find patterns in music such as chord progressions or cadences when the data is organized as a matrix representing time and pitch. Specifically for this project, time is organized into discrete time based on sampling the pitches of a given instrument (voice) at a certain frequency. Choosing the frequency has consequences on how the model performs, as we shall soon see.

This project eventually settled on eighth note discrete time scales and a 128 chromatic pitch vector based on .midi file capabilities. For the data set, 163 Bach Chorales were chosen based on parsing through the Music21 corpus for the chorales that had four voices and were in a 4:4 time signature. This data representation and model architecture was first done by a Google AI Team who also published an interesting paper about their techniques. This project uses many of those techniques but is implemented in Keras and with Music21.

The basic idea of using a CNN is to use several filters for different patterns (cadences, chords, etc.) to generate probability distributions over all the possible pitches for each discrete time step. A piece of music can be generated from either no input or from partial inputs by repeatedly sampling from the unknown notes and generating new probability distributions.

Data Preparation

Music21’s corpus works with music files in musicXML format, so a helper class needs to be built in order to turn those files into a numpy array representing a piano roll (found in utils.py in the github repository). Another method that I built was a method to transpose a piano roll into a random key by adding or subtracting a number on the interval [1,11] from the pitches (the chromatic scale has 12 notes). In this way, batches of inputs can be generated by selecting a random 4 measures from one of the Bach chorales and transposing it to a random key.

👁 Piano Roll for a Single Voice as Input

Piano Roll for a Single Voice as Input

The four piano rolls (one for each Soprano, Alto, Tenor, and Bass voice) are then stacked together with four mask matrices that represent what time steps contain known pitches for each voice. The model is then trained by taking batches of Bach Chorales and erasing the same notes for each piece in the batch. A custom loss function then minimizes the sum of the negative log-likelihoods for the correct pitches of the erased values (divided by the total number of erased notes in order to avoid weighting batches with more erased notes or batches with different orderings more heavily).

Implementing this in Keras required the use of a custom defined generator to make the batches of inputs, as well as a custom loss function, which makes the model a little difficult to save and load. Luckily, once a model is trained, .json and .h5 files can load a model for inference purposes.

Model Architecture

The paper this model follows used 64 convolution layers with 128 filters. They were eventually able to use dilated convolution in order to save computing power. Since I had to train this model on a single computer without large numbers of GPUs, the architecture I used was 20 convolutional layers with 64 5×5 filters. Like the original paper, BatchNormalization and skip connections were used every two layers. For the convolution layers, ReLU activations were used with padding to keep the original size of the input in terms of time and pitch.

👁 Iterated Model Architecture

Iterated Model Architecture

The final layer is a convolution layer that outputs four channels and uses the softmax activation. In this way, the output can be trained to become the four voices with probability distributions over the 128 pitches at each time step. The loss is then computed by summing the negative logs of the probabilities corresponding to the actual notes for the erased notes in the input.

Generating Music with Gibbs Sampling

Several papers have looked at different sampling and resampling methods for music generation. Bach2Bach‘s architecture used a pseudo-Gibbs Sampling procedure to rewrite parts of a generated score. Coconet’s paper looked at different techniques and found that the one that created the best generated music samples was to use an annealed probability for sampling from erased or unknown notes. The equation for this annealed probability is fairly simple to understand, it is: α_n = max(α_min, n(α_max – α_min)/BN, Where the α values represent probabilities that an erased note in the input will be erased for the next iteration of the sampling. B represents the fraction of steps the sampling should occur above α_min, and N is the total number of steps, generally set to the number of voices (4) times the number of time steps (32).

The way Gibbs sampling works and why it works so well has to do with the probability distributions converging on a coherent piece of music. If we feed a mostly erased score (or random noise) into the model, the probability distributions will generally be spread out across several pitches since each pitch depends on what comes before and after it, as well as what is happening in the other voices. In effect, we have are trying to model a joint probability distribution without enough context of the other variables.

So, we sample from the probability distributions to obtain pitches at each unknown time step, then erase all of those pitches again with probability α_n. In this way, the block sampling that occurs at the beginning of the Gibbs sampling keeps the music from simply staying on the same note. As the process moves forward, fewer and fewer notes are being re sampled, which allows the probability distributions over the erased notes to converge on musically coherent pitches.

Effect of Temporal Resolution:

Using the metric of Negative Log Likelihood, it can appear that higher temporal resolution is beneficial to the music quality. For training the network, for example, the minimum loss function when using sixteenth note resolution was .168, or about an average probability of 85% over the erased notes. The quarter note resolution’s best loss function was .487, or about an average probability of 61%. However, the sixteenth note resolution can achieve better success by highly weighting the pitches before and after the current time step in its own voice, since a quarter note represents four time steps in sixteenth note resolution. A better metric is to see if the Gibbs sampling procedure reduces the average negative log-likelihood.

👁 TOP: 1/16th resolution, melody input. MID: 1/8th resolution, melody input. BOT: 1/8th resolution, random input.

TOP: 1/16th resolution, melody input. MID: 1/8th resolution, melody input. BOT: 1/8th resolution, random input.

With the trained models, it appears 1/16th note resolution is over-fit, as it does very well on the training data and can achieve low NLLs there, but the NLL does not converge during the Gibbs sampling process. The best behavior was exhibited by the eighth note resolution where a melody was input into the model, with the other voices unknown.

GitHub and References

Code base and music samples for the project can be found at the following GitHub repository.

References:

[1] M. Cuthbert and C. Ariza. music21: A Toolkit for Computer-Aided Musicology and Symbolic Music Data. 2010. In Proceedings of the International Society for Music Information Retrieval.

[2] G. Hadjeres, F. Pachet, and F. Nielsen. DeepBach: a Steerable Model for Bach Chorales Generation. 2010. In Proceedings of the 34th International Conference on Machine Learning.

[3] C. Huang, T. Cooijmans, A. Roberts, et. al. Counterpoint by Convolution. 2017. In Proceedings of the 18th International Society for Music Information Retrieval Conference.

Written By

Luke Griswold

See all from Luke Griswold

Convolutional Network, Machine Learning, Music

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/using-convolutional-neural-networks-to-generate-harmony-1a1cdfd7ec56/