Audio augmentations in TensorFlow

Audio augmentations in TensorFlow
Photo by Richard Horvath / Unsplash

For image-based tasks, researchers and practitioners alike often rely on augmentations. These are (artificial) data transformations, such as rotating, blurring, or resizing. In contrast to modifications of other data types, image augmentations can be quickly understood. Often, a glance shows us how a particular image was transformed. Although augmentations are common in the image domains, they are also applied for other data types, such as audio.

In an earlier blog post, I described a simple GUI to visualize such augmentations right in the browser. It's still running, and you can try it live here. In this post, I follow up on my earlier work and describe two ways how you can apply the augmentations to your dataset in TensorFlow. The first way directly modifies the data; the second way does so during the network's forward pass.

Direct audio augmentations

In the first scenario, we start by generating an artificial audio dataset. The procedure is straightforward and kept as simple as possible, so you can easily follow along. Instead of loading a pre-existing dataset, we just repeat one sample from librosa's library as often as we want:

As you might have noticed, we have created a Dataset object during this process. For the sake of convenience, I chose to do so; but we could have also worked with pure NumPy arrays. Anyway, the audio data is now stored together with its sample rate.

Now that our small dataset is ready to be used, we can start with applying the augmentations. For this step, we use the audiomentations library (on a side note, this is the library that generates the transformations for the mentioned GUI).

To keep things simple, we import only three modifications, notably PitchShift, Shift, and ApplyGaussianNoise. The first two shift the pitch (PitchShift) and the data (Shift, which can be thought of as rolling the data around; e.g., a dog's bark will be shifted by + 5 seconds). The last transformation makes the signal noisier, increasing the challenge for the neural network later on. Next, we combine all three augmentations into a single pipeline:

Before we can feed our data through the pipeline, we have to write some additional code. We need to do this because we are working with a Dataset object. which--put simply-- feeds placeholders through a function before applying it (which is only during actual data loading!).

In our case, this additional code tells TensorFlow to temporarily transform the tensors to NumPy arrays, which are accepted by the pipeline:

With these helper functions in place, we can now augment our dataset. Additionally, we expand the data's dimension, adding an artificial axis at the end. This turns a single audio sample from (num_data_point,) to (num_data_points, 1), indicating that we have mono audio. This is necessary, where we to feed the data through a network:

That's it for this section. After the augmentations are applied, any later operation, be it a network or something else, will get to see the augmented data.

Audio augmentations during the forward pass

In contrast to the first technique, augmenting the audio data within the network places the computational load on the forward pass.

For this, we'll use the kapre library, which provides custom TensorFlow layers. Among these layers are the MelSpectrogram layer, which accepts the raw (that is, unmodified) audio data and computes a Mel-scaled spectrogram on the GPU.

While not directly relevant for data augmentation, this has two benefits:

First, we could optimize the parameters for the spectrogram generation during, e.g., a hyperparameter search, rendering a repeated audio-to-spectrogram generation unnecessary. Secondly, the transformation takes place directly on the GPU and is thus faster, both in terms of the raw conversion speed and memory-on-device placement.

With these benefits in mind, we can apply the augmentations after the spectrograms have been computed. Although the number of readily available transformations is limited at the time of writing, we can conceptually deceive the process.

First, we start with loading the audio-specific layers, which are those provided by the kapre library. These layers, as described, take the raw audio data and compute a spectrogram representation:

Afterward, we add an augmentation layer from the spec-augment package. This package implements the SpecAugment paper by researchers Park et al. [1], which masks part of the spectrogram. The masking obfuscates information required by the neural network, thereby increasing the challenge. This modification, in turn, forces the network to attend to other features, expanding its capabilities to generalize to unseen data:

Finally, we can add more layers on top. For our case, I've added an untrained residual network, with an arbitrary ten classes to classify the data into:

That's it for this part. We now have a deep neural network that augments the audio data during its forward pass.


In this blog post, we have discussed two ways of augmenting audio data: The first approach directly modified the audio data, the second approach did so as part of the forward pass of a neural network. These two methods should give a good idea of how you might proceed in your use case. To get you started quickly, you can find the complete notebook on GitHub.

[1] Park et al., Specaugment: A simple data augmentation method for automatic speech recognition, 2019, Proc. Interspeech 2019

Pascal Janetzky

Pascal Janetzky

Avid reader & computer scientist