Parsing the ESC50 audio dataset with TensorFlow

Parsing the ESC50 audio dataset with TensorFlow
Photo by Richard Horvath / Unsplash

TensorFlow's TFRecord format is handy but inconvenient to handle the first few times. While many datasets come readily from TensorFlow Datasets, some must be parsed manually. The ESC-50 dataset, short for Environmental Sound Classification, is one of them.

Dataset description

It consists of 2000 individual audio files. Each file, changed in the WAV format, is five seconds long and belongs to one of 50 classes. Per class, 40 examples are available. With around 600 MB in size, the dataset is small and handy for quick experimentation. Also, because of the even class distribution, it's balanced. This means one can use accuracy-related metrics, which are generally easily understood.

Downloading and extracting

To download the (raw) dataset, go to its GitHub page, and click on the link. Depending on your connection, it'll take a few seconds to minutes. After the download, unpack the archive. You have two folders, audio and meta, and two files. We are concerned with the audio folder.

Within this folder, the files are named according to this scheme:

{FOLD}-{CLIP_ID}-{TAKE}-{TARGET}.wav.

The FOLD placeholder stands for the fold that this file should be used in, CLIP_ID stands for the ID of the source, TAKE gives information whether multiple 5-second-segments were taken from the source, and TARGET is the name of the class in numeric format.

Parsing data into TFRecords

If this is your first time working with TFRecords, TensorFlow's native format for efficient data storage, or you need a refresher, have a look at this Google Colab notebook and an accompanying description here.

Generally, we only need the FOLD and TARGET information from the filenames to write the audio data into TFRecord format. To get this done, we first import some packages and a couple of helper functions:

Afterwards, we'll define the function that loads the audio from the disk. We use the librosa library for this; I just found that this is usually the most convenient and fastest way. By splitting the filename, we can extract its class label (the TARGET part). In the end, we return the audio data, the sampling rate, the label, and the file's name:

Afterwards, we define the function that makes the extracted data ready to be stored. Here, we'll use the helper function implemented previously. As I describe in my introductory, hands-on guide to the TFRecord format, these functions are used to make integers, floats, strings, and byte data ready for writing to disk.

We give these features appropriate names to extract them at some later point, e.g., when feeding them to a neural network. Lastly, we wrap them into an Example object, which you can think of as a box that has some content and properties:

With these two core functionalities defined, loading the audio data from disk and getting them into a TFRecord-compatible format, we can now tend to the method combining them. Here, we create a TFRecordWriter object responsible for writing the data to disk and iterate over all audio files that we've found (implemented soon). Then, each audio file is parsed, packed into an Example object, and written to a TFRecord file.

Note that the ESC50 dataset is conveniently small. Per pre-arranged fold, there are only 400 small files. Therefore, we can use on TFRecord file per fold, leading to five files in total. For larger datasets, you want to use more files. For such cases, the documentation has some tips.

Finally, the following main() function is used to iterate over the folds. We do this by using search patterns, which python's glob module can work with:

To round off our small script, we define an argument parser. It takes the path to the audio directory mentioned at the beginning and the output directory. If it does not already exist, it will be created:

Running the script is done by calling python /path/to/script.py

With the script defined to create the TFRecord files, we probably want to read the files back later. So that's what we'll implement now.

Reading data from TFRecords


Getting the data out of the TFRecord format is, once you've done it a couple of times, actually very easy. We only have to inverse the storing procedure. Previously, we put the features, named sr, len, y, and so on, into the box. We thus use the same names to get the data out.

The only caveat is the audio data. Because it is an array, we have to reshape it. That's why we stored the len property. Similarly, the filename has to be parsed to a string:

We use the following function to read the content of one or more TFRecord files. It returns a dataset object:

We can iterate over the first few elements using a for loop, inspecting them for any errors. Each sample consists of four components: The actual audio data, the label, the sampling rate, and the original filename:

That is it! Now it's up to you.

Summary

We parsed the audio files to TFRecord using librosa and several helper functions. Then, we not only stored the data and labels but also saved the file name and the sampling rate. Lastly, we wrote some code to read the data from the TFRecord files, giving us a dataset we can iterate over.

Pascal Janetzky

Pascal Janetzky

Avid reader & computer scientist