Preparing the ImageNet dataset with TensorFlow

Without a doubt, the ImageNet dataset has been a critical factor in developing advanced Machine Learning algorithms. Its sheer size and a large number of classes have been challenging to handle. These problems led to better data-handling tools and novel neural network architectures.

TensorFlow Datasets is such a dataset handling tool. With its help, you can conveniently access a variety of datasets from many categories. In most cases, you can download the data directly from TensorFlow.

However, ImageNet is an exception; it requires a manual setup. While there seem to be some instructions on achieving that, they are somewhat vague. As a result, it took some time to prepare the dataset, but I ended up with a concise script.

Downloading ImageNet

Before we do any preparation, we need to obtain the dataset. To do so, go to the sign-up page and create an account. After you have done this and applied to using the ImageNet dataset, proceed to the download page. Under "ImageNet Large-scale Visual Recognition Challenge (ILSVRC)", select the 2012 version. This will lead you to a new page:

On this page, we need to download the "Training Images (Task 1 & 2)" and "Validation Images (all tasks)" files. Because they are a combined 150 GB large, this will take some time. If you have access to some University-provided internet, I recommend utilizing their network. Often, they have excellent download speeds.

Afterwards, you have two files. The first one, ILSVRC2012_img_train.tar, contains the training images and their labels. The second one, ILSVRC2012_img_val.tar, contains the validation images and their labels. With these archives at hand, we can now prepare the dataset for actual use.

Preparing the ImageNet dataset

The full script to prepare the dataset is shown below. Adapt any directory paths to your case:

To install TensorFlow datasets, run

pip install tensorflow-datasets

After the necessary installations and imports, we define the path where the ILSVRC2012_img_train.tar and ILSVRC2012_img_val.tar files are located:

Afterwards, we set some configuration parameters. The manual_dir

parameter is the key here. It ensures that TensorFlow Datasets searches for the downloaded files at our specified location. The download_dir and extracted_dir are temporary directories used during dataset preparation. You can delete them afterwards:

Lastly, we start the actual building process. The preparation is done with a DatasetBuilder object, which "knows" how to setup a particular dataset. To get the builder for ImageNet, we instantiate it by passing "imagenet2012".

On this object, we then call the actual preparation, download_and_prepare(). The only thing that we do here is pass our configuration object:

That's all we have to do on the python side.

Running the script

To run the script, we type

python <script_name>.py

This builds the ImageNet dataset in the default directly, ~/tensorflow_datasets/. To change this, we can call the script with

TFDS_DATA_DIR=<custom_path> python <script_name>.py

We are prepending the TFDS_DATA_DIR to set the environment variable responsible for the build location to a directory of our choice. This is mainly useful in compute clusters, where multiple workers access the same datasets.


Unfortunately, we cannot set up the test dataset as conveniently. Also, no labels are provided to enforce fairness. Therefore, the only way to assess one's model is by uploading a file with image->predicted labels mapping to the grading servers. To get the test images, download the test archive from the same download page as before. Then, extract it to a directory of your choice. Any further processing then follows a typical data pipeline. However, this is out of this article's scope. To go further, have a look at Keras' functionality to achieve this.