Writing Machine Learning Code that scales

Writing Machine Learning Code that scales
Photo by Frank Busch on Unsplash

After you have finally created that training script it’s time to scale things up. From a local development environment, be it an IDE or Colab, to a large computer cluster, it’s quite a stretch. The following best practices make this transition easier.

Argument parsers

The first is to use argument parsers. Python provides such functionality with the argparse module. Generally, you want to make the batch size, the number of epochs, and any directories a selectable argument. It’s very frustrating to go through a script manually, find all places where a specific parameter is used, and change it all by hand. Thus, avoid hardcoding such parameters. Instead of writing

you then write

Separate argument parser setup from the training code

That’s a mistake I have made often. I created the argument parser in the training script and verified the training locally. Next, I deployed this code. Only to see it fail on the remote machines: the directories that I had set as default parameters were pointing to my local disk.

Edit/Note: While I still recommend using argument parsers, I found that handling more than ten, twenty parameters is chaotic. In that case, I recommend using them for the main settings only. For the more advanced settings, I recommend using the Gin python library.

Read more about it here:

Pascal Janetzky - Private Site Access

Separate argument parser setup from the training code

Instead of writing code similar to

(that overrides your manually changed parameters on the deployment machine), you have a separate file that is unaffected by updates:

You have this file for your local and deployed project. You then import this method in your main training code; locally the paths point to your local folders, on the remote machines they point to their appropriate directories. This way you won’t have to remember changing the default parameters after going from a local single GPU machine to a multi-worker setup.

Note: Similar to before, I now recommend handling parameter settings with the Gin python library. I wrote a short introduction here.

Separate model creation script

Separating your model creation routines from the main training code results in lean code. Similar to the previous point, you use a separate file to store the model and its configuration, and then import the method in the main code. Combine this with the next point for massive scalability.

Make hyperparameters easily editable

When you hard-code any parameters you will have a hard time updating them. Using 512 kernels instead of 16? Using a larger dense layer?

Every time you try a new configuration you first have to find the relevant code and then change it. That will quickly become annoying. The solution is to make the hyperparameters part of the script’s arguments or to have a separate file where you store them.

The first option would look like:

and the second option would be similar to:

In both cases, you have a central place that manages the default parameters. Scaling from 16 to 512 kernels? That’s only a matter of seconds now.

Note: Similar to before, I now recommend handling parameter settings with the Gin python library. I wrote a short introduction here.

Use device-agnostic code

This is a superb feature I have seen at TensorFlow: Whether it’s a single GPU or 20 workers with 16 GPUs each, write your code that it works anyway — without any modifications. TensorFlow handles this for you with strategy objects. This enables you to seamlessly switch between a single GPU and multiple workers, which is handy when you only have access to a single device when coding but use multiple devices upon deployment.

Instead of writing:

you wrap the model creation and compile routines with the strategy object:

Wrapping your code in the scope of the strategy object takes care of distributing the model and the optimizers. You can use the following code snippet, from Hugginface’s Transformer library, to create such an object:

I have used the same technique to scale from a local non-GPU setup to a multi-GPU cluster and TPU training. There’s something magical about utilizing four, eight, and more GPUs with a single command.

Let your framework do the dataset handling

Instead of manually taking care of loading data, parsing, caching, …, you let your ML framework do the task. Many people have contributed to these projects and made the default pipelines very efficient. They natively support multiprocessing, caching, batching, and so on.

Especially for complex environments, this is vital: How do you feed your data to three workers? Such functionalities are often already implemented. Don’t reinvent the wheel, but focus on the fun part: Creating and training models.

Pascal Janetzky

Pascal Janetzky

Avid reader & computer scientist