The specs are indeed intimidating: Up to 32 GPU cores and up to 16 CPU cores. Pair that with 64 GB of RAM, and you're well equipped for any workload. And, not to forget the design. Well, it seems Apple did it again.
Most of the time, we write and debug our code locally. After we've passed any tests, we then deploy the scripts to a remote environment. If we're fortunate, we might have access to multiple GPUs.
Custom training loops offer great flexibility. You can quickly add new functionality and gain deep insight into how your algorithm works under the hood. However, setting up custom algorithms over and over is tedious. The general layout often is the same; it’s only tiny parts that change.
After you have finally created that training script it’s time to scale things up. From a local development environment, be it an IDE or Colab, to a large computer cluster, it’s quite a stretch. The following best practices make this transition easier.