How to (not) fail a Machine Learning project
Four months is a long time. But, so is ten months. Generally, everything further away than four weeks is in the future. This situation can be a blessing and a curse for machine learning projects. A blessing because we have ample time to finish the project. A curse because we don’t have to start (or continue to work) just yet.
Edit #1 (Mai 2022): After running more than ten thousand experiments for a single project, I have included more lessons learned. Adapt them and enjoy faster progress.
Edit #2 (June 2022): I’ve updated the article with thoughts about how a clean labeling pipeline makes life easier.
Pursuing projects over the long term is thus an art of itself. Thankfully, we rarely have to start from scratch; people before us have often gone the road already. By learning from their mistakes and improving what they found helpful, we are well equipped to handle daunting projects. Therefore, one of the first steps is to find areas for improvement. Here, Barbara Oakley’s A Mind for Numbers helped me start the habit of listing the lessons I learned that day.
Fearing that it sounds cliche, it’s often only through the mistakes we made (or just avoided) that we can do better the next time. But, taking this mindset, we inevitably find areas where we can improve. And after working on a machine learning project for half a year, I found there’s much to be learned.
In the hope of helping other readers and practitioners avoid my accidents, I’ll now present a selection of the most important lessons I learned. Of course, it’s not that you’ll inevitably fail if you make the same mistakes. Also, it’s not that you’ll automatically be successful if you manage to avoid them. Rather, by learning as much as possible and seeing what resonates with your experience, you increase the chances of success by a large margin.
The remainder is divided into two categories. The first one, machine learning, lists lessons learned related to practicing the actual ML tasks. The second one, general, lists takeaways applicable to any larger project.
Good code is good, and faster good code is better. But only to a certain point. After this point, the additional time spent optimizing code does not translate into equally increased returns. Obviously, there is no hard threshold; it depends on the project. However, I trust that many practitioners are familiar with a similar situation.
Use dedicated experiment tracking software
If you have a couple of experiments, tracking them manually by noting the specific configuration is feasible. However, this approach does not scale well. Manually tracking the output of all experiments does not work beyond a few dozen. Here, I recommend using dedicated tracking software. Personally, I have been using Weights & Biases for a long time and have come to praise the fantastic user interface. Additionally to logging the results, such systems can be set up to track the input to a specific experiment. Such features, in turn, make replicating results easier.
Use fewer datasets at first
In his book A World Without Email, Cal Newport introduces the idea of doing less but better (c.f. p. 215 ff and p. 227 ff). We can employ the same concept for machine learning projects. When it comes to testing new ideas (for research), we often want to evaluate them as throughout as possible. Though this is reasonable, it does not apply at every stage. In the beginning, it’s better to handle a few core datasets well than to support dozens in a meager way. Over the project’s duration, you can then decide to add additional datasets.
Do EDA explicitly
Here’s one flawed approach I took a couple of years back. I trained a neural network to classify audio data. However, rather than examining the data — that is, looking at the label distribution, the frequencies, the duration, etc. — I used a much more unstructured and error-prone approach. I manually experimented with different hyperparameters, hoping to find a good set by chance. Obviously, this trial-and-error process works only in the rarest case. In contrast, by doing an exploratory data analysis, we can gain more insights into our datasets early on and save time and cost over the long term.
Consider copying the data to the pod
This point applies when you a) use Docker (or any other container management software) and b) have a slow file system. If these requirements apply to your situation, you can think about copying the data to the container before experimentation. However, you have to balance the additional overhead of copying with the resulting reduced training times. A middle course is to read the first data iteration from the file system but cache it to a temporary disk (looking at an EmptyDir backed by SSD or RAM) afterward.
Automate image building
This scenario is familiar: locally, you have changed your setup, but is this reflected in your containers? It’s annoying to realize mid-training that a dependency is missing. Thus, every time you update some packages locally, have this trigger an image build process. Alternatively, have a script that you start manually and that executes in the background without your supervision.
Decide on metrics to track in advance
Rating the success of a particular configuration is only possible when you have decided in advance on which criteria to do so. Once the experiments have begun, there’s not much option to change the metrics post hoc. It, therefore, pays off if you take time, in the beginning, to carefully evaluate metrics and have them collected in each run. To be extra safe for later ideas: Store the ground truth and the prediction of each experiment. Having this data enables you to calculate additional data afterward.
Save predictions and ground-truth data
Even after you have decided on the metrics to track, you might still reach a point down the road when you think to yourself: Having this additional metric would be helpful. The dumbest option would be restarting all experiments after adding it; this could cost you months of compute time and thousands of dollars. The better approach is to be prepared and at least save the predictions on the test dataset together with the ground-truth data. Then, once you require further insights into your model’s performance, you can calculate them ad-hoc. You can even go one step further and save the prediction-ground-truth pair during training (which might incur additional disk space and stall training) to analyze the training progression in more detail. Downsampling the data by capturing only a subset or at a frequency is a good solution.
Prepare for hyperparameter optimization
KerasTuner, Optuna, Weights & Biases, Ray-Tune: the number of tools that find optimal hyperparameters for your problem set is large. If you plan to do them at a (far away) later point, you can be smart early on. Usually, the optimization frameworks provide the next set of parameters through dictionaries or similar structures. Therefore, if you design your code to load hyperparameters from a dictionary, changing the parameters is as simple as passing a different configuration. Taking this into account early on reduces overhead when preparing the frameworks.
Checkpoint your training
On March 11, 2022, a collective of more than 900 researchers from all around the world started the training of the worlds’ largest open-source language model. Running on 384 A100 Nvidia GPUs, the training is estimated to take three months and more (you can follow the progress here). At such high stakes, it is exceptionally costly if the training fails for whatever reasons. Therefore, it’s essential to checkpoint the training state regularly to avoid losing progress. That is what the engineers involved did, and even with frequent checkpointing, there’s always the possibility of hardware failures. Of course, losing seven hours of progress hurts, but it’s way better than starting anew.
Have a small subset
Jeremy Howard and Rachel Thomas founded the machine learning research company fast.ai to make neural networks accessible to everyone. On a related note, Howard also provided the Imagenette dataset, which is a manageable subset of the excruciatingly large ImageNet dataset. This dataset, ImageNet, has received has probably received the most attention in the last few years. There are two reasons. The first reason is its status as a challenging benchmark, fueled by yearly competitions. The second reason is the sheer size. This dataset takes more than 250 GB on disk, and one usually needs fast accelerators to tackle the dataset in a reasonable time. That is where the mentioned Imagenette comes into play. Having a small subset allows you to evaluate novel ideas more quickly.
Make parallelization easy
If you are lucky and have access to more than one accelerator, making the code ready for parallelization in advance becomes a potential approach. In TensorFlow, you do this by wrapping all routines that create variables into a Strategy object. This is slightly more effort at the moment but makes distributing your code much easier in the long run. In other words, if you know in advance that more accelerators will be available, it pays off to prepare for parallelization.
Use significance tests
Let’s assume the following situation. You have experimented with two configurations and now compare the results. The first run has achieved an accuracy of 75%, while the second run achieved 79%. Looking at this metric only indicates that the second configuration is better. However, the question is, is it actually better, or was it just luck? We can investigate this case by running a significance test. Here, you compare two ideas. Idea one is that there’s no difference, and idea two says there is a difference (for our purpose: without indicating the direction). We can now examine these hypotheses in detail using various tests and, ultimately, conclude that the difference between our experiments is statistically relevant. I recommend the Statology website, which has plenty of hands-on tutorials related to data analysis.
Check for regression
On February 23, 2022, I trained a ResNet152 on the UrbanSound8K dataset. This dataset is split into ten folds, and a single fold took ~30 minutes for 100 training epochs. In total, the training on the complete dataset took 300 minutes or five hours. Two months later, a single epoch took 15 minutes.
What has happened? This case, as it turned out, was a regression. Observing the February experiment, I noticed that the validation scores were drastically lower than the training scores. To reduce this discrepancy, at that time, I decided to add audio augmentations. Because I had previously worked with the Audiomentations library (and even built a GUI to visualize the transformations), I once again relied on its features. This library offers several audio-related transformations, such as pitch shifting or time-shifting. After adding this functionality, I checked if the validation score had increased. Seeing that this was indeed the case, I focused on other tasks.
However, as I found out two months later, and as is clearly described in the documentation, the transformations run on the CPU and only on a single sample at the time. This detail turned out to be a huge bottleneck, especially given the sizeable UrbanSound8K dataset. So, to cut it short, I decided to re-implement the critical augmentations in pure TensorFlow and made them work on batched data. These modifications then reduced a single epoch to ~30 seconds. So the moral of this excursion is a) check the entire data pipeline, and b) don’t mindlessly add cool stuff.
Monitor your data pipeline
Last year, I had the chance to read Building Machine Learning Pipelines by Catherine Nelson and Hannes Hapke. In their highly practical (and relevant) book, they cover the process of setting up automated model deployment pipelines. Though they use the TensorFlow Extended library as the core pipeline tool, their ideas and tips are not restricted to this framework. The information, which is the point relevant for a machine learning project, also applies to any ML data pipeline. For example, a small shape error or a normalization operation can cause havoc down the line. And, as the previous point undermines, any further transformation affects the pipeline’s speed. Therefore, it’s good to regularly check your pipeline’s throughput, at least after you’ve added critical operations such as augmentations.
Have a clean and approved labeling process
Data is at the heart of the models we train. If we are lucky, our datasets might be of high quality and, especially important for classification or similar tasks, are already labeled. If this is not the case, then we often have to devise a labeling pipeline from scratch or modify existing code. In both cases, having a clean and approved pipeline is important. What do I mean by clean and approved? First, for us clean is the opposite of complex, messy. What we have to avoid is writing code that is dispersed over multiple files and is challenging to understand. Ideally, the pipeline is fully automated, requiring only minor manual intervention. This leads us to the second point, approved. A pipeline can be created very quickly, but it is its quality that counts. If our labeling process is faulty, the data we use might cause harm down the line. We can prevent this and improve the quality by discussing our pipeline with coworkers and subject matter experts. Exchanging with them helps us correct errors and leads to a much round, accepted pipeline.
Don’t get into perfectionism
Often, we are caught in the moment. The idea for just another improvement strikes our mind, and before we notice, we have spent an hour deep down in the codebase. As good as this sounds, keep such timely steps to the end. Especially in the beginning, focus on building the core.
Implement core features first
At the beginning of a project, you have this really long (imaginary) to-do list. This situation can be overwhelming and lead us to implement features in no particular order. This approach, of course, is not sustainable; the result often depends on a few core features. Therefore, it is better to focus on these first and add further ones later.
Document your lessons learned
Chances are, this project won’t be your last. And this is good! What’s not so good is to repeat mistakes. To be a little bit better the next time, it helps to note down the lessons learned. Have a single sheet of paper for each project and add an entry whenever you realize that something could have been easier. You condense your experience and have a customized and inverted getting-started list the next time.
More time does not necessarily help
The invention of the Pomodoro technique underlines one fact: When given only limited time, we can be as productive as with more relaxed constraints. The machine learning domain is no outlier. Allowing a fixed amount of time only to create value puts yourself under pressure to deliver. Surprisingly, tackling your projects in blocks as short as 25 minutes, followed by 5-minute breaks, can boost productivity. Experiment with this approach, which I have from A Mind for Numbers, and see where you can apply it.
Practice the art of nonfinishing
Let’s consider the following case: You train a neural network to classify flower images. Carefully observing the accuracy scores, you see room for improvement, so you start by choosing another optimizer and increase the batch size. In the meantime, your colleagues found that classifying plants can be skipped altogether; you decide on using a pre-trained network instead. But perfectionism now got a hold over you. Before you ditch the (now obsolete) task, you want to prove that you can reach outstanding test results. In the end, you might then have an excellent solution for a problem that no longer exists. For every party involved, this is not a satisfying situation. So be relentless in nonfinishing obsolete tasks.
Stick to today’s task list
This one is short. Avoid, in general, getting lost at a single subtask, but strive to tackle the to-do list one by one. I recommend reading Cal Newport’s Deep Work (especially p. 223 ff.) for more on this.
Order the daily task list
This is a simple trick. You’ll make progress faster by ordering your tasks regarding their importance. By tackling the least-important ones last, you will have spent most of the time and cognitive resources on the main features.
Don’t overload the daily task list
It requires experimentation and self-reflection to roughly assess in advance how much you can tackle in a work session. I often added just another thing, which piled up to very long lists. In the beginning, stick to shorter lists first and then expand as you learn. Psychologically, the effect of seeing a cleared to-do list is highly satisfying. In contrast, it is de-motivating to have undone tasks at the end of the day.
Regularly exchange with co-creators
The Einstellung effect (c.f. A Mind For Numbers, p. 17) states that once you have arrived at a problem-solving approach, you are highly biased to stick with it and experience reduced attention to alternative solutions. We experience this phenomenon in our everyday life, such as when we look for our keys and frantically search around the place we deem most probable most quickly. However, we are more open to other ideas by not getting caught in the moment, which helps deduce that the key might be in our pocket. Similar situations can arise in projects where you are ego-centered to believe in your approach. Though it takes self-confidence to cease working on one’s solution, getting the intelligent input of peers brings you forward faster.
Embrace constructive criticism
Rarely do you work on a project alone, and nearly every time, you receive outside influence. This influence must not come in the form of co-workers but can also stem from video lectures or blog entries (such as this one). Regardless of the case, constructive input can propel us forwards and point to things we have missed. True, it takes self-confidence to have one’s ideas criticized, but such feedback is exponentially powerful if it comes from those who have been there already. You both honor the idea-giver and progress faster by considering their input and adapting it to your problem.
Guard you off-time
In his book Digital Minimalism, author and Computer Science professor Cal Newport proposes cultivating one’s leisure time with quality activities (p. 165 ff). This idea also relates to working on machine learning projects. Here, it’s not necessarily the raw hour count you put in but the amount of time spent in what Cal Newport popularized as a state of deep work. However, it’s evident that cognitive resources are finite; we need time to recharge. And this recharge is best done working with one’s hand instead of the brain.
Rather than spending even more time glued to our screens, we should strive for more manual actions. The beauty of this approach is that it uses the diffuse attention that Barbara Oakley describes in A Mind For Numbers (c.f. p. 11 ff). While we are away, seemingly not doing anything project-related, our minds can wander and develop creative ideas.
Don’t pursue too many projects at once
Make this; do that; work there: The more projects we (have to) run in parallel from the same domain (i.e., not one from chess and the other from machine learning), the easier we can get overwhelmed by the constant demand on our attention. Up to a certain point, such a situation can stimulate our brains; we benefit from different angles. However, there’s a threshold for cross-bred profits, namely when we start jumping from one activity to another. This situation interferes with getting distance between us and the work because we are already working on the next thing from the same domain. To give our brains space to breathe, we should limit the number of projects to pursue concurrently.
Tackle a problem from multiple angles
Are you stuck on a challenging problem? Try alternating between focused and diffuse thinking. I have learned this from Barbara Oakley’s A Mind For Numbers, which explains these modes in detail. The gist is that we use (and usually know of) only the focused one by default. This mode is active when you are concentrated on a problem and get your hands dirty trying multiple approaches. However, we might get stuck because we are stuck; it’s a circle. By concentrating further, we just add another iteration of the endless loop. What comes to our help is diffuse thinking. You let the insights learned bounce around your brain after investing considerable amounts of intensive (mind) work. It’s like playing pinball (this analogy is by Barbara Oakley): you put in enough effort to get the ball rolling and let your brain’s bumpers do the rest. We can activate this mode by going for a walk and letting our minds wander, for example.
Be productive, not busy
Productivity is making measurable progress; busyness is inventing measures to hide your non-progress.
Use project management tools
There’s often already a process in larger organizations that states how to organize projects. If that’s not the case, or you are the only member, think about using dedicated project management software. At first, learning yet another tool sounds like more work. However, that’s only for the moment. Cleverly organizing your work frees your cognitive resources for three reasons:
- You collect all project-related data in a single place — no more frantic searching for information.
- All (upcoming) steps are clear at any moment, giving you direction.
- You can visually see the progress by updating the tool throughout.
If digital tools are not your thing, you can replicate the general ideas using a whiteboard or blackboard and manually track your progress. If this topic sounds interesting, check out Personal Kanban by Jim Benson & Tonianne DeMaria Barry and A World Without Email (especially chapters five onward) by Cal Newport.