January 26, 2019
I wrote a short tutorial with Tensorpad to help Machine Learning developers use Cloud GPUs. In it, you’ll build an animal classifier and training it on the Tensorpad platform. Cloud training is an important new component of my workflow, so I wanted to share my motivation for writing the blog post.
I’ve been experimenting with Machine Learning workflows for the past 2 years. It’s been difficult to determine my needs. Which tools? How much data? How much computing power? Notebooks? What version of Python?
At first, I used a Late-2011 MacBook Pro with Tensorflow on Docker via Jupyter Notebooks. It was ok for learning how to develop models but terrible for training them. My notebook constantly crash whenever I wanted to train simple models. Heck, it could barely transform image batches. It didn’t have enough RAM or a CUDA-supported GPU, and only 10 GB of available HDD space.
I decided it’d be a good idea to build a low-end Machine Learning machine. I wanted to train deeper models with larger image datasets, and iterate on them faster. It needed plenty of RAM, an SSD, and a GPU with CUDA support. I chose a low-end GPU with the intention of upgrading it in the future. This was a pretty good plan… but I completed underestimated the difficulty of setting up the development environment. I spend days configuring the machine before I could train Tensorflow models with a GPU.
This machine worked really well for about 6 months. The scope of my projects started to change. I wanted to handle much more data, train multiple models in parallel, and most importantly, train faster.
I figured I could either build a high-end machine or shift to cloud training. A classic on-premise vs hosted debate. After much contemplation, I decided against building a more powerful machine. Why? Setting it up would not be easy. Especially with the complex workflows I wanted to implement. It’d be we’ll worth the money to pay someone to handle infrastructure so I can focus on what I really care about, Machine Learning.
The question became, how do integrate cloud training into my workflow? I’ve found that building a model locally, then using the cloud for training allowed me to iterate much quicker. This is where Tensorpad comes in. They offer 1080Ti GPU containers with a Jupyter Lab interface. It mirrors my local setup. I can run my notebooks and scripts on their cloud with minimal changes. I could also run different models in different containers.
I wouldn’t say that my workflow is perfect yet. I could still improve parallel training coordination, model management, and data management. But this is a great step in the right direction.
Written by Peter Chau, a Canadian Software Engineer building AIs, APIs, UIs, and robots.