...

When I first started working at Clusterone, I was tasked with building a basic self-steering car demo.

I was both excited and daunted. Excited because it sounds (and is) cool. Daunted because, come on, that’s a hard problem to tackle, even at the small scale and scope I had set for myself (steering only, no time series, image input and not video).

And the demo had to be distributed.

Gulp.

Building TensorFlow models on there own is not an easy thing. Distributed computing is hard. Distributed TensorFlow? I could feel the sleepless nights coming.

I had built some TensorFlow models before for classwork at Stanford, and the experience had been great, though painful. When the stack trace finally makes sense (kind of), and your model runs, and that loss finally goes down and you realize you’re not overfitting too much, you realize that deep learning is amazing and TensorFlow is an extraordinary tool.

Writing TensorFlow though, requires a lot of time and effort, as does training the model and tuning hyper-parameters. On top of that, I had had to worry about the environment setup, moving data around, collaborating with my team mates, etc. Under time pressure, setting up the right environment (though the course material provided guidelines) is both a critical step and a massive time overhead.

You want to get out there coding, fast. So I made the typical mistake of a first-timer: rush it.

My team didn’t have a lot of experience with git, so we stopped using it very fast. We had quite a few data versions as we were testing various image sizes and data augmentation techniques, so we were moving a lot of data around and we got lost quite a few times because our pipeline wasn’t rigorously defined.

We lost time using the wrong script because it wasn’t versioned. We lost money forgetting to turn down our Google Cloud Platform Instances.

Some of it was inexperience, but a lot of this could have been fixed with either building a clean environment for ourselves (which we realistically didn’t have time for) or using a more streamlined platform.

So when I heard about what Clusterone was building, I got very excited. Clusterone recently announced a computation platform that easily runs distributed TensorFlow jobs, with no infra setup, a streamlined flow and one-click access to TensorBoard.

Building a self-driving car demo would be a good test of how well the platform worked.


The Demo

Self-steering code on GitHub

The demo was inspired by comma.ai’s “Learning a Driving Simulator”, though I simplified it to have only the steering. The data also came from that demo, as well as part of the data input pipeline.

I am reworking that pipeline now to support queuing, run input as a py_func (it sounds like tf.py_func cannot handle generators as inputs, see more below) and speed things up.

The model is a super-simple 3-layer CNN + 2 fully-connected layers, with dropout. A lot of improvements can be done on that side! That model was written in Keras in the original comma.ai repo, but I rewrote it in TensorFlow as distributed TensorFlow with Keras was not well supported at the time even tough there seems to have been some progress and someone wrote an example a few days ago.

There was no accuracy metric in the original code (WIP), but some pretty cool visualizations of the actual steering vs the predicted steering. Those were visualized using pygame, which I thought was too heavy, so I rewrote it so that road images and predicted steering overlays would be directly exported to TensorBoard, which I think is kind of cool. (By the way I hear there might be some TensorBoard video summary plugins coming up at some point).



Data and Data Input Pipeline

The dataset can be downloaded through here and is also available pre-loaded on Clusterone.

From comma.ai:

"It consists of videos clips of variable size recorded at 20 Hz with a camera mounted on the windshield of an Acura ILX 2016. In parallel to the videos some measurements such as car speed, acceleration, steering angle, GPS coordinates, gyroscope angles were also recorded. The data is stored in .h5 files."

The data input pipeline is just a thin wrapper on comma.ai’s code. As I was trying to improve performance though, I realized that it will need to be completely refactored in the future. Indeed, it relies on a python generator to yield data rows.

I wanted to use a py_func to add to the graph and feed into tf.train.batch, as this function automatically enqueues data batches (in essence asynchronously buffering the data) and enables a speed up that is even more critical in a distributed setting.

I realized however that py_func required a python function with a return statement as an input, and did not accept generators, so naively adding that to the code did not work. Instead I will have to rewrite some of that code.


Lesson learnt: py_func

Doing some reading I realized that py_func was not meant to be used unless no alternative exists. It essentially incorporates a python function into the graph, meaning that code will probably be slower than the C code most low-level unctions are written in. Instead, one would have to use some of the TensorFlow data input functions or convert their data to a supported format.

h5 does not have a reader as of yet, and converting the comma.ai dataset (that supports distributed storage thanks to .h5) to another format does not sound like an ideal scenario. The only options left are to stick with a non TensorFlow input pipeline or write a custom tensor flow .h5 reader (if someone has suggestions, I’d like to hear them ;))



Distributed TensorFlow setup

After writing the model — I won’t be talking about it here as it is very simple — the next step was to make it run in a distributed fashion. It is quite easy when using `tf.train.MonitoredTrainingSession', especially since I was running on Clusterone, hence I didn’t need to take care of either (a) cluster management and (b) generating the 'tf.train.ClusterSpecs' as it was taken care of by the Clusterone client.

All I had to do was:

  • call Clusterone’s device_and_target function
  • define the Graph in a with tf.device(...) block
  • set up a tf.train.StopAtStepHook that will act as a counter of the number of iterations.
  • run the training operations in a with tf.train.MonitoredTrainingSession. That function essentially takes care of allocating tasks to the workers and parameter servers, saving checkpoints and TensorBoard summaries (you don’t event have to setup a summary writer). See main_tf.py in the repo.


Results

I was able to train the model in a distributed fashion on 5 GPUs, and got a nicely decreasing loss:

self-driving-car-tensorboard-training-loss

I also got a convenient visualization of the predicted steering on TensorBoard:

self-steering-car-images

Performance

Distributed computing sounds cool, right? But what about the speed-up (and the cost)? Is it worth it spending time configuring all this?

It sounds like it does!

self-driving-car-performance

Running on a single GPU got me a ~3x speed up vs a 8-cpu machine (AWS 4.xlarge). Running on 5 GPU machines got me a ~14x speed up in stationary state.

Cost-wise, the spend is nearly the same on Clusterone (roughly $.005/step), meaning getting a fast training didn’t cost me more than a slow training. That’s great news if I want to get that self-steering to work quickly!



Future work

I’m very aware that this model is extremely simple and the hyper-parameter are not very well tuned (on top of having no accuracy metric defined). All of this is work in progress and I have a few improvement ideas:

  • Rewrite the data input pipeline to enable queuing with tf.train.batch and get a speedup
  • Add validation loss, asynchronously
  • Add more TensorBoard visualizations
  • Switch from this naive approach to a RNN, and work on image sequences and not a single image a this does not really make sense for a car.
  • Exploit the data better (we are only using a very small part of the logs right now)
  • Rewrite the model in distributed Keras (although it is simple enough not to really need this)
  • All in all, building that basic distributed model was a great experience and I’m looking forward to build more. I also noticed there was a lack of posts / resources on distributed TensorFlow, so if you find interesting articles / repos, you’ll make my day if you send them my way ❤️.


What’s next?

On the side, I’m also helping build a new feature called just init(to be released soon), that will deploy a model scaffolding that anyone can use to rapidly build a distributed TensorFlow model on Clusterone. Stay posted!