Scale Your Deep Learning AI with Distributed Machine Learning

Develop powerful distributed machine learning applications with Clusterone, allowing you to get the most out of your implementation.

Machine learning is breaking records everywhere. Ever larger models are trained in ever less time and accomplish results that were nothing but dreams only a few years ago. A large factor in this success story is the increasing importance of distributed machine learning, allowing developers to rapidly scale their systems.

Which Road to Take in Distributed Machine Learning?

When designing a system for distributed machine learning, there's no single right path. Depending on your application, it might be useful to distribute the model across several devices for training (model parallelism) or to use copies of the same model to be able to feed them with more training data in less time (data parallelism).

Thankfully, frameworks like TensorFlow include everything you need to parallelize your code. On top of that, services like Clusterone can manage the complex infrastructure of computer clusters for you, making distributed machine learning as easy as programming for a single machine.

distributed-machine-learning-speedup

Model Parallelism vs. Data Parallelism

Let's take a look at two common ways to achieve distributed learning: model parallelism and data parallelism.

In Model Parallelism, the model itself is distributed across multiple devices. Each device runs a part of the model and trains it. This method is applied when the model becomes too large to fit on a single GPU. To efficiently run a distributed model, complex synchronization is required to ensure all parts of the model interact with each other correctly.

Using Data Parallelism, the model is duplicated onto several machines. Each machine trains on a separate set of training data, leading to significant increases in training speed. The most common approach is to use one machine to store the model parameters.

This parameter server sends parameters of the model to multiple worker machines, who each run the training procedure on a small batch of data and then return the updates to the model parameters back to the parameter server.

Use synchronous or asynchronous training?

There are two ways to implement data parallelism. With synchronous training, all workers read model parameters at the same time and then wait for all updates to return to the parameter server before starting the next training cycle. This sometimes leads to faster convergence (given the same number of steps), but it slows down training due to idling workers.

Using asynchronous training, workers read and write data whenever it is convenient for them. This method requires a slightly higher amount of housekeeping to ensure the model stays coherent, but it significantly speeds up training.

synchronous-vs-asynchronous-parallelism

Ready to dive deep into distributed machine learning? Clusterone offers everything you need to run your machine learning code on a cluster of GPUs in minutes.