Machine learning is gaining more traction day-by-day and this mostly can be attributed to the ever-increasing amount of data and compute resources available to train the machine learning models. However, as the size of the data increases, so does the complexity involved in training these algorithms at scale especially in a distributed environment. The complexity is compounded by the fact that many of the machine learning algorithms are iterative in nature, with each iteration using the result of previous ones. As this reduces the amount of parallelism that can be used for training, several asynchronous methods have also been proposed for the same. In both cases, it becomes imperative to find allocation strategies that maximize the usage of resources in a distributed environment when training these algorithms, while still maintaining an acceptable level of accuracy.
We start by surveying existing systems available for training ML algorithms at scale, noting the advantages and disadvantages of the various approaches undertaken. We then study the resource utilization patterns of these algorithms for various datasets using TensorFlow as the framework for running experiments in a distributed setting using both asynchronous and synchronous modes. We then discuss models to effectively allocate resources for efficient training of these algorithms.