With the increase in size of datasets needing to be processed, companies and researchers have turned to cloud computing infrastructure to host their computational workloads. Given a workload, in order to choose an optimal cloud resource configuration, one needs to accurately measure and/or predict job completion time for various resource configurations in consideration. However, these measurements can be expensive both in time and cost.
In this work, we train a logistic regression classifier implemented through Spark’s MLlib in order to model job completion time across various datasets. With our methodology, we’ve built a deliverable for users to profile the number of epochs a particular cluster configuration can achieve on their dataset in an hour. The platform is built using modern web development technologies, and purchases machines as cheaply as possible through use of the AWS spot market.