Report ID
2004-25
Report Authors
Daniel Nurmi, Rich Wolski, and John Brevik
Report Date
Abstract
In this paper, we describe a system for application checkpointscheduling in volatile resource environments. Our approach combineshistorical measurements of resource availability with an estimate ofcheckpoint/recovery delay to generate checkpoint intervals thatminimize overhead.When executing in a desktop computing or resource harvesting context,long-running applications must checkpoint, since resources can bereclaimed by their owners without warning. Our system records thehistorical availability from each resource and fits a statisticalmodel to the observations using either Maximum Likelihood Estimation(MLE) or Expectation Maximization (EM). When an application isinitiated on a particular resource, the system uses the computeddistribution to parameterize a Markov state-transition model for theapplication\'s execution, evaluates the expected overhead as a functionof the checkpoint interval, and numerically optimizes this quantity.Using Condor as a target platform, we investigate the effectiveness ofthis technique fitting exponential, Weibull, 2-phasehyperexponential and 3-phase hyperexponential distributions toobserved availability data. To verify our method and compare thedistributions each against the same conditions, we use observationstaken from the Condor pool at the University of Wisconsin andtrace-based simulation. We examine the practical value of ourapproach by observing an implementation of our system when applied toa test application that is then run on the ``live\'\' Condor system.Finally, we conclude with a verification of the simulated resultsagainst the experimental observations. Our results indicate that applicationefficiency is relatively insensitive to the choice of distribution (among theones we investigate) but that induced network load is not.
Document
2004-25.pdf190.76 KB