CS190N Lecture notes &mdash The Beginning of the End

CS190N Playing a TV Superhero in Real Life

So why should you care about any of this? Aside from the obvious social advantages you now possess when you attend parties where the program "Numb3rs" is discussed, it turns out that the techniques that have been visited upon you have value as part of modern computer science research. Making this observation, however, is not without controversey.

At some level, computer science is not a science at all in that it does not attempt to discern relationships between observable natural phenomena. Under such a definition, mathematics is not a science either, but mathematicians have the good sense to keep the word "science" out of the title of their discipline. However, much to the disappointment of many currently practicing computer scientists, computer science is not properly categorized as mathematics either. To explain "The Internet," for example, one would be hard pressed to rely solely on mathematics.

Computer science as misnamed engineering is also problematic. At a high-level, engineering is usually thought of as being concerned with essentially "economical" ways of representing and manipulating natural phenomena. For example, a physicist might be able to describe the forces acting on a concrete structure using quantum mechanics, but an engineer would almost certainly head for Newtonian mechanics and/or dynamics as a useful set of "short cuts." Moreover, the engineer is probably more concerned with new ways to use the Newtonian methods to build a better (stronger, cheaper, longer-lasting, etc.) bridge than is the scientist, who might be more interested in a more complete description of how matter is structured.

Computer scientists certainly are interested in engineering concerns, but they also must be concerned with explaining observable phenomena as well. Battery lifetime in a personal computational device, for example, can depend on how the device is used by its owner. Clearly the problem can be studied as a physical chemistry problem:

What electro-chemical processes yield the greatest energy storage capacity?

or an engineering problem

What chemicals can we use to manufacture a battery with a longer lifetime at the same cost as those that are available now?

or as a computer science problem

How can we build system software that prolongs battery life given the behavior of the device's user?

Is it engineering? Architecture? Social studies? Child and family development? "Stop pressing that button so many times! It won't help!"

Perhaps a better question is "Do you care?" If you do care about how the term "computer science" is defined as it pertains to a domain of discourse, (and there are many people in academia and at the funding agencies who do) then we should argue about this issue as philosophically as possible. However, independent of its categorization, this last italicized question provides an entree' into what the future holds for the discipline. In a similar although less dramatic vein, it also provides us with the basis for discussing why it is we chose to teach this class (specifically as a computer science class) and somewhat implicatively why it is you chose to take it.

Where is Charlie When You Need Him?

Consider the following problem you are almost certainly experiencing at this moment if you are working on the final assignment. The machines you are using in the CSIL and GSL to crack passwords and render images can be (and frequently are) rebooted from the console (i.e. front-panel power switch). People using these machines tend to "clean" each other off through a hard reboot thereby causing an interruption of your computation. Thus, when you start a computation, perhaps you'd like to know when it is that the machine you are using will be rebooted next.

Now imagine the opening to your own UCSB episode of "Numb3rs..."

         85 machines

29 students
     1 power switch

                       2 hidden images
     3 LCG parameters

	    1 project deadline

Oh Charlie!

Seriously, this question, it turns out, has several implications outside the confines of your personal experience. If we could know when the next machine reboot was coming it would be possible to decide whether to use a particular machine for a computation based on how long it will take.

More generally, for almost all programs, it is possible to stop the computation, save the memory state, and later to resume it from where it has left off. This process is called "checkpointing" and it is a commonly-used methodology for long-running programs. Checkpointing is computationally expensive, however, so you don't want to generate a checkpoint too often. On the other hand, checkpointing infrequently causes efficiency to be lost to work duplication. Thus, the most efficient thing to do would be to checkpoint right before a failure which means you need to know when a failure is going to happen.

Now that you've had this class, I submit, as Charlies-in-training you have the tools necessary to attack this problem.

First you need some data. Here is some for the CSIL and GSL machines.

Along the x-axis is time in seconds represented in Unix UTC format. The y-axis is on a log scale and it shows the time until reboot (in seconds) that begins at the corresponding x-axis value. The total time period covered by this data set is July 20th, 2003 through April 26th, 2004. What might you do? Let's start by making a CDF.

Next, you probably want to know what kind of a distribution best fits this data. Clearly, from the CDF, it doesn't look normal. No problem. You know there are other distributions at your disposal. What you want to do is to figure out, for a given distribution, what parameters best describe the data. That is, you might choose a distribution, compute the Maximum Likelihood Estimation for the parameters for the distribution given the data, and plot.

Here is such a plot for the Weibull, Pareto, and Exponential distributions.

Not convinced? How about putting the xaxis on a log scale?

Okay, now you better be convinced. MLE for the Weibull nails it.

So at this point, you have a good candidate distribution for modeling the time between reboots for a randomly chosen CSIL/GSL machine. That's pretty cool, if you think about it for a minute. For example, knowing this distribution, I suspect you could give me a time that would serve as a lower bound with 95% probability. That is, you could find a number such that for 100 randomly chosen availability intervals, 95 of them would be bigger than this number. Moreover, if you wanted to assert that this distribution will continue to model the data well into the future, those 100 intervals might all be future intervals.

All of this discussion assumes that the intervals are independent, however, which they most probably aren't. How would you go about determining whether they are or are not, by the way? Probably a good thing to know for the final.

Thus, you could pick a lower bound on availability and assert that the availability interval beginning at the last reboot time will be less than your lower bound 95 times out of 100. Turns out that number is something like 1800 seconds. Hmm. Makes sense, doesn't it? That is, if you pick a machine at random at the time it reboots, and you run a job that is 30 minutes or less in duration, 95 times out of 100 that job will complete.

But Wait -- That's the Wrong Answer

Or at least, it is the right answer to the wrong question. Hopefully, by going through this exercise we've illustrated that to answer Charlie-like questions, we need to be very careful about what our data tells us and how we use it. This answer makes a bunch of assumptions about the relationship between reboots that are unsubstantiated -- that's true. Even if we could substantiate them, or at least show that violating these assumptions does not perturb the answer we want very much (Monte Carlo anyone?), we still have two problems. First, we've answered the question for a machine chosen at random. Secondly, we have answered the question for the case when we start executing at the time a reboot happens.

"Come on Charlie. There is a reboot out there and we have to find it." -- Don

Let's attack these problems in this order. Given that you have a good model (that you found using MLE), let's ask the question "Do all of the machines have the same distribution?"

You have seen good ways to begin to answer that question, haven't you? Given that we have enough data, for example, we could compare means pairwise. Or we could generate a likelihood score for matching machines.

Turns out that there is a non-parametric method (we didn't get a chance to talk about non-parametric methods, but if the class were longer, we would have) called a Kruskal-Wallis test that you can throw at this data which is somewhat better suited to this problem. It is no different in principle than the tests you've seen. You compute a summary statistic from your data and compare that statistic to a critical value from a specific distribution which turns out to be that old familiar friend, the Chi-squared. There you go.

Turns out our good friends Kruskal and Wallis (not guest stars on the show, sadly) tell us that, in fact, the machines are not all well-modeled by the same distribution. Bummer. That means we need to model each machine using a separate distribution -- perhaps using a separate MLE. The bad news here, however, is that while we have quite a bit of data for the entire population of machines, we have relatively little data for many of the individual machines. You've seen what happens to parametric methods when your data sets get too small. Typically, you lose almost all resolution power.

"I need to expand my data." -- Charlie

Before we revisit this issue, let's look at the other problem as well. Clearly, it is less-than-satisfactory to require a job launch to occur at the time of a reboot. What we'd like, I'm sure you'll agree, is to ask the question

From this point in time, what is the lower-bound on the availability time remaining until a reboot?

To answer this question you need a thing called a conditional distribution. That is, you need a probability distribution that provides something like

The probability that machine reboots in > y seconds, given that it has experienced x seconds of availability

You've seen this kind of a distribution before in the discussion of Bayes Rule when you were exploring the typing habits of the North American University student. In this case, you want something like

P(T > t | T > x) = 1 - P(T <= t | T > x)

where T is your random variable.

Turns out that this gadget has a name (sort of) which is the "conditional failure distribution function" and it can be calculated as

1 - &int ((f(s) / (1 - F(x))) ds)

where f and F refer to the density (PDF) and distribution (CDF) functions respectively for the lifetime of the gadgets about which you care, and the limits of integration are minus infinity to t.

Now.

If you knew the PDF and CDF and you were lucky, the integral would have a nice closed form. What if none of that is true? Recall that for each of your machines you probably don't have enough data to discriminate between MLE parameterizations of different models. No problem. How about you form the empirical CDF using the small number of values (interpolating liberally) and then use, oh, numerical integration to compute the conditional future lifetime?

You could, but in this case there is an easy reduction that can come to your rescue. Notice that for x < t, the probability is zero. That is, since we are given T > x, we know that the probability of T <= t for values of T < x is zero. Thus, the limits of integration are x to t. Notice also that 1 - (F(x)) is constant with respect to the integration. Finally, recall that integrating a PDF gives you a CDF and you get

(F(t) - F(x)) / (1 - F(x))

which is pretty easy to calculate from your empirical data.

Here we go...

So what do you get? Well, here is a look at a sample of the data from the CSIL/GSL where we run the following simulation. Each machine trace appears thus:

1075389801 23129.890000 aeonflux.cs.ucsb.edu:9786.upTime.10.uptime.data
1075413122 4223.950000 aeonflux.cs.ucsb.edu:9786.upTime.10.uptime.data
1075421012 65768.870000 aeonflux.cs.ucsb.edu:9786.upTime.10.uptime.data
1075488401 195536.390000 aeonflux.cs.ucsb.edu:9786.upTime.10.uptime.data
1075753082 89174.470000 aeonflux.cs.ucsb.edu:9786.upTime.10.uptime.data
.
.
.

The way you can read this is that at each time stamp in the first column an availability period (measured in seconds) in the second column begins. For example, 1075389801 (Thu Jan 29 07:23:21 2004) starts a period of availability lasting 23129.890000 seconds (about 6.4 hours).

To determine the accuracy with which this technique works, we divide the data up by machines (choosing machines with at least 40 reboots) and pick 1000 random starting times between the earliest and latest times in each trace. Then, for each starting time, we compute the 0.05 quantile from the empirical CDF made from only the data preceding it in the trace. We then record the proportion of "failures" when the time until the next reboot is shorter than the 0.05 quantile. If the technique is working perfectly, the fraction we should get should be very close to 0.05.

bugs 0.020000
dilbert 0.215000
freakazoid 0.058000
homer 0.039000
lisa 0.030000
popeye 0.030000
sylvester 0.055000

Not bad, really. The machine "dilbert" could be a problem, but we might expect to miss one once in a while.

Free Supercomputers

So why do you care? Yes, that is a rhetorical question because I suspect I know the answer: you don't. Perhaps the question to ask is: "Why should you care?" One reason is because you have started your own tech. business that provides secure back-up storage. Clearly, if you can get a handle on when machines are going to fail, or when machine users are going to reboot them, you can optimize your file and network bandwidth usage.

Another reason you might care is because you have something intense to compute. Here is a CDF showing CPU idle time for a machine (scooby.cs.ucsb.edu) I chose at random from the CSIL

It is a little hard to read, but that is because something like 80% of the values are 2.0 so I had to put it on a log scale. The CPU idle time sensor records the fraction of the machine CPUs (in total) that are available to you every ten seconds. Most of the machines in the CSIL include hyperthreading support which Linux reports as 2 CPUs so the maximum possible idle fraction is 2.0. Practically, hyperthreading gives you about 1.2 CPUs worth of processing power if you are lucky so the way you should read this is that the machine is available for running a process, full speed, whenever the CPU reading is larger than 1.2. This graph covers the quarter to date (January 1 through about March 9). From it, I think it is clear that the machine has an idle processor something like 90% of the time. This is a 3.0 GHz machine that is doing nothing during 90% of its usable lifetime. What a shame since it could be doing something useful like looking for Ramsey counter examples.

Other people have observed this phenomenon long before us and, to take advantage of these cycles, these people have developed a system called Condor. The idea is to run a daemon on every machine that notices when the machine is idle. You submit your programs to Condor as a batch script and when it finds an idle machine, it launches them for you and collects their output.

Condor has been running at the University of Wisconsin for some time now. The numbers fluctuate some, but here is a status as of today.

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

         INTEL/LINUX   989   146     653       189       1          0        0
       INTEL/WINNT50   106     4      89        13       0          0        0
     SUN4u/SOLARIS28    39    18       1        20       0          0        0
        X86_64/LINUX   126     2     101        23       0          0        0

               Total  1260   170     844       245       1          0        0

Building such a system sounds simple but it turns out to be very tricky when you start to consider sandboxing, resource discovery, etc. Perhaps the most problematic issue, however, concerns what happens when the machine is suddenly "reclaimed" by its owner. Since it is running on the "desktops" of the University of Wisconsin (student labs, research labs, etc.) the owners of those machines need to be able to reclaim them -- in the same way you logically reclaim a CSIL or GSL machine by rebooting it. Only the rebooting option is inconvenient and dangerous. Our system staff tolerates it, but it simply wouldn't work campus wide. Condor, then, supports an eviction capability that kills your job when the owner of a machine wants it back. If you have a job with the right characteristics, Condor can generate a checkpoint before it kills the job so it can be restarted (automatically by Condor) but only certain kinds of jobs can be checkpointed and then they can only be restarted on the same kind of machine.

If you could predict when Condor was going to kill your job, however, you might be able to do your own efficient checkpointing or, at the very least, decide whether you want to use the CPU it has assigned to you or not.

Turns out that we have been monitoring Condor for a while and, using a statistical invention due to your Uncle Norman, we can make a prediction that we assert should have a 95% success rate and, when we observe its performance, it does -- right on the money. How does it work? As junior Charlies you are certainly qualified to understand the methodology, but the lecture grows long and the "hour" late. We'll leave it to you to contemplate how you can predict when and where the next "murder" of a Condor job will be committed.

Bringing it All Back Home

This diatribe describes one of several on-going research projects that your hosts this quarter have been pursuing. Without exception, all of them have involved the understanding and use of many of the techniques we have discussed in this class (in the same way the failure prediction discussion outlines). Indeed, your Aunt Heloise has been on many systems oriented graduate student committees (Masters and Ph.D.) and she noticed that all of them -- at some point -- required one or more of the techniques we have discussed to be applied in order to answer the question at hand. All of them.

Moreover, she noticed after a time that in almost every case the student either didn't know what technique was needed (and used a poor substitute instead) or applied the correct technique in cook-book form and then couldn't interpret the results. Think about that for a minute. Your Aunt Heloise has been a committee member or a committee chair for something like 20 or 25 students in the last 5 years or so. She is not a mathematician, but rather a system builder whose own graduate student training came when she was a young operating systems and compiler developer. In every case, the analysis and/or implementation demanded some technique we have discussed this quarter, and in almost every case the technique was either misapplied or misinterpreted.

What does that mean? I think it means that there has been a sea change in the way systems research is done, and that in these new seas the boats that are going to float and sail will do so with the understanding that I hope we have imparted to you through our analysis of "Numb3rs."

So while you may, at times, have wondered "Why are we doing this?" (other than for the obvious entertainment value) the answer, I believe, is because by doing so we have prepared you better to be a leader in the field of computer science.

No matter how you define it.