At this point, you should be an expert on how to write threaded codes that use sockets. Now we get to the Grid part. Let's assume, for a moment, that you have written a program consisting of many threads and that uses many sockets to communicate between them. Let's further assume that you care about the performance of the overall collection of the threads and that the machines you are going to use are shared by your colleagues. If you think about it for a minute, you'll realize that your new-found (or possibly recently rediscovered) threads and programming skills give you quite a bit of flexibility about where you can run the various parts of your application. In particular, it should be possible for you to decide on the best subset of the machines that are at your disposal to get all of your threads executed in the shortest possible time.
If this flexibility isn't clear, let's try to make the exposition somewhat less dense with an example. Say, for example, that you 100 threads you need to execute in your program. Perhaps each one is assigned the task of computing an individual matrix multiply. Clearly you can execute all 100 threads on one machine, but it is likely that parallelism will pay performance dividends if the arrays are large enough. You've decided, then, that you want to use a set of machines to run your threads. Great -- now which ones? To answer that question, you need some information. Let's say each thread is executing exactly the same pieces of code on the same input size, for clarity. In that case, you need
Still not clear? Consider using just two machines: a local machine and a remote machine. Say your local machine is a 500MHZ workstation that can run your thread in 12 seconds. The remote machine is 2.0GHZ and, for the sake of argument, say it runs each thread in 1/4 the time (3 seconds each). What is your fastest time to completion?
With only this information, the answer is pretty simple: 20 threads go on the local machine and 80 threads go on the remote machine. The expected completion time is 20 * 12 == 80 * 3 == 240 seconds. What if it takes 1.5 seconds to send each input matrix pair to the remote machine, and another 1.5 seconds to get the result back to get the result back? The answer is still pretty simple. Now each remote thread takes 6 seconds. Therefore, 33 threads go on the local machine and 67 threads go on the remote machine for an expected execution time of 67 * 6 == 402 seconds. This schedule assumes, of course, that communication and computation cannot be overlapped.
However.
What do you do when there are processes coming and going and the network performance is changing because other processes are using it? If the remote machine has a compute-bound process running on it, the scheduler is fair, and you run at the same nice() priority, you'd expect to get half the CPU cycles making your threads take 6 seconds to execute each. The fraction of the CPU that you will obtain, then, can be used to calculate the slowdown that your threads will experience. Similarly, the time required to send your inputs and retrieve your results will be calculated as a function of the network performance. The Network Weather Service is designed to provide this information.
forecasting -- But wait -- there's more. Let's say you are really clever and you ask the NWS for the CPU percentage that is available on the remote system before you decide how to partition your threads. In our example, the NWS might report "100%" indicating that you will not share the CPU with any other processes. You consider that and decide to run your program with 67 threads on the remote machine as discussed previously. Before you start to run, however, another user of the remote machine starts a process and you only really obtain 50% of the cycles. What has happened, although you probably didn't know it, is that you used the current availability as a prediction of what would be available when your program eventually ran. What you needed, however, was the current state, but rather a prediction of the availability. If you knew, for example, that it was nearing noon and the likelihood of another process showing up was pretty high because people just getting to work at the crack of noon would be surfing the web and emailing prodigiously, you might be able to anticipate the 50% value.
statistical forecasting -- The NWS tries to make this kind of prediction based on past observed behavior. It constantly takes measurements from the resources it monitors and then uses statistical models to try and make short-term predictions of load and availability.
sensor control -- Sensors report to and are under the control of distributed sensor control subsystem. We'll talk more about how and why sensors need to be controlled in the next section.
stateless and persistent state -- To make them useful, though, you really should make them stateless. There are two reasons. First, people will do there level best to kill them at every opportunity. For two years we ran NWS sensors at SDSC where the NFS performance was less than stellar. Whenever NFS dropped a packet and did a retry (thereby slowing file-system access) the instant response by SDSC administrators was to kill any and all NWS sensors immediately. Of course, that didn't improve the problem ever, but they never seemed to start the sensors up with the same enthusiasm they displayed when killing them off. Anyway, it is important to protect any state you wish to have survive irrational system administrator behavior (or just plain machine instability) in a separate persistent state subsystem.
forecasting -- We'll also talk quite a bit more about forecasting and how it is done, but the key idea here is that it does not depend on how the information is gathered. Rather, any data that is in the persistent database can be used to generate forecasts.
reporting -- Finally, it is important to be able to deliver the performance data in a variety of presentation formats. Currently, the NWS supports a Unix command-line interface, a really bad C language interface, a nascent Java interface, and LDAP (which you will hear about in the next lecture).
Why don't we talk about it as part of the logical architecture? Because logically, it shouldn't matter. The NWS APIs were designed to support location independence (at least, in their default usage) for the various processes. That is, you shouldn't have to care about where a process is, only what it does. In theory, then, you should be able to specify only the names of the actions that you want and the system magically translates those requests into machine addresses and operating system calls.
That's the theory, anyway.
In reality, the way it works is that (at present) all NWS processes must register themselves with the nws_nameserver process. A registration contains several pieces of information:
Yes -- I do know that the centralized nameserver is both a potential performance bottleneck and a single-point-of-failure. I'll explain presently.
virtual NWS worlds -- This architecture (with the caveats listed above excepted) provides users with the ability to set up their own virtual NWS worlds. That is, if you set up your own nws_nameserver process someplace, you can launch and run your own set of NWS processes independent of any other NWS processes that happen to be running. The only issue has to do with well-known addresses and port numbers. For example, we have two nws_nameservers running on pompone.cs.ucsb.edu at present. One is listening to port 8090 and the other is listening to port 8091. You will be using the one on port 8091 exclusively -- the other is used by a bunch of people around the country for their research (please do not molest this server, no matter how tempting it might seem). The next lecture will explain exactly how you will get access to this server. In the mean time, you should know that it is out there and functioning.
While the intention is that it be easy to interface any kind of sensor to the NWS, we have learned two things in building the system:
Be that as it may, at present, the NWS includes
actual execution time = (unloaded execution time) / fraction
leading you to conclude that you would experience a slowdown of 2 if this were a prediction. Here is an example from running the command
nws_extract -N pompone.cs.ucsb.edu:8091 -f time,meas avail ella:8060
Time Measure 1011322094 1 1011322104 1 1011322114 1 1011322124 1 1011322134 1 1011322144 0.99692 1011322154 0.9453 1011322164 0.9453 1011322174 0.92497 1011322185 0.82119 1011322194 0.82119 1011322204 0.81127 1011322214 0.84726 1011322224 0.84726 1011322234 0.86954 1011322244 0.8812 1011322254 0.89322 1011322265 0.91842 1011322274 0.91842 1011322284 0.96667Don't worry about the command. If you are curious, look in the users_guide for details. When you access the NWS, you will be using LDAP. This is just the Unix command-line interface at work.
The left column are time stamp values taken on the remote machine in UTC. At time 1011322185 which translates to Thu Jan 17 18:49:45 2002 someone ran something on ella.cs.ucsb.edu that took roughly 20% of the CPU.
Incidentally -- I had to look through several examples of machines here before I could find anything but 1.0 at 6:45 PM on a Thursday. Were you working then?
1011322651 0.99933 1011322661 1 1011322671 1 1011322681 1 1011322691 1 1011322701 1 1011322711 1 1011322721 1 1011322731 1 1011322741 1 1011322751 1 1011322761 1 1011322771 1 1011322781 1 1011322791 1 1011322801 1 1011322811 0.98982 1011322821 0.98982 1011322831 0.99933 1011322841 0.99933It looks unloaded, right? Okay -- now I'll start up a process that uses the CPU.
Time Measure 1011322981 1 1011322991 1 1011323001 1 1011323011 0.9118 1011323021 0.78817 1011323031 0.69462 1011323041 0.64787 1011323051 0.61066 1011323061 0.58074 1011323071 0.56539 1011323081 0.60718 1011323091 0.64787 1011323101 0.68561 1011323111 0.71824 1011323121 0.70387 1011323131 0.64394 1011323141 0.59698 1011323151 0.56839 1011323161 0.53708 1011323171 0.51894After a while, the available value goes to around 0.5, which is the number I want to know if I want to start a new process on zonker.cs.ucsb.edu. However, what if I want to know whether I should continue to use zonker.ucsb.edu? That is, let's say I can move from zonker.cs.ucsb.edu to ella.cs.ucsb.edu
Time Measure 1011323165 1 1011323175 1 1011323186 1 1011323195 1 1011323205 1 1011323215 1 1011323225 1 1011323235 1 1011323245 1 1011323255 1 1011323265 1 1011323275 1 1011323285 1 1011323295 1 1011323305 1 1011323315 1 1011323325 1 1011323335 1 1011323345 1 1011323355 1where our lonely user has finished surfing the web at 7:10 PM on a Thursday. According to these measurements zonker.cs.ucsb.edu is 50% available and ella.cs.ucsb.edu is 100% free. In that case, you might decide to move to ella.cs.ucsb.edu. But then, after a while, ella.cs.ucsb.edu will look like it is 50% consumed and zonker.cs.ucsb.edu will look 100% free. In that case you might decide to move back. Notice that the only process that is running on either machine is yours.
What is happening here? The answer is that in order to make a migration decision, you actually need a second measure. If you are running on zonker.cs.ucsb.edu you need to know the fraction of the CPU that you will continue to get in order to compare that fraction with the available fraction from ella.cs.ucsb.edu. Think about that for a while.
The NWS provides this second measure as current CPU. Here, now, is the current CPU measurement series from zonker.cs.ucsb.edu with a compute-bound process running:
Time Measure 1011323591 1 1011323601 1 1011323611 1 1011323621 1 1011323631 1 1011323641 1 1011323651 1 1011323661 1 1011323671 1 1011323681 1 1011323691 1 1011323701 1 1011323711 1 1011323721 1 1011323731 1 1011323741 1 1011323751 1 1011323761 1 1011323771 1 1011323781 1and here is the available CPU from ella.cs.ucsb.edu
Time Measure 1011323636 1 1011323645 1 1011323655 1 1011323665 1 1011323675 1 1011323685 1 1011323695 1 1011323705 1 1011323715 1 1011323725 1 1011323735 1 1011323745 1 1011323755 1 1011323765 1 1011323775 1 1011323785 1 1011323795 1 1011323805 1 1011323815 1 1011323825 1Now, if you compare the current CPU from zonker.cs.ucsb.edu where you are running with the available CPU from ella.cs.ucsb.edu, you see no advantage in migrating. Again, what this says is that if you continue to run on zonker.cs.ucsb.edu (current CPU) you will continue to get 100% of the CPU. f you start new work on ella.cs.ucsb.edu (available CPU), you will 100% of that CPU. While you are running on zonker.cs.ucsb.edu, though, only 50% will be available for new work (available CPU).
A word about mutiprocessors -- These availability metrics make sense for single processors, but if the machine has more than one processor, things become a little more ambiguous. In particular, it is difficult to characterize the availability of a multiprocessor with a single number. For example, an 8 processor machine that has 1 process running on each processor might be 50% available (you can run one new process on each processor and share all 8 processors 50-50). What do you report if 4 of the processors are completely idle and 4 are completely consumed? One would suspect 50% as well, but in the first case you'd have 8-way parallelism, and in the second only 4-way.
The current NWS solution (we are looking for a better one) is to report
availableCPU = number of CPUs / (processes + 1)roughly. As such, available CPU is the fraction of a single CPU that is available from the machine. For example, a completely idle 8 CPU machine offers 800% of one CPU.
The current CPU measure is more difficult still:
currentCPU = (processes < 1.0) ? CPUs : (CPUs / processes);It is an exercise left to the reader to determine how and why this formulation makes any sense. Periodically, we grapple with these metrics. New ideas are welcome.
Even if you are willing to accept 18.75GB worth of traffic every hour, though, there is a problem. If all of the sensors wake up at the same time, you have 10,000 simultaneous communications taking place. What, then, are you measuring? Right. Each NWS measurement would measure the effect of the other 9999 measurements that are taking place simultaneously. That's probably not what you want.
Consider the following figure:
If hosts are attached to a shared network medium, and each host determines when it will probe the network, the measurements potentially contend.
Here is an example of an NWS trace on a 10mb ethernet.
At the beginning of the trace, only one probe was running. Then, at some point, a second sensor started up and began to probe. Can you tell when that was?
The solution is to arrange for the NWS sensor to access the network according to a mutual exclusion protocol. The idea is that the sensors pass around a token which gives each sensor the "right" to probe the other sensors. We term each collection of sensors that are practicing wholesome mutual exclusion together a clique.
To get periodic measurements, each sensor
controlling the periodicity -- The first problem is that if the token is immediately forwarded, the periodicity becomes a function of the number of hosts in each clique. We would like the user to be able to set the periodicity of the measurements.
token loss -- The second problem is that the network may partition or a machine may die while the sensor process running on it is holding the token. In either case, some (if not all) machines will be deprived of the token and will never take their measurements.
distributed leader election -- The solution to both of these problems if for the hosts within a clique to run a distributed leader election protocol. Initially, one host is designated the leader in the token. It is the responsibility of the leader to launch the token with the prescribed periodicity. Each member of the clique keeps its own timeout on token reception. If that timeout expires, the host "elects" itself leader and regenerates the token with a time stamp. If any host receives two tokens with different time stamps, it destroys the older one. If you think about it for a minute, you can convince yourself that at least one token will continue to circulate as long as all but one of the hosts do not reboot. Indeed, killing a token off is a bit of a chore.
scalability -- The obvious problem with this approach is that it does not scale well. If all hosts in the U.S., for example, were to be part of a single token, the measurement frequency would be incredibly long. We can scale the system, however, by organizing the cliques into a hierarchy.
The main idea is that hosts at distant sites all enjoy pretty much the same network performance from one locale because in the intermediate gateways are largely shared.
So far, we've only talked about measurement. There is a lot more to say and many excellent masters and Ph.D. theses lurk in this general area, but we'll move on and talk, now, about forecasting.
Simple Question: What is ethernet bandwidth? -- The answer is No ethernet has a bandwidth -- is has many bandwidths. The real question you are asking is What will the bandwidth be on the ethernet when I use it?. Consider the following graph:
At the right most point in time, assume that you are thinking of using the ethernet and you want to know what the bandwidth will be. How will you go about it?
statistics -- There are many ways you can think about solving this problem. Most of those ways involve the use of statistics. In a couple of lectures, we'll talk more extensively about statistics and the various techniques that are available to you. Here, though, we'll illustrate what almost everyone does at first. If you are confronted with the trace above, you almost certainly look at the last value you've seen and guess that the next value will be the same (or close).
Error -- Were you right? A better question is What is the amount you were wrong by? By quantifying the error you give yourself (and more importantly your scheduler) a numerical measure of the "goodness" or a forecast.
A word about the ping program -- Note that this isn't the same as running ping. Ping measures RTT of ICMP packets and you are looking for a bandwidth prediction. If you use ping you really have two errors: measurement error (since ping doesn't measure bandwidth) and prediction error.
choosing a prediction method -- It turns out that there are a whole bunch of different ways you can predict the net value. One obvious way is to keep a running average. Another is to guess the mid point between the min and max you've seen so far. Rather than insisting on one method, the NWS provides an API for including many different methods, and then choosing the best one based on past error performance. Here is the current forecasting suite built into the NWS at present when used to predict coast-to-coast bandwidth.
The idea is to use the measurement history to see how well individual forecasters did over time (in terms of their measurement error) and then choose the best to make the next forecast. In the figure, the term MSE is an abbreviation of "Mean Square Error" and MAE is short for "Mean Absolute Error." At the bottom of the table are the errors generated by the two automatic-selection techniques (one for MSE and one for MAE). The important thing to note is that the errors are equivalent to the lowest error among all of the forecasters in the suite.
Instead, if you decide to use the NWS, there will be several choices available to you.
The first, and by far most entertaining interface is due to Martin Swany and Graziano Obertelli. If you go to this web page you will be presented with a visualizer that will attempt to splash time series graphs up in your browser. It does so by producing encapsulated postscript so you should make sure that your browser will automatically launch a plug-in to display it.
The first screen asks you to select a name server. By default, the name server that shows up is nmi.cs.ucsb.edu. THIS IS NOT YOUR NAME SERVER. It is not. We maintain the NWS for other researchers around the country. You may find any of the UCSB machines you care about in this name server, and you are free to query it, but you should NOT use it for your class project.
To use the web page for this class properly, you should change the name server to
loggerhead.cs.ucsb.edu
You can choose a CPU measurement type, and then a machine and the page automatically graphs the last 5000 seconds worth of data. The red graph are measurements and the green graph (which is super imposed over the read graph) are forecasts. If you leave the page up for a while, and there is activity, the page will automatically refresh. In other words -- NWS: The Movie.
Another set of interface options that you can play with are the Unix command-line tools. In the directory where this page is located:
/cs/faculty/rich/public_html/cs290I-grid/notes/NWS/bin
you will find nws_search and nws_extract compiled for Linux X86. If to try them out
nws_extract -N cisa.cs.ucsb.edu -f time,measurement,mse_forecast availableCpu bullwinkle.cs.ucsb.eduYou must type it exactly as shown, all on one line, or it won't work. In particular, if you use bullwinkle.cs.ucsb.edu as the last argument, it won't work. The reason is that when you call gethostname() on the system here, the DNS is not configured to give you back a fully-qualified host name. The NWS name server, then, thinks of the data series as coming from bullwinkle and not bullwinkle.cs.ucsb.edu. Yes -- we can launch people into space for fun, clone sheep as a party game, and view energy that was created just after the Universe was born, but we can't seem to figure out the fully qualified host name. This bothers me. A lot. So please don't ask me about it.
If you do type this magic incantation, you should get something like
Time Measure MSE_Fore 1080776737 0.99933 0.99933 1080776747 0.99933 0.99933 1080776757 0.99933 0.99933 1080776767 0.99933 0.99933 1080776777 0.99933 0.99933 1080776787 0.99933 0.99933 1080776797 0.99933 0.99933 1080776807 0.99933 0.99933 1080776817 0.99933 0.99933 1080776827 0.99933 0.99933 1080776837 0.99933 0.99933 1080776847 0.99933 0.99933 1080776857 0.99933 0.99933 1080776867 0.99933 0.99933 1080776877 0.99933 0.99933 1080776887 0.99933 0.99933 1080776897 0.99933 0.99933 1080776907 0.99933 0.99933 1080776917 0.99933 0.99933 1080776927 0.92526 0.92526Note that you do not have to be logged on to bullwinkle to get this information. When you run nws_extract the tool contacts the nws_nameserver running on cisa.cs.ucsb.edu that is listening to port 8090 (that is what the -N argument specifies). It discovers where bullwinkle.cs.ucsb.edu is storing its data, and contacts that persistent state server. It then fetches the previous trace history, runs it through the forecasters, and displays the that 20 measurements or so. You can configure the output every which way from Sunday. Type man nws_extract for some more explicit details.
What you get back are a list of time stamps, measurements, and forecasts. In this case, the forecasts look pretty good.
To see what the nws_nameserver knows about bullwinkle.cs.ucsb.edu try running
nws_search -N cisa.cs.ucsb.edu "&(objectclass=nwsHost)(name=bullwinkle*)"Again -- you must type exactly what is shown. It is even case sensitive. Please -- don't ask. What you should get back is
name :bullwinkle.cs.ucsb.edu:9786 objectclass:nwsHost hostType :sensor ipAddress :128.111.43.72 owner :nws port :9786 started :Apr_13,_2004_17:20:54 version :2.9.1 flags :debug,thread,experimental systemType :Linux releaseName:2.4.25lab-8 machineArch:i686 CPUcount :1 memory :503 timestamp :1081968942 expiration :1081970742which is the name server's opinion about what is running on bullwinkle.cs.ucsb.edu. There is actually quite a bit of information here. I'll leave it to you to ponder.
One important use of nws_search that you will need to consider, however, is to discover which resources you will be entitled to use. If you do not use only the resources listed in the NWS for your second project, it is wrong. This process is called resource discovery and it is an important part of grid computing.
To see what hosts a name server knows about, type
nws_search -N cisa.cs.ucsb.edu hostand you will get back all of the host records from the server. You'll need to parse this information a bit, but the list is the list of hosts you are entitled to use.