CS290I Condor


Condor is a job submission system that uses the idle cycles of the machines at the University of Wisconsin. The basic idea is that you log into a Condor-enabled machine, you compile your code, and you write a short script that you submit to the Condor system explaining how to run your code. Condor puts that script and your binary in a safe place, and when it notices that there is a workstation someplace at Wisconsin that no one is using, it will copy your binary to that machine and run it there according to the instructions in your script.

Here is how it works. Each machine that allows Condor job submission runs a daemon called

condor_schedd
which is responsible for collecting up your binary and job script. Let's say we have written a hello_world.c program and we want to run it under Condor. The submission script might look like
Executable = ./hello_world
Arguments =
Universe = vanilla
should_transfer_files = yes
WhenToTransferOutput = ON_EXIT_OR_EVICT
Output=/home/rich/example/hello_world.output.$(Process)
Queue 1
where the binary is called hello_world and it is in the same directory as the submission script. The script is acsii text. Let's say the file name is hw.submitfile.

To submit it you use the condor_submit command.

Here is an example output from my login:
[rich@agt-login example]$ condor_submit hw.submitfile Submitting job(s). 1 job(s) submitted to cluster 947962. [rich@agt-login example]$ There are many options that you can include in your submit file and some of these dramatically affect the performance of the job. In this simple version, the output will be sent (as it occurs) to a file called hello_world.output.0 in the example subdirectory of my home directory.

However, once the job is submitted it will wait until Condor finds an idle machine to use to run it. To see what else Condor is doing, type


condor_q

which tells you all of the job in your Condor pool and what there status is. If you want the status of a particular job, use the cluster number:

[rich@agt-login example]$ condor_q 947962


-- Submitter: agt-login.cs.wisc.edu : <198.51.254.66:59037> :
agt-login.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
947962.0   rich            5/13 12:41   0+00:00:21 I  0   0.0  hello_world

Why is it a cluster number? Because you can ask Condor to spawn multiple jobs in a single submission by changing the Queue argument in the submit script. The collection of jobs is called a "cluster" and each individual job is given an index (starting at zero) within the cluster.

The line shown above tells you that I submitted the job, when, that it has not accumulated much time yet, and that it is in the "I" state meaning the job is idle. When it runs, it will go to the "R" state until it exits and then the output will be put inthe output file.

WARNING: There is much more to using Condor than I've illustrated in this example.. You MUST read the documentation. No -- wait -- read this sentence again. YOU MUST read the documentation. I've listed several of the documentation pages at the top of this pages. Consult them early and often.

The Condor Model

Here is what is happening under the covers. First off, there is a central Condor manager process that runs someplace controlled by Condor that is orchestrating the whole show. At the submitting site, a "scheduling" daemon accepts submission scripts and tells the central manager about them. It also starts a "shadow" process on the submitting machine. More on the shadow in a second. At the execution site, a "start" daemon listen for assignments from the central manager. It is the central manager's responsibility to look for idle machines capable of running scripts.

When an idle machine is discovered, a proxy process is started on the idle machine called the "starter" and it is given the address of the shadow on the submitting machine. The two open a couple of sockets between each other, and then the starter forks your job.

The Vanilla Universe

From here on out, things are a little different depending on what you want to have happen when the remote machine you are using becomes "busy" again. The issue at hand is that Condor guarantees that the owner of that machine can reclaim it from Condor instantly even if there are Condor jobs (your job) running on it. Also Condor guarantees that your job will not be able to damage the remote machine while it is running there.

In the vanilla universe, these two guarantees are made in the following way. First, by carefully manipulating the shell defaults, your program is prevented from opening any local files. Any attempt to open a local file will simply cause an error. Secondly, the terminal for your job (standard in, standard out, standard error) is connected to the sockets that the starter opened to the shadow. Any output your program produces will be automatically redirected back to the shadow on the submitting machine. That is how your output gets written to a file. The shadow writes it.

Once your job is running, however, there are two reasons it will stop running. The first is that Condor implements time slicing for the "idle" resources it is using. If there are more Condor jobs than machines, it does not wait for the running jobs to complete. Rather, it assigns multiple jobs to the same host and it sends STOP and START signals to control which ones are running and for how long. The other way your program stops is if the owner of the machines decides to reclaim it. The start daemon monitors the machine for keyboard activity, load average, occupied memory, etc. and when it detects that the owner has become active again, it "evicts" the Condor jobs on the machine. In the vanilla universe, the eviction action is to KILL the job.

Again, in the vanilla universe, if the job has not run to completion (voluntarily exited or faulted), Condor will keep track fo the script and restart the job from the beginning on some other idle machine. To remove a job from eligibility, you can use the

condor_rm
command. I'll let you read about it, but it tells Condor to kill any running instances of the job and remove any script for the job that may be pending.

The Standard Universe

A kill and restart may be fine for what you are doing (my Ramsey search code for Condor uses the vanilla universe) but it requires that your program be able to save its own state and restart itself from where it left off when Condor kills and restarts it. This process is called checkpointing and in the vanilla universe you must write your own checkpointing routines. Condor will not warn you when it is going to kill your jobs, so knowing when to checkpoint is a statistical game, but the most you lose is the work you've done since your last checkpoint. However, if you stare at Unix and Linux systems long enough you eventually figure out that the core file that gets created when your process "core dumps" actually gives you enough information to restart the program from where the point where the core was taken. Moreover, the default action when sending a process a SIGQUIT is to drop a core file.

Thus Condor includes a facility for causing your process to drop a core file and then restarting your process from the core (as opposed to at the beginning) when your process is evicted. To get this functionality, you must run in the standard universe, but there are some restructions and rules that you must follow.

First, in the standard universe your program can open files. However, in the same way that standard in and standard out are redirected back to the shadow process running on the submitting machine, all the file I/O is also directed back. To pull off this trick, you must compile your program with a version of the compiler that replaces all of the file I/O operations with ones that do the redirection. The command to use is condor_compile. Doing this is trickier than it looks because a reclaim event could take place while your program is in the middle of an I/O. The Condor I/O libraries have been carefully written to make sure everything goes smoothly.

Next, Condor does not have the ability to take a core file from a machine of one type and start it on another. It turns out that each Condor "pool" of machines all have the same type, but that there are multiple pools. In this class you will only be using one pool so maybe this isn't a big drawback.

A bigger problem, perhaps, is that your program is prohibited from using threads or from forking. The problem here is that once your job forks, Condor has a hard time making sure it does no evil. Also, each forked job would require its own checkpoint file and that could be big. There are other restrictions for the standard universe and they can be found here.

That should be enough to get you started. You can essentially run forever in the Condor pool if you like since your jobs will not disturb the machine owners at Wisconsin. Go to it.