Here is how it works. Each machine that allows Condor job submission runs a daemon called
condor_scheddwhich is responsible for collecting up your binary and job script. Let's say we have written a hello_world.c program and we want to run it under Condor. The submission script might look like
Executable = ./hello_world Arguments = Universe = vanilla should_transfer_files = yes WhenToTransferOutput = ON_EXIT_OR_EVICT Output=/home/rich/example/hello_world.output.$(Process) Queue 1where the binary is called hello_world and it is in the same directory as the submission script. The script is acsii text. Let's say the file name is hw.submitfile.
To submit it you use the condor_submit command.
Here is an example output from my login:[rich@agt-login example]$ condor_submit hw.submitfile Submitting job(s). 1 job(s) submitted to cluster 947962. [rich@agt-login example]$ There are many options that you can include in your submit file and some of these dramatically affect the performance of the job. In this simple version, the output will be sent (as it occurs) to a file called hello_world.output.0 in the example subdirectory of my home directory.
However, once the job is submitted it will wait until Condor finds an idle machine to use to run it. To see what else Condor is doing, type
condor_qwhich tells you all of the job in your Condor pool and what there status is. If you want the status of a particular job, use the cluster number:
[rich@agt-login example]$ condor_q 947962 -- Submitter: agt-login.cs.wisc.edu : <198.51.254.66:59037> : agt-login.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 947962.0 rich 5/13 12:41 0+00:00:21 I 0 0.0 hello_worldWhy is it a cluster number? Because you can ask Condor to spawn multiple jobs in a single submission by changing the Queue argument in the submit script. The collection of jobs is called a "cluster" and each individual job is given an index (starting at zero) within the cluster.
The line shown above tells you that I submitted the job, when, that it has not accumulated much time yet, and that it is in the "I" state meaning the job is idle. When it runs, it will go to the "R" state until it exits and then the output will be put inthe output file.
WARNING: There is much more to using Condor than I've illustrated in this example.. You MUST read the documentation. No -- wait -- read this sentence again. YOU MUST read the documentation. I've listed several of the documentation pages at the top of this pages. Consult them early and often.
When an idle machine is discovered, a proxy process is started on the idle machine called the "starter" and it is given the address of the shadow on the submitting machine. The two open a couple of sockets between each other, and then the starter forks your job.
In the vanilla universe, these two guarantees are made in the following way. First, by carefully manipulating the shell defaults, your program is prevented from opening any local files. Any attempt to open a local file will simply cause an error. Secondly, the terminal for your job (standard in, standard out, standard error) is connected to the sockets that the starter opened to the shadow. Any output your program produces will be automatically redirected back to the shadow on the submitting machine. That is how your output gets written to a file. The shadow writes it.
Once your job is running, however, there are two reasons it will stop running. The first is that Condor implements time slicing for the "idle" resources it is using. If there are more Condor jobs than machines, it does not wait for the running jobs to complete. Rather, it assigns multiple jobs to the same host and it sends STOP and START signals to control which ones are running and for how long. The other way your program stops is if the owner of the machines decides to reclaim it. The start daemon monitors the machine for keyboard activity, load average, occupied memory, etc. and when it detects that the owner has become active again, it "evicts" the Condor jobs on the machine. In the vanilla universe, the eviction action is to KILL the job.
Again, in the vanilla universe, if the job has not run to completion (voluntarily exited or faulted), Condor will keep track fo the script and restart the job from the beginning on some other idle machine. To remove a job from eligibility, you can use the
condor_rmcommand. I'll let you read about it, but it tells Condor to kill any running instances of the job and remove any script for the job that may be pending.
Thus Condor includes a facility for causing your process to drop a core file and then restarting your process from the core (as opposed to at the beginning) when your process is evicted. To get this functionality, you must run in the standard universe, but there are some restructions and rules that you must follow.
First, in the standard universe your program can open files. However, in the same way that standard in and standard out are redirected back to the shadow process running on the submitting machine, all the file I/O is also directed back. To pull off this trick, you must compile your program with a version of the compiler that replaces all of the file I/O operations with ones that do the redirection. The command to use is condor_compile. Doing this is trickier than it looks because a reclaim event could take place while your program is in the middle of an I/O. The Condor I/O libraries have been carefully written to make sure everything goes smoothly.
Next, Condor does not have the ability to take a core file from a machine of one type and start it on another. It turns out that each Condor "pool" of machines all have the same type, but that there are multiple pools. In this class you will only be using one pool so maybe this isn't a big drawback.
A bigger problem, perhaps, is that your program is prohibited from using threads or from forking. The problem here is that once your job forks, Condor has a hard time making sure it does no evil. Also, each forked job would require its own checkpoint file and that could be big. There are other restrictions for the standard universe and they can be found here.
That should be enough to get you started. You can essentially run forever in the Condor pool if you like since your jobs will not disturb the machine owners at Wisconsin. Go to it.