Batch scheduling

Ever notice how your computer slows down more and more when you open more and more applications? This is especially true for computationally intensive applications. How can you tell if all your optimization efforts are worth anything, if the computer is running all kinds of random programs at the same time as your super-optimized library? This is why people sometimes set up computers to use a batch scheduler, also called a batch queue. This is a program running on a “master” or “front-end” server, that controls computational jobs on a cluster of “worker” or “compute” nodes. The master ensures that each worker is running at most one job at a time. You can log into the master and submit jobs to the queue. However, you can't log into the compute nodes and run jobs directly; they are protected from logins that would disturb running computations. This ensures a fair distribution of resources among the jobs.

Interactive processes

Interactive jobs are those that don't go through a batch scheduler. They can be handy for debugging a code that doesn't take too long to run. You usually shouldn't just run them like any program, because then they will run on the front-end node, to which 80 gazillion people are logged in. They will be very annoyed if your compute-intensive program prevents them from editing their files and running their own jobs! (They can also use the ps command to find out who you are, so don't be surprised if they send you angry e-mails or track you down.)

To get an interactive shell on a remote compute node for debugging or general use, you may use “qsub -I”

Torque Resource Manager and Maui Cluster Scheduler

The CITRIS and PSI clusters run the Torque Resource Manager and Maui Cluster Scheduler. Torque is derived from PBS, and is similar.

To submit a script to Torque, run qsub script.sh. Torque defines environment variables and runs your script on the first assigned node but it does not launch processes on multiple CPUs. You may use gexec or mpirun within your script to launch processes on multiple assigned nodes from the first node you are assigned. Our PBS installation sets the GEXEC_SVRS environment variable to a list of the nodes you were assigned. If you requested 2 cpus per node, the node will be listed twice.

The default queue is batch which contains all nodes. View the output of qstat -a or qstat -q to check the queue status.

 
torque.txt · Last modified: 2008/10/16 09:23 by mhoward
 
Except where otherwise noted, content on this wiki is licensed under the following license:CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki