Archive for the ‘PBS’ Category

Our queue setup for job priority is basically, first in, first out. However, at times, a user will have many jobs in the queue, but only a couple are running. Then, if more jobs are submit to the queue, those jobs run before the first ones are done. This happens when something goes wrong with the scheduler and these jobs are given DEFER status. Unfortunately, these jobs remain deferred even after whatever caused the problem has passed. The current solution is to run (as root) /var/maui/bin/releasehold job number, which will release the hold. I haven’t yet found a way to do this for all jobs, so as of now, this has to be run for each job.

This manual should give some clues as to what we can do to eliminate this problem.

After installing the Maui scheduler, I made a couple of changes to the queue. Here are the current settings:

Qmgr: p s
#
# Create queues and set their attributes.
#
#
# Create and define queue cdf1
#
create queue cdf1
set queue cdf1 queue_type = Execution
set queue cdf1 from_route_only = True
set queue cdf1 resources_max.cput = 240:00:00
set queue cdf1 resources_min.cput = 00:00:01
set queue cdf1 enabled = True
set queue cdf1 started = True
#
# Create and define queue cdf
#
create queue cdf
set queue cdf queue_type = Route
set queue cdf route_destinations = cdf1
set queue cdf route_held_jobs = True
set queue cdf route_waiting_jobs = True
set queue cdf enabled = True
set queue cdf started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_host_enable = True
set server acl_hosts = *.uchicago.edu
set server default_queue = cdf
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.nodect = 1
set server resources_default.nodes = 1
set server resources_default.walltime = 24:00:00
set server scheduler_iteration = 300
set server node_check_rate = 120
set server tcp_timeout = 6
set server pbs_version = 2.1.1

Trying again to get the Maui scheduler working because the plain pbs scheduler is having lots of problems.

I found a page on the web that said to leave the –with-pbs flag off the configure command. So the configure command I used was:

./configure –prefix=/var/maui
make ran without any errors
make install

I forgot that I did have these environment variables set. I don’t know if they did anything.

CPPFLAGS=’-I/var/torque/include’
LDFLAGS=’-L/var/torque/lib’

Edit the file /var/maui/maui.cfg. Basically, I just took the one from the Tier2 cluster and changed the name of the server. To start the Maui scheduler, be sure to stop pbs_sched and then run: /var/maui/sbin/maui -C /var/maui/maui.cfg

It also created a bunch of files in /usr/spool/maui, which is NOT where I’ve been keeping that stuff. So, I copied it all to /var/maui and made a link from /usr/spool/maui to /var/maui. Edited /etc/rc.d/init.d/torque to reflect the new scheduler and restarted it.

Sometimes the queue would have open machines and jobs waiting, but wouldn’t start the jobs. The comment on these jobs would be that they weren’t running, they were waiting for starving jobs to finish. The default setting is to not run any jobs when there are jobs that have been waiting for at least 24 hours (classified as starving). This behavior was disabled by editing /var/spool/pbs/sched_priv/sched_config and change the help_starving_jobs option to false.

Previously, I had set the ideal_load to 0.3 and the max_load to 1.0. These are too low for the machines in the glass room. I’m changing them all to:

$ideal_load 2.0
$max_load 3.0

To add a new node (say cdf5) to the pbs queue on cdf30, start the qmgr program and issue this command:

c n cdf5 np=2,properties=fast

Only put the np=2 part in if the node has dual processors.

I forgot that most people have scripts to submit jobs to the queue. And that these scripts have the queue name hard-coded into them. So, I changed the names of all the queues to match the names on the old system. I also changed the blog entry of the names to what they are now.

After upgrading some kernels to 2.4.21-40, the make install_mom install_clients command wouldn’t work. So I recompilied torque with a machine already upgraded to 2.4.21-40. This worked fine, except that it has the name of the server as cdf36 instead of cdf30. So, after running the make command, check /var/spool/pbs/server_name and /var/torque/server_name that they contain the correct name.

pbsnodes -a gives a list of all machines in the queue and their state.

If a machine comes up as “Down” when it should be up, do this:

pbsnodes -d

Then, pbsnodes -a should show it correctly.

Maui didn’t seem to work properly after the first compilation, so I’m trying it again.

./configure –prefix=/var/maui –with-pbs=/var/torque
make install
I get tons of errors. One thing I do notice is that it’s trying to link to libraries in /usr/local/maui/include and there’s nothing there. I edited the Makefile to change this to /var/maui/include, but there isn’t anything there yet either.

In the Makefile, edit the line:
export MSCHED_HOME=/var/torque

After this, make got a little further, but still get errors.