Archive for the ‘PBS’ Category

After the recent reinstallations, torque and maui need to be reinstalled. Since we’ve changed the setup a bit, I think that it’s now ok to take the default installation locations (/usr/local). So the command to configure and install the software is:

cd /system/software/linux/torque-2.0.0
./configure --with-rcp=scp
make
make install (as root)

This puts all the programs in /usr/local/bin and the needed libraries in /usr/local/lib. The spool directory is in /var/spool/torque.

Maui is done with:

cd /system/software/linux/maui-3.2.6p19
./configure
make
make install (as root)

Maui’s home is /usr/local/maui.

Startup scripts are provided in /system/software/linux/torque-2.3.0/contrib/init.d. We only need pbs_mom and pbs_server because we’ll be using maui for the scheduler. They need to be edited with the correct values.
PBS_DAEMON=/usr/local/sbin/pbs_server
PBS_HOME=/var/spool/torque

Copy pbs_mom and pbs_server to /etc/rc.d/init.d. And run /etc/rc.d/init.d/pbs_server. Once it’s a running process, can create the queues with qmgr.

[root@cpserver init.d]# qmgr
Max open servers: 4
Qmgr: p s
#
# Set server attributes.
#
set server acl_hosts = cpserver
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
Qmgr: c q cp1
Qmgr: s q cp1 queue_type=Execution
Qmgr: s q cp1 from_route_only=True
Qmgr: s q cp1 resources_max.cput=240:00:00
Qmgr: s q cp1 resources_min.cput=00:00:01
Qmgr: s q cp1 enabled=True
Qmgr: s q cp1 started=True
Qmgr: c q cp
Qmgr: s q cp queue_type=Route
Qmgr: s q cp route_destinations=cp1
Qmgr: s q cp route_held_jobs=True
Qmgr: s q cp route_waiting_jobs=True
Qmgr: s q cp enabled=True
Qmgr: s q cp started=True
Qmgr: s s scheduling=True
Qmgr: s s acl_host_enable=True
Qmgr: s s acl_hosts=*.uchicago.edu
Qmgr: s s default_queue=cp
Qmgr: s s query_other_jobs=True
Qmgr: s s resources_default.nodect=1
Qmgr: s s resources_default.nodes=1
Qmgr: s s resources_max.walltime-96:00:00
Qmgr: s s resources_max.walltime=96:00:00
Qmgr: s s submit_hosts = cpserver
Qmgr: c n cpserver np=2

Maui’s startup script is provided in /system/software/linux/maui-3.2.6p19/contrib/service-scripts/redhat.maui.d. Edit this file:
MAUI_PREFIX=/usr/local/maui
also change the user as which it should run. We don’t have a maui user, so use my own username instead. This turned out to be a big problem, so have to run as root.

and copy to /etc/rc.d/init.d/maui.

Now, chkconfig –add pbs_mom, pbs_server and maui. Restart them all and submit a test job.

The job was accepted in the queue, but never executed. Oops, forgot to edit maui.cfg, add ADMIN1 and ADMIN3 and change the RMCFG line:

ADMIN1   root maryh
ADMIN3   ALL

#RMCFG[CPS1] TYPE=PBS@RMNMHOST@
RMCFG[base] TYPE=PBS

Test job now works, so can move on to the compute node.

The compute node doesn’t need maui, only torque. So simply run make install on the compute node.

In /var/spool/torque, check that server_name has the proper name. Copy the pbs_mom startup script from the server to this node. Start it up. Back on the server, create a new node in qmgr.

c n cpcompute np=8

Create /var/spool/torque/mom_priv/config

$usecp cpserver.uchicago.edu
$ideal_load 8.0
$max_load 10.0
$restricted *.uchicago.edu

This node has eight cores, so the ideal_load is eight.

Finally go back on the server into qmgr and add the compute host as another submit host:

qmgr
s s submit_hosts += cpcompute

If want to allow other hosts to be able to submit jobs to the queue, add the following to the queue manager.

set server submit_hosts = host1
set server submit_hosts += host2

As long as the qsub programs are installed on host1 and host2, a user will be able to submit jobs from either of those hosts to the queue.

I’m installing a new pbs for Ed’s group. I redownloaded the torque and maui software to /system/software/linux. Maui is already running on a machine designated as the queue server. Now, I just need to set up a new machine with eight cores as a compute node.

[cpcompute] ./configure -prefix=/var/torque -exec-prefix=/var/torque -with-rcp=scp -with-server-home=/var/spool/pbs -with-server-name=/var/torque/server_name
[cpcompute] make
[cpcompute] make install_mom install_clients  (as root)

Queue settings for the CP group pbs queue:

[root@cp_server bin]# ./qmgr
Max open servers: 4
Qmgr: p s
#
# Set server attributes.
#
set server acl_hosts = cps1
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
Qmgr: c q cp1
Qmgr: s q cp1 queue_type=Execution
Qmgr: s q cp1 from_route_only=True
Qmgr: s q cp1 resources_max.cput=240:00:00
Qmgr: s q cp1 resources_min.cput=00:00:01
Qmgr: s q cp1 enabled=True
Qmgr: s q cp1 started=True
Qmgr: c q cp
Qmgr: s q cp queue_type=Route
Qmgr: s q cp route_destinations=cp1
Qmgr: s q cp route_held_jobs=True
Qmgr: s q cp route_waiting_jobs=True
Qmgr: s q cp enabled=True
Qmgr: s q cp started=True

Qmgr: s s scheduling=True
Qmgr: s s acl_host_enable=True
Qmgr: s s acl_hosts=*.uchicago.edu
Qmgr: s s default_queue=cp
Qmgr: s s query_other_jobs=True
Qmgr: s s resources_default.nodect=1
Qmgr: s s resources_default.nodes=1
Qmgr: s s resources_max.walltime=96:00:00

If a pbs job is unable to write to disk (usually a log file to the user’s home area), check the file /var/spool/torque/mom_priv/config and make sure that there’s a line for each SUBMISSION system that looks like this:

$usecp uct3-edge3.uchicago.edu:/ /
$usecp uct3-edge5.uchicago.edu:/ /
$usecp uct3-edge7.uchicago.edu:/ /

Then, if any changes were made to this file, restart pbs_mom.

pbsnodes -a -lists all nodes and their status
qstat -a -lists all jobs currently in queue and their status

checknode nodename – gives details about a particular node
checkjob job_number – gives details about a particular job
showstart job_number – gives a rough approximation as to when a job may start
showq – will give info about all the jobs, the important stuff is at the end where it will show blocked jobs. These can be unblocked with:

releasehold job_number – unblock or release a hold

Add the line:

ADMIN3: ALL

to maui.cfg to let everyone run commands that get info about currently running jobs and compute nodes. These users are not allowed to make changes to jobs or priorities. Note that the ALL must be in all caps, or it doesn’t work properly.

Unless specifically allowed, when users run the qstat command, they only see their own jobs. In order to see all the jobs in the queue, no matter who owns them, add the following line in qmgr:

set server query_other_jobs=true

After setting up the cpv queue, when I would submit a job, I would always get the error that said something like:

job rejected by all possible destinations

I copied the queue settings from the working cdf queue and couldn’t figure out what was wrong. Then, I found on the torque wiki a line that said if you get this error, that you should set the route_destinations to queue@localhost. I tried this and it didn’t work. But then I changed it to:

s q route_destinations = cpv@pnn

That did the trick. So there must be a difference in how SLF and SLC determine their hostname or something like that. I’m not looking into it any further because this seems to work just fine.

Queue server: pnn
Queue computes: pnn2 pnn3 pnn4 pnn5 pnn6

Computes:
1. In /support/data1/maryh/torque-2.1.1-27, run:

make install_mom install_clients

2. Edit /var/spool/pbs/mom_priv/config:

$usecp pnn.uchicago.edu
$ideal_load 2.0
$max_load 3.0
$restricted *.uchicago.edu

3. Edit /var/spool/pbs/server_name
pnn

4. Get /etc/rc.d/init.d/torque from another pbs compute node.

Server:
1. Unpack new torque directory in /support/data1/maryh

2. /configure –prefix=/var/torque –exec-prefix=/var/torque –with-rcp=scp –with-server-home=/var/spool/pbs –with-server-name=/var/torque/server_name

3. make –got errors, needed yum install tclx-devel

4. make install

5. edit /var/torque/server_name with pnn

6. edit /var/spool/pbs/torque.cfg

SERVERHOST pnn
ALLOWCOMPUTEHOSTSUBMIT true

7. Set up Maui
Maui would not compile. The config file had some strange lines in it. So, I just copied the Make file from the cdf30 setup and it compiled fine. Again, I had to copy everything from /usr/local/maui to /var/maui and make a link to /usr/local/maui from /var/maui.

8. Edit /var/maui/maui.cfg, look at cdf30 copy for the two lines.