Archive for July, 2006

After upgrading some machines to SLC3, would get this error message during boot. The solution was to add the following to the kernel lines in /etc/grub.conf:

apm=off noapmd acpi=off noacpi

The kernel lines then read as:

kernel /vmlinux-2.4.21-27.0.2.EL.cernsmp ro root=LABEL=/ apm=off noapmd acpi=off noacpi

I just returned from a four day networking class. The old Ethereal software has been replaced with Wireshark. So, I have installed the latest version on my Mac and am now just watching broadcast traffic, to get a feel for what is going on with our network.

First thing I found is that some of our printers were broadcasting Novell (IPX) stuff. We’re not running any Novell stuff, so I turned this off. Next, I found a bunch of packets that looked like this:

234 0.07334 65369.1 0.255 ZIP GetNetInfo Request

I now know that these are Appletalk packets and we don’t need to be broadcasting them. After I while, I found that the offending device was an old Netgear print server. I logged into the print server, but couldn’t find a way to turn off Appletalk. So, when I get some time, I’ll just attach the printer directly to the user’s computer and take it off the network. I don’t think that it’s necessary for this printer to be networked.

I forgot that most people have scripts to submit jobs to the queue. And that these scripts have the queue name hard-coded into them. So, I changed the names of all the queues to match the names on the old system. I also changed the blog entry of the names to what they are now.

After upgrading some kernels to 2.4.21-40, the make install_mom install_clients command wouldn’t work. So I recompilied torque with a machine already upgraded to 2.4.21-40. This worked fine, except that it has the name of the server as cdf36 instead of cdf30. So, after running the make command, check /var/spool/pbs/server_name and /var/torque/server_name that they contain the correct name.

The solution to getting the D-Link dge530t gb card working was to use the driver from the Marvell site and NOT the driver from the D-Link site. The D-Link driver would not compile. The Marvell driver compiled fine, as long as the kernel source rpm was installed. And now that it has been compiled on one machine running SLF305, it can just be copied to the new machine modules directory. Then, when the D-Link card is installed, kudzu will find it on the next reboot.

pbsnodes -a gives a list of all machines in the queue and their state.

If a machine comes up as “Down” when it should be up, do this:

pbsnodes -d

Then, pbsnodes -a should show it correctly.

Maui didn’t seem to work properly after the first compilation, so I’m trying it again.

./configure –prefix=/var/maui –with-pbs=/var/torque
make install
I get tons of errors. One thing I do notice is that it’s trying to link to libraries in /usr/local/maui/include and there’s nothing there. I edited the Makefile to change this to /var/maui/include, but there isn’t anything there yet either.

In the Makefile, edit the line:
export MSCHED_HOME=/var/torque

After this, make got a little further, but still get errors.

I upgraded cdf29 to the 2.4.21-40 kernel and installed one of the D-Link gb network cards. The card is listed with lspci, but it will not load the sk98lin driver. I downloaded the latest source and tried compiling it, but it gave me a bunch of errors. Through google, I found this page:

http://ncdf68.fnal.gov/twiki/bin/view/Main/MoversUpgrade#By_Wayne (Hit cancel on the window that comes up and look under “By Wayne”)

And it looks like Fermilab may have fixed this driver, but since I don’t have an account there, I can’t see for myself. I asked one of the students to download the driver file for me and we’ll see if that works.

I also bought a couple more Intel cards because I know those will work.

The networking problems were caused by a setting that I didn’t make after adding the new storage units. Since the storage units have two ethernet adapters, I put one on the campus 10 subnet and one on my own 192.168 subnet that was going to be set up for gb speeds. The idea was that most of the data transfer would take place on the 192.168 subnet, keeping it off the campus network. Unfortunately, I was unable to get the gb nics working. I decided to let the cdf users use the storage units through the 10 subnet, while I continued to work on the gb network stuff. The problem is that, by default, data on the 128 will go to the switch and then come back on the 10. This basically overwhelmed the switch, causing all our problems. Ron at Network Services told me to add a route to the 10, so that the step of going to the switch would be eliminated. So, I added the following:

route add -net 10.135.102.0 netmask 255.255.255.0 eth0

So, now our route table looks like this:
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
128.135.102.0 * 255.255.255.0 U 0 0 0 eth0
10.135.102.0 * 255.255.255.0 U 0 0 0 eth0
169.254.0.0 * 255.255.0.0 U 0 0 0 eth0
default v102router.uchi 0.0.0.0 UG 0 0 0 eth0

To make the route permanent, do the following:

on SLF305, make /etc/sysconfig/network-scripts/route-eth0 (permissions 755):
10.135.102.0/24 dev eth0

Compiling the Maui scheduler to use with the new Torque pbs software. This scheduler is supposed to be better than the default one that comes with Torque, so I thought I’d give it a try.

As user me:
./configure –with-prefix=/var/maui
make
as root, mkdir /var/maui and chown to me:
make install