Archive for May, 2007

My first disk on the raid failed. Found out (because I haven’t yet written a script) because the disk became read-only. And when I ran fdisk -l /dev/sda, it showed no partitions.

I ran tw_cli and got errors like this:

//pnn> info

Ctl   Model        Ports   Drives   Units   NotOpt   RRate   VRate   BBU
------------------------------------------------------------------------
c0    9550SX-8LP   8       6        1       1        4       4       -        
c1    9550SX-8LP   8       8        1       0        4       4       -        

//pnn> info c0

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-5    INOPERABLE     -      64K     1629.74   OFF    OFF      OFF      

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     233.76 GB   490234752     WD-WCANY1788981     
p1     OK               u0     233.76 GB   490234752     WD-WCANY1795430     
p2     OK               u0     233.76 GB   490234752     WD-WCANY1851853     
p3     OK               u0     233.76 GB   490234752     WD-WCANY1787889     
p4     NOT-PRESENT      -      -           -             -
p5     NOT-PRESENT      -      -           -             -
p6     OK               u0     233.76 GB   490234752     WD-WCANY1788370     
p7     OK               u0     233.76 GB   490234752     WD-WCANY1788683     

I’m not exactly sure why both p4 and p5 showed as not-present, but they did. The disk that was the problem was p4. (I found that out by running the command again, where it showed only p4 as the problem.)

To fix this, run:

/c0 remove p4 (or whichever disk is the problem)

Replace the disk with a new one.

/c0 rescan

Looking for something saying Found /c0/p4, but if you don’t find it, do a info c0.

Then, start the rebuild with:

/c0/u0 start rebuild disk=4

After this command, the info c0 shows:

//pnn> info c0

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-5    REBUILDING     4      64K     1629.74   OFF    OFF      OFF      

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     233.76 GB   490234752     WD-WCANY1788981     
p1     OK               u0     233.76 GB   490234752     WD-WCANY1795430     
p2     OK               u0     233.76 GB   490234752     WD-WCANY1851853     
p3     OK               u0     233.76 GB   490234752     WD-WCANY1787889     
p4     DEGRADED         u0     233.76 GB   490234752     WD-WCANY2233803     
p5     OK               u0     233.76 GB   490234752     WD-WCANY1789375     
p6     OK               u0     233.76 GB   490234752     WD-WCANY1788370     
p7     OK               u0     233.76 GB   490234752     WD-WCANY1788683     

This is a 250 gb disk and it looks like it’s going to take about an hour to rebuild.

Should also turn the write cache back on. Use:

tw_cli /c0/u0 set cache=on

We’re having some problems with our 3ware 9650se raid card. The messages in the log are:

3w-9xxx: scsi9: WARNING: (0x06:0x002C): Unit #0: Command (0x2a) timed out, resetting card.

And we get LOTS of these messages. The 3ware site said to try upgrading the driver. So I downloaded the file and untarred it. To create the driver, I ran:

make -f Makefile.rh

This created a 3w-9xxx.o file which I copied to /lib/modules/2.4.21-47.0.1.ELsmp/kernel/drivers/scsi.

I also upped the number of nfsd running from 64 to 128, though I’m pretty sure that this wasn’t the problem.

We’ll see if the new driver helps.

I took Marty’s old thermal_shutdown script and rewrote it in perl, so that I could use my ups shutdown script with it. I got the stuff for my shutdown script somewhere on the web, but I can’t find it anymore to cite it. So, my apologies to the original author.

#!/usr/bin/perl

# thermal_shutdown - Script to check temperature in glass room
#     If temp too high, email gurus, shutdown computer and shutdown ups
#
#       8 May 2007
#
#       Changed Marty's thermal_shutdown script to perl and added part about ups shutdown
#
#       The perl module Device::SerialPort must be installed.  I installed the rpm
#       perl-Device-SerialPort-1.002-1.2.el4.rf.i386.rpm.  A copy is in 
#       /support/data1/kickstart/additions.

use Device::SerialPort;  # Need to communicate with the serial port ups connection

# Some definitions
$PORT="/dev/ttyS0"; # Port with ups cable
$FILE="/system/monitors/therm_209a.data"; # File with temperature data
$LIMIT=80.0; # Temperature above which everything gets shut down
chop($THISHOST=`hostname -s`); # Hostname of this computer
chop($CURRENT=`cat $FILE`); # Current temperature
chop($DATE=`date`); # Current date

# This entire script only runs if the current temp is greater than the limit

if($CURRENT > $LIMIT)
{
        #################################################################
        # MAIL SECTION:  Send mail to the gurus list with the information
        #################################################################
        my($sendmail) = "/usr/sbin/sendmail";
        my($subject) = "TEMPERATURE SHUTDOWN: $THISHOST";
        my($mailto) = "chiefs@blog";
        my($mailfrom) = "root@$THISHOST";
        my($message) = "                     
    *****  THERMAL SHUTDOWN   *****
    
        Computer $THISHOST and its ups are now being shutdown to protect itself from damage.

        The temperature in the glass room has exceeded $LIMIT degrees F.

        The temperature is $CURRENT degrees F on $DATE.

        Please call Physical Plant to have someone come out to check out the system.  If no 
        one responds in 15 minutes, call again, or call Jim.

        The number for Physical Plant is 123-555-1414.

        Jim's phone number:  123-555-7824.


";



        # Open a stream to mail and send everything
        open(MAIL,"|$sendmail -oi -t");
        print MAIL "From: $mailfrom\n";
        print MAIL "To: $mailto\n";
        print MAIL "Subject: $subject\n\n";
        print MAIL "$message\n";
        close(MAIL);

        ##########################################################################
        # SHUTDOWN SECTION:  Send signal to shutdown the ups and then the computer
        ##########################################################################
        # Connection Settings
        $ob = new Device::SerialPort ($PORT) || die "Can't open $PORT: $!\n";
        $ob->baudrate(2400);
        $ob->parity("none");
        $ob->databits(8);

        # Send Y to put the ups in smart mode
        $pass=$ob->write("Y");

        # Send two Ks with > 1.5s delay between to shut down
        $pass=$ob->write("K");
        sleep 2;
        $pass=$ob->write("K");

        undef $ob;

        # Now shutdown the computer.  Hopefully, it'll be shut down before the ups goes off
       `/sbin/shutdown -h -t0 now`;
}

#!/usr/bin/perl

use Device::SerialPort;

# Port with ups cable
$PORT=”/dev/ttyS0″;

# Connection Settings
$ob = new Device::SerialPort ($PORT) || die “Can’t open $PORT: $!\n”;
$ob->baudrate(2400);
$ob->parity(“none”);
$ob->databits(8);

# Send Y to put the ups in smart mode
$pass=$ob->write(“Y”);

# Send two Ks with > 1.5s delay between to shut down
$pass=$ob->write(“K”);
sleep 2;
$pass=$ob->write(“K”);

undef $ob;

# Now shutdown the computer. Hopefully, it’ll be shut down before the ups goes off
`/sbin/shutdown -h -t0 now`;

The problem I was having was that the ups is in dumb mode when it first starts up. In this mode, it doesn’t recognize most commands. So, the first step in a script is to put the ups in smart mode. This is done by sending a Y. Next, I want to shut the ups off, but I want to put a little delay in to give the computer itself enough time to shut down. The shutdown command is run by sending two Ks with a delay of at least 1.5s between the Ks.

I can also add a line to say how long of a delay to use. To set this, I have to determine about how long it takes for the computer to shut down. The command for this is p. The default is 20 seconds. The delay is set by the eeprom. To change this, I used kermit and issued the following:

p-

This cycles the p through it’s commands. It showed 600, so the delay is now set to ten minutes. I added a line to my test script to immediately shut down the computer. The shutdown worked fine, but after ten minutes, the ups was still on. I’m trying it again, but changing the shutdown time back to 20 seconds.

This basically shut the ups off as soon as the computer shut down. If there were any hangups in the shutdown, this would probably shut the computer off before they were resolved. But I think that’s ok. This script is mainly for when the machines need to be shutdown immediately.

Apparently, I was wrong. The apcupsd program does do something to connect to the ups because if I try to use kermit to connect, I get nothing. But, if I start apcupsd and then stop it, then I can use kermit to connect. So, now I need to figure out what it’s doing.

To prepare for the next time that our glass room overheats, we want to run a script that automatically shuts off the ups anytime the temperature rises above a certain level. We already have the temperature monitor set up, so now we just need to figure out what to do to send the ups the signal to shut itself off.


I’ve installed the apcupsd program on one of our computers and connected it to its ups with the provided serial cable. I’m guessing that I don’t really need apcupsd running, but for now I have it. The rpm I had complained that it wouldn’t install without libcrypto.so.4 and libnetsnmp.so.5. We have libcrypto.so.6 and libnetsnmp.so.10 installed. So I just made links from these newer libraries to the older names.

cd /lib
ln -s libcrypto.so.6 libcrypto.so.4
cd /usr/lib
ln -s libnetsnmp.so.10 libnetsnmp.so.5

I removed the apcupsd program because I didn’t need it.

With the ups connected to the serial port, I can use kermit to talk to it.

kermit
c-kermit> set line /dev/ttyS0
c-kermit> set speed 2400
c-kermit> c

Peter Behr pointed out that webmail wasn’t working. The problem was after logging in, (for me, in Safari) I’d get a can’t open page redirect.php error. (In Firefox, the page would just be blank.) I looked at some files on the shop computer, where webmail (squirrelmail) was still working ok and found some differences in the /etc/httpd/conf.d/php.conf file. The solution was to add the following to the php.conf on the hep computer.

AddType application/x-httpd-php .php

Restart the web browser and things started working.

I think this must be due to the different versions of php used by the different versions of redhat. The broken computer is running RHEL4 ES release 4 (server), while the working one is running RHEL4 WS release 4 (client).