Archive for April, 2009

I got an email alert this morning letting me know about a disk failure in one of our raids.

//cdfs1> info c1

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-5    DEGRADED       -      64K     1629.74   OFF    OFF      OFF      

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     233.76 GB   490234752     WD-WCANK2922638     
p1     OK               u0     233.76 GB   490234752     WD-WCANK2785939     
p2     OK               u0     233.76 GB   490234752     WD-WCANK2785884     
p3     DEVICE-ERROR     u0     233.76 GB   490234752     WD-WCANK2941755     
p4     OK               u0     233.76 GB   490234752     WD-WCANK2922794     
p5     OK               u0     233.76 GB   490234752     WD-WCANY3726392     
p6     OK               u0     233.76 GB   490234752     WD-WCANK2785937     
p7     OK               u0     233.76 GB   490234752     WD-WCANK2941415     

Here is a log of what I did:

//cdfs1> /c1 remove p3
Exporting port /c1/p3 ... Failed.

Drive not degraded port=3 
//cdfs1> info c1

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-5    DEGRADED       -      64K     1629.74   OFF    OFF      OFF      

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     233.76 GB   490234752     WD-WCANK2922638     
p1     OK               u0     233.76 GB   490234752     WD-WCANK2785939     
p2     OK               u0     233.76 GB   490234752     WD-WCANK2785884     
p3     NOT-PRESENT      -      -           -             -
p4     OK               u0     233.76 GB   490234752     WD-WCANK2922794     
p5     OK               u0     233.76 GB   490234752     WD-WCANY3726392     
p6     OK               u0     233.76 GB   490234752     WD-WCANK2785937     
p7     OK               u0     233.76 GB   490234752     WD-WCANK2941415     

I then replaced the disk with a working one.

//cdfs1> /c1 rescan
Rescanning controller /c1 for units and drives ...Done.
Found the following unit(s): [/c1/u0].
Found the following drive(s): [none].

//cdfs1> info c1

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-5    INOPERABLE     -      64K     1629.74   OFF    OFF      OFF      

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     233.76 GB   490234752     WD-WCANK2922638     
p1     OK               u0     233.76 GB   490234752     WD-WCANK2785939     
p2     OK               u0     233.76 GB   490234752     WD-WCANK2785884     
p3     OK               -      233.76 GB   490234752     WD-WCANY1569322     
p4     OK               u0     233.76 GB   490234752     WD-WCANK2922794     
p5     OK               -      233.76 GB   490234752     WD-WCANY3726392     
p6     OK               u0     233.76 GB   490234752     WD-WCANK2785937     
p7     OK               u0     233.76 GB   490234752     WD-WCANK2941415     

That’s bad. Disk p5 disappeared as well. I tried rescanning a few times, but that didn’t find it. So I tried just rebuilding disk 3 anyway.

//cdfs1> /c1/u0 start rebuild disk=3
Sending rebuild start request to /c1/u0 on 1 disk(s) [3] ... Failed.

(0x0B:0x0033): Unit busy

That didn’t work either. So I tried removing disk 3 and putting it back in.

//cdfs1> info c1

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-5    INOPERABLE     -      64K     1629.74   OFF    OFF      OFF      

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     233.76 GB   490234752     WD-WCANK2922638     
p1     OK               u0     233.76 GB   490234752     WD-WCANK2785939     
p2     OK               u0     233.76 GB   490234752     WD-WCANK2785884     
p3     OK               -      233.76 GB   490234752     WD-WCANY1569322     
p4     OK               u0     233.76 GB   490234752     WD-WCANK2922794     
p5     OK               -      233.76 GB   490234752     WD-WCANY3726392     
p6     OK               u0     233.76 GB   490234752     WD-WCANK2785937     
p7     OK               u0     233.76 GB   490234752     WD-WCANK2941415     

//cdfs1> /c0 remove p3   <-----OOPS--This should have been /c1 remove p3
Exporting port /c0/p3 ... Done.


//cdfs1> info c1

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-5    INOPERABLE     -      64K     1629.74   OFF    OFF      OFF      

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     233.76 GB   490234752     WD-WCANK2922638     
p1     OK               u0     233.76 GB   490234752     WD-WCANK2785939     
p2     OK               u0     233.76 GB   490234752     WD-WCANK2785884     
p3     OK               -      233.76 GB   490234752     WD-WCANY1569322     
p4     OK               u0     233.76 GB   490234752     WD-WCANK2922794     
p5     OK               -      233.76 GB   490234752     WD-WCANY3726392     
p6     OK               u0     233.76 GB   490234752     WD-WCANK2785937     
p7     OK               u0     233.76 GB   490234752     WD-WCANK2941415     

//cdfs1> /c1 rescan
Rescanning controller /c1 for units and drives ...Done.
Found the following unit(s): [/c1/u0].
Found the following drive(s): [none].

//cdfs1> info c1

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-5    INOPERABLE     -      64K     1629.74   OFF    OFF      OFF      

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     233.76 GB   490234752     WD-WCANK2922638     
p1     OK               u0     233.76 GB   490234752     WD-WCANK2785939     
p2     OK               u0     233.76 GB   490234752     WD-WCANK2785884     
p3     OK               -      233.76 GB   490234752     WD-WCANY1569322     
p4     OK               u0     233.76 GB   490234752     WD-WCANK2922794     
p5     OK               -      233.76 GB   490234752     WD-WCANY3726392     
p6     OK               u0     233.76 GB   490234752     WD-WCANK2785937     
p7     OK               u0     233.76 GB   490234752     WD-WCANK2941415     

//cdfs1> /c1 remove p3
Exporting port /c1/p3 ... Done.


//cdfs1> info c1

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-5    INOPERABLE     -      64K     1629.74   OFF    OFF      OFF      

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     233.76 GB   490234752     WD-WCANK2922638     
p1     OK               u0     233.76 GB   490234752     WD-WCANK2785939     
p2     OK               u0     233.76 GB   490234752     WD-WCANK2785884     
p3     NOT-PRESENT      -      -           -             -
p4     OK               u0     233.76 GB   490234752     WD-WCANK2922794     
p5     OK               -      233.76 GB   490234752     WD-WCANY3726392     
p6     OK               u0     233.76 GB   490234752     WD-WCANK2785937     
p7     OK               u0     233.76 GB   490234752     WD-WCANK2941415     

//cdfs1> /c1 rescan
Rescanning controller /c1 for units and drives ...Done.
Found the following unit(s): [/c1/u0].
Found the following drive(s): [/c1/p3].

//cdfs1> info c1

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-5    INOPERABLE     -      64K     1629.74   OFF    OFF      OFF      

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     233.76 GB   490234752     WD-WCANK2922638     
p1     OK               u0     233.76 GB   490234752     WD-WCANK2785939     
p2     OK               u0     233.76 GB   490234752     WD-WCANK2785884     
p3     OK               -      233.76 GB   490234752     WD-WCANY1569322     
p4     OK               u0     233.76 GB   490234752     WD-WCANK2922794     
p5     OK               -      233.76 GB   490234752     WD-WCANY3726392     
p6     OK               u0     233.76 GB   490234752     WD-WCANK2785937     
p7     OK               u0     233.76 GB   490234752     WD-WCANK2941415     

Nope, that didn’t work either. So I decided to remove disk 5 (but I never took it out of the case) and rescan.

//cdfs1> /c1 remove p5
Exporting port /c1/p5 ... Done.


//cdfs1> /c1 rescan
Rescanning controller /c1 for units and drives ...Done.
Found the following unit(s): [/c1/u0].
Found the following drive(s): [none].

//cdfs1> info c1

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-5    DEGRADED       -      64K     1629.74   OFF    OFF      OFF      

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     233.76 GB   490234752     WD-WCANK2922638     
p1     OK               u0     233.76 GB   490234752     WD-WCANK2785939     
p2     OK               u0     233.76 GB   490234752     WD-WCANK2785884     
p3     OK               -      233.76 GB   490234752     WD-WCANY1569322     
p4     OK               u0     233.76 GB   490234752     WD-WCANK2922794     
p5     OK               u0     233.76 GB   490234752     WD-WCANY3726392     
p6     OK               u0     233.76 GB   490234752     WD-WCANK2785937     
p7     OK               u0     233.76 GB   490234752     WD-WCANK2941415     

Ah, success. I don’t know why disk 5 got goofy all of a sudden, but I could now rebuild the new disk.

//cdfs1> /c1/u0 start rebuild disk=3
Sending rebuild start request to /c1/u0 on 1 disk(s) [3] ... Done.


//cdfs1> info c1

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-5    REBUILDING     0      64K     1629.74   OFF    OFF      OFF      

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     233.76 GB   490234752     WD-WCANK2922638     
p1     OK               u0     233.76 GB   490234752     WD-WCANK2785939     
p2     OK               u0     233.76 GB   490234752     WD-WCANK2785884     
p3     DEGRADED         u0     233.76 GB   490234752     WD-WCANY1569322     
p4     OK               u0     233.76 GB   490234752     WD-WCANK2922794     
p5     OK               u0     233.76 GB   490234752     WD-WCANY3726392     
p6     OK               u0     233.76 GB   490234752     WD-WCANK2785937     
p7     OK               u0     233.76 GB   490234752     WD-WCANK2941415     

//cdfs1> 

What was weird was that this computer has two raids. The errors that I got were all from raid c1. After it had been rebuilding a while, I went to check on it and found that disk 3 in raid c0 was now showing up as NOT-PRESENT because stupidly above, I had run /c0 remove p3 instead of /c1 remove p3. So, I rescanned c0 and rebuilt the drive on raid u0.

Here are the steps we did:
1. Make the following directories. You might have to be root to make them, but then change the owner and group on /opt/atlas to a regular user.

mkdir -p /opt/atlas/apt/rpmdm
mkdir -p /opt/atlas/apt/rpm
chown -R user:group /opt/atlas

Exit root.

2. Get apt and install in /opt/atlas

cd /opt/atlas
rpm -ivh --nodeps --relocate=/=$PWD/apt --dbpath=$PWD/apt/rpmdb http://atlas-computing.web.cern.ch/atlas-computing/links/reposDirectory/apt/apt-0.5_15lorg3.90-1.slc4.atlas.i386.rpm

3. Edit ~/.rpmmacros (don’t do this to the root account)
%_dbpath /opt/atlas/apt/var/lib/rpm
%_rpmlock_path /opt/atlas/apt/rpm/transaction

4. Get and fix the apt config file

cd /opt/atlas
wget http://pcatd12.cern.ch/releases/download/config/apt.conf
sed s#INSTALL_ROOT#$PWD#g apt.conf > apt/etc/apt.conf

5. Add the sources for these files to the sources list

cd /opt/atlas/apt/etc/apt/sources.list.d
wget http://pcatd12.cern.ch/releases/download/config/atlas.list

6. Get the setup file

cd /opt/atlas/apt
wget http://pcatd12.cern.ch/releases/download/config/setup.sh

7. Source the setup file

source setup.sh

8. Update apt lists

apt-get update

9. Install the tdaq software and source (all dependencies will be installed as well)

apt-get install tdaq-01-09-01_i686_slc4_gcc34_opt
apt-get install tdaq_src

The library in which we are most-interested is in:
/opt/atlas/tdaq/tdaq-01-09-01/installed/i686-slc4-gcc34-opt/lib/libROSslink.so

We copied this file to /usr/lib.

We’ve found that the vme software will in fact compile with a 2.6 kernel. The version of the software that worked for us was Scientific Linux 4.7. Strangely, Red Hat Enterprise Linux 4 did not work at all. We got lots and lots of error messages when running make. On SL47, we only got a few errors about MODULE_PARMS being an error. (I read somewhere that we just had to switch MODULE_PARMS to module_params, but turns out that we didn’t even need to do that.)

The odd error that we did have was after we ran make install, we could load the module (modprobe vme_universe) and things worked fine. However, after a reboot, the module would be loaded, but the hello_vme program would keep giving us the “Cannot initialize VMEbus” error. After examining more closely what make install was doing, we added the following to /etc/rc.d/rc.local:

# Load VME module
# Also need to create the device files
mkdir -p /dev/bus/vme
mknod --mode=666 /dev/bus/vme/ctl c 221 8
modprobe vme_universe

That fixed things. I don’t know why MAKEDEV isn’t making the module, but we got around this issue.

We still have a problem of getting large usb drives to work, but possibly, since we have to make the device files by hand for the vme stuff, that we may also have to do that for the usb stuff. (And we need to look into which modules need to be loaded for usb things.)

I happened to use a Ubuntu 8.10 disk, but any linux should work. Boot off the linux disk and, in the Ubuntu case, choose the first option. It says something like “Try Ubuntu without changing my hard disk”. Once you have the os loaded, open a terminal. (I don’t remember where I found it, but it’s on one of the menus.) In this terminal, enter:

sudo fdisk -l

This will show you your hard disks. Ours came up as /dev/sdb1.

Then fix the mbr with:

sudo lilo -M  /dev/sdb mbr

That’s it. Just reboot and windows came up.

We have learned some things about the VMIVME-7648 controllers and the vme software that we have. The main issue is that the vmisft software that we have will only compile on a 2.4 kernel. Thus, our previous installation of SLC4.5 (basically RHEL4) with its 2.6 kernel would not work. So the first step is to install something with a 2.4 kernel. For us, we used SLC306 (derived from RHEL3) because we had the disks.

Our first attempt was to simply do the installation just as we had done previously–connect our multicard reader to a pc with no other drives and install. We had a problem in that our multicard reader always showed a disk for each type of card. This meant that during the installation, an error would come up about there not being /dev/sda present. The compact flash card always came up as /dev/sdb. In the SLC45 installation, we could simply cancel this error and move on to the next disk. However, in the SLC306 installation, canceling this error led to the automatic closing of the installation program. We got lucky in that a new multicard reader we bought did not have this problem. The compact flash card in that reader showed as /dev/sda, which solved all our problems.

For reference, the model we used is the Verbatim Universal Card Reader 15-in-1, manufacturer part number 95343.
verbatim_card_reader.jpg

Once the operating system was installed, the compact flash card was installed into the vme controller. When grub started, we needed to stop the boot process and edit the boot options. We needed to add ide=nodma to the end of the kernel line.
grub_options.jpg

Then, during the boot process, it will find that a lot of hardware had changed. Just delete anything that it says is missing and configure any new hardware. This will require the network cards to be reconfigured.

Once the system is up. Edit the file /etc/grub.conf with the ide=nodma bit so that this is always present. Also, it’s a good idea to disable the second network card, if it’s not needed. This is done in /etc/sysconfig/network-scripts/ifcfg-eth0 and ifcfg-eth1. Just change ONBOOT=yes to ONBOOT=no for whichever one is not needed.

Next, compile and install the vmisft software. It can be downloaded here.
vmisft software

Unzip and untar the file. Then run make and make install to install the software. Next you need to make sure that the vme_universe module is loaded whenever the computer is turned on. Add the following to /etc/rc.d/rc.local:

# Load vme module
modprobe vme_universe

To check that everything is working, simply type:

modprobe vme_universe

to load the module immediately. Run lsmod to check that the module is loaded. Once it’s loaded, the vmisft directory has a test program. Run:

cd vmisft-7433-3.6/vme_universe/TOOLS
./hello_vme

If everything is working correctly, you’ll get a message like this:

Bus handle value 3
We're ready to start accessing the bus now!

If things are not working, the error message will be:

Cannot initialize the VMEbus

Some other issues that we had with these controllers had to do with the bios. The following are some screenshots showing how things should be set up. Only screens changed from the default are shown.

main_bios.jpg

cf_disk_bios.jpg

advanced_bios.jpg

io_config_bios.jpg

boot_order_bios.jpg