Daily procedures for CASTOR operators

This web page lists montoring links that should be looked at and commands that should be executed at least once per day. There are two CASTOR instances now in production, they are CASTOR PUBLIC and CASTOR REPACK. CASTOR REPACK is looked after by the tape operations team. There is also the CASTOR PPS (pre-production) instance. Some of the commands act on this instance as well but it is not a production instance and therefore is not of high priority. Most of the commands listed here are only acting on the CASTOR PUBLIC instance. Even though many of the commands contain ssh they should all be executed on the following machine if you want a guarantee that they will work. You are of course welcome to run them from other machines if you know what you are doing.

c2adm.cern.ch

You should have a valid CERN kerberos ticket before executing any of the listed commands. You can obtain such a ticket by executing kinit. The commands listed here should not be executed as user root, instead they should be executed as your normal CERN unix account.

Monitoring links

CASTOR support requests come in through SNOW. Search for the CASTOR functional element (FE). Tickets of interest usually have the "CASTOR 3rd Line Support" assignment group. Please ignore tickets that have the "Castor Tape 3rd Line Support" assignment group as these are handled by the tape operations team.

Check disks (MD devices)

CASTOR storage is organised into disk pools. End users choose which disk pool they use by specifyiing a CASTOR service class. The CASTOR black and white lists control which users can use which service classes. A CASTOR disk server is organised into filesystem mount points. CASTOR storage implements redundancy by using software RAID to provide the storage of each file system mount point. Staff in the IT computer center will proactively repair any broken RAID arrays. Such repairs can sometimes take days if hardware is not immediately available. Data lost from disk after it has been recalled from tape is not true data loss because a user can request the data back from tape. Data lost from disk on its way to tape for archival is true data loss. To help a RAID system protect the data it contains, the CASTOR filesystem that uses it should be put into the Readonly state when the array is one disk away from total failure. CASTOR currently uses RAID 1, 6 and 60. A filesystem using RAID 1 should be made Readonly after one disk has failed. A filesystem using RAID 6 should be made Readonly after two disks have failed. A filesystem using RAID 60 should be made Readonly after four disks have failed.

The following relatively fast command line lists the status of any RAID system that has at least one failed disk. This command should be executed on c2adm.cern.ch.

2>/dev/null wassh -c 'castor*disk*' -l root -- 'egrep "speed|_" /proc/mdstat | grep -v check' | grep -v WARN

The following relatively slow (and rather long) command line also lists the status of any RAID system that has at least one failed disk. This command however gives a much more user friendly output. Please note that this command sets the ssh option StrictHostKeyChecking=no. If you do not agree with this then you should remove this option. This command also sets UserKnownHostsFile=/dev/null so that your own ~/.ssh/known_hosts file is not polluted. If you ever executed this command without the ssh -o 'UserKnownHostsFile=/dev/null' option then you might want to clean or simply delete your ~/.ssh/know_hosts file if again you do not agree with the use of the ssh -o 'StrictHostKeyChecking=no' option. This command should be executed on c2adm.cern.ch.

DISK_SERVERS=`ssh -o 'LogLevel=ERROR' -o 'UserKnownHostsFile=/dev/null' -o 'StrictHostKeyChecking=no' root@castorpublic printdiskserver | tail -n +3 | head -n -2 | awk '{print $1;}'`; for DISK_SERVER in ${DISK_SERVERS}; do for DEVICE in `ssh -o 'LogLevel=ERROR' -o 'UserKnownHostsFile=/dev/null' -o 'StrictHostKeyChecking=no' root@${DISK_SERVER} grep -B1 _ /proc/mdstat | egrep '^md' | awk '{print $1;}'`; do MOUNT=`ssh -o 'LogLevel=ERROR' -o 'UserKnownHostsFile=/dev/null' -o 'StrictHostKeyChecking=no' root@${DISK_SERVER} df | grep ${DEVICE} | awk '{print $NF;}'`; LAYOUT=`ssh -o 'LogLevel=ERROR' -o 'UserKnownHostsFile=/dev/null' -o 'StrictHostKeyChecking=no' root@${DISK_SERVER} grep -A1 ${DEVICE} /proc/mdstat | tail -n 1 | awk '{print $NF;}'`; ssh -o 'LogLevel=ERROR' -o 'UserKnownHostsFile=/dev/null' -o 'StrictHostKeyChecking=no' root@castorpublic printdiskserver -f ${DISK_SERVER} | grep ${MOUNT}/ | awk '{print "'${DISK_SERVER}' '${DEVICE}' '${LAYOUT}' " $0;}'; done; done

The following command line can be used to make a CASTOR filesystem Readonly. This command should be executed on c2adm.cern.ch.

ssh root@castorpublic modifydiskserver --state Readonly -m FILESYSTEM DISK_SERVER_FQDN

The following command line can be used to put a CASTOR filesystem back into Production once its disks have been repaired. This command should be executed on c2adm.cern.ch.

ssh root@castorpublic modifydiskserver --state Production -m FILESYSTEM DISK_SERVER_FQDN

The following command line lists all of the CASTOR filesystems. This command should be executed on c2adm.cern.ch.

ssh root@castorpublic printdiskserver -f

If you see any filesystems in the FILESYSTEM_READONLY state that have 100% healthy RAID systems then you can put them back into Production or FILESYSTEM_PRODUCTION.

Check for lost files (late migrations)

The archival of a file to tape is an asynchronous operation for the end user. Once they have synchonously written a file to CASTOR disk they must poll CASTOR to determine when the file has finally been written to tape. There are various reasons why a CASTOR disk file may not make it to tape. The following command-line lists files that are unexpectedly late in their journey to tape. This command should be executed on c2adm.cern.ch.

ssh root@castorpublic cat /var/spool/castor/latemigrations.`date +%Y%m%d` 2>/dev/null

Any files listed by the above command-line should be investigated as to why they are not being archived to tape.

Check for stuck diskservers

CASTOR disk servers can lose contact with the rest of CASTOR and become considered as missing. The following command-line lists missing disk servers. This command should be executed on c2adm.cern.ch.

wassh `echo castor{public,pps,repack} | tr " " ,` -l root -- 'listtransfers | grep missing'

Please note that the above command can display false results, please run it at least twice if a disk server has been reported as missing.

A missing diskserver can be recovered by executing the following command. This command should be executed on c2adm.cern.ch.

ssh root@DISK_SERVER systemctl restart diskmanagerd

Check for late transfers (transfers not from the day):

The following command lists transfers that are not from today. A transfer is either a file being written from an end user's machine to CASTOR disk or from CASTOR disk to an end user's machine. This command should be executed on c2adm.cern.ch.

wassh `echo castor{public,pps,repack}|tr " " ,` -l root -- 'listtransfers | grep -v "`date +\"%b %d\"`"'

It is quite normal for transfers to the backup pool to take a relatively long period of time. These transfers should be ignored the first day they are listed. Any other late transfers or backup transfers older than a day should be investigated.

A file involved in a late transfer that is not on tape should be looked at with more urgency. If such a transfer fails then the end user will be notified synchronously. In theory this means that this is not a data loss situation for CASTOR because CASTOR told the user that their file never made it to CASTOR disk and so they should try again. This said, some users may not find it easy to resend their data. Some users create their data on the fly and do not have an actual file to copy into CASTOR again. End users suffering from these failures should be notified by e-mail so they don't build up a backlog of data that they may have difficulty to resend.

The following commands can be used to determine whether a file is on tape. These commands should be executed on c2adm.cern.ch.

ssh root@castorpublic.cern.ch nsls -l FULL_CASTOR_PATH
ssh root@castorpublic.cern.ch nsls -T FULL_CASTOR_PATH

The nsls -l command will list the letter m for migrated at the beginning of the line if the file has been migrated to tape. The nsls -T command will list the tape copies of the file. Both the nsls -l and nsls -T commands require the full path of the CASTOR file being examined. Unfortunatley CASTOR files are identified by their CASTOR nameserver file ID when they are listed in ongoing transfers.

The full path of a CASTOR file can be obtained from its CASTOR nameserver file ID using the following command. This command should be executed on c2adm.cern.ch.

ssh root@castorpublic.cern.ch nsgetpath CASTOR_FILE_ID


Home