July 2008 log
Fixed a number of small, but critical issues. Configured PhEDEx to use storage element properly, so we show up as hosting data in DBS. Commissioned PhEDEx links to FNAL & Nebraska. Put the website into a presentable form and gave presentation about work to All USCMS meeting. Worked on gLite-UI installation, but did not complete.
July 31, 2008
MK--Fixed condor job submission from WNs
- Details are in the Admin How-To Guide, under Encountered Errors: Condor.
- Did some updating of Admin How-To guide to reflect recent work.
- Turning rocks-grub service off, rather than removing it using replace-auto-kickstart.xml, doesn't change the need for a password on the WNs, despite having an RSA key. For some unknown reason, the root RSA key is not being distributed to the WNs as a part of the install anymore.
July 29, 2008
MK--Fixed partitioning shoot-node issue
- Problem was that at any given point in the install process, either disk LABELs, /etc/fstab, or /etc/grub.conf would be wrong. The only way to get it working for every point in the install stage was to modify all three in the <post> section of extend-compute.xml. Changed instructions in the Admin How-To guide to reflect the needed modifications.
July 28, 2008
MK--Tried to fix partitioning shoot-node issue, tried RSA key issue, tried RSV issue
- Tried to use e2label in extend-compute.xml to get shoot-node to work on nodes without the default partitioning scheme. Tried issuing e2label manually on a machine which already had non-default partitioning (compute-0-3). Issued shoot-node compute-0-3, but it did not come back up (presumably due to Kernel panic, will verify). Created Rocks distribution with non-default partitioning and e2label calls placed manually in extend-compute.xml, modelled after /etc/fstab on compute-0-3 and identical to the same e2label commands issued on compute-0-3 manually. Issued shoot-node compute-0-0 instead of rocks remove host partition compute-0-0, nukeit.sh and cluster-kickstart, so this may be why compute-0-0 did not come back up. Tried proper sequence on compute-0-1, left before determined if node came back up with correct partitions.
- Issued cluster-fork "scp hepcms-0:/root/.ssh/* /root/.ssh" command and WNs didn't ask for password anymore. Temporary fix, doesn't explain why nodes aren't picking up the RSA key correctly when they reinstall.
- After reboot this morning, RSV showed same issue as before. gratia & html consumers were running in condor_prod instance as well as condor_devel instance. Probes running exclusively in condor_devel instance. Killed condor_prod jobs, stopped osg-rsv service, checked condor_q's all empty, started osg-rsv service, saw only jobs in condor_devel, as desired. However, RSV webpage shows there are still some missing probes, even if condor_q shows the appropriate 20 probes. Will check website tomorrow since some probes make take some time to run.
July 24, 2008
MK--Problem with rsa keys and/or validation, problem with partitions on WNs.
- shoot-node and cluster-kickstart don't accept password set in ssh-add, must supply password manually. shoot-node doesn't allow ssh connection to monitor install process. WN boot sequence issues a complaint about iptables file, but this hasn't been modified recently - suspect issue due to possible changes to grub service in extend-compute.xml.
- After manually copying the rsa key from the HN to compute-0-1, shoot-node did not demand a password. However, VNC server on compute-0-1 still didn't start. Why won't view pop up on shoot-node anymore? ssh connection still fails with error permission denied (publickey). Checked iptables and all appears to be well, so must be a more recent change - grub and removal of rocks-boot-auto package??
- WNs are giving error during boot that iptables-restore is choking. Removed changes to iptables in extend-compute.xml, but this is not the cause of the permission denied error - no errors on reinstall of WN after removing changes, but node still prompts for password and won't pop window during install. Will attempt to distribute rsa key to all nodes (cluster-fork) once I resolve partition/shoot-node issue. Suspect it will not request password, but that it still will not pop window during shoot-node process. Clueless as to why the connection isn't working during shoot-node, really shouldn't be the removal of rocks-boot-auto package, should it?
- Gave compute-0-2 new partitioning scheme. Installed and rebooted once successfully. However, shoot-node failed - install proceeded OK, but rebooted with Kernel panic. Error is that label /1 (the root partition that Rocks likes) was not found. Used Rocks boot disk to reinstall compute-0-2, but still had Kernel panic. Reported in email to Rocks-Discuss. Forced default partitioning scheme, which installed and rebooted once OK.
- compute-0-0 also showed a Kernel panic after shoot-node after the new partitioning scheme successfully installed once (as for compute-0-2). Forced default partitioning scheme, all OK (as for compute-0-2). Issued shoot-node after default partitioning scheme and node reinstalled and rebooted successfully. Then modified replace-auto-partition.xml to do the partitions I want (again). compute-0-0 installed and booted successfully. Issued shoot-node again, and node did not come back up due to Kernel panic.
July 23, 2008
MK--Placed srm configuration on WNs, changed partitioning on WNs
- Nodes didn't come back up due to issues with partitions.
July 22, 2008
MK--Brought HN and all services back up, reinstalled CMSSW
- HN went down due to corruption of LABELs on drives. Unknown what exactly caused the corruption in the first place, although probably had something to do with attempting to remove a network mounted directory. Remounted the root partition in read-write mode (mount -n -o remount,rw /dev/sda1). Edited /etc/fstab to change LABELs to /dev/sdx# and /etc/grub.conf to change LABEL to /dev/sda1 (for root partition). Rebooted successfully then backed up files. Issued e2label /dev/sdx# /part commands, restored /etc/fstab and /etc/grub.conf, rebooted, and all was well. One oddity was that /etc/grub.conf was pointing to LABEL=/, whereas the root partition was actually given the label /1 by /etc/fstab. Changed /etc/grub.conf to use LABEL=/1, unknown if this is what was causing the issue in the first place. Reboot went successfully, OSG services back up and running.
- Reinstalled CMSSW_1_6_12 with no issues. Will be cautious about putting the SITECONF directory there in the future, as it seems once the directory is there, CMSSW refuses to function without it.
July 21, 2008
MK--Successfully finished PhEDEx configuration for SE mode, HN down due to / partition disk corruption and/or failure
- PhEDEx finally had successful transfers using srm:// instead of file://, files pass tests. Will proceed with Rocks install of PhEDEx.
- Edited JobConfig directory in /software/cmssw/SITECONF/T3_US_UMD, now CMSSW won't work without the site-local-config.xml file.
- HN down after attempted move of /software/cmssw to /home/cmssoft/cmssw. Creating directory /scratch/cmssw didn't show networked /software/cmssw, so rebooted the HN. HN boot sequence stops with error "fsck.ext3: Unable to resolve 'LABEL=/1' ", checking /etc/fstab show this label corresponds to the root partition. I'm unable to backup critical files as the recovery mode mounts everything as read only disks - unable to remount as read-write.
- PhEDEx directory is tarred and ready for backup on compute-0-7, but can't be downloaded, perhaps need to figure out a way to mount USB drive and get the files that way, similarly for HN?
July 16, 2008
MK--Debugged srmcp failures, edited site configuration & home page
- srmcp failures appear due to gsiftp third party transfers failing from FNAL. Entirely unknown as to why - dCache experts claim third party transfers are OK and PhEDEx hasn't come to a grinding halt at FNAL, which relies on third party transfers. Some very special set of circumstances must be causing the third party transfers to fail, but I don't know what they are!! Emails have been forwarded to dCache experts, but I'm not anticipating a response. This is a BIG TODO!
- Continued website cleanup, effectively complete until further fixes.
July 15, 2008
MK--Edited user how-to guide, created to do page, cleaned up admin how-to guide
- Made website generally presentable. Site configuration isn't complete, but presentation coming soon and want site cleaned up.
July 14, 2008
MK--Configured PhEDEx for SE mode, edited User Contacts page
- PhEDEx configuration now passes TestCatalogue tests. download-srm log indicates LoadTest jobs are failing, but the files are showing up where they're supposed to. They have zero size. Checking previous LoadTests, before any changes, indicates that files have zero size, but do not show failure in download-srm. Decided to stick with SE configuration, rather than local configuration, and turned Prod instance on. Requested FNAL link be commissioned.
- User Contacts page should now be complete.
July 13, 2008
MK--Edited switch guide, configured PhEDEx for SE mode
- Inserted guide by Mark Burr for serial switch connection into admin how-to guide.
- PhEDEx using srm://hepcms.... fails TestCatalogue, unknown why.
July 10, 2008
MK--Tested condor, configured Kerberos, installed CVS, edited User How-To
- Condor jobs can be submitted from the HN, but not the WNs. Some permissions problems - can't access files in NFS mounted directories and can't execute simple binaries such as /bin/sleep. TODO
- Configured Kerberos again and tested, no issues. Can't ssh to cmslpc despite apparently valid ticket. TODO
- Installed CVS for CMSSW, able to successfully download from UserCode and from CMSSW area.
- Began serious edits of User How-To.
July 8, 2008
MK--Fixed g++ issue, security for WNs, PhEDEx configuration
- Installed g++ and needed libstdc++-devel library (unknown dependency, but needed and wasn't installed) on all the WNs
- Typo in WN security fixed, WNs now fully secured
- Further read and understood PhEDEx configuration
July 7, 2008
MK--Fought with gLite UI configuration, YAIM, & SE configuration for PhEDEx
- Can't install YAIM via rpm as suggested by gLite-UI tarball installation instructions. No links on the instructions work and every email sent (five!) bounced, despite instructions being updated only a few months ago. May not be able to get any help whatsoever on gLite-UI install. Reverting to gLite-UI standard installation via yum.
- Values for many gLite UI configuration parameters are extremely uncertain.
- server.xml file for PhEDEx allows me to switch from file: to srm:, but I don't know where to specify the webservice-path (srm/v2/server instead of srm/managerv2). Contacted Paul.
July 2, 2008
MK--WN security, srm configuration, CRAB installation
- One of the WN security settings is getting overwritten, possibly by 411. Emailed Rocks listserv to get help (TODO).
- srmcp is not picking up access_latency parameter in srm config file. Emailed OSG listserv to get help (TODO).
- jpackage 1.7 non-free repo install fails, the rpm is no longer being served or file name has changed. Need to dl to HN and server from there to be sure of future support. (TODO)
July 1, 2008
MK--Continued RSV srm tests, configured WN security, created cron job to clean /tmp on all nodes, began editing user's guide
- Decided to create links in OSG directory instead of modifying PATH variable. Placed alias of srmcp into OSG setup script as well, so no longer editing /etc/skel files to set up the environment correctly for all users. RSV srmcp still failing, possibly due to etc srm config file (TODO).
- Extended HN security settings to WNs and created cron job on all nodes to clean /tmp. Re-installed compute-0-0 and compute-0-1 successfully (TODO: re-install the rest).
- Began editing user's guide, particularly srm commands.
UMD HEP T3 Computing Cluster