Notes, To Do & Sandbox
If you arrived at this page by Google search, odds are this page won't help you, it's mostly 'notes to self'. Try the admin guide.
Notes
The next shoot-node will:
- The phedex appliance will install a slightly different phedex tarball which contains a previously missing file. However, I plan to pull the phedex appliance on our upgrade and install phedex manually on the grid node, not as a part of the kickstart.
- Link /etc/grid-security/certificates to /sharesoft/osg/ce/globus/share/certificates instead of /share/apps/osg/globus/share/certificates
- Use /sharesoft/osg/app as the CMSSW installation directory in gLite-UI's site-info.def file.
- The interactive nodes will install CRAB_2_6_1 and link it as /scratch/crab/current, will also install CRAB_2_6_2, though doesn't work, possibly because of the age of our gLite-UI client.
- Will make /boot/grub/grub.conf a symlink to /boot/grub/grub-orig.conf instead of to grub-orig.conf. Will also make /boot/grub/grub-orig.conf writable.
- Will not yum update to the newest kernel.
- Will not install CRAB_2_6_3.
To Do
- Remove old kernels from grid and interactive nodes.
- ipmi driver failed to start on nodes (all of them afaik) after reboot to new Kernel - establish if ipmish calls can still be made to other BMC's
- Write guides in admin docs on recovering from GN reboot and reinstall. Modify instructions on HN reboot and reinstall.
- Reinstall Dell's OMSA on /scratch on the HN instead of /share/apps.
- I think my config.ini in OSG CE needs to have the jobmanager be jobmanager, not jobmanager-condor. This is so it will go through ManagedFork.
- site_verify thinks we're failing OSG $APP writability, but this might be OK. Perhaps the grid3-locations file needs to be writable by the appropriate account, but I think the whole app directory shouldn't be.
- TODO MAP SAM TO NEW SAM ACCOUNT (ADD SAM ACCOUNT TO DOCS)
- TODO MAP BOCKJOO TO CMSSOFT ACCOUNT.
- TODO MAP LOCAL USERS TO LOCAL ACCOUNTS.
- Have PhEDEx perform auto-proxy renewal using ProxyRenew script inside PHEDEX/Custom/Template
- Remove comments about needing to start PhEDEx services after reinstall (keep regarding reboot).
- Have OMSA on the grid appliance report storage problems
- Change language in admin guide to non-HN from WN where appropriate
- Have BMC's on all nodes report to and get DHCP addresses from hepcms-hn, NOT hepcms-0!
- Fix text of email alert to direct to hepcms-hn, not hepcms-0.
- I added source of /sharesoft/osg/setup.(c)sh to the ops, mis, cmssoft, sam, and uscms01 user accounts. This might break something!
- Once get rsv cert, add new entry to grid-mapfile-local
- Remove old settings from sudoers file on HN if no longer needed now that OSG is on GN.
- It seems the external gateway IP address may have changed, it's not clear if it should be 128.8.164.0 or 128.8.164.1.
- Change Kickstart such that /etc/fstab mounts /data differently on the grid node. Specifically, the grid node does NOT want the nfs mount of /data from itself!
- Edit the user guide to reflect the fact that $SRM_CONFIG no longer has default settings. Possibly transition them to srm-copy entirely.
- Edit user guide to reflect the fact that srm options are no longer in $SRM_CONFIG.
- Add info to admin guide about how I configured condor for PREEMPTION_REQUIREMENTS (see Jan 2009 log).
- To use srm-copy in PhEDEx (which provides the needed third party copy mode for transfers from UNL), I need to adapt FileDownloadVerify to create directories as a part of the pre-validation process. Paul has done this already: http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/COMP/SITECONF/T1_US_FNAL/PhEDEx/FileDownloadVerify?view=markup so I just need to get the relevant code and put it into my FileDownloadVerify.
- Mesa may provide the needed OpenGL libraries. glx-utils might as well, though don't see on RedHat rpms.
- Increase the size of the /root partition on WNs and have them install the Workstation packages: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2005-October/014916.html. Then I won't have to install everything manually! Do this when we get interactive nodes or next time we do a WN reinstall?
- Turn off the T2_US_Caltech LoadTests
- Change config.ini se_control_version to 2.2.0
- Add brief section about getting datasets to pass SAM tests.
- Boot sequence appears somewhat farfenoogled - the HN is trying to boot off of PERC6/E (the big disk array) instead of PERC6/i. Can't figure out where this got set, so have to use F11 at boot time to get into the boot menu and select the correct controller. What a pain!
- Completely reinstall the cluster using hepcms-0.umd.edu as the head node name, not HEPCMS-0.UMD.EDU. This has caused all sorts of troubles!!
- Increase size of /var and /root partitions
- Install cmsShow on the cluster
- Follow Burt's instructions for reducing wasted hours from gLite jobs (email)
- Add an afs mount to get files stored at CERN and FNAL.
- Add a section to the admin how-to "Modify Rocks" about installing software on the HN and the WNs.
- PhEDEx:
- FileDownloadVerify agent doesn't give a bad exit code when the file size is wrong, or PhEDEx doesn't do anything about the bad exit code (check if this is still the case in 3.1.1).
- The FNAL srm client forces third party transfers from dCache->BeStMan to be done via pushmode. Brian Bockelman says the LBNL srm client doesn't require pushmode. Since pushmode has a high degree of unreliability, it would be best to switch PhEDEx over to the LBNL client. This is what's done at UNL and works very well. Will need an srmcp_wrapper which calls srm-copy instead (for directory creation).
- This worked:
srm-copy "srm://srm.unl.edu:8443/srm/managerv2?SFN=/pnfs/unl.edu/data4/cms/store/user/mkirn/transfer-tests.txt" "srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/tmp/transfer-tests4.txt" -retry_num=2 -pushmode - As did the equivalent at FNAL:
srm-copy "srm://cmssrm.fnal.gov:8443/srm/managerv2?SFN=/resilient/kirn/transfer-tests.txt" "srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/tmp/transfer-tests3.txt" -retry_num=2 -pushmode - However, I can't use delegation false because my srm-copy client is too old (2.2.1.2.e5 doesn't support it, but 2.2.1.2.i5 does).
- Also, I'm not 100% sure that I'm contacting UNL's BeStMan-Gateway - I get the impression that they still have some stuff on dCache. So I'm not sure that this has really tested PhEDEx transfers from UNL. I'm pretty confident that it hasn't.
- I did the equivalent command for UMD:
srm-copy "srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/home/kirn/cluster_test_files/transfer-tests.txt" "srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/tmp/transfer-tests5.txt" -retry_num=2 -pushmode - Which worked, but we're not using a pure BeStMan-Gateway, which might be why it still worked.
- This worked:
- Have a cron job which makes the phedex tar ball every week.
- Add info on deleting data, in particular, that the directories must be manually cleaned up after a PhEDEx deletion request completes.
- Rename WNs HEPCMS-X in Rocks, so that their internal names match external names, then remove TRUST_UID_DOMAIN from WNs /opt/condor/etc/condor_config.local (or try to).
- yum update, called for lcg-CA package on WNs during kickstart install, places lots of rpm's with extension .1, because original rpm is still present. Will yum update use the newer rpm, or the older? Not an apparent problem, CRAB jobs work. But should be checked.
- Change compute-0-6 rank # to rank 6 instead of rank 8. Guess: insert-ethers --replace compute-0-6 --rank 6. Not even close to critical, only impacts printing, order of servicing Condor jobs, and cluster-fork ordering.
- Figure out CERN's path to DBS/PhEDEx data, presumably /castor/cern.ch/cms/<DBS path>, but needs to be verified.
- Take /root/OSG/AddAccounts script and write how to recover account names and passwords using this.
- Currently using find to clean /home/uscms01. However, best solution is to follow this: https://twiki.grid.iu.edu/bin/view/Sandbox/MaradonnaWorkerNodes so that gLite jobs go directly to the WN /tmp instead of /home/uscms01.
- Change kinit_cern and kinit_fnal so the cern.ch and fnal.gov domains are provided automatically.
- Get backup script to auto-copy to the hep-t3 webserver.
- quotas on /home and /data
- Removing /export/home/username doesn't remove the /home/user directory, possibly have to restart NFS services?
- Monitors:
- Nagios to issue alerts?
- Gratia (condor job status?) /share/apps/osg-0.8.0/gratia/var/logs/gratia-probe-condor.log
- Temperature: ipmi?
- RAID (Dell?)
- Determine why WNs aren't accepting ssh-add RSA keys. Every time I login for the first time to a compute node as root, I receive the statement "/usr/X11R6/bin/xauth: creating new authority file /root/.Xauthority" Should this file have been made already?? make -C /var/411 just says "Nothing to be done for `all' ," so it's not like 411 isn't aware of the new keys. Try removing all xml files in site-profiles, this will determine if it's fixable via the distribution, or something else must be done.
- Upgrade RSA keys to next version (4 I think?)
- Configure RSV to use grid certificate, rather than grid proxy.
Sandbox
Kernel panics/system rescue on HN
insert SL4.5 install disk 1, type linux rescue
Generally don't need the network enabled.
Mount Linux installation under /mnt/sysimage? Continue
chroot /mnt/sysimage
Execute whatever commands are needed to recover (/boot/grub/grub-orig.conf is a common source of problems preventing boot.)
exit
exit
system will reboot, remove CD from drive while rebooting
Holding condor jobs
To hold all the jobs running on the condor batch system, as root from the HN:
condor_status -schedd
For all nodes listed as the scheduler for running or idle jobs (e.g. compute-x-y):
ssh compute-x-y
condor_hold -name compute-x-y -all
To resume jobs:
condor_status -schedd
For all nodes listed as the scheduler for held jobs (e.g. compute-x-y):
ssh compute-x-y
condor_release -name compute-x-y -all
There must be an easier way to do this, but I don't know what it is! cluster-fork "condor_hold -all" will only hold jobs submitted by the root user.
Rocks Backup
Doesn't work - seg fault
- mkdir -p /data/users/root/RocksRestore/tmp
- rm -r /var/tmp
- ln -s /data/users/root/RocksRestore/tmp /var/tmp
- cd /export/site-roll/rocks/src/roll/restore
- make roll
BeStMan (via VDT)
BeStMan:
Note: As before, we are installing BeStMan on the same network mount as the CE, so we'll have the CE handle certificates.
Therefore, we must wait on turning on the BeStMan services
until after the CE install. We also run Globus GridFTP via the CE.
- Install with starting configuration:
cd /share/apps/bestman
pacman -get http://vdt.cs.wisc.edu/vdt_1101_cache:Bestman
*Note: As of Oct. 29, VDT has not officially released BeStMan-Gateway. To get VDT's BeStMan-Gateway, we had to use the version in development in the test-cache:
pacman -get http://vdt.cs.wisc.edu/test-cache/bestman:Bestman
I will change this guide once it has been officially released. Alternatively, you can download the BeStMan tarball and follow the BeStMan installation steps provided in the archived OSG 0.8 Admin guide.
Answer yall when asked if you want to add sites to trusted.caches
Agree to license? y
Update CRLs automatically? n
Cron rotation of VDT log files? y
Where to store CA files? l (lowercase L, local)
Update CA certificates automatically? n
Run Globus GridFTP automatically? n
Run BeStMan automatically? y - Additional configuration:
Note that we copied the default configuration saved in vdt-install.log and modified it to our needs.
vdt/setup/configure_bestman \
--vdt-install /share/apps/bestman \
--http-port 7070 \
--https-port 8443 \
--globus-tcp-port-range 20000,25000 \
--enable-gateway - vdt-register-service --name bestman --enable
--enable-sudofsmng ??? - Need to edit sudoers file???
--disable-sudofsmng
vdt-install.log: !!! Important: If this is your only gridmap file, please set it in server-config.wsdd as well.
UMD HEP T3 Computing Cluster