November 2008 log
Primarily debugged PhEDEx and srm transfers, learned that third party pushmode is generally unreliable and a couple failed transfers are not cause for alarm. Also updated OSG and debugged CRAB private DBS registration.
November 30, 2009
MK -- Finished debugging third party pushmode srm transfers
- The problems with transfers from UCSD were because one of the storage nodes had certificate problems.
- My manual tests from FNAL and UNL were failing because it appears that third party pushmode is generally unreliable. After subscribing to PhEDEx samples hosted at these sites, it took about half a dozen retries on a handful of files before they would finally download.
- I cannot trust the results of manual tests of third party pushmode. To test if something is going to work, I have to throw something like PhEDEx at it, which will "try, try, try again" over the course of days. Only once a PhEDEx transfer has been stuck for a minimum of 24 hours and retried on the same files a minimum of six times should I take action. In all likelihood, since pushmode relies on the sending site to do the work, a transfer held up on a few files indicates a problem with one of the sending site's storage nodes, possibly with its certificates. A transfer held up on the entire dataset indicates some other problem, probably local to UMD.
November 29, 2009
MK -- Root partition was completely full
- Caused OSG to fail, not surprisingly. Backups into /root were consuming too much space - just sent them to /data. Will need to check RSV results later to see if OSG has recovered.
- OSG apache couldn't bind to ports, as claim already consumed. Did vdt-control --off, so not sure how to get the services to shut off short of reboot. So will need to do a head node reboot at some point. Dinko is currently running jobs, so need to wait.
- Despite apache thinking it can't start, RSV was able to display results on web page, so suspect it's currently running and don't know how to cleanly kill it without a head node reboot.
- globus-ws is giving it's usual problematic complaint:
WARNING: It seems like the container died directly
Please see $GLOBUS_LOCATION/var/container.log for more information
Starting Globus container. PID: 7613
### 2008-11-29 09:37:47 vdt-control(do_init) starting 'globus-ws' failed: 1024
I've seen this before and have never been able to fix it. But I've never tried rebooting the HN to fix it. It seems every time I copy from a backup location, globus-ws just won't start. I thought initial problems with OSG were because of upgrades not full / partition, so performed a couple of OSG service starts from various backup areas. - Despite globus-ws thinking it can't start, CRAB jobs did run successfully. Suspect similar issue to apache - some service somewhere didn't turn off properly and will require a reboot for proper cleanup.
November 26, 2008
MK -- Updated OSG vo-package & VDT to VDT 1.10.1n
- According to GOC Ticket #5967:
- pacman -update OSG:vo-package
- Edited edg-mkgridmap.conf to include only desired VOs (mis, uscms01, ops)
- According to these instructions:
- pacman -update Gratia-Condor-Probe
- pacman -update Gratia-Metric-Probe
- pacman -update OSG-RSV
- pacman -update VOMS-Client
- Oddly, osg-rsv became disabled, had to call vdt-register-service --name osg-rsv --enable. On calling vdt-control --on osg-rsv, wouldn't start, log said "Starting OSG-RSV: ERROR: No probes configured!" Contacted Arvind. Turns out I needed to call configure:
- $VDT_LOCATION/monitoring/configure-osg.py -c
- Test with my usual CRAB job and it worked just fine.
- Hand transferred the files which PhEDEx was choking on using "srmcp srm://... file://..." PhEDEx checks the transfer and marks it as complete. Sent an email to OSG-storage to figure out why "srmcp srm://... srm://..." chokes on particular files, but not others. If this problem is not resolved, I will have to write a python script which parses download-srm, looking for failed downloads, and attempting to do the non-third-party version of the transfer manually. Will take a day to write all by itself.
November 21, 2008
MK -- Debugged CRAB DBS registration, installed CMSSW_2_1_17
- CRAB_2_4_3_pre3 contains the needed fixes for doing private DBS registration.
November 20, 2008
MK -- Debugged PhEDEx deletions
- Had incorrect settings in ConfigPart.Download and FileDownloadDelete (in SITECONF). Told ConfigPart.Download to use only srmv2 and told FielDownloadDelete to use srmrm. Based on T2_IT_Pisa FileDownloadDelete (thanks to Nicolo). PhEDEx should now delete on request and also clean up files which don't pass FileDownloadVerify.
November 19, 2008
MK--Worked with PhEDEx transfers
- Although transfers from UCSD appeared to succeed, files showed up with zero file size. PhEDEx log shows this, yet PhEDEx still convinced itself that the transfer completed. This was due to pushmode=false. Put in PhEDEx request to delete the sample, but now the download-delete log is issuing the error "alert: no pfn for ..." Not sure what script is being executed to remove files, so can't debug. Contacted Paul.
- FNAL transfer started with pushmode=false, I changed to pushmode=true and restarted PhEDEx using my logrotate script. It appears that the files with zero size are going to stay, so I'll have a partly corrupt dataset. Will probably have to do a PhEDEx deletion and transfer again as well. Will wait until the transfer completes (or claim it completes).
- There was an unusual issue with Prod_T3_US_UMD/state/download-srm/archive subdirs being owned by root. I may have accidentally called logrotate at some point as root, which would have put the PhEDEx service as running under root. Will need to confirm that cron really is running logrotate as root once it does so.
November 18, 2008
MK -- Debugged srm transfers
- After renewing phedex user proxy, telling PhEDEx to use the new srm client, and setting pushmode=false in config-2.xml, transfer from UCSD completed overnight.
- Third party transfers with FNAL don't universally work, regardless of pushmode or the version of the FNAL client used. Suspect network lag issues or who knows? Abandoning for now, because did retrieve desired sample from UCSD. It may be necessary to fiddle with proxy and settings every time I make a PhEDEx request! For the time being, left pushmode=false in $SRM_CONFIG.
- srmmkdir is creating directories owned by root, rather than the grid mapped user. I used --enable-sudofsmng and it worked just dandy!
November 17, 2008
MK -- Debugged PhEDEx transfers & threw logrotate at PhEDEx logs
- Requested transfer did not go through after several days, so checked PhEDEx logs. Logs indeed showed a problem with the transfer, though gave no details. Checking /store/mc/Summer08 showed directory was made, but no contents. Edited ConfigPart.Common & ConfigPart.Download.save in /state/partition1/phedex/current/SITECONF/T3_US_UMD/PhEDEx to tell it to use the new srmcp client installed by OSG 1.0 (suspect problem with $SRM_CONFIG, which points to the new config). Tail of transfer log shows no errors, but no attempt to transfer so far. (?)
- When attempted to do manual transfers using third party with FNAL, they seemed to fail, no matter what I did.
- Created ~phedex/phedex.logrotate and /var/spool/cron/phedex on compute-0-7 to rotate logs. Command line tests appear OK and links have not (yet?) gone down.
November 14, 2008
MK -- Installed CRAB_2_4_2
- Updated user guide with new SE output details.
November 5, 2008
MK -- Threw logrotate at files in /var/log
- Details in /root/security.txt.
UMD HEP T3 Computing Cluster