Configuration
The UMD HEP T3 cluster is composed of one head node (HN), one grid node (GN), two interactive nodes (INs), and eight worker nodes (WNs). After RAID and formatting, we have ~9TB disk space for large datesets, ~400GB for network mounted software such as CMSSW, and ~400GB disk space for users' network mounted /home. Our cluster is managed by Rocks and is designed to have full T3 capability, including a storage element. It is on the Open Science Grid (OSG) and affiliated with the CMS virtual organization (VO).
Last edited August 17, 2009
Table of Contents
Node Roles
The OSG Site Planning guide played an important role in the design of our cluster. Our head node (HN) distributes the OS and basic configuration to all other nodes via Rocks Kickstart files, as well as running the Squid web proxy for accessing CMSSW's Frontier database. The grid node (GN) runs the OSG computing element (CE), storage element (SE), PhEDEx, and CMSSW. Users login to and run interactive jobs on the two interactive nodes (INs), which have locally installed gLite-UI & CRAB software. The eight worker nodes (WNs) are members of the condor pool and service batch jobs submitted either by local users or grid users within our supported VOs (primarily CMS).
Head node:
external name: hepcms-hn.umd.edu
internal name: HEPCMS-0 (for historical reasons)
- Rocks head
- Condor pool manager
- Stores users' /home area, which is network mounted
- Ganglia monitor and web server
- Provides internal network gateway
- Squid web proxy for Frontier (CMSSW conditions database)
Grid node:
external name: hepcms-0.umd.edu
internal name: grid-0-0
- Job submission point to WNs for condor grid jobs
- Grid storage element (SE) & computing element (CE)
- Services SE requests with BeStMan-Gateway
- Hosts network-mounted OSG worker node client
- Controls big disk via DAS cable and PERC6/E controller
- Hosts network-mounted CMSSW
- Runs PhEDEx
Having one node fulfill the four important roles of CE, SE, PhEDEx service, and CMSSW network mount is not a scalable solution. We do this because splitting the roles is not practical on such a small cluster.
Some implementations of PhEDEx run atop gLite-UI, which may cause problems with the Rocks frontend, OSG CE or SE. Additionally, some CRAB installations (such as ours) can run atop gLite-UI, which may need to be configured differently for CRAB vs. PhEDEx. Our PhEDEx installation uses simple srm commands instead of the specialized file transfer service (FTS), which requires gLite-UI. A PhEDEx installation which uses gLite-UI should not be on the OSC CE or SE, a Rocks frontend, or on a node with gLite-UI configured for CRAB.
Two interactive nodes:
external names: hepcms.umd.edu points to hepcms-in1.umd.edu & hepcms-in2.umd.edu
internal names: interactive-0-0 & interactive-0-1
- Job submission point to WNs via Condor (interactive users)
- Installs gLite-UI & CRAB in /scratch
- Runs user interactive jobs
One note of import is that gLite-UI does not do well on a Rocks frontend (some tarball installations of gLite-UI seem better behaved). So our CRAB, based on gLite-UI, cannot be installed on the HN, nor on the GN for similar problems with the OSG CE & SE. However, CRAB does support job sumission to European sites using Condor GlideIn to some CrabServers, which does not require gLite-UI.
Eight worker nodes:
external name: hepcms-1.umd.edu -> hepcms-7.umd.edu
internal names: compute-0-0 -> compute-0-7
- Service CE (Condor) jobs sent via the GN
- Service interactive (Condor) jobs sent via any the INs
- Stores CMSSW temporary output in /tmp
- Uses the network-mounted OSG WN client for binaries and configuration needed by grid jobs
Hardware
HN: Dell PowerEdge 2950
- Two quad core Xeon E5440 Processors 2x6MB Cache, 2.83GHz, 1333MHz FSB
- 8GB 667MHz RAM
- PERC6/I : controls physical disks 0 & 1 using RAID-1 (OS), ~70 GB; physical disks 2-5 using RAID-5 (users' area and applications), ~420GB
- PERC6/E : currently unused
GN: Dell PowerEdge 2950
- Two quad core Xeon E5440 Processors 2x6MB Cache, 2.83GHz, 1333MHz FSB
- 8GB 667MHz RAM
- PERC6/I : controls physical disks 0 & 1 using RAID-1 (OS), ~70 GB; physical disks 2-5 using RAID-5 (CMSSW & OSG software network mounts), ~420GB
- PERC6/E : controls all 15 physical disks of PowerVault MD1000 (big disk), configured as RAID-6, ~9 TB
INs: Dell PowerEdge 1950
- Two quad core Xeon E5440 Processors 2x6MB Cache, 2.66GHz, 1333MHz FSB
- 16GB 667MHz RAM
- 146GB primary disk
- 146GB /tmp disk
WNs: Dell PowerEdge 1950
- Two quad core Xeon E5440 Processors 2x6MB Cache, 2.83GHz, 1333MHz FSB
- 16GB 667MHz RAM
- 80GB primary disk
- 250GB /tmp disk
PowerVault MD1000 (aka big disk)
- DAS
- 15 750GB 7.2K RPM SATA 3Gbps hard drives
- Controlled by PERC6/E controller in HN
PowerConnect 6224
- Managed switch
- Stacking capable
- 24 GbE ports
APS 2200 VA
- 120 Volt UPS
- Network controllable (currently not configured)
- Powers the two 2950s (HN, GN), the PowerVault, the two 1950 INs, and the switch
PowerEdge 2160AS KVM switch
- 16 KVM ports via CAT5 cables (requires Dell server interface pod)
- 2 physical KVM control ports, one connected to rack KVM
Partitions
Head node:
/dev/sda 69374, RAID-1 67.75 GB physical disks 0:0:0, 0:0:1 :
root/ 8189 /sda1 ext3
swap 8189 /sda2 swap
/var 4095 /sda3 ext3
/sda4 is the extended partition which includes /sda5
/scratch 48901 /sda5 ext3
/dev/sdb 418168, RAID-5 408.38 GB physical disks 0:0:2, 0:0:3, 1:0:4, 1:0:5 :
/export 418168 /sdb1 ext3
- Squid is installed in /scratch/squid, which is not network mounted (Squid doesn't like network mounts or RAID-5). Squid is needed for contacting the Frontier CMS conditions database, which is a part of CMSSW.
- /export contains the users' network mounted /home area as well as the network mounted /share/apps area (Rocks default). /home/install is also used by Rocks as the OS & kickstart distribution point.
- We were not able to find explicit details on how /export is handled on subsequent Rocks upgrades, but we believe that /export is preserved between reinstalls.
- Because the /home/user and /share/apps sub-directories are auto-network-mounted, they may not all be visible on an ls command; they must be explicitly cd'ed into first (i.e., the directories don't mount until they're accessed). ls /export/home from the HN will always show all the users' home directories, similarly for ls /export/apps.
Grid node:
/dev/sda 69374, RAID-1 67.75 GB physical disks 0:0:0, 0:0:1 :
root/ 8189 /sda1 ext3
/tmp 8189 /sda2 ext3
swap 4094 /sda3 swap
/sda4 is the extended partition which includes the rest
swap 4095 /sda5 swap
/var 4095 /sda6 ext3
/localsoft 40712 /sda7 ext3
/dev/sdb 418168, RAID-5 408.38 GB physical disks 0:0:2, 0:0:3, 1:0:4, 1:0:5 :
/scratch 418168 /sdb1 ext3
/dev/sdc 9744877, RAID-6, 8.9 TB 15 physical disks
(Logical volume)
/data 9744877 /dev/mapper/datastore-cmsdata0 xfs
- /scratch is network mounted as /sharesoft
- OSG is installed in /scratch/osg
- CMSSW is installed in /scratch/cmssw
- PhEDEx is installed in /localsoft/phedex
- /scratch and /localsoft are preserved across Rocks kickstarts, but will be formatted when the partition table changes
Interactive nodes:
/dev/sda 134.8 GB :
root/ 7.9 GB /sda1 ext3
swap 7.9 GB /sda2 ext3
/var 4.0 GB /sda3 ext3
/sda4 is the extended partition which includes the rest
/scratch 115 GB /sda5 ext3
/dev/sdb 134.8 GB :
/tmp 134.8 GB /sdb1 ext3
- CRAB is installed in /scratch/crab
- gLite-UI is installed in /scratch/gLite
- /scratch is preserved across Rocks kickstarts, but will be formatted when the partition table changes
Worker nodes:
/dev/sda 76293 :
root/ 8192 /sda1 ext3
swap 8192 /sda2 ext3
/var 4096 /sda3 ext3
/scratch 55813 /sda4 ext3
/dev/sdb 238418 :
/tmp 238418 /sdb1 ext3
- /scratch is meant for locally installed WN software, currently unused
- /tmp is meant for temporary grid job output. It is used explicitly by CRAB CMSSW jobs and can be used as a temporary location for job output for interactive jobs or condor batch jobs. Output must be transferred by the user out of /tmp as soon as the job completes as this partition is regularly cleaned.
Big disk array:
The entire disk array is treated as a single drive in the OS. We use RAID-6 so single disk failure will not result in a significant performance loss and so our data survives dual disk failure. This disk is treated as a logical volume in the OS. Our disk array allows connections to up to two additional arrays in a daisy-chain. By doing an LVM, we can install additional arrays and merely extend the LVM over the new available space. We use the XFS formatting system, which is designed to handle large disk volumes and has been documented to perform well with BeStMan. While we do not use BeStMan in a pure storage resource manager (SRM) capacity, the ability to do so later may become necessary as the size of the volume increases. The disk array, at the present time, is managed by the OS and is network mounted as /data on all nodes. This makes the array much more accessible to users, but is not a scalable solution. After RAID-6 and formatting, our disk array is roughly 9TB in size.
Network
For security purposes, port information is not listed here. It can be read (by the root user only) in the file ~root/network-ports.txt on the HN.
external IP : external hostname : internal IP : Rocks name
--------------------------------------------------------------------
N/A : N/A (switch) : 255.255.254 : network-0-0
128.8.164.11 : hepcms-hn.umd.edu : 10.1.1.1 : HEPCMS-0
128.8.164.12 : hepcms-0.umd.edu : 10.255.255.237 : grid-0-0
128.8.164.13 : hepcms-1.umd.edu : 10.255.255.253 : compute-0-0
128.8.164.14 : hepcms-2.umd.edu : 10.255.255.252 : compute-0-1
128.8.164.15 : hepcms-3.umd.edu : 10.255.255.251 : compute-0-2
128.8.164.16 : hepcms-4.umd.edu : 10.255.255.250 : compute-0-3
128.8.164.17 : hepcms-5.umd.edu : 10.255.255.249 : compute-0-4
128.8.164.18 : hepcms-6.umd.edu : 10.255.255.248 : compute-0-5
128.8.164.19 : hepcms-7.umd.edu : 10.255.255.247 : compute-0-6
128.8.164.20 : hepcms-8.umd.edu : 10.255.255.246 : compute-0-7
128.8.164.21 : hepcms-in1.umd.edu : 10.255.255.236 : interactive-0-0
128.8.164.22 : hepcms-in2.umd.edu : 10.255.255.235 : interactive-0-1
internal network always on eth0
external network always on eth1
External Gateway: 128.8.164.1
Netmask for external internet: 255.255.255.0
Netmask for internal network (on HN): 255.0.0.0
DNS for external internet: 128.8.74.2, 128.8.76.2
DNS for internal network (on HN): 10.1.1.1
The command 'dbreport dhcpd' issued from the HN can provide much of this information, including MAC addresses.
UMD HEP T3 Computing Cluster