187855744 matt(agement)
130572424 fengxs
123467620 msnajdr
72580156 sting-2006
60392992 xiao
41563976 backups
39103304 bh1-2005
33482612 fransp
22769984 petryk
21399404 jinbei
</home2 usage: Tue Apr 25 07:58:21 PDT 2006>
141228060 dale
126612416 fransp
48689268 petryk
47373184 roland
40428984 cwlai
10698328 maggie
9198576 fengxs
7558684 scn
node004 has just been rebooted and there is probably some PBS mess to be cleaned up. Please bear with us.
Fortunately, the PBS problem that was afflicting the system yesterday seems to have cleared itself up, so that jobs requesting 8 or more processors will now run.
Report any further difficulties to Matt, but note that turn-around on problem reports may be slow until next Tuesday.
Matt will be very busy until FEBRUARY 8TH so turnaround on cluster problems may suffer during that period. Management appreciates your patience.
PBS is currently refusing to start jobs with anything other than
{1,2,3,4,5,6,7} processors. Mutiple
qterm -t quick
service pbs start
incantations have NOT cleared the problem. This presumably CAN be fixed,
but I have to tend to NSERC business until Feb 8.
Until that time, or until you hear further, I suggest one of two options
1) Cleverly reconfigure your MPI program so that they can be initiated as
a certain number of 7-processor (e.g.) runs
AND/OR
2) Start to use the Anarchy (TM) system from the old cluster to submit
parallel jobs interactively via load averages.
Should you choose route 2), note that 'avail -m', though slow, works,
so that, e.g.
avail -m -n 8 -s
generates a short (-s) listing of the 8 (-n 8) least busy Myrinet (-m)
nodes
so that
a doit 'avail -m -n \!* -s > mfile; /opt/gmpi.intel/bin/mpirun -np \!* -machinefile mfile XXX'
where XXX is the name of the executable, will provide an alias 'doit' so that
doit 8
runs XXX on 8 processors etc. Don't be surprised if it takes many seconds
before you see the machine file; this is simply reflective of the slowness
of the 'avail' script. Once the machine file is generated, execution of
the MPI program should begin very quickly.
Management apologizes profusely for the inconvenience this has caused, which includes, but may not be limited to, a slightly confused PBS state earlier this morning, and possibly the premature termination of some jobs at that time.
All users should check output for integrity and restart/resubmit jobs as necessary.
Management wishes all the best of the season!
The /home2 partition was briefly filled yesterday due to certain technical difficulties. All users who were running jobs yesterday that were writing to that partition should check output for integrity and resubmit/restart jobs as necessary.
Thanks to our next business day parts cross-ship service agreement with Atipa, node057 is coming back from the shop after only 15 business days! We will also be taking this opportunity to install a redundant tape drive in the head node.
Thus, per an earlier message, the entire cluster will be shutdown at
All users should ensure that their PBS and interactive jobs on the cluster have been killed by MONDAY, NOVEMBER 15, 9:30 AM, or management will kill them for you at that time. The cluster should be available by noon or shortly thereafter
Management apologizes in advance for the inconvenience the shutdown will cause. Direct any questions, gripes etc. to Matt.
Management is aware that there is a problem with PBS, but has to go to a party. Please bear with us.
The cluster is again available for general use. Report any problems to Matt or Pal.
Note that there will be another complete shutdown of the cluster in a week or two for installation of a second tape drive on the head node.
Per an earlier message, the entire cluster will be shutdown for a reboot, as well as to deal with some hardware issues on
All users should ensure that their PBS and interactive jobs on the cluster have been killed by WEDNESDAY, OCTOBER 27, 9:30 AM, or management will kill them for you at that time. The cluster should be re-available by noon the same day.
There will another shutdown in approximately two weeks in order to install a redundant tape drive on the head node.
Management apologizes in advance for the inconvenience these shutdowns will cause. Direct any questions, gripes etc. to Matt.
node057 IS UNAVAILABLE UNTIL FURTHER NOTICE.
The PBS server crashed sometime over the past 12 hours and had to be restarted. Some jobs that were running at the time of the crash have apparently restarted (more would have restarted had it not been for certain technical problems) but all users should check job output for integrity and kill/restart jobs as necessary.
Management apologizes for the inconvenience.
Management also suspects that a cluster-wide reboot will soon be a very good idea, so would appreciate if users are prepared to stop computing on a day's notice or so.
vnfe4:/home was completely filled early this morning. Any users running jobs that are preforming output to that partition should check their job output for integrity and resubmit as necessary.
Indeed, vnfe4:/home is chronically full these days. Management reminds users that they are ALL responsible for ensuring that usage levels remain at about the 80% level MAXIMUM.
PLEASE CLEAN-UP AND ECONOMIZE DISK USAGE on /home ASAP!!
Also note that WestGrid has a large scale storage facility at SFU that was custom designed for long term storage of large quantitites of data.
Management is sorry to report that PBS was accidentally shut down about 90 minutes ago, leading to the loss of all jobs that were running at that time. It seems that some/most/all jobs were restarted, but users should check their job output and resubmit as necessary.
Management apologizes for the inconvenience this technical problem has caused.
With luck this will be (mostly) transparent to users, but there may be some glitches, so if you encounter problems accessing pages etc. that you used to be able to access, send mail to Matt immediately.
You should NOT have to modify any URLs to continue to access these pages.
Note, however, that the switch-over has not yet taken place. This message will be updated when it actually has.
The Intel compilers have been upgraded to Version 8.0. The update should, in theory, be transparent to users, so contact Matt if you have any problems with their use. Note that the preferred name for the Intel Fortran compiler is now ifort.
FEBRUARY 12: 06:00-06:30
Please refer to the PREVIOUSLY ANNOUNCED PAGE, Running I/O Intensive Jobs on the Cluster for policies that have been in effect since last month. The page is also available via the link in the System Usage section below.
All users should (must!) re-read this page as well as the two pages in the System Overview section below for more details about the upgrade, and CRUCIALLY, for changes in running policy.
Compute nodes node001 through node054 all now have Myrinet cards, and are accessible through the PBS Myrinet queue. There are only three gigabit nodes now, node055, node056, node57, and those are the ONLY nodes that should be used interactively. However, as described in the New Policy section HERE, users will also be able to submit a reasonable number of serial (single processor) jobs via the Myrinet queue, particularly until the WestGrid UBC/TRIUMF cluster comes back on-line and stabilizes.
Please contact Matt immediately if you encounter any problems with the new configuration, but note that I will be out of town with infrequent mail contact from Fri. Feb. 6 at noon, through Tues. Feb. 10 AM.
During this time, e-mail problem reports to Pal Sandhu with a CC to Matt.
With luck, the system will be back up sometime late this afternoon, or early evening, but it could also be down longer. Please monitor this site for updated news on cluster availability.
The previously announced cluster shutdown for the Myrinet upgrade is now scheduled for
THURSDAY, FEBRUARY 5, 9:00 AM --- 5:00 PMTo expedite cluster shutdown, and to minimize the amount of data loss etc., users should ensure that all batch and interactive jobs have been terminated by 8:30 AM on Thursday. Jobs still running at that time will be summarily killed.
Note that the completion time is approximate; the work may take longer, and could even extend into Friday.
Also note that once the upgrade is complete there will be 11 fewer Gigabit nodes available for interactive jobs.
All batch jobs have been terminated, and users running interactively on the Gig nodes should check program output for integrity and restart as necessary.
Management apologizes for the inconvenience.
Sometime in the latter part of next week, probably either Wed. Feb. 4, or Thurs. Feb. 5, the cluster will be shut down in order to install Myrinet equipment in some of the current Gigabit nodes. Further, more specific information will be supplied as it becomes available, but users should be prepared to vacate the cluster on about a day's notice.
Please refer to the new page Running I/O Intensive Jobs on the Cluster for new policies that are to take effect immediately. The page is also available via the link in the System Usage section below.
Contact Matt if anything about the new policies is unclear to you, or if you have concerns/questions about them.
Nodes
node001 node002 node003 node004have been marked off-line from the point of view of PBS in order to free them up for some system work. Jobs currently running on those machines will be allowed to complete.
The head node hung up this evening, and all PBS jobs that were running were killed. Users running on the Gig nodes should also check program output for integrity and restart jobs as necessary.
Management apologizes for the inconvenience.
As usual, report any problems to Matt at choptuik@physics.ubc.ca
The entire cluster is tentatively scheduled to be shut down WEDNESDAY, NOVEMBER 12 and THURSDAY, NOVEMBER 13.
During this period, all original power supplies will be replaced (we have good reason to believe that the cluster was constructed with a bad batch of supplies---on the order of 15 have already failed), and the bad Myrinet blade will also be replaced. Management requests that all users ensure that their jobs on the cluster, both batch/PBS and interactive, have been stopped by 8:00 AM WEDNESDAY, NOVEMBER 12.
All jobs that are still running at that time will be subject to summary termination. Should the scheduling of this work change, this message will be updated.
Thanks in advance for your cooperation, and please contact Matt at choptuik@physics.ubc.ca should you have questions or concerns about this matter.
node001 node002 node003 node004 node021 node023 node024
These nodes will be taken off-line in a few days in order to facilitate repair of the Myrinet switch.
node001 node002 node003 node004 node021 node022 node023 node024
Once the jobs currently running on these jobs have finished execution, and the blade has been replaced, the nodes will be made available again.
node001 node002 node003 node004 node021 node023 node024
Once the jobs currently running on these jobs have finished execution, the blade will be temporarily pulled from the switch (in order to determine a serial number), after which the nodes will be made available again. The process will have to be repeated once the replacement blade has arrived.
The node has been marked as down from the point of view of PBS, so this unavailability should be transparent to users.
Any users who have launched jobs INTERACTIVELY on the Myrinet nodes should KILL those jobs ASAP and resubmit via the batch queue.
Please report any further problems to Matt, and many thanks to Mark Thachuk for bringing the problem immediately to our attention.