vnp4.physics.ubc.ca: P4-Xeon Cluster News Archive (edited)

2008

SEPTEMBER 20, 2008, 11:00 AM

The cluster is back in normal operation. Contact Matt should you have any problems with cluster access, PBS job submission, etc.
AUGUST 27, 2008, 11:30 AM

FFTW 3.1.2 has been installed on the cluster in

   /usr/local/lib
   /usr/local/intel/lib
   /usr/local/pgi/lib

corresponding to compilation/linking with the GNU, Intel and PGI compilers respectively.

Report any problems to Matt.
AUGUST 15, 2008, 4:00 PM

Per the August 13 message, a recent version of Open MPI has now been installed on the cluster. Some issues remain to be sorted out, and the documentation in this page and in links from this page is still out of date.

However, at this time, PAMR/AMRD users should begin to use Open MPI with the Intel compilers, and to assist in this transition, instructions for configuring and building PAMR/AMRD and running the wave example using Open MPI / Intel are available HERE.

Note that the changes relative to previous recommended practice are minor, and amount to ensuring that the appropriate versions of mpicc, mpif77, mpirun, etc. are used to build and run the application.

Let Matt know if you have any problems with the updated example, or in getting your own PAMR/AMRD application to work with the new installation.
AUGUST 13, 2008, 3:00 PM

A new version of GM --- the low level software that supports the Myrinet interconnect --- has been installed on the cluster. The previous version was gm-1.6.5, the current version is gm-2.0.28.

For the most part, this change should be transparent to users, however, ALL MPI applications that use GM should be rebuilt (re-linked) to ensure that the new libraries are being used.

Note that contrary to information sent out to some users in an e-mail message, users CAN continue to use the various versions of MPICH that are installed on the cluster.

Open MPI should soon be functional on the cluster, and information on its use will be added to this page as it becomes available.

Users are free to resume submission of PBS jobs.

As usual, report all problems to Matt.
JUNE 12, 2008, 1:00 AM

Further to yesterday's message, the entire cluster will be shutdown on

SATURDAY, JUNE 14 @ 4:00 PM

Users should ensure that all jobs have been terminated by that time. Estimated date/time for the cluster to be back on-line is

SUNDAY, JUNE 15 @ 11:00 PM
JUNE 11, 2008 1:00 AM

Ongoing electrical work in the Klinck building necessitates a complete shutdown of this cluster, possibly as early as the afternoon of Friday June 13, and continuing to as late as the morning of Monday June 16.

As we get more precise estimates of the expected downtime, they will be posted here, but in the meantime, users should prepare for the upcoming shutdown.

2007

DECEMBER 11, 2007, 2:00 AM

Due to the continuing work on the AC equipment in the cluster co-location room, the entire cluster will be shut down on

SUNDAY, DECEMBER 16 @ 4:00 PM

Users should ensure that all jobs have been terminated by that time. The cluster should be back on-line by early morning

TUESDAY, DECEMBER 18

at the latest.
NOVEMBER 8, 2007, 6:00 AM

Due to ongoing work on the AC equipment in the cluster co-location room, the entire cluster will be shut down on WEDNESDAY, NOVEMBER 14 at 4:00 PM. Users should ensure that all jobs have been terminated by that time. The cluster will remain down until some time THURSDAY, NOVEMBER 15 (currently there is no specific estimate for what that time might be).
JUNE 19, 2007, 4:15 PM:

The entire cluster was shutdown and restarted in order to do some system work. Any jobs that were running as of 1400 this afternoon will have to be restarted.

Management apologizes for the inconvenience.
JUNE 14, 2007, 4:00 PM:

The cluster is back on line and vnfe4.physics.ubc.ca's IP number has been changed to 142.103.164.2.

The physics.ubc.ca forward DNS has been updated, but it may take some time for the updates to propagate to your local system. If you have trouble connecting to the head via vnfe4.physics.ubc.ca, use the IP number 142.103.164.2.

Unfortunately, a reboot of the head was necessary, so users will have to restart jobs

As usual, please report any problems to Matt.
JANUARY 15, 2007:

The cluster is back in production at long last! Management apologizes profusely for the extraordinarily long down time, but the /home partition RAID unit problems turned out to be more serious than first anticipated. Indeed, at least for the time being /home should be considered defunct, and all user directories of the form /home/USER have been relocated to /store3/vnfe4/home/USER. This relocation should have been data-loss-free and, in principle, should be transparent to users.

Please let Matt know immediately should you encounter any problems with the new set-up. Note that since /store3 is not served by the head node, vnfe4, users should be able to do parallel I/O to directories rooted in their new home directories (i.e. on store3 without significant adverse impact on the head-node performance.
JANUARY 3, 2007: As most of you are aware, the cluster is currently down due to a hardware failure in the /home file system. We hope to have this problem resolved in a "few days."

Thank you for your patience, and management apologizes for the inconvenience.

2006

DECEMBER 12, 2006: IMPORTANT!!!

In order to accommodate air conditioning maintenance in the machine room, THE CLUSTER WILL BE COMPLETELY SHUTDOWN TOMORROW, WEDNESDAY DECEMBER 13, AT APPROXIMATELY 8:00 PM. It is anticipated that the system will be back up sometime in the MID TO LATE AFTERNOON, THURSDAY DECEMBER 14.

All jobs still running on the cluster as of 8:00 PM December 13 will be summarily terminated.

Management apologizes for the inconvenience.
SEPTEMBER 5 2006 1600

IMPORTANT ANNOUNCEMENT, PARTICULARLY FOR DISK HOGS!!

There is a new node, store2, attached to the system. Like store, store2 is a dedicated I/O node. In this case, store2 has a 2.6 Terabyte partition which is NFS-mounted on all nodes as /home3. Any user with user id <userid> should find that the directory /home3/<userid> exists, and is owned by <userid>. Should this not be the case, report the matter to Matt immediately.

Users should note that BOTH /home2 and /home3 are available for I/O intensive parallel jobs (see more informatio HERE), and users are encouraged to use the new partition for their jobs. However, note that in anticipation of this new partition also filling up relatively quickly, note that users are STILL responsible for monitoring and managing their own storage, including offloads to facilities such as the WestGrid storage setup etc.

WARNING!! At least initially, nothing on /home3 will be backed up!!

ROCK ON!! (BUT KEEP THOSE PARTITIONS AT 80% OR LESS, PLEASE!!)
SEPTEMBER 8 2006 1130

Management apologizes for letting node054, whose Myrinet NIC is defunct, back into the PBS pool, which resulted in the abend of several MPI jobs. Until management reconfigures PBS so that node054 is not in the pool at start-up (which event will be announced via this channel), users are requested to be vigilant for management in ensuring that node054 remains out of the PBS pool.

How does one do this, you ask? Via the pbsnodes command with the -l (list) option. Output should look like

% pbsnodes -l
node054 down, job-busy

If there is NO output, then node054 is in the pool, one can anticipate problems and Matt should be notified at once.
AUGUST 17 2006 0930

ROCKETH ON!! As many of you have discovered, PBS is back up and running.

Management wishes that it could report that Altair Engineering, which corporation markets PBS Pro, was pleasant and helpful during this year's version of The Great PBS Pro License Renewal. Instead, management now has to write another nasty letter to a president of A Company That Hires Employees Unclear On The Concepts of "Customer" and "Service"!

AUGUST 11 2006 1715

node054

has apparently blown its Myrinet NIC, so will be running as a Gigabit node until further notice. Observe that management does NOT have the time to update all on-line documentation re this matter, so caveat emptor and keep those problem reports coming in (thank you very much, Dr Choi!)!!

ROCK ON!!
JUNE 22 2006 1430 The following nodes are unavailable until further notice

node005, node031, node047

With luck, identification of the additional faulty machines (i.e. node005 and node031), will clear up the problems encountered by users with MPI PBS jobs earlier today. Thanks to Frans for apprising management of the problem.
JUNE 19 2006 1315

node047 is unavailable until further notice.

With luck, identification of this machine as faulty will clear up the problems encountered by users with MPI PBS jobs earlier today.JUNE 19 2006 1245

Management is aware of the problems users are encountering running multi-processor MPI jobs, and is working on the issue.

Management thanks you for your patience, as always, and kudos to Xiaosi and Dale for apprising management of the problem.
JUNE 15 2006 1630

node038 is back on-line!

and thus the cluster is whole, once more. For the record, t'was a disk drive that needed replacement, and was quickly and efficiently replaced by Tony from our Most Bodacious Vendor, Varsity Computers!
JUNE 7, 2006, 1630:

vnfe4:/home is 100% full! Again!

PLEASE CLEAN UP IMMEDIATELY, AND PLEASE, PLEASE, PLEASE, MONITOR & MANAGE YOUR DISK USAGE ACTIVELY. NOTE THAT THE usage SCRIPTS THAT ARE RUN EVERY MORNING ARE WORKING SO THERE IS NO EXCUSE NOT TO CHECK THE WEB PAGE WHEN YOU GET IN IN THE MORNING / AFTERNOON TO SEE IF CLEAN-UP IS NECESSARY: i.e. IF USAGE IS ABOVE 80%.

Thanks for your cooperation, and note that further incidents of this type are likely to result in some unpleasantness for the chronic offenders, and believe me, you probably have LITTLE idea of the extent to which management COULD make life unpleasant for any given user!!

FOLLOWS THE CURRENT USAGE FOR THE TOP TEN USERS on /home
```
187855744       matt(agement)
130572424       fengxs
123467620       msnajdr
72580156        sting-2006
60392992        xiao
41563976        backups
39103304        bh1-2005
33482612        fransp
22769984        petryk
21399404        jinbei
```
MAY 30, 2006, 1645: All of the following nodes have had their Myrinet NICs (C-cards, still $995 US per!), replaced

node007, node010, node026, node033, node038

and all but node038 are back in the gene pool, so to speak.

node038 has undiagnosed problems, so again, limp along with the other 56 machines as you can.

Rock on!
JUNE 4, 2006, 1830:

The cluster is back on line following the completion of the electrical work in Klinck. It appears as if this cluster escaped unscathed, but, as usual, report any problems to Matt.
JUNE 3, 2006, 0800: IMPORTANT!!

The power in the LS Klinck Bldg, which houses our cluster, will be turned off later today so some electrical work can be done, and thus our clusters are offline until further notice.

Management apologizes profusely for the non-existent notice of this event, and particularly for the fact that users did not have a chance to gracefully terminate their jobs. The work should be done by about 1000 tomorrow (Sunday), but management may be having dim sum at that time, so stay tuned to this page for status!

MAY 20, 2006, 1400

Management is pleased to announce that, thanks in large part to a purging of files related to the vnification of Steve Plotkin's machines into the old cluster, vnfe4:/home is currently at a paltry 71% usage, with nearly 0.25 Tbyte free.

Let's have a bet on the time-scale for some idiot, errr, USER that is, probably doing black hole calculations to take it to 90%+ !! Whaddya say folks? I say 72 hours!!

Rock On!
MAY 18, 2006, 0600: The following nodes need their Myrinet NICs replaced, and thus will not be available until further notice.

node007, node010, node026, node033, node038

Please try to limp along using the other 49 nodes.
MAY 18, 2006, 0600:

First, management apologizes for the undue amount of time that has elapsed since the update of this message.

The previously mentioned system work has been completed, but has, at least temporarily, left vnfe1:/home at 97% usage. All major users of vnfe1:/home are asked to economize per usual, and management will do the same (Jason, this means cleaning up stuff related to backups to the extent possible, please!). Also, in the interim, we have had yet another Myrinet failures, but replacement cards for all of the down nodes are on order.

Users are again invited/encouraged to contact Matt with any questions, complaints, requests etc.

Thank you for your continued patience and understanding, and Rock On!
MAY 12, 2006, 1400: IMPORTANT!!

MANAGEMENT IS AWARE OF THE VARIOUS AND SUNDRY SYSTEM PROBLEMS RECENTLY AND CURRENTLY BEING EXPERIENCED BY VARIOUS AND SUNDRY USERS, AND, AS USUAL, APOLOGIZES PROFUSELY FOR THE COLLECTIVE AGGRAVATION CAUSED, IN THIS INSTANCE, BY MANAGEMENT'S ONGOING (AS OF 0800 THIS AM), UNANNOUNCED, BUT TEMPORARY SYSTEM WORK THAT INVOLVES THE HEAD, vnfe4.

PLEASE DO NOT REPORT PROBLEMS TO Matt UNTIL SUCH TIME AS THIS INCONVENIENT, BUT NECESSARY, SYSTEM WORK IS COMPLETED.

"BUT WAIT!", YOU SAY, "HOW WILL I KNOW WHEN THE SYSTEM WORK IS DONE?"

YES, INDEED, MANAGEMENT PROMISES TO ADVISE YOU VIA THIS VERY PAGE, AND, AS HAS BEEN DONE FOR TIME IMMEMORIAL, VIA /etc/motd

APRIL 25, 2006 0830

Here is the list of hogs as promised below. Dale and Frans, you should know better!

</home2 usage: Tue Apr 25 07:58:21 PDT 2006>

141228060       dale
126612416       fransp
48689268        petryk
47373184        roland
40428984        cwlai
10698328        maggie
9198576 fengxs
7558684 scn

APRIL 25, 2006 0800 (Happy 79th, Dad!!):

The /home2 partition is 100% full.

Management apologizes for this, particularly to Maggie Wang who alerted us to this problem yesterday when the partition was a mere 99% full.

All users running jobs that perform I/O to /home2 should check their job output for integrity and kill/restart jobs as necessary.

Management reminds ALL users that is their responsiblity to keep their disk usage under control on this PRODUCTION MACHINE. Management is currently running the /home2 usage script, and will soon post a list of hogs, with an accompanying e-mail message.

Remember that the WestGrid HSM facility at SFU is ideal for all of your high volume, long term file storage needs!

[Rant on] And of course, absolutely no one will read this before the e-mail gets sent, because even though it's the 21st century, and even though Google went from 0 to $10,000,000,000 market cap in about the same time as it takes a newborn to get through toilet training you USERS (and, here, management uses that term in its most perjorative sense) still don't get the picture re keeping an eye on this web site. [Rant off]
FEBRUARY 27, 2006, 1330: The entire cluster was rebooted, and PBS reset, earlier today. Management apologizes for the inconvenience this has caused, but, given how stable the system has been, it really probably was time for a "preemptive" reboot. Thanks to Frans for drawing attention to this matter, and please continue to report all problems to Matt.

ROCK ON!!
FEBRUARY 21 2006, 2300: The following nodes have been taken out of the PBS pool in order to diagnose a hardware problem:

node025, node026, node033, node035

Dale and Xiaosi, should you read this (unlikely) before tomorrow AM, please note that you have processes running on node025 (Xiaosi, a.out) and node026 (Dale, gh3d), respectively. Management would appreciate if you could kill those processes, as well as any associated processes (i.e. processes on other machines if the node025/node026 procs are from a parallel run). If the jobs are still running tomorrow, they will be summarily terminated in order that the hardware debugging can continue.

Otherwise, ROCK ON!

2005

JULY 22, 7:15 PM: CONGRATULATIONS, ROMAN!!

We now have new license keys for good old PBS Pro, so the batch system is back in operation. Someone do management a favour and send them a message next Canada Day that it's time to renew, and to ask for 110 Linux 32-bit single core licenses.

Rock on!!
JULY 21, 8:40 AM: HEY YOU!! VNP4 USERS!! GET SOMETHING RUNNING ON THIS CLUSTER BEFORE MYRICOM SEES ITS IDLE STATE!! WHAT DO YOU THINK THIS IS, SUMMER CAMP?!?! SEE BELOW!! !!
JULY 21, 7:30 AM: IMPORTANT!!

Our PBS license has expired so, in lieu of setting the system clock back a week or so, and until management get the renewal into place, users will have to resort to the "anarchy" queueing system which runs on our old cluster. That is, use the c2run script to INTERACTIVELY launch parallel jobs that use the Myrinet, and be very careful to monitor the node load factors to ensure that you haven't overloaded any nodes.

Also, any batch jobs that were running as of 7:00 AM this morning have been terminated.

Management apologizes for the inconvenience.

JUNE 27, 7:00 AM:

The /home partition was completely filled yesterday, due, apparently, to a combinarion of slothful users who are only infrequently checking their e-mail now that summer has arrived :-) and technical problems.

All users who were running jobs performing output to that partition should check job output for integrity and restart/resubmit as necessary.

Management apologizes for the inconvenience and lost time.

JUNE 16, 7:00 AM:

IMPORTANT NOTE!! DUE TO UPDATES IN THE GM-MPICH LIBRARIES ALL MPI/MYRINET APPLICATIONS SHOULD BE RECOMPILED IF THEY HAVE NOT BEEN SO SINCE JUNE 2; i.e. remove all MPI/MYRINET executables and object files, and rebuild the executable before submitting PBS jobs, and especially if you get unexpected MPI error messages in your PBS output file.

TUESDAY, JUNE 7, 1:30 PM

ALL cluster nodes are back on-line, which means we are, knock on wood, back to our May 23 state, i.e. where we were before the preemptive card replacment.
SUNDAY, JUNE 5, 9:30

THE FOLLOWING NODE IS UNAVAILABLE UNTIL FURTHER NOTICE:

node032

IMPORTANT NOTE!! DUE TO UPDATES IN THE GM-MPICH LIBRARIES ALL MPI/MYRINET APPLICATIONS SHOULD BE RECOMPILED IF THEY HAVE NOT BEEN SO SINCE THE CLUSTER CAME BACK ON LINE LAST WEEK; i.e. remove all MPI/MYRINET executables and object files, and rebuild the executable before submitting PBS jobs, and especially if you get unexpected MPI error messages in your PBS output file.
FRIDAY, JUNE 3, 5:00 PM

TENTATIVELY, AT LEAST, THE ENTIRE CLUSTER IS BACK ON LINE.

Report any problems you encounter to Matt IMMEDIATELY, please.

IMPORTANT NOTE!! DUE TO UPDATES IN THE GM-MPICH LIBRARIES ALL MPI/MYRINET APPLICATIONS SHOULD BE RECOMPILED IF THEY HAVE NOT BEEN SO SINCE THE CLUSTER CAME BACK ON LINE EARLIER THIS WEEK; i.e. remove all MPI/MYRINET executables and object files, and rebuild the executable before submitting PBS jobs, and especially if you get unexpected MPI error messages in your PBS output file.
THURSDAY, JUNE 2, 11:30 PM

ANY RESEMBLANCE TO A PREVIOUS /etc/motd IS PURELY INTENTIONAL!

We appear to be back online with 53 of 54 Mryinet nodes. The following node is offline until further notice:

node025

PBS is back up and all users are again exhorted to pound away mercilessly (modulo the system rules) ASAP so that we can be sure that the Myrinet hardware is solid. Again, Varsity will most likely again work on the cluster tomorrow (but only on the one node), and management promises to take its pledge not to accidentally kill jobs that didn't quite have to die yet (sorry again Frans and Dale!) a bit more seriously.

IMPORTANT NOTE!! DUE TO UPDATES IN THE GM-MPICH LIBRARIES ALL MPI/MYRINET APPLICATIONS SHOULD BE RECOMPILED IF THEY HAVE NOT BEEN SO SINCE THE CLUSTER CAME BACK ON LINE EARLIER THIS WEEK; i.e. remove all MPI/MYRINET executables and object files, and rebuild the executable before submitting PBS jobs, and especially if you get unexpected MPI error messages in your PBS output file.
THURSDAY, JUNE 2, HIGH NOON

PLEASE DO NOT SUBMIT PBS JOBS ON THE CLUSTER UNTIL FURHTER NOTICE.

INDEED, IT WOULD PROBABLY BE BEST TO STAY OFF THE CLUSTER COMPLETELY UNTIL FURTHER NOTICE.
WEDNESDAY, JUNE 1, 4:30 AM

IMPORTANT ADDENDUM TO PREVIOUS MESSAGE: ALL MPI/MYRINET CODE SHOULD BE RECOMPILED (i.e. remove all .o files and executables, then rebuild your executables before submitting PBS jobs)

We appear to be back on line with 51 of 54 Mryinet nodes. Considering where we were a couple of days ago. management would say we're lucky.

The following nodes are offline until further notice.

node001, node025, node051

PBS is back up and all users are exhorted to pound away mercilessly (modulo the system rules) ASAP so that we can get this beast stable again.

Varsity and management will work on the remaining 3 cards tomorrow, and will do our best not to disrupt jobs that have been started by that time.

ROCK ON!! (If only for a brief time!!)
TUESDAY, MAY 31, 11:00 PM

We appear to be back on line with 51 of 54 Mryinet nodes. Considering where we were a couple of days ago. management would say we're lucky.

The following nodes are offline until further notice.

node001, node025, node051

PBS is back up and all users are exhorted to pound away mercilessly (modulo the system rules) ASAP so that we can get this beast stable again.

Varsity and management will work on the remaining 3 cards tomorrow, and will do our best not to disrupt jobs that have been started by that time.

ROCK ON!! (If only for a brief time!!)
TUESDAY, MAY 31, 10:30 AM

Management has come to its senses and is going to remove the recently installed batch of Myrinet cards (preemptively supplied by Myricom), and will try to restore the cluster to its pre May 10 status.

THE CLUSTER WILL THEREFORE BE DOWN UNTIL FURTHER NOTICE.

Stay tuned to this page for developments.

Management apologizes for the inconvenience.
SATURDAY MAY 28, HIGH NOON

As anticipated, the cluster has NOT been restored to a Myrinet-stable state.

PBS has been stopped.

You can use the cluster INTERACTIVELY, and at your own risk. ANY AND ALL nodes on the machine may be rebooted at any time, and with no notice.

Thanks for your patience.
FRIDAY MAY 27, HIGH NOON

The Myrinet state is currently very ugly on this cluster, to use a technical term, and may not get better for several days, due to the mass of hardware that was involved in the preemptive "upgrade" (yeah right, does anyone else smell lawyers) PLUS the fact that Matt is on what amounts to a n-day holiday where n is some small integer.

Thanks for your patience.
THURSDAY MAY 26, 12:30 AM:

NOTE: This news item supersedes a previous one on this subject. Note that a different set of nodes has been identified as requiring Myrinet work.

The following nodes require Myrinet work

    node010, node051

and are currently not available for MPI/Myrinet jobs via PBS.

PBS JOBS THAT WERE RUNNING AT 12:30 AM this morning or earlier should have their output scrutinized for integrity and jobs should be resubmitted as necessary.

With luck, you should be able to continue to submit and run parallel jobs while we work on the above nodes.

Management apologizes for any inconvenience caused.
WEDNESDAY MAY 25, 4:45 PM:

The following nodes require Myrinet work

    node010, node032, node045

and are currently not available for MPI/Myrinet jobs via PBS.

ALL PBS JOBS THAT WERE RUNNING AT 3:30 PM THIS AFTERNOON HAVE BEEN KILLED.

With luck, you should be able to continue to submit and run parallel jobs while we work on the above nodes, but let Matt know if this is not the case since there is mixed evidence with large processor number jobs (but perhaps Matt has simply encountered a PAMR "feature"---cpi works on a job running simultaneously on all 51 available nodes).

ROCK ON!!
WEDNESDAY MAY 25, 12:45 PM:

There is currently a PBS/MYRINET issue that is apparently blocking PBS submitted MPI jobs from running. Management is working on the problem and apologizes for the inconvenience.
TUESDAY MAY 24, 3:45 PM:

Thanks to our good friends at Varsity Computer, the hardware upgrade was completed more quickly than scheduled. The cluster (sans node010, which still needs an updated Myrinet card) is available for general use.

ROCK ON!
TUESDAY MAY 24, 10:30 AM: THE CLUSTER IS DOWN UNTIL APPROXIMATELY THURSDAY MAY 26, 10:30 AM

As described in a previous news item, all original Myrinet cards in the cluster are being replaced.
WEDNESDAY MAY 10, 11:00 AM: ALL USERS PLEASE NOTE!!

2-DAY CLUSTER SHUTDOWN AFTER MAY LONG WEEKEND

Barring unforeseen circumstances, the entire cluster will be down

TUESDAY, MAY 24, 7:00 AM UNTIL WEDNESDAY, MAY 25, 6:00 PM

During this time, all nodes will be opened, and those with original Myrinet cards wil have those cards replaced with ones that Myricom is providing preemptively.

Users are urged to plan well in advance for this shutdown. All jobs still running at 0700 May 24 will be summarily terminated.

Please let Matt know immediately should you anticipate undue problems because of the shutdown.
FEBRUARY 3, 1:30 PM:
node004 has just been rebooted and there is probably some PBS mess to be cleaned up. Please bear with us.
FEBRUARY 2, 9:00 AM:
Fortunately, the PBS problem that was afflicting the system yesterday seems to have cleared itself up, so that jobs requesting 8 or more processors will now run.

Report any further difficulties to Matt, but note that turn-around on problem reports may be slow until next Tuesday.
FEBRUARY 1, 4:00 PM:
Matt will be very busy until FEBRUARY 8TH so turnaround on cluster problems may suffer during that period. Management appreciates your patience.

FEBRUARY 1, 4:00 PM:

PBS is currently refusing to start jobs with anything other than
{1,2,3,4,5,6,7} processors.  Mutiple

qterm -t quick
service pbs start

incantations have NOT cleared the problem.  This presumably CAN be fixed,
but I have to tend to NSERC business until Feb 8.

Until that time, or until you hear further, I suggest one of two options

1) Cleverly reconfigure your MPI program so that they can be initiated as
a certain number of 7-processor (e.g.) runs

AND/OR

2) Start to use the Anarchy (TM) system from the old cluster to submit
parallel jobs interactively via load averages.

Should you choose route 2), note that 'avail -m', though slow, works,
so that, e.g.

avail -m -n 8 -s

generates a short (-s) listing of the 8 (-n 8) least busy Myrinet (-m)
nodes

so that

a doit 'avail -m -n \!* -s > mfile; /opt/gmpi.intel/bin/mpirun -np \!* -machinefile mfile XXX'

where XXX is the name of the executable, will provide an alias 'doit' so that

doit 8

runs XXX on 8 processors etc.  Don't be surprised if it takes many seconds
before you see the machine file; this is simply reflective of the slowness
of the 'avail' script. Once the machine file is generated, execution of
the MPI program should begin very quickly.

JANUARY 5, 11:00 AM: The / partition became 100% full sometime yesterday due to shoddy administrative practices on the part of management.
Management apologizes profusely for the inconvenience this has caused, which includes, but may not be limited to, a slightly confused PBS state earlier this morning, and possibly the premature termination of some jobs at that time.

All users should check output for integrity and restart/resubmit jobs as necessary.

2004

DECEMBER 24: 5:00 AM: The cluster will be running in "UNATTENDED MODE" from now until the New Year. Response time for system problems will generally be slower than normal, perhaps up to several days, but problems should still be reported to Matt.

Management wishes all the best of the season!

NOVEMBER 22, 5:00 PM

The /home2 partition was briefly filled yesterday due to certain technical difficulties. All users who were running jobs yesterday that were writing to that partition should check output for integrity and resubmit/restart jobs as necessary.

NOVEMBER 13, 11:30 AM

Thanks to our next business day parts cross-ship service agreement with Atipa, node057 is coming back from the shop after only 15 business days! We will also be taking this opportunity to install a redundant tape drive in the head node.

Thus, per an earlier message, the entire cluster will be shutdown at

MONDAY, NOVEMBER 15, 10 AM

All users should ensure that their PBS and interactive jobs on the cluster have been killed by MONDAY, NOVEMBER 15, 9:30 AM, or management will kill them for you at that time. The cluster should be available by noon or shortly thereafter

Management apologizes in advance for the inconvenience the shutdown will cause. Direct any questions, gripes etc. to Matt.

NOVEMBER 5, 3:00 PM

Management is aware that there is a problem with PBS, but has to go to a party. Please bear with us.

OCTOBER 27, 12 NOON

The cluster is again available for general use. Report any problems to Matt or Pal.

Note that there will be another complete shutdown of the cluster in a week or two for installation of a second tape drive on the head node.

OCTOBER 25, 2:30 PM

Per an earlier message, the entire cluster will be shutdown for a reboot, as well as to deal with some hardware issues on

WEDNESDAY, OCTOBER 27, 10 AM

All users should ensure that their PBS and interactive jobs on the cluster have been killed by WEDNESDAY, OCTOBER 27, 9:30 AM, or management will kill them for you at that time. The cluster should be re-available by noon the same day.

There will another shutdown in approximately two weeks in order to install a redundant tape drive on the head node.

Management apologizes in advance for the inconvenience these shutdowns will cause. Direct any questions, gripes etc. to Matt.

OCTOBER 20, 10:30 PM

node057 IS UNAVAILABLE UNTIL FURTHER NOTICE.

OCTOBER 16, 9:00 AM

The PBS server crashed sometime over the past 12 hours and had to be restarted. Some jobs that were running at the time of the crash have apparently restarted (more would have restarted had it not been for certain technical problems) but all users should check job output for integrity and kill/restart jobs as necessary.

Management apologizes for the inconvenience.

Management also suspects that a cluster-wide reboot will soon be a very good idea, so would appreciate if users are prepared to stop computing on a day's notice or so.

OCTOBER 2, 1:30 PM

vnfe4:/home was completely filled early this morning. Any users running jobs that are preforming output to that partition should check their job output for integrity and resubmit as necessary.

Indeed, vnfe4:/home is chronically full these days. Management reminds users that they are ALL responsible for ensuring that usage levels remain at about the 80% level MAXIMUM.

PLEASE CLEAN-UP AND ECONOMIZE DISK USAGE on /home ASAP!!

Also note that WestGrid has a large scale storage facility at SFU that was custom designed for long term storage of large quantitites of data.

SEPTEMBER 20, 7:15 PM:

Management is sorry to report that PBS was accidentally shut down about 90 minutes ago, leading to the loss of all jobs that were running at that time. It seems that some/most/all jobs were restarted, but users should check their job output and resubmit as necessary.

Management apologizes for the inconvenience this technical problem has caused.

AUGUST 10: 3:30 PM: The physical host that serves these Web pages has changed.

With luck this will be (mostly) transparent to users, but there may be some glitches, so if you encounter problems accessing pages etc. that you used to be able to access, send mail to Matt immediately.

You should NOT have to modify any URLs to continue to access these pages.

AUGUST 10: 12 NOON: The physical host that serves these Web pages will soon be changing. With luck this will be (mostly) transparent to users, but there may be some glitches, so if you encounter problems accessing pages etc. that you used to be able to access, send mail to Matt immediately. You should NOT have to modify any URLs to continue to access these pages.

Note, however, that the switch-over has not yet taken place. This message will be updated when it actually has.

MARCH 11: 11:45 PM:

The Intel compilers have been upgraded to Version 8.0. The update should, in theory, be transparent to users, so contact Matt if you have any problems with their use. Note that the preferred name for the Intel Fortran compiler is now ifort.

FEBRUARY 11: 3:30 PM: A short outage in network connectivity to the cluster has been scheduled by UBC IT Services for

   FEBRUARY 12:  06:00-06:30

FEBRUARY 27: 11:00 PM: ATTENTION ALL USERS!!

Please refer to the PREVIOUSLY ANNOUNCED PAGE, Running I/O Intensive Jobs on the Cluster for policies that have been in effect since last month. The page is also available via the link in the System Usage section below.

FEBRUARY 5: 4:00 PM: MYRINET UPGRADE COMPLETE The Myrinet upgrade has been successfully completed, and the cluster is again available for general use.

All users should (must!) re-read this page as well as the two pages in the System Overview section below for more details about the upgrade, and CRUCIALLY, for changes in running policy.

Compute nodes node001 through node054 all now have Myrinet cards, and are accessible through the PBS Myrinet queue. There are only three gigabit nodes now, node055, node056, node57, and those are the ONLY nodes that should be used interactively. However, as described in the New Policy section HERE, users will also be able to submit a reasonable number of serial (single processor) jobs via the Myrinet queue, particularly until the WestGrid UBC/TRIUMF cluster comes back on-line and stabilizes.

Please contact Matt immediately if you encounter any problems with the new configuration, but note that I will be out of town with infrequent mail contact from Fri. Feb. 6 at noon, through Tues. Feb. 10 AM.

During this time, e-mail problem reports to Pal Sandhu with a CC to Matt.

FEBRUARY 5: 8:30 AM: THE CLUSTER IS UNAVAILABLE UNTIL FURTHER NOTICE FOR THE MYRINET UPGRADE

With luck, the system will be back up sometime late this afternoon, or early evening, but it could also be down longer. Please monitor this site for updated news on cluster availability.

FEBRUARY 3: 1:00 PM: CLUSTER SHUTDOWN / UPGRADE

The previously announced cluster shutdown for the Myrinet upgrade is now scheduled for

   THURSDAY, FEBRUARY 5, 9:00 AM --- 5:00 PM

To expedite cluster shutdown, and to minimize the amount of data loss etc., users should ensure that all batch and interactive jobs have been terminated by 8:30 AM on Thursday. Jobs still running at that time will be summarily killed.

Note that the completion time is approximate; the work may take longer, and could even extend into Friday.

Also note that once the upgrade is complete there will be 11 fewer Gigabit nodes available for interactive jobs.

FEBRUARY 2: 10:30 AM: THE HEAD NODE HUNG UP SOMETIME THIS MORNING

All batch jobs have been terminated, and users running interactively on the Gig nodes should check program output for integrity and restart as necessary.

Management apologizes for the inconvenience.

JANUARY 31: 2:00 PM: CLUSTER SHUTDOWN / UPGRADE FORTHCOMING

Sometime in the latter part of next week, probably either Wed. Feb. 4, or Thurs. Feb. 5, the cluster will be shut down in order to install Myrinet equipment in some of the current Gigabit nodes. Further, more specific information will be supplied as it becomes available, but users should be prepared to vacate the cluster on about a day's notice.

JANUARY 22: 11:00 PM: ATTENTION ALL USERS!!

Please refer to the new page Running I/O Intensive Jobs on the Cluster for new policies that are to take effect immediately. The page is also available via the link in the System Usage section below.

Contact Matt if anything about the new policies is unclear to you, or if you have concerns/questions about them.

JANUARY 23: 2:30 PM:

Nodes

   node001 node002 node003 node004

have been marked off-line from the point of view of PBS in order to free them up for some system work. Jobs currently running on those machines will be allowed to complete.

JANUARY 2: 2:30 PM: Matt will be travelling January 3 to January 11 inclusive. During this time, please report any problems to Kevin Lai with a CC to Matt.

JANUARY 1: 12:30 AM: HAPPY NEW YEAR!

The head node hung up this evening, and all PBS jobs that were running were killed. Users running on the Gig nodes should also check program output for integrity and restart jobs as necessary.

Management apologizes for the inconvenience.

2003

DECEMBER 12: 9:30 AM: The cluster is back after last evening's power outage. Please report any problems encountered to Matt at choptuik@physics.ubc.ca.
DECEMBER 11: 10:30 PM: The cluster went down this evening due to a campus wide power outage, and will not be back up until tomorrow AM
DECEMBER 4: 4:30 PM: The head node crashed this afternoon. All processes, both batch and interactive were lost. The system is now back up and available for general use. Management apologizes for the inconvenience.
NOVEMBER 18: 1:30 PM: There appears to be a software problem on the cluster that causes node-hangs when running certain MPI jobs on the gigabit nodes. This issue is under investigation, but until further notice, PLEASE DO NOT RUN MPI JOBS ON THE GIG NODES.
NOVEMBER 13: 1:00 PM: The hardware updates on the cluster are now complete and the system is available for general use.
As usual, report any problems to Matt at choptuik@physics.ubc.ca
NOVEMBER 12: 10:30 AM: The cluster is now down for the previously announced hardware work, and will likely remain down through tomorrow afternoon.
NOVEMBER 4: 2:00 PM: The top command on several of the gigabit nodes, including node040, node041, node055 and node057 is, for some inexplicable reason, generating a floating point exception, then dumping core. With luck, next week's restart of the cluster will clear this up; in the meantime, users of the affected nodes should use uptime or procinfo -f to monitor load averages.
NOVEMBER 3: 4:00 PM: ALL USERS PLEASE NOTE THE FOLLOWING:
The entire cluster is tentatively scheduled to be shut down WEDNESDAY, NOVEMBER 12 and THURSDAY, NOVEMBER 13.

During this period, all original power supplies will be replaced (we have good reason to believe that the cluster was constructed with a bad batch of supplies---on the order of 15 have already failed), and the bad Myrinet blade will also be replaced. Management requests that all users ensure that their jobs on the cluster, both batch/PBS and interactive, have been stopped by 8:00 AM WEDNESDAY, NOVEMBER 12.

All jobs that are still running at that time will be subject to summary termination. Should the scheduling of this work change, this message will be updated.

Thanks in advance for your cooperation, and please contact Matt at choptuik@physics.ubc.ca should you have questions or concerns about this matter.
OCTOBER 20: 5:30 PM: Due to a delay in the shipment of the replacement Myrinet blade, the following nodes have been put back on-line:
```
node001 node002 node003 node004 node021 node023 node024
```
These nodes will be taken off-line in a few days in order to facilitate repair of the Myrinet switch.
OCTOBER 20: 9:30 AM: In preparation for the replacement of a blade in the cluster's Myrinet switch, the following nodes have been marked as down from the point of view of PBS
```
node001 node002 node003 node004 node021 node022 node023 node024
```
Once the jobs currently running on these jobs have finished execution, and the blade has been replaced, the nodes will be made available again.
OCTOBER 7: 2:00 PM: In order to deal with a bad port in one of the blades of the cluster's Myrinet switch, the following nodes have been marked as down from the point of view of PBS
```
node001 node002 node003 node004 node021 node023 node024
```
Once the jobs currently running on these jobs have finished execution, the blade will be temporarily pulled from the switch (in order to determine a serial number), after which the nodes will be made available again. The process will have to be repeated once the replacement blade has arrived.
SEPTEMBER 30: 3:30 PM: node028 will be unavailable until further notice due to a problem with its Myrinet interface.
The node has been marked as down from the point of view of PBS, so this unavailability should be transparent to users.
SEPTEMBER 17: 2:15 PM: ssh has been updated on the cluster due to a security issue, and as a result all ssh connections have just been terminated. Management apologizes for the inconvenience this has caused. Also note that the host key of vnfe4 has changed, so you will have to remove old entries from your ~/.ssh/known_hosts files in order to get rid of the diagnostic messages warning of the change in the host key.
AUGUST 19: 3:30 PM: PBS users will be interested in the qsort command, provided by Mark Thachuk and Roman Baranowski. qsort turns the full output from qstat into a much more human-consumable form.
AUGUST 19: 2:30 PM: PBS has been reconfigured so that users can submit more than two jobs at a time.
Any users who have launched jobs INTERACTIVELY on the Myrinet nodes should KILL those jobs ASAP and resubmit via the batch queue.

Please report any further problems to Matt, and many thanks to Mark Thachuk for bringing the problem immediately to our attention.
AUGUST 19: 2:30 PM: Mark Thachuk has reported an interesting bug in some Perl scripts that are used during an mpirun invocation. The crux of the matter is that YOU CANNOT SUBMIT A JOB USING mpirun FROM A DIRECTORY WHOSE FULL PATH NAME CONTAINS AN '=' (EQUAL SIGN).
AUGUST 12: 12 NOON: Matt will be on vacation August 13 - August 20 inclusive. During this time, please report any problems to Kevin Lai.