Sun May 31 10:52:06 PDT 2015
The cluster is slowly recovering from the Great Soda Hall Power Shutdown and Machine Room Upgrade. The storage nodes and cluster masters (zen and psi) are back up and I will be restarting compute nodes shortly. Our apologies for the unanticipated extended downtime.
Due to the extended nature of the power/network outage/downtime I was not able to complete all of the scheduled maintenance, so I will need another scheduled maintenance window sometime in the near future. Details on that to follow.
One bit of maintenance that did get accomplished is a temporary migration of /work4 from s131:/export/work4 to san2.millennium.berkeley.edu:/mnt/archive/work4 in preparation for a disk upgrade that will double the size of this scratch space. If you are currently mounting it from s131, please temporarily redirect your mount to san2.millennium.berkeley.edu:/mnt/archive/work4. If you have not been actively using this partition in the past, please hold off until after the upgrade is complete.
Thu Oct 30 09:40 PDT 2014
/work2 is back. zen and psi should be back to normal.
Thu Oct 30 09:30 PDT 2014
The file server serving /work2 has had a kernel-panic/crash. We are rebooting it now, but it takes time to fsck the disks. It the meantime cluster nodes may run slowly waiting for NFS timeouts.
Tue Aug 27 10:00 PDT 2013
We are in the process of upgrading and updating the software configuration of the EECS compute cluster. As part of the transition, we have set-up a new head node so that the old and new queues can run side-by-side for some time.
The new head node is called “zen.millennium.berkeley.edu”. It and its compute nodes are running the latest Ubuntu long-term server release: Ubuntu 12.04 LTS. The queue manager software has also been updated to Torque 2.5.12 and Maui 3.3.1, and some minor differences may be encountered.The new head node will support both a “zen” queue with the old 3GB/2core Dell 1850s and a “psi” queue with the 48GB/8core HP DL1000s and 256GB/24core Dell R810s.
We are opening up the new cluster for unbilled beta-test usage for the next few days, until September 1, 2013 so that users have a chance to check things out and shake out any remaining bugs before billing starts. After September 1, usage on both head nodes will be billed, and more nodes will gradually be migrated from the old “psi” head node to the new “zen” head node as time goes on.
If there are no significant problems with migrating to the new setup, I would like to switch over most of the compute nodes to the new “zen” head node queues before the end of this billing quarter: Sept 30. (The few remaining 16GB/8core Dell 1950 nodes are likely to remain on the old “psi” head node and “zen” queue for the time being, and may soon be retired.)
Please report any concerns or problems (or gratitude) to firstname.lastname@example.org.
Thanks in advance for your help in checking out the new setup.
Cluster Support <mailto:email@example.com>
- Thu 30 Oct 2014 - /work2 file server had a kernel-panic. rebooted.
- Tue 27 Aug 2013 - Free Beta Test of queues on new zen headnode
- Tue 6 Aug 2013 - Reboot S132 at 10:30am; back at 11:15am. NFS was hung, rendering /work, /usr/mill and /usr/sww inaccessible.
- Wed 24 Jul 2013 - S131 crashed. /work4 inaccessible. rebooting at 9:45am.
- Fri 26 Apr 2013 - Failure of primary NIS and DNS server caused temporary access issues
- Mon 18 Mar 2013 - Network failure partitioning research net has been patched
- Wed 06 Mar 2013 - Ganglia (http://monitor.millennium.berkeley.edu/) is down.
- Sat 15 Dec 2012 - a DNS server, tangelo, went down this morning, resulting in slow/unreliable logins to many systems - rebooted, service restored
- Fri 30 Nov 2012 - /work server hung (also /usr/mill and /usr/sww) - rebooted and remounted on cluster
- Thu 29 Nov 2012 - Power failure in Soda Hall resulted in complete cluster restart
- Sat 9 Jun 2012 - DNS server failed, took down everything with it, recovering.