Sat Apr 26 23:47:18 PDT 2008
Status Update
Since the air conditioning was restored, we've been seeing a
crash on esme at about 2:00 AM each morning. The
crashes appeared to be connected some interaction between the daily
backups and our NFS configuration. To check that, I temporarily
disabled tape backups on the production servers, and spent all day
Friday testing configuration changes on a pair of test
machines.
And esme did survive Friday night. But the
rack's Ethernet switch did not. Luckily I had a spare, and I went
in and replaced the dead switch. I also updated some of the
configuration files on gytha, and I'm hoping that
we'll see improved stability (although there may be some
stuck
machines that will require rebooting).
I have one more set of configuration updates to test and then implement, but I'm going to hold off on those until I'm in the office; probably around lunch on Monday. I will also probably try running a backup during the day to see whether it will trigger a crash before we return to running automated backups.
Given the failure of the switch, it's possible that the key to the problem was related to the switch handling the large amount of bandwidth consumed by the backups rather than anything connected directly to the NFS configurations or the backups (none of that configuration had been changed until last night). And if losing the switch is the only hardware failure from the HVAC failure, we will have gotten off pretty easy. I'll be keeping an eye on things in case there are additional failures.
In the meantime, I appreciate your patience while I work to restabilize the systems, and while I do some additional testing.
Wed Apr 23 00:00:25 PDT 2008
Status Update
I'm very pleased to announce that F&M did a great job in getting a replacement motor for our HVAC system, and that system is on line and cooling our machine room down to its usual chilly 68° F.
As a result, I have brought most everything back up. The lab
systems (responding to shell.math.hmc.edu),
ponder, the mail and web servers, the Amber cluster,
and even our new mirror server are up and running.
hex is also running, but there's something
preventing SSH connections from working (for me, too). I will look
into that on Wednesday morning and will, I hope, have it working
properly by the afternoon.
Once again, I apologize for the incredible inconvenience of this
outage right during the crunch time at the end of the semester. I
am still watching the systems fairly closely, as I have some seen
some weird
(that's a technical term) behavior related to NFS
and RPC. I am doing my best to keep things as stable and available
as I can.
Mon Apr 21 15:24:50 PDT 2008
hmcposter LaTeX Class Version 3.0 on Website
Earlier this month,
I announced a new version of the hmcposter class.
Unfortunately, I hadn't updated the symlinks to the current
versions, so they weren't updated on the website.
I took advantage of the opportunity to add two shim classes that will catch attempts to use the older classes and tell you what you need to do to switch to the newer version.
The new version is available from the
poster class page and is also installed on the math cluster (in
/shared/local/share/texmf/tex/latex/hmcposter.
Mon Apr 21 15:11:59 PDT 2008
HVAC Outage Continues; Limited Services Restored
Apparently the manufacturer of our AC system won't be able to get us a replacement motor until May 8. So F&M are looking for an equivalent motor that they can install. That's still going to take until tomorrow.
In the meantime, I have the file, mail, and authentication
servers running. I have also brought ponder on line
for general use. It's okay to (re)boot faculty, lab, and classroom
workstations for use.
The Amber cluster, hex, our new mirror server (but
not yum.math.hmc.edu) will remain offline
until we get proper cooling back.
If you need information that is stored on
hex or the Amber cluster, please send me e-mail so we
can work out some way of getting you access. That won't happen
before tomorrow, though, and we may have those systems back on line
by then.
Mon Apr 21 09:18:15 PDT 2008
Servers Off Line Until HVAC Is Repaired
We needed to move the rack over to allow access to the HVAC unit. I have taken down the mail, authentication, and file servers until the repairs can be made.
All computing services will be unavailable until our HVAC is back on line. Once we have cooling, I will begin restarting machines.
Apologies for the inconvenience.
Sun Apr 20 22:08:46 PDT 2008
Partial Recovery; Repairs Tomorrow
After much work, the core servers are working as they did before the failure. I am currently processing the backlog of mail (most of which, of course, is spam), and I have turned mail delivery back on.
I have not turned the IMAP and POP servers on, so getting mail will not be possible until I do. I have also shut down most of the Linux machines and Macs, and I won't be turning those back on until after things are more stable.
Note that tomorrow morning around 8:00 AM, we will have an HVAC engineer working to replace the dead motor in our HVAC unit. We may need to shut systems down in order to move things in the machine room to provide access to the HVAC equipment. Please check back here (the core web server is in my office and will remain operational) for more information tomorrow morning.
Please don't rush into starting up office, classroom, or lab machines unless I've said here that things are back up and reliable.
I apologize for this unforeseeable and extremely annoying outage. Please be assured that I am doing everything that I can to bring our systems back on line in a timely but safe manner.
Sun Apr 20 12:04:50 PDT 2008
HVAC Failure in Machine Room
The HVAC unit that cools our machine room has failed, causing the temperature to rise dramatically and several of the machines to shut down to protect themselves.
At this point I have shut down all the machines in the machine room, and relocated the web server to my office. F&M has helped out by getting some powerful fans to cool the room down, but it looks like we won't get the HVAC unit back on line until sometime tomorrow.
I am letting the fans run with the systems off to cool them down as much as possible. I am also verifying the home directory volume to ensure that no serious damage has occurred.
Once it seems like things have cooled down to a reasonable level, I will bring some of the more important servers back on line (e.g., mail server, file server, ponder). Full services will not be restored until we have air conditioning in the room again.
Stay tuned here for more updates.
Fri Apr 4 14:57:51 PDT 2008
Version 3.0 of hmcposter LaTeX Document Class Released
Version 3.0 of the hmcposter LaTeX document class
has been released.
This version of the class supports the creation of posters for Clinic projects and for thesis (and other classes) using a single document class.
More information about the class and how to use it is available
on the
hmcposter class page. We also have the
printing process and about
creating good posters.
The participants' resource page in the Clinic website (Mudd only) and the thesis tools page have been updated with new information about this poster class.
As usual, while we've tested the class, there may still be some problems that we missed. Please report them to us so that we can fix them as soon as possible.
Fri Feb 29 02:02:08 PST 2008
Unexpected Power Outage; Systems Okay?
Apparently there was a glitch in the power feed from Edison to the Claremont Colleges. The Colleges' generators did not kick in, which means that systems without UPS power -- notably the Amber cluster and the Scientific Computing Lab -- lost power and crashed.
The Scientific Computing Lab systems seem to have recovered and are running as expected. Our servers are backed by UPSs and by a local generator, so they did not lose power. The Amber cluster did lose power, and, while some machines restarted when power was restored, some did not. I have manually restarted those systems, and the cluster seems to be up and running normally.
If you're having a power-related problem, please let me know so that I can look into it.
Fri Feb 29 01:51:54 PST 2008
Server Work May Require Workstation Reboots (YMMV)
I ended up doing some fairly significant work in the machine
room Thursday afternoon and evening, which involved rewiring the
entire rack. In order to be sure that some of the systems were
working properly, I rebooted several of the machines in the rack,
including the department's main file server (gytha)
and our parallel compute server hex. As a result, some
workstations -- especially Mac OS X machines -- may be confused
about their NFS mounts. If you have problems logging in or if you
can log in but you can't access your home directory or applications
or other materials stored in /shared/local, please
reboot the machine and try again.
I'm about to go to bed, but I will be reachable at home or by cell tomorrow if there are any unforeseen issues.