Sat Apr 26 23:47:18 PDT 2008
Status Update
Since the air conditioning was
restored, we've been seeing a crash on
esme at about 2:00 AM each
morning. The crashes appeared to be
connected some interaction between the
daily backups and our NFS
configuration. To check that, I
temporarily disabled tape backups on
the production servers, and spent all
day Friday testing configuration
changes on a pair of test machines.
And esme did
survive Friday night. But the rack's
Ethernet switch did not. Luckily I had
a spare, and I went in and replaced the
dead switch. I also updated some of the
configuration files on
gytha, and I'm hoping that
we'll see improved stability (although
there may be some stuck
machines
that will require rebooting).
I have one more set of configuration updates to test and then implement, but I'm going to hold off on those until I'm in the office; probably around lunch on Monday. I will also probably try running a backup during the day to see whether it will trigger a crash before we return to running automated backups.
Given the failure of the switch, it's possible that the key to the problem was related to the switch handling the large amount of bandwidth consumed by the backups rather than anything connected directly to the NFS configurations or the backups (none of that configuration had been changed until last night). And if losing the switch is the only hardware failure from the HVAC failure, we will have gotten off pretty easy. I'll be keeping an eye on things in case there are additional failures.
In the meantime, I appreciate your patience while I work to restabilize the systems, and while I do some additional testing.
Wed Apr 23 00:00:25 PDT 2008
Status Update
I'm very pleased to announce that F&M did a great job in getting a replacement motor for our HVAC system, and that system is on line and cooling our machine room down to its usual chilly 68° F.
As a result, I have brought most
everything back up. The lab systems
(responding to
shell.math.hmc.edu),
ponder, the mail and web
servers, the Amber cluster, and even
our new mirror server are up and
running.
hex is also running, but
there's something preventing SSH
connections from working (for me, too).
I will look into that on Wednesday
morning and will, I hope, have it
working properly by the afternoon.
Once again, I apologize for the
incredible inconvenience of this outage
right during the crunch time at the end
of the semester. I am still watching
the systems fairly closely, as I have
some seen some weird
(that's a
technical term) behavior related to NFS
and RPC. I am doing my best to keep
things as stable and available as I
can.
Mon Apr 21 15:24:50 PDT 2008
hmcposter LaTeX Class Version 3.0 on Website
Earlier this month,
I announced a new version of the
hmcposter class.
Unfortunately, I hadn't updated the
symlinks to the current
versions, so they weren't updated on
the website.
I took advantage of the opportunity to add two shim classes that will catch attempts to use the older classes and tell you what you need to do to switch to the newer version.
The new version is available from
the poster class page and is also
installed on the math cluster (in
/shared/local/share/texmf/tex/latex/hmcposter.
Mon Apr 21 15:11:59 PDT 2008
HVAC Outage Continues; Limited Services Restored
Apparently the manufacturer of our AC system won't be able to get us a replacement motor until May 8. So F&M are looking for an equivalent motor that they can install. That's still going to take until tomorrow.
In the meantime, I have the file, mail,
and authentication servers running. I
have also brought ponder
on line for general use. It's okay to
(re)boot faculty, lab, and classroom
workstations for use.
The Amber cluster, hex,
our new mirror server (but not
yum.math.hmc.edu) will
remain offline until we get proper
cooling back.
If you need information that
is stored on hex or the
Amber cluster, please send me e-mail so
we can work out some way of getting you
access. That won't happen before
tomorrow, though, and we may have those
systems back on line by then.
Mon Apr 21 09:18:15 PDT 2008
Servers Off Line Until HVAC Is Repaired
We needed to move the rack over to allow access to the HVAC unit. I have taken down the mail, authentication, and file servers until the repairs can be made.
All computing services will be unavailable until our HVAC is back on line. Once we have cooling, I will begin restarting machines.
Apologies for the inconvenience.
Sun Apr 20 22:08:46 PDT 2008
Partial Recovery; Repairs Tomorrow
After much work, the core servers are working as they did before the failure. I am currently processing the backlog of mail (most of which, of course, is spam), and I have turned mail delivery back on.
I have not turned the IMAP and POP servers on, so getting mail will not be possible until I do. I have also shut down most of the Linux machines and Macs, and I won't be turning those back on until after things are more stable.
Note that tomorrow morning around 8:00 AM, we will have an HVAC engineer working to replace the dead motor in our HVAC unit. We may need to shut systems down in order to move things in the machine room to provide access to the HVAC equipment. Please check back here (the core web server is in my office and will remain operational) for more information tomorrow morning.
Please don't rush into starting up office, classroom, or lab machines unless I've said here that things are back up and reliable.
I apologize for this unforeseeable and extremely annoying outage. Please be assured that I am doing everything that I can to bring our systems back on line in a timely but safe manner.
Sun Apr 20 12:04:50 PDT 2008
HVAC Failure in Machine Room
The HVAC unit that cools our machine room has failed, causing the temperature to rise dramatically and several of the machines to shut down to protect themselves.
At this point I have shut down all the machines in the machine room, and relocated the web server to my office. F&M has helped out by getting some powerful fans to cool the room down, but it looks like we won't get the HVAC unit back on line until sometime tomorrow.
I am letting the fans run with the systems off to cool them down as much as possible. I am also verifying the home directory volume to ensure that no serious damage has occurred.
Once it seems like things have cooled down to a reasonable level, I will bring some of the more important servers back on line (e.g., mail server, file server, ponder). Full services will not be restored until we have air conditioning in the room again.
Stay tuned here for more updates.
Fri Apr 4 14:57:51 PDT 2008
Version 3.0 of hmcposter LaTeX Document Class Released
Version 3.0 of the
hmcposter LaTeX document
class has been released.
This version of the class supports the creation of posters for Clinic projects and for thesis (and other classes) using a single document class.
More information about the class and
how to use it is available on
the hmcposter class
page. We also have
the printing process and about
creating good posters.
The participants' resource page in the Clinic website (Mudd only) and the thesis tools page have been updated with new information about this poster class.
As usual, while we've tested the class, there may still be some problems that we missed. Please report them to us so that we can fix them as soon as possible.