Sat Apr 26 23:47:18 PDT 2008

Status Update

Since the air conditioning was restored, we've been seeing a crash on esme at about 2:00 AM each morning. The crashes appeared to be connected some interaction between the daily backups and our NFS configuration. To check that, I temporarily disabled tape backups on the production servers, and spent all day Friday testing configuration changes on a pair of test machines.

And esme did survive Friday night. But the rack's Ethernet switch did not. Luckily I had a spare, and I went in and replaced the dead switch. I also updated some of the configuration files on gytha, and I'm hoping that we'll see improved stability (although there may be some stuck machines that will require rebooting).

I have one more set of configuration updates to test and then implement, but I'm going to hold off on those until I'm in the office; probably around lunch on Monday. I will also probably try running a backup during the day to see whether it will trigger a crash before we return to running automated backups.

Given the failure of the switch, it's possible that the key to the problem was related to the switch handling the large amount of bandwidth consumed by the backups rather than anything connected directly to the NFS configurations or the backups (none of that configuration had been changed until last night). And if losing the switch is the only hardware failure from the HVAC failure, we will have gotten off pretty easy. I'll be keeping an eye on things in case there are additional failures.

In the meantime, I appreciate your patience while I work to restabilize the systems, and while I do some additional testing.


Posted by Claire Connelly | Permalink

Wed Apr 23 00:00:25 PDT 2008

Status Update

I'm very pleased to announce that F&M did a great job in getting a replacement motor for our HVAC system, and that system is on line and cooling our machine room down to its usual chilly 68° F.

As a result, I have brought most everything back up. The lab systems (responding to shell.math.hmc.edu), ponder, the mail and web servers, the Amber cluster, and even our new mirror server are up and running.

hex is also running, but there's something preventing SSH connections from working (for me, too). I will look into that on Wednesday morning and will, I hope, have it working properly by the afternoon.

Once again, I apologize for the incredible inconvenience of this outage right during the crunch time at the end of the semester. I am still watching the systems fairly closely, as I have some seen some weird (that's a technical term) behavior related to NFS and RPC. I am doing my best to keep things as stable and available as I can.


Posted by Claire Connelly | Permalink

Mon Apr 21 15:24:50 PDT 2008

hmcposter LaTeX Class Version 3.0 on Website

Earlier this month, I announced a new version of the hmcposter class. Unfortunately, I hadn't updated the symlinks to the current versions, so they weren't updated on the website.

I took advantage of the opportunity to add two shim classes that will catch attempts to use the older classes and tell you what you need to do to switch to the newer version.

The new version is available from the poster class page and is also installed on the math cluster (in /shared/local/share/texmf/tex/latex/hmcposter.


Posted by Claire Connelly | Permalink

Mon Apr 21 15:11:59 PDT 2008

HVAC Outage Continues; Limited Services Restored

Apparently the manufacturer of our AC system won't be able to get us a replacement motor until May 8. So F&M are looking for an equivalent motor that they can install. That's still going to take until tomorrow.

In the meantime, I have the file, mail, and authentication servers running. I have also brought ponder on line for general use. It's okay to (re)boot faculty, lab, and classroom workstations for use.

The Amber cluster, hex, our new mirror server (but not yum.math.hmc.edu) will remain offline until we get proper cooling back.

If you need information that is stored on hex or the Amber cluster, please send me e-mail so we can work out some way of getting you access. That won't happen before tomorrow, though, and we may have those systems back on line by then.


Posted by Claire Connelly | Permalink

Mon Apr 21 09:18:15 PDT 2008

Servers Off Line Until HVAC Is Repaired

We needed to move the rack over to allow access to the HVAC unit. I have taken down the mail, authentication, and file servers until the repairs can be made.

All computing services will be unavailable until our HVAC is back on line. Once we have cooling, I will begin restarting machines.

Apologies for the inconvenience.


Posted by Claire Connelly | Permalink

Sun Apr 20 22:08:46 PDT 2008

Partial Recovery; Repairs Tomorrow

After much work, the core servers are working as they did before the failure. I am currently processing the backlog of mail (most of which, of course, is spam), and I have turned mail delivery back on.

I have not turned the IMAP and POP servers on, so getting mail will not be possible until I do. I have also shut down most of the Linux machines and Macs, and I won't be turning those back on until after things are more stable.

Note that tomorrow morning around 8:00 AM, we will have an HVAC engineer working to replace the dead motor in our HVAC unit. We may need to shut systems down in order to move things in the machine room to provide access to the HVAC equipment. Please check back here (the core web server is in my office and will remain operational) for more information tomorrow morning.

Please don't rush into starting up office, classroom, or lab machines unless I've said here that things are back up and reliable.

I apologize for this unforeseeable and extremely annoying outage. Please be assured that I am doing everything that I can to bring our systems back on line in a timely but safe manner.


Posted by Claire Connelly | Permalink

Sun Apr 20 12:04:50 PDT 2008

HVAC Failure in Machine Room

The HVAC unit that cools our machine room has failed, causing the temperature to rise dramatically and several of the machines to shut down to protect themselves.

At this point I have shut down all the machines in the machine room, and relocated the web server to my office. F&M has helped out by getting some powerful fans to cool the room down, but it looks like we won't get the HVAC unit back on line until sometime tomorrow.

I am letting the fans run with the systems off to cool them down as much as possible. I am also verifying the home directory volume to ensure that no serious damage has occurred.

Once it seems like things have cooled down to a reasonable level, I will bring some of the more important servers back on line (e.g., mail server, file server, ponder). Full services will not be restored until we have air conditioning in the room again.

Stay tuned here for more updates.


Posted by Claire Connelly | Permalink

Fri Apr 4 14:57:51 PDT 2008

Version 3.0 of hmcposter LaTeX Document Class Released

Version 3.0 of the hmcposter LaTeX document class has been released.

This version of the class supports the creation of posters for Clinic projects and for thesis (and other classes) using a single document class.

More information about the class and how to use it is available on the hmcposter class page. We also have the printing process and about creating good posters.

The participants' resource page in the Clinic website (Mudd only) and the thesis tools page have been updated with new information about this poster class.

As usual, while we've tested the class, there may still be some problems that we missed. Please report them to us so that we can fix them as soon as possible.


Posted by Claire Connelly | Permalink