Mon Sep 22 17:03:41 PDT 2008

Hex Memory Degradation

Our parallel compute server hex has been having some stability issues since the start of the semester, either not booting or crashing after running for some time. The errors pointed to a faulty memory module, and I believe I have isolated the problematic DIMM and pulled it and a pairmate from the machine.

Assuming that the machine doesn't crash in the next 24 hours or so, I will have the RAM exchanged and will reinstall the two DIMMs in the machine. In the meantime, hex should be fully functional, but will only have 28 GB of RAM until we get the DIMM replaced.

(I have also ordered a battery-backup module for hex's array controller, which should decrease the risk of power-related failure causing data loss in data stored on the machine's /scratch partition. With luck, I'll be able to add the battery module at the same time I reinstall the memory.)

As usual, we apologize for any inconvenience. And, as usual, if you notice strange behavior with any of our systems, please report it to system@math.hmc.edu.


Posted by Claire Connelly | Permalink

Mon Sep 22 16:56:23 PDT 2008

Mirror Temporarily Offline

I rebooted the mirror server (mirror.hmc.edu) to apply some important security updates. As it's been a while since the machine was last rebooted, it's performing a check on each file system. And, as most of the mirror file systems are fairly large, it's taking a while to churn through them.

The mirror server will be back up as soon as the file-system checks are complete, probably later this afternoon or in the early evening. My apologies for the inconvenience (which is also stopping me from doing some things I had planned on working on).


Posted by Claire Connelly | Permalink