Mon Sep 22 17:03:41 PDT 2008
Hex Memory Degradation
Our parallel compute server
hex has been having some
stability issues since the start of the
semester, either not booting or
crashing after running for some time.
The errors pointed to a faulty memory
module, and I believe I have isolated
the problematic DIMM and pulled it and
a pairmate from the machine.
Assuming that the machine doesn't crash
in the next 24 hours or so, I will have
the RAM exchanged and will reinstall
the two DIMMs in the machine. In the
meantime, hex should be
fully functional, but will only have 28
GB of RAM until we get the DIMM
replaced.
(I have also ordered a battery-backup
module for hex's array
controller, which should decrease the
risk of power-related failure causing
data loss in data stored on the
machine's /scratch
partition. With luck, I'll be able to
add the battery module at the same time
I reinstall the memory.)
As usual, we apologize for any
inconvenience. And, as usual, if you
notice strange behavior with any of our
systems, please report it to
system@math.hmc.edu.
Mon Sep 22 16:56:23 PDT 2008
Mirror Temporarily Offline
I rebooted the mirror server (mirror.hmc.edu)
to apply some important security
updates. As it's been a while since the
machine was last rebooted, it's
performing a check on each file system.
And, as most of the mirror file systems
are fairly large, it's taking a while
to churn through them.
The mirror server will be back up as soon as the file-system checks are complete, probably later this afternoon or in the early evening. My apologies for the inconvenience (which is also stopping me from doing some things I had planned on working on).