I have added an 80 GB disk to the head node of the Amber
cluster, and mounted that disk space on the other machines in the
cluster. Jobs generating large amounts of output that needs to be
accessible from multiple machines should use /scratch
rather than your home directory. (Jobs that just produce a lot of
output while running, but return some smaller result, can spool
data to /tmp.)
The /scratch partition includes a copy of the
/hrothgar directory from the old cluster, which has
old user files. For users with the same username on the
department's Linux cluster as on the old Beowulf system, I have
fixed ownership so that you can access your files. If you had an
account with a different username and can't access your files,
please send me e-mail and I'll make the necessary changes.
I have converted and updated the old Beowulf cluster documentation to match the mathematics and computer-science departments' current cluster, Amber.
I have not yet been able to test and verify all of the example code provided on the old site, but the code that does work properly is included, along with links to upstream documentation.
Please note the new policies and account-request process. Amber accounts require a standard math-cluster account (as Amber nodes are, basically, standard math cluster nodes with some additional software).
If you have comments or questions, please send them to
beowulf at math.hmc.edu.
The Amber cluster has been successfully moved into its new home. Tim and I will probably be doing some additional shuffling around over the next few weeks or months, but we should be able to either make those disruptions short enough as to be unnoticeable or announce the disruptions in advance.
There may still be some issues that users might notice that I'm
not seeing; if you have any issues, please report them to
system@math.hmc.edu.
Thanks for your patience and cooperation!
The mathematics and computer-science departments' Beowulf cluster, Amber, is going to be moving from the mathematics department's machine room to the much more commodious CS machine room.
We will be moving the cluster sometime tomorrow, Wednesday, 2005 November 2.
If all goes perfectly, the cluster move will be simple and quick. If things get a bit more complicated, we will have to disassemble and reassemble the cluster, which means disconnecting sixteen computers (power & Ethernet), moving them in groups of three or four, then reconnecting everything in the new location, which will require at least an hour, maybe longer.
To make the process as easy as possible, we're asking that anyone who is actively using the cluster stop their work by 10:00 AM on Wednesday. We will post here when the cluster is back up.
(People who are authorized to use the Amber cluster have already received e-mail messages at their math addresses with this information, and will also receive a message when the cluster is running again.)
The Amber cluster has sixteen Dell PowerEdge 400SC nodes, each with a 2.8 GHz Pentium 4 processor and 1 or 1.25 GiB of RAM. The nodes communicate over a gigabit Ethernet switch. The cluster is running CentOS 3 with various additional cluster-related software packages (notably LAM/MPI). Use of the cluster is limited to faculty, students, and staff of the colleges who are doing computationally intensive research, especially research that requires or can take advantage of parallel-computing techniques.
Amber cluster nodes were purchased with funds from several CS faculty members. Systems integration and support is provided by the mathematics department.
Late last week F&M was able to take a look at our machine-room air conditioning. It turned out that there was a loose wire in the thermostat that was periodically breaking contact and resetting the system. (At a guess, it's possible that as the room cooled down, the wire contracted and broke contact. Once the room warmed up again, the wire expanded and the system worked again.)
Whatever the exact details were, the air conditioning is now running again, and I have restarted the Amber cluster. Please let me know if you have any issues with the cluster.
In related news, I have swapped out the thermally compromised drive from our backup array with a new drive I'd purchased for that purpose a few months ago. The array is now working as expected, as is our disk-based backup system.
The air conditioning unit for our machine room is continuing to have problems. I have entered the room twice and found the controller flashing OFF. The buttons on the controller don't seem to work, and I have to power off the whole system before the controller responds again and the air conditioner runs.
I have reported the problem to F&M, but until they can fix it, I will have to keep the Amber cluster offline.
Sorry for any inconvenience.
The chiller is back up, which means that we have air conditioning in the machine room again, so I've restarted the Amber cluster.
Sorry for the disruption in service, but some things are out of my hands. The servers have to come first.
Tom Shaffer, the college's plant engineer, has informed us that the separate air conditioning system that supplies cooling for various labs and computer machine rooms is offline. As a result, I have taken down the Amber cluster until air conditioning is restored.
I may to take some additional servers offline in the near future, but we'll keep our fingers crossed that it won't come to that.
Some tiny number of you may have noticed that the Amber cluster was off line for a couple of hours. During that time, the cluster was completely dissassembled, stacked up in the hallway, and then moved to its new (temporary) location in the department's small machine room.
The cluster was moved because with the summer heat, my office was running around 85 - 90° Fahrenheit, which isn't healthy for people or computers. This move is temporary because the department's machine room is now completely filled up with machines, leaving little room for humans to move around and do any maintenance.
The plan is still for the cluster to move to the CS machine room by the end of the summer. It will remain on the mathematics department subnet, and will continue to be available to people who are in the amber group.
The old cluster will be retired; at this point I'm thinking that we will probably maintain the head node in some form so that people who haven't already done so can retrieve their data, but the rest of the machines will probably be stripped or scrapped outright. If you're in the market for a Pentium II machine, let me know and we might be able to hook you up.