Mon Nov 17 12:04:33 PST 2008

More on hex

I have pulled out the CPU expansion board from hex and sent it back to the vendor, who is going to try to get a replacement from the manufacturer. In the meantime, hex is running with eight cores (four CPUs) and 16 GB of RAM, and it seems to be stable.

Please let me know about any problems, and check back here for updates.


Posted by Claire Connelly | Permalink | Categories: News, System Maintenance, Amber

Wed Jun 4 15:06:43 PDT 2008

Systems Work: Saturday, June 7, and Sunday, June 8

When

I will be doing some systems work this weekend, June 7--8.

Work will probably begin around 11:00 AM on Saturday, June 7, and will continue for several hours. If necessary, additional work may be done on Sunday, June 8, within a similar block of time.

What Will Be Affected

The work will disrupt most of our networked services, including e-mail, file service, interactive sessions, and the web server for periods of several minutes to an hour over the course of the work.

I also want to make sure that all of our Macs are running the latest security updates, so will be updating these machines during this time period as well.

What You Should Do

If you're using a Mac or Linux system that mounts file systems from our servers, before you leave on Friday evening,

  • Save all open files;
  • Close all applications;
  • Log out;
  • Leave your machine running.

Why

This work is necessary for us to ensure the security and improve the stability of the overall system. In particular, I am hoping that ongoing issues with our web server will be resolved as a result of this work.

I will do my best to keep as much of the system functional as possible for as much of the time as I can, but there will still be some outages.

Additional Background

Last semester we had some serious issues with interactions between the NFS support on our new file server and on our workstations and older servers, exacerbated by the HVAC failure. I was able to stabilize things, but we still see some flaky behavior (especially From the web server, which needs to be rebooted periodically).

On the Linux server side, I plan to update to the latest kernel releases and do some experimentation to see if everything will work together happily. I will need to reboot various servers and workstations an arbitrary number of times to explore all the possible interactions.

For Macs, I will install the latest updates, most of which require the machines to be rebooted. As Tiger (Mac OS X 10.4) has problems when an NFS server disappears and reappears, these machines would need to be rebooted anyway.

Comments/Problems/Other Issues

As usual, if there are problems with the scheduling of this work, requests or any other comments, please let me know.

Updates/Status Reports

As usual, updates on the status of the systems and progress reports will be posted to the ``sysblog'', on our web server at

http://www.math.hmc.edu/computing/blog/>

Thanks for your cooperation!


Posted by Claire Connelly | Permalink | Categories: Mail, News, System Maintenance, Linux, Macintosh, Website, Amber

Fri Mar 17 11:32:25 PST 2006

Scratch Space and Old Beowulf Files Available on Cluster

I have added an 80 GB disk to the head node of the Amber cluster, and mounted that disk space on the other machines in the cluster. Jobs generating large amounts of output that needs to be accessible from multiple machines should use /scratch rather than your home directory. (Jobs that just produce a lot of output while running, but return some smaller result, can spool data to /tmp.)

The /scratch partition includes a copy of the /hrothgar directory from the old cluster, which has old user files. For users with the same username on the department's Linux cluster as on the old Beowulf system, I have fixed ownership so that you can access your files. If you had an account with a different username and can't access your files, please send me e-mail and I'll make the necessary changes.


Posted by Claire Connelly | Permalink | Categories: Amber

Mon Jan 23 12:12:38 PST 2006

Amber Cluster Documentation Now Available

I have converted and updated the old Beowulf cluster documentation to match the mathematics and computer-science departments' current cluster, Amber.

I have not yet been able to test and verify all of the example code provided on the old site, but the code that does work properly is included, along with links to upstream documentation.

Please note the new policies and account-request process. Amber accounts require a standard math-cluster account (as Amber nodes are, basically, standard math cluster nodes with some additional software).

If you have comments or questions, please send them to beowulf at math.hmc.edu.


Posted by Claire Connelly | Permalink | Categories: Website, Amber

11.02.2005 14:54

Amber Cluster Move Complete

The Amber cluster has been successfully moved into its new home. Tim and I will probably be doing some additional shuffling around over the next few weeks or months, but we should be able to either make those disruptions short enough as to be unnoticeable or announce the disruptions in advance.

There may still be some issues that users might notice that I'm not seeing; if you have any issues, please report them to system@math.hmc.edu.

Thanks for your patience and cooperation!


Posted by Claire Connelly | Permalink | Categories: System Maintenance, Amber

11.01.2005 11:19

Amber Cluster Move Scheduled

The mathematics and computer-science departments' Beowulf cluster, Amber, is going to be moving from the mathematics department's machine room to the much more commodious CS machine room.

We will be moving the cluster sometime tomorrow, Wednesday, 2005 November 2.

If all goes perfectly, the cluster move will be simple and quick. If things get a bit more complicated, we will have to disassemble and reassemble the cluster, which means disconnecting sixteen computers (power & Ethernet), moving them in groups of three or four, then reconnecting everything in the new location, which will require at least an hour, maybe longer.

To make the process as easy as possible, we're asking that anyone who is actively using the cluster stop their work by 10:00 AM on Wednesday. We will post here when the cluster is back up.

(People who are authorized to use the Amber cluster have already received e-mail messages at their math addresses with this information, and will also receive a message when the cluster is running again.)

The Amber cluster has sixteen Dell PowerEdge 400SC nodes, each with a 2.8 GHz Pentium 4 processor and 1 or 1.25 GiB of RAM. The nodes communicate over a gigabit Ethernet switch. The cluster is running CentOS 3 with various additional cluster-related software packages (notably LAM/MPI). Use of the cluster is limited to faculty, students, and staff of the colleges who are doing computationally intensive research, especially research that requires or can take advantage of parallel-computing techniques.

Amber cluster nodes were purchased with funds from several CS faculty members. Systems integration and support is provided by the mathematics department.


Posted by Claire Connelly | Permalink | Categories: System Maintenance, Amber

10.03.2005 10:13

A/C Working, Cluster Back

Late last week F&M was able to take a look at our machine-room air conditioning. It turned out that there was a loose wire in the thermostat that was periodically breaking contact and resetting the system. (At a guess, it's possible that as the room cooled down, the wire contracted and broke contact. Once the room warmed up again, the wire expanded and the system worked again.)

Whatever the exact details were, the air conditioning is now running again, and I have restarted the Amber cluster. Please let me know if you have any issues with the cluster.

In related news, I have swapped out the thermally compromised drive from our backup array with a new drive I'd purchased for that purpose a few months ago. The array is now working as expected, as is our disk-based backup system.


Posted by Claire Connelly | Permalink | Categories: System Maintenance, Amber

09.23.2005 12:13

Air Conditioning Problems Continue

The air conditioning unit for our machine room is continuing to have problems. I have entered the room twice and found the controller flashing OFF. The buttons on the controller don't seem to work, and I have to power off the whole system before the controller responds again and the air conditioner runs.

I have reported the problem to F&M, but until they can fix it, I will have to keep the Amber cluster offline.

Sorry for any inconvenience.


Posted by Claire Connelly | Permalink | Categories: System Maintenance, Amber

07.21.2005 10:17

Chiller Back, Amber, Too

The chiller is back up, which means that we have air conditioning in the machine room again, so I've restarted the Amber cluster.

Sorry for the disruption in service, but some things are out of my hands. The servers have to come first.


Posted by Claire Connelly | Permalink | Categories: System Maintenance, Amber

07.21.2005 09:27

Amber Cluster Offline Due to Air Conditioning Issues

Tom Shaffer, the college's plant engineer, has informed us that the separate air conditioning system that supplies cooling for various labs and computer machine rooms is offline. As a result, I have taken down the Amber cluster until air conditioning is restored.

I may to take some additional servers offline in the near future, but we'll keep our fingers crossed that it won't come to that.


Posted by Claire Connelly | Permalink | Categories: System Maintenance, Amber

07.19.2005 17:03

Amber Cluster Move, Part I

Some tiny number of you may have noticed that the Amber cluster was off line for a couple of hours. During that time, the cluster was completely dissassembled, stacked up in the hallway, and then moved to its new (temporary) location in the department's small machine room.

The cluster was moved because with the summer heat, my office was running around 85 - 90° Fahrenheit, which isn't healthy for people or computers. This move is temporary because the department's machine room is now completely filled up with machines, leaving little room for humans to move around and do any maintenance.

The plan is still for the cluster to move to the CS machine room by the end of the summer. It will remain on the mathematics department subnet, and will continue to be available to people who are in the amber group.

The old cluster will be retired; at this point I'm thinking that we will probably maintain the head node in some form so that people who haven't already done so can retrieve their data, but the rest of the machines will probably be stripped or scrapped outright. If you're in the market for a Pentium II machine, let me know and we might be able to hook you up.


Posted by Claire Connelly | Permalink | Categories: System Maintenance, Amber