Technical Staff Blog

Category archives: Service Instability

Tstaff announcements about service instabilities

RSS feed of Service Instability

Last update on .

A virtual machine server crashed this morning at around 5am, affecting 49 CS Department servers, including list, web, and other important services.  The technical staff spent much of today locating and fixing problems that arose from this event.  All or most services have been restored, but problem mail was interrupted mid-day today, and remains broken ...

Last update on .

During the final weeks of the semester, the technical staff encountered issues with routine maintenance of our GPFS filesystem.  IBM suggested that there was corruption in the filesystem, and recommended that we take it offline to repair it.  We postponed that work until after commencement, to avoid disrupting the final weeks of classes, a conference ...

Last update on .

Emergency Linux Workstation Reboot

Tstaff will be rebooting several department-managed Linux workstations tomorrow, Wednesday morning at 5am. This is part of an ongoing effort to track down some recent filesystem-related problems.

Workstations to be rebooted include all machines in the Sunlab and some other undergrad spaces, in addition to a few faculty and grad student machines. The complete list ...

Last update on .

Reports came in yesterday and today of messages being delayed for many hours.  Some were specifically because of account name changes, which was resolved early in the day, yesterday.  While others were delayed due to an after hours service outage, which has also been resolved.

Email services have been returned to normal and all delayed ...

Last update on .

This is an after the fact report - this issue has been resolved. Certain parts of the 4th and 5th floor of CIT experienced a brief network outage on Thursday, October 4th sometime between 2pm and 3pm. TStaff immediately contacted CIS Networking and they dispatched Comm Ops, it appeared to be a bad uplink on one ...

Last update on .

In hopes of fixing the ongoing FastX issues permanently, we are going to do some reconfiguration of the virtual hosts in the FastX cluster that will require all of them to be shut down. We will take the cluster offline at 8am tomorrow morning (Thursday the 27th). While we don't know how long this ...

Last update on .

FastX continues to be unreliable for most users. Even after a cold reboot of the entire cluster last Friday, most connections are getting rejected, seemingly nondeterministically and accompanied by any one of a few recurring error messages.

We still don't know the underlying cause and have contacted StarNet to see if they have any ...

Last update on .

Users have reported an unusual volume and variety of FastX problems over the last couple of days. So far we have been unable to determine the precise cause, or narrow the problems down to any particular host in the cluster. Therefore, as part of our ongoing attempt to diagnose and fix the problems, we are ...

Last update on .

Late Friday evening into Saturday CIS expereinced a failure of one of the VMWare servers in the stack they provide to us. A number of our internal services depend on the virtual machines hosted by that server, which resulted in intermittent issues through the department (website, dhcp, vpn, ssh, and others).

CIS has temporarily migrated ...