No doubt you've heard by now, the CIT experiened a building wide power outage. This is the first time since we deployed our UPS that we've lost power longer than the batteries could hold us, so the entire machine room was powered down. As of 12:23pm, we believe most critical services are back online, with the exception of our virtual and grid machines. This should mean it's safe to boot Linux, Windows, and Mac desktops. We are working our way through checks to ensure everything is working properly. We will document ongoing issues in this post.
(12:29): The VPN server is up, but experiencing some sort of network routing issue. We are investigating.
(12:39): The backend servers for the db.cs.brown.edu postgreSQL cluster are experiencing hardware problems. It may be a while before this service is back up.
(12:43): The VPN service is back and operational.
(12:44): The list service appears to be down. We are investigating.
(13:03): CIS is reporting that a number of network edge switches, those that desktops connect to directly, are not powering back up. They are looking into this issue and we will let you know when we have a better idea of the areas that are affected.
(13:09): We are starting to boot up virtual machines. Some have come back up just fine, others will require manual disk checks.
(13:13): Much of the Sunlab is without power and those machines that do have power have no networking because the network closet doesn't have power. CIS is aware of the switch problems and Facilities is aware of the power issues. Most of the MSLab is operational.
(13:26): Astaff members are walking the building looking for rooms without power. If you are missing power in your office, please let an Astaff member know. Thanks Astaff!
(14:03) Two GPFS CIFS nodes, used by Windows and Mac clients for SMB connections, were in a bad state. They have been restored.
(14:26) CIS has two switches that appear to have died during the power outage/spike. One switch affects the third floor machines on Northeast side of the build and the other affects the Sunlab. CIS is working on getting replacement switches installed. When they go to install the hardware, they will need to reboot the other switches linked to these dead switches, which will cause about a five minute network blip for anyone connected directly to those switches.
(14:44) Most of the xen virtual machines should be back and on the network again. Please email email@example.com if you are an administrator of a xen machine and are still having issues.
(15:30) The hardware that drives our home grown information signs died during the power event. So we accelerate some plans we already had to bring a new signage system onlne next week.
(15:47) The grid scheduler nodes are back up and 65 of the compute servers. We are working our way through the rest of the machines to determine why they are not on the network.
(16:37) We discovered that the time was way off on some of our GPFS servers. It was so far off in fact that NTP refused to sync! This could mean that timestamps on files may be a bit funky for the next hour. We are still in the process of syncing the remaining machines, this should be complete in the next 15 minutes.
(17:40) We were alerted that the zephyr service was not running. It should be back and operational now.
(18:08) We just resolved an issue with printing to bw4 and bw5 using our authenticated print server, i.e. anyone printing from wireless or outside the department.
(18:57) We finally managed to recover the backend hardware for the db.cs.brown.edu database cluster.
(21:21) The ownCloud file syncing service was having some issues with the backend web server. These have been resolved, so syncing should be working as expected again.