Technical Staff Blog

Last update on .

whoa-1444580.jpg At about 1am this morning, disk hardware providing a backing store for our VMWare machines went offline. This caused nearly every one of our production servers to go offline as well as all our hosting class machines provisioned for users and research groups. CIS is investigating the issue and we will post updates as we learn more.

07:20 - The Dell hardware providing disk storage for our virtual machines appears to have experienced some sort of controller failure. This hardware has multiple, redundant controllers, so it appears as though many pieces of hardware may have been affected. CIS is reaching out Facilities to see whether they noticed any power fluctuations or other issues that could explain the multiple failure.

08:00 - Dell was scheduled ot be onsite at 8am, but their arrival has been pushed back to 9am.

09:00 - Dell analysis of the logs appears to indicate that the controllers on the disk arrays took themselves offline as a result of a vulnerability scan performed by CIS. Apparently this is a "feature" according to Dell, when the controllers think they are under attack they all take themselves offline. Dell is onsite and working to re-enable the controllers and ensure the disks are in working order.

09:15 - Dell has blessed the arrays. Since the operating system disks were forcefully disconnected from the virtual machines for so long, every single one of our virtual machines must now be rebooted. This is some 200 machines! We are going to focus on Tstaff managed services first and then make our way to user managed VMs.

11:00 - Core services (authentication, authorization, DNS, LDAP, email, file system, web) should be operational again. We are turning our focus to the remainder of the Tstaff services.

13:00 - We believe that all of the Tstaff managed services are back and operational. We checked services as we brought servers back online. The CS workstations appear to have weathered the storm just fine. If you are sitting on a Windows machine, you may well need to reboot before it will allow you to log in. Now we are starting to look at user managed VMs. Unfortunately, every single VM will need to be rebooted.

16:00 - We believe that every personally managed VM has now been rebooted. If you are experiencing any weird behavior with your VM, please email and we can take a look to see whether there appear to be any system issues.