Technical Staff Blog

Last update on .

ibm-gpfs.jpg The GPFS hardware is scheduled to be moved starting May 23rd @ 06:30am. Please check this page on the day of the move for periodic status updates...

... and so it begins...

05:45: Prep work for the move has begun. We are working to ensure that the most critical services remain active during the file system outage. Part of this involves disabling services on critical servers that reside along side critical services and rely on the file system. This allows us to unmount all NFS directories from these servers.

DSC_1123.jpg 06:05: One of the challenges with moving to the basement data center is that all of our network VLANs, supporting all of our private address spaces, are reserved on the type of switches used in the basement. In addition to moving all the hardware, we need to change all of those network VLANs. The non-active firewall has been updated to reflect the changes that will be in affect after the move. One of the final steps for today will be to failover our firewalls and, once the new configuration is confirmed, update the previously active firewall as well.

06:34: GPFS software is being shut down. Assuming all of our expected services are still active after the software shutdown, from this point forward until the hardware move is complete, the file system will be offline.

06:52: All pre-flight checks have been completed. Those services that are still up are listed on our on File System Outage Page.

06:55: We are starting to shut down hardware in preparation for the moves.

07:56: All of the GPFS NFS and CIFS gateways have been removed from 531 and transported to the basement data center. Work is beginning on moving the first of ten GPFS NSD servers, wrigley and trident.

DSC_1136.JPG 08:09: The GPFS NFS and CIFS servers have been racked and we are starting the process of re-connecting 85 network connections. The first two NSD servers, wrigley and trident, have been moved. We are heading up to grab the first of the disk hardware.

10:15: We are starting to unrack the next to NSD servers, cotton and cadbury, along with their stack of disks.

10:30: The first stack of disks have been powered up and everything appears to have spun up properly. The monitoring software can see all the hardware and all disks appear to spun back up. The network cabling for the GPFS NFS and CIFS servers is now complete. The NSD servers, wrigley and trident, have been moved.

DSC_1144.JPG 11:02: The second stack of disks have been powered up and everything appears to have spun up properly. The monitoring software can see all the hardware and all disks appear to have spun back up. The NSD servers, cotton and cadbury, have been installed in their new rack.

11:10: We are taking a quick lunch break, back at it around noon.

11:50: We are back and starting to dismantle the next two NSDs, blackjack and chiclets, and their associated stack of disks.

DSC_1232.jpg 14:05: It looks like we may have lost two power supplies in one of the disk trays, which prevents it from powering up. If it's only a power supply problem, them we should be able to swap out one of these power supplies with a power supply from another unit and get it up and running again.

14:13: We swapped out one of the bad power supplies with one from a different array and all the disks are now visible. We will reach out to IBM and have them delivery two replacements, but in the mean time we are going to continue moving forward with the moves.

15:13: We are continuing to interract with IBM support to arrange for two replacement power supplies. In the mean time, we have now moved two more NSD nodes, dove and mnm, down into the data center. Along with these nodes, we are moving their associated disk arrays.

DSC_1221.jpg 15:33: The disk arrays for dove and mnm have powered up successfully. Two more NSD servers, bigred and razzles, to move down and their associated disk arrays.

15:55: IBM is shipping out two replacement power supplies. We won't have them in hand until first thing tomorrow morning, so cross your fingers we don't loose too many more! This is somewhat frustrating, since we contacted IBM support, let them know that we were planning this move and asked that they stage replacement parts in a nearby location for exactly this reason. 

Assuming the last batch of disks come online, we should be able to run just fine without these power supplies.

14:27: While Tstaff members were purging cables in 531, apparently some of the fiber interconnecting our two distribution switches were pulled. This effectively broke the entire CS network. We are working to trace these fiber runs. The network must be fully operational again before we continue our migration work, otherwise we can't use the managment software to confirm the disks are happy.

DSC_1250.jpg 17:44: Thanks to quick response from the CIS network folks, the network should be back. We are continuing the migration work.

17:48: All of the disks have been moved and the management software reports they are in a happy state! Other than two lost power supplies, all of the disks appear to have survived. Now we need boot up the NSD nodes and make sure they can see the disks. This is the point where we will start bringing GPFS back online. NFS and CIFS will come up at the tail end of this process.

18:05: We appear to be missing some network hardware, which is preventing us from booting the NSDs. The CIS network folks have been paged.

18:25: CIS located the necessary switch hardware. We are in the process of connecting up the NSD servers.

18:32: All ten of our NSD servers now have network connections. We are mounting the GPFS file systems on the NSDs and management nodes.

DSC_1255.jpg 18:42: We booted a number of machines before the networking was ready, which appears to have left GPFS in a strange state. The good news is that we have now successfully mounted all of the GPFS file systems on at least one of nodes, so this means the GPFS file system has survived the move and we just need to clean things up at this point.

18:46: All of the NSDs are able to see all GPFS file systems. We are working our way through the management nodes. Once the managment nodes are happy, we will start booting up the NFS and CIFS nodes.

19:31: At this point, departmental NFS and CIFS servers are operational. We are still working on the servers providing NFS to the DMZ machines. This service requires some network changes on the firewall which we staged but could not test until the GPFS migration was complete. We are now starting to test these changes.

22:08: Wow, debugging routing on firewalls can be challenging. We are now running on an updated firewall, which allowed us to bring up the last two GPFS servers that provide NFS services to the DMZ. The previously active firewall has not yet been updated, because we had to do some work by hand. At this point though, we need some sleep in order to look at the firewall issues with a fresh set of eyes.

22:24: Until the firewall rule issues are resolved, the static website (the one that is html files on the filesystem) will be read-only.  The webserver is currently serving a copy of that site.  Changes you make to web files will not be lost, but will also not be served until the remaining issues are resolved.  Ftp services, which were down today, will also remain unavailable until our firewalls are fully functional.