Technical Staff Blog

Last update on .

ibm-gpfs.jpg In just under two weeks, Tstaff will be physically moving our GPFS file system hardware from CIT 531 to the datacenter in the basement of the CIT. The purge of the upstairs machine rooms is the culmination of nearly two years of work by Tstaff. We will be shutting down GPFS @ 6:30am to begin the hardware migration. We hope to have everything moved and bring the file system back online around 6pm. After GPFS comes back online, Tstaff will be working to verify all services are operational again. This process will likely take a few more hours. Please plan on the file system being offline for the entire day.

Tstaff is working on a webpage that details the expected state of every departmental service during the migration. On the day of the move, we plan to communicate our progress as we go. Stay tuned for future announcements on both of these.

There are some risks associated with this move:

  • Many of these disks have been spinning non-stop for more than five years. Hardware that is never shut down can sometimes fail to start again, especially after it has been physically jostled as it will be during it's journey to the basement. If we loose too many disks, we could loose an entire GPFS file system.
  • We need to change the network VLAN for every GPFS server moving to the basement, which includes changes to our DHCP server and firewalls as well. While we should be able to recover from any network mishaps, it does add to the complexity of the move.
  • Levels of risk
    • We have daily, offsite, tape backups for anything stored in the main GPFS file system (/home, /course, /research)
    • We continuously run offsite tape backups of the data GPFS file system (/data), but because of it's size and number of small files, the backups often lag.
    • The /nbu file system is at most risk, because we do not back this up to tape (all is not lost though, see next section).

To help mitigate these risks:

  • By May 23rd, we should be able complete making a copy of every single file onto a CIS file system
    • There files will have lost any extended ACLs, one of the reasons we aren't simply migrating to a CIS file system right away. They will have fundamental POSIX permissions though.
    • It has taken us about a month and a half to copy all but 35TB of our files. Unfortunately, the remaining two weeks should be just enough time to copy the remaining files, but it leaves no time for further incremental copies.
  • We are working with IBM to ensure they have spare parts at a dispatch center realatively close. We have also given them a heads up that this move will be happening and that we may well need technical support should something go wrong.
  • Max purchased a brand new cart which should help minimize any jarring from the movement. He, personally, is going to handle every single piece of server and disk tray.

If GPFS doesn't come back we will work with CIS to transition over to using their file server. Unfortunately, this will not be an instantaneous process. We have many machines that expect to mount from very particular network shares. Using a CIS server instead of GPFS not only changes the name of the server, but also the path to every single share. In this worst case scenario, we estimate it could be a week or two before Tstaff and CIS will have things functional again.

We appreciate your support in this effort and are doing everything we can to minimize our chances for failure.