GPFS Cluster Instability
- Posted by Phirum Peang
- on July 16, 2015
We had a gpfs filesytem failure on Wednesday, July 15, from 12am to 9:30am. The gpfs cluster was in an unhappy state starting at 12am when one of its node went into an unknow state. Then around 3:15am the cifs cluster went down. During the gpfs filesytem outage as we were working to debug the issue, around 8:45am the primary firewall had problem. It got into a bad state and its partner couldn't take over properly. Manual intervention was required to fix the firewall. Once the firewall was working properly, we were able to bring the gpfs cluster back to normal around 9:30am. Once the gpfs file system was recovered, we went through recovering departmental services. We are still investigating the root cause of the gpfs file system failure and firewall failover problem.