Technical Staff Blog

Last update on .

Two GPFS NFS servers failed simultaneously at around 1pm today, causing file services to be unavailable for about 40 minutes.  The servers, crows and runts, serve different cluster groups; crows serves the grid and runts serves the internal department network.  Runts' failure was a kernel lock-up which prevented the normal automatic failover behavior.  The technical staff had to physically power it down before the system could begin to recover.  We have collected system data and we are opening a service ticket with IBM.