Compute Cluster Instability - Resolved?
- Posted by Mark Dieterich
- on July 9, 2015
Since the unplanned network outage on July 2nd, our compute grid has been unstable. We've been scouring grid logs, examing network switches, and generally pulling our hair out. The issue could be replicated by simply restarting the grid master service. We may have, finally, located a corruption in the underlying configuration files that was causing the problem. It appears we somehow ended up with some crufty configuration data in underlying files that was not reflected in the management interface for the grid! We have removed this crufty data and it appears that the grid master service can now be cleanly restarted. We are cautiously optimistic that this was the cause of recent instabilities. Tstaff is continuing to monitor the grid, but if you notice any issues we have missed, please send an email to problem@cs.brown.edu with details.