Technical Staff Blog

Last update on .

Due to persistent instability with no consistent timing nor other indicators, the Grid Qmaster has been restarted to allow for more thorough testing. There seem to be some underlying issues with the new configuration, but it is not clear wether this is the result of the upgrade to Jessie or the change to hosting the Qmaster in our VMWare environment. Whatever the actual problem is, the result is the Qmaster's queueu lists get emptied of the available host names and have to be reintialized by hand.  There has been some reprieve is mitigating a possible SYN flood happening on the Qmaster's daemon port, but that has only cut down on the frequency of these crashes.

As always, we thank you for your patience while we are working on resolving this issue. If you have any information relating to these issues or any questions/concerns, please email problem@cs.brown.edu.

1:15am - UPDATE: As it turns out, this is the result of the Qmaster having trouble deleting jobs using the qdel command for specific users. It isn't clear why this process causes the Qmaster to fail.