yes I’ve noticed this one, the reporter also sent me some private emails without any config unfortunately, but I figured what happens in the code. He very likely has a “track” directive in his config and a state change is propagated over multiple servers, resulting in thread_isolate()/thread_release() to be called recursively and thus breaking the rendez-vous point.
I’m seeing two possible fixes for this : the easy but ugly one being to use two different code paths depending on thread_isolated() to decide whether or not to isolate/release. The clean one consisting in implementing a thread-local counter for the isolation. I actually think this one would be clean because we don’t need to be that careful anymore, we’d naturally support recursivity.
While analysing this, I just found (and fixed) an inconsistency caused by the switch back to synchronous changes : the server updates were applied after the propagation and not before. This was made to simplify the code apparently. Now it causes the tracking servers to see incomplete updates and to remain down while the tracked server is up. I’ve moved this update back to its place before the loop.
I’m now checking if I can reliably reproduce the deadlock to be certain I fix it correctly.
In fact it’s not exactly this since there doesn’t seem to be any recursive call while the thread is isolated inside the srv_* code. However what I’m seeing is that we may call pendconn_redistribute() which grabs the server’s lock, and this code is called with the server’s lock already held. This could be the problem. I need to check if it makes sense to take this lock there now.
Francis, if you have a simple config showing the bug in a reproducible manner, I would really appreciate it. I have some ideas about how to deal with it but all of them are quite complex and before spending time on one of them I would like to be sure I’m not missing a part of the problem.
Thanks!