-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Fix race conditions with conn.in_flight_requests #1757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race conditions with conn.in_flight_requests #1757
Conversation
@@ -612,7 +612,7 @@ def _poll(self, timeout): | |||
conn = key.data | |||
processed.add(conn) | |||
|
|||
if not conn.in_flight_requests: | |||
if not conn.has_in_flight_requests(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Race condition which causes the "Protocol out of sync" (the root cause) happens on this condition.
I wonder if we should just bite the bullet and make the entire class
threadsafe? It seems likely that we will eventually hit other race
conditions wrt _sock, connect, send, recv, close, etc.
|
That is an end goal, but also looks like a big project. Just putting Rlocks everywhere is probably not going to cut it 😆. |
I missed one more race condition around IFR which just popped up in logs, easy addition. |
I believe there is still a race here that could lead to protocol out of sync: Thread A (with conn._lock) -> _protocol.send_request(request) Notice that Thread A has not yet placed an entry on its in_flight_requests dict, so Thread B will think this is a protocol out-of-sync error and close the socket. Granted, this timing is extreme and seems highly unlikely. And, even if this did happen, your changes should prevent the "hanging" problem caused by the race between disconnect + ifr queue. Nonetheless, I wonder if we should be reusing the existing conn._lock here? I think that would synchronize the protocol buffer, network I/O, and ifr tracking. |
@dpkp I don't think it would work to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about using the existing _lock instead of adding a new one (_ifr_lock) ? The current lock is intended to synchronize access to the protocol buffer, which itself must be synchronized with the IFR data structure. I'm also slightly concerned that having two locks here may lead to some other deadlock scenario where thread A has _lock and wants _ifr_lock, while thread B has _ifr_lock and wants _lock...
kafka/conn.py
Outdated
# If requests are pending, we should close the socket and | ||
# fail all the pending request futures | ||
if self.in_flight_requests: | ||
self.close(Errors.KafkaConnectionError('Socket not connected during recv with in-flight-requests')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
close
also acquires the ifr_lock, so this is going to fail unless the lock is reentrant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, not another one :(
Thanks!
I hear you, but I am worried about performance if we use RLocks. In older Pythons it would be significantly slower, and we have components still stuck to 2.7 I'll try from scratch and see if I manage to rewrite this using just a non-reentrant |
This was great work! But I'm going to close in favor of the other PRs because I think we have a good path forward there. |
Given how late we are in the python 2.7 lifecycle, I personally think we should be more concerned about getting the semantics correct and not worry too much about the python 2.7 performance. The momentuum to EOL python 2.7 seems to have really picked up over the last 18 months... We've been experiencing this at my dayjob, seems like every other week a new 3p open source library announces they will quit supporting python 2.7 in 2020. So I suspect what will happen is that companies/application owners will be faced with a choice to either migrate to python 3 or simply stop upgrading all their external libraries. And if they choose to stop upgrading, then whether Again, this isn't quite where we are at today, but it's coming very very quickly, so if we go for what's semantically correct, it will make long-term maintenance much easier. |
Resolves #1744.
Based on previous version and investigations in #1746 which can be discarded.
This change is