Skip to content

Fix race conditions with conn.in_flight_requests #1757

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

isamaru
Copy link
Contributor

@isamaru isamaru commented Mar 21, 2019

Resolves #1744.

Based on previous version and investigations in #1746 which can be discarded.


This change is Reviewable

@@ -612,7 +612,7 @@ def _poll(self, timeout):
conn = key.data
processed.add(conn)

if not conn.in_flight_requests:
if not conn.has_in_flight_requests():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition which causes the "Protocol out of sync" (the root cause) happens on this condition.

@dpkp
Copy link
Owner

dpkp commented Mar 21, 2019 via email

@isamaru
Copy link
Contributor Author

isamaru commented Mar 21, 2019

I wonder if we should just bite the bullet and make the entire class threadsafe?

That is an end goal, but also looks like a big project. Just putting Rlocks everywhere is probably not going to cut it 😆.
In particular, there's quite a lot of future resolving. I'd say those callbacks would need to be put outside locked sections.

@isamaru
Copy link
Contributor Author

isamaru commented Mar 21, 2019

I missed one more race condition around IFR which just popped up in logs, easy addition.

@jeffwidman jeffwidman changed the title kafka-python #1744 Fix race conditions with conn.in_flight_requests Fix race conditions with conn.in_flight_requests Mar 21, 2019
@dpkp
Copy link
Owner

dpkp commented Mar 22, 2019

I believe there is still a race here that could lead to protocol out of sync:

Thread A (with conn._lock) -> _protocol.send_request(request)
Thread B (with client._lock) -> conn.send_pending_requests(), including network I/O write
Broker -> processes request, sends response
Thread B (with client._lock) -> receives selector.EVENT_READ
Thread B (with conn._ifr_lock) -> checks conn.has_in_flight_requests()

Notice that Thread A has not yet placed an entry on its in_flight_requests dict, so Thread B will think this is a protocol out-of-sync error and close the socket.

Granted, this timing is extreme and seems highly unlikely. And, even if this did happen, your changes should prevent the "hanging" problem caused by the race between disconnect + ifr queue.

Nonetheless, I wonder if we should be reusing the existing conn._lock here? I think that would synchronize the protocol buffer, network I/O, and ifr tracking.

@isamaru
Copy link
Contributor Author

isamaru commented Mar 25, 2019

@dpkp
Good catch, I am actually reproducing this one too.
Yes, the solution is to extend self._lock until in_flight_requests is processed.

I don't think it would work to use self._lock everywhere, particularly in close, since it looks like it can get called from inside sections which are already locked (and I'd like to avoid a more expensive reentrant lock).

Copy link
Owner

@dpkp dpkp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about using the existing _lock instead of adding a new one (_ifr_lock) ? The current lock is intended to synchronize access to the protocol buffer, which itself must be synchronized with the IFR data structure. I'm also slightly concerned that having two locks here may lead to some other deadlock scenario where thread A has _lock and wants _ifr_lock, while thread B has _ifr_lock and wants _lock...

kafka/conn.py Outdated
# If requests are pending, we should close the socket and
# fail all the pending request futures
if self.in_flight_requests:
self.close(Errors.KafkaConnectionError('Socket not connected during recv with in-flight-requests'))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

close also acquires the ifr_lock, so this is going to fail unless the lock is reentrant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, not another one :(
Thanks!

@isamaru
Copy link
Contributor Author

isamaru commented Mar 27, 2019

I hear you, but I am worried about performance if we use RLocks.

In older Pythons it would be significantly slower, and we have components still stuck to 2.7
https://stackoverflow.com/a/1977542
https://stackoverflow.com/a/5441992

I'll try from scratch and see if I manage to rewrite this using just a non-reentrant _lock, but it will have to include some more substantial changes.

@dpkp
Copy link
Owner

dpkp commented Apr 1, 2019

This was great work! But I'm going to close in favor of the other PRs because I think we have a good path forward there.

@dpkp dpkp closed this Apr 1, 2019
@jeffwidman
Copy link
Contributor

jeffwidman commented Apr 1, 2019

I am worried about performance if we use RLocks. In older Pythons it would be significantly slower, and we have components still stuck to 2.7

Given how late we are in the python 2.7 lifecycle, I personally think we should be more concerned about getting the semantics correct and not worry too much about the python 2.7 performance.

The momentuum to EOL python 2.7 seems to have really picked up over the last 18 months... We've been experiencing this at my dayjob, seems like every other week a new 3p open source library announces they will quit supporting python 2.7 in 2020.

So I suspect what will happen is that companies/application owners will be faced with a choice to either migrate to python 3 or simply stop upgrading all their external libraries. And if they choose to stop upgrading, then whether kafka-python are using normal locks or RLocks won't matter because they'll still be on an old version.

Again, this isn't quite where we are at today, but it's coming very very quickly, so if we go for what's semantically correct, it will make long-term maintenance much easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants