Fix race conditions with conn.in_flight_requests #1757

isamaru · 2019-03-21T16:02:18Z

Resolves #1744.

Based on previous version and investigations in #1746 which can be discarded.

This change is

isamaru · 2019-03-21T16:04:01Z

kafka/client_async.py

@@ -612,7 +612,7 @@ def _poll(self, timeout):
            conn = key.data
            processed.add(conn)

-            if not conn.in_flight_requests:
+            if not conn.has_in_flight_requests():


Race condition which causes the "Protocol out of sync" (the root cause) happens on this condition.

dpkp · 2019-03-21T16:40:07Z

I wonder if we should just bite the bullet and make the entire class threadsafe? It seems likely that we will eventually hit other race conditions wrt _sock, connect, send, recv, close, etc.

isamaru · 2019-03-21T16:53:16Z

I wonder if we should just bite the bullet and make the entire class threadsafe?

That is an end goal, but also looks like a big project. Just putting Rlocks everywhere is probably not going to cut it 😆.
In particular, there's quite a lot of future resolving. I'd say those callbacks would need to be put outside locked sections.

isamaru · 2019-03-21T17:28:05Z

I missed one more race condition around IFR which just popped up in logs, easy addition.

dpkp · 2019-03-22T03:45:00Z

I believe there is still a race here that could lead to protocol out of sync:

Thread A (with conn._lock) -> _protocol.send_request(request)
Thread B (with client._lock) -> conn.send_pending_requests(), including network I/O write
Broker -> processes request, sends response
Thread B (with client._lock) -> receives selector.EVENT_READ
Thread B (with conn._ifr_lock) -> checks conn.has_in_flight_requests()

Notice that Thread A has not yet placed an entry on its in_flight_requests dict, so Thread B will think this is a protocol out-of-sync error and close the socket.

Granted, this timing is extreme and seems highly unlikely. And, even if this did happen, your changes should prevent the "hanging" problem caused by the race between disconnect + ifr queue.

Nonetheless, I wonder if we should be reusing the existing conn._lock here? I think that would synchronize the protocol buffer, network I/O, and ifr tracking.

isamaru · 2019-03-25T13:37:42Z

@dpkp
Good catch, I am actually reproducing this one too.
Yes, the solution is to extend self._lock until in_flight_requests is processed.

I don't think it would work to use self._lock everywhere, particularly in close, since it looks like it can get called from inside sections which are already locked (and I'd like to avoid a more expensive reentrant lock).

dpkp

What do you think about using the existing _lock instead of adding a new one (_ifr_lock) ? The current lock is intended to synchronize access to the protocol buffer, which itself must be synchronized with the IFR data structure. I'm also slightly concerned that having two locks here may lead to some other deadlock scenario where thread A has _lock and wants _ifr_lock, while thread B has _ifr_lock and wants _lock...

dpkp · 2019-03-26T04:07:47Z

kafka/conn.py

+                # If requests are pending, we should close the socket and
+                # fail all the pending request futures
+                if self.in_flight_requests:
+                    self.close(Errors.KafkaConnectionError('Socket not connected during recv with in-flight-requests'))


close also acquires the ifr_lock, so this is going to fail unless the lock is reentrant

Ah, not another one :(
Thanks!

isamaru · 2019-03-27T13:54:54Z

I hear you, but I am worried about performance if we use RLocks.

In older Pythons it would be significantly slower, and we have components still stuck to 2.7
https://stackoverflow.com/a/1977542
https://stackoverflow.com/a/5441992

I'll try from scratch and see if I manage to rewrite this using just a non-reentrant _lock, but it will have to include some more substantial changes.

dpkp · 2019-04-01T00:07:04Z

This was great work! But I'm going to close in favor of the other PRs because I think we have a good path forward there.

jeffwidman · 2019-04-01T02:51:08Z

I am worried about performance if we use RLocks. In older Pythons it would be significantly slower, and we have components still stuck to 2.7

Given how late we are in the python 2.7 lifecycle, I personally think we should be more concerned about getting the semantics correct and not worry too much about the python 2.7 performance.

The momentuum to EOL python 2.7 seems to have really picked up over the last 18 months... We've been experiencing this at my dayjob, seems like every other week a new 3p open source library announces they will quit supporting python 2.7 in 2020.

So I suspect what will happen is that companies/application owners will be faced with a choice to either migrate to python 3 or simply stop upgrading all their external libraries. And if they choose to stop upgrading, then whether kafka-python are using normal locks or RLocks won't matter because they'll still be on an old version.

Again, this isn't quite where we are at today, but it's coming very very quickly, so if we go for what's semantically correct, it will make long-term maintenance much easier.

kafka-python dpkp#1744 Fix race conditions with conn.in_flight_requests

34ca6b3

isamaru commented Mar 21, 2019

View reviewed changes

isamaru mentioned this pull request Mar 21, 2019

DP-238 Test a fix for race condition with closed connection #1746

Closed

kafka-python dpkp#1744 Fix one more race condition with IFR

f229be3

jeffwidman changed the title ~~kafka-python #1744 Fix race conditions with conn.in_flight_requests~~ Fix race conditions with conn.in_flight_requests Mar 21, 2019

kafka-python dpkp#1744 Fix additional race condition with IFR

fd3ec03

dpkp reviewed Mar 26, 2019

View reviewed changes

kafka-python dpkp#1744 Fix deadlock with _ifr_lock in recv

7a0d086

isamaru mentioned this pull request Mar 27, 2019

kafka-python #1744 Attempt 2 at race conditions with IFR #1766

Closed

dpkp closed this Apr 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race conditions with conn.in_flight_requests #1757

Fix race conditions with conn.in_flight_requests #1757

isamaru commented Mar 21, 2019 •

edited by dpkp

Loading

isamaru Mar 21, 2019

dpkp commented Mar 21, 2019 via email

isamaru commented Mar 21, 2019 •

edited

Loading

isamaru commented Mar 21, 2019

dpkp commented Mar 22, 2019 •

edited

Loading

isamaru commented Mar 25, 2019

dpkp left a comment

dpkp Mar 26, 2019

isamaru Mar 27, 2019

isamaru commented Mar 27, 2019 •

edited

Loading

dpkp commented Apr 1, 2019

jeffwidman commented Apr 1, 2019 •

edited

Loading

Fix race conditions with conn.in_flight_requests #1757

Fix race conditions with conn.in_flight_requests #1757

Conversation

isamaru commented Mar 21, 2019 • edited by dpkp Loading

isamaru Mar 21, 2019

Choose a reason for hiding this comment

dpkp commented Mar 21, 2019 via email

isamaru commented Mar 21, 2019 • edited Loading

isamaru commented Mar 21, 2019

dpkp commented Mar 22, 2019 • edited Loading

isamaru commented Mar 25, 2019

dpkp left a comment

Choose a reason for hiding this comment

dpkp Mar 26, 2019

Choose a reason for hiding this comment

isamaru Mar 27, 2019

Choose a reason for hiding this comment

isamaru commented Mar 27, 2019 • edited Loading

dpkp commented Apr 1, 2019

jeffwidman commented Apr 1, 2019 • edited Loading

isamaru commented Mar 21, 2019 •

edited by dpkp

Loading

isamaru commented Mar 21, 2019 •

edited

Loading

dpkp commented Mar 22, 2019 •

edited

Loading

isamaru commented Mar 27, 2019 •

edited

Loading

jeffwidman commented Apr 1, 2019 •

edited

Loading