Change coordinator lock acquisition order #1821

dpkp · 2019-05-29T06:29:57Z

Attempt to address heartbeat lock contention by changing the lock acquisition order from client->coordinator to coordinator->client. This aligns with the java client and (I hope) should help improve concurrency coordination between the heartbeat thread loop and the main consumer loop by allowing us to drop the client lock during future processing.

This change is

jeffwidman

This looks good on paper. Never know for sure til we get it out in the wild. 😃

dpkp · 2019-05-30T04:16:35Z

The thing to be concerned about here is whether it is possible to find a code path that attempts to acquire the coordinator lock while holding the client lock. Unfortunately, I think this is super tricky because even though we have deferred processing of futures that handle connection responses into an unlocked pending_completions list, there are loads of other futures that are not in this list. Now, most of these are not going to get generated while the client lock is acquired (see git grep future.success). But a very real risk is future.failure(). An obvious place where this might happen is during KafkaClient._maybe_connect, which acquires the client lock and then calls into the BrokerConnection.connect function. If something goes wrong during connect, we will call self.close() and that will cause all in-flight-request futures to be immediately failed. That will cause errbacks to run with the client lock acquired, and if one of those errbacks ends up attempting to acquire the coordinator lock, for example in _handle_join_failure or _handle_heartbeat_failure, then we could get a deadlock.

The java client uses its RequestFutureCompletionHandler to shim these connection failures onto the pending completions queue, in order to avoid the problem of errbacks running while the client lock is acquired. I expect that we will need to implement something similar to really get this under control.

* Coordinator lock acquired first, client lock acquired second * Release client lock to process futures

dpkp · 2019-09-30T00:06:50Z

I haven't found any specific paths that would trigger a deadlock with this change. The example I mentioned above would require that somehow we get an in-flight-request before the connection is completed. That shouldn't happen, so I'm going to merge and keep an eye out for any other issues.

huangcuiyang · 2020-06-09T07:47:56Z

I also think that the repair method you suggested is more thorough and better, and my method is relatively simple, in order to quickly repair production problems.
When the heartbeat thread uses 2 keys to call client.poll to read data, the performance of the main consumer thread is less affected. I have tested the performance.
I hope you can use a better way to solve the design problem of multi-threaded shared client.poll

jeffwidman approved these changes May 29, 2019

View reviewed changes

dpkp mentioned this pull request Sep 18, 2019

1.4.7 #1903

Closed

Change coordinator lock acquisition order

318c261

* Coordinator lock acquired first, client lock acquired second * Release client lock to process futures

dpkp force-pushed the coordinator_lock_order branch from 937388f to 318c261 Compare September 29, 2019 18:05

dpkp merged commit 0f929bd into master Sep 30, 2019

dpkp deleted the coordinator_lock_order branch September 30, 2019 00:07

carsonip mentioned this pull request Oct 18, 2019

Gevent support for kafka-python #1867

Closed

dpkp mentioned this pull request May 21, 2020

Consumer deadlock in coordinator when heartbeat request timeout #1985

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change coordinator lock acquisition order #1821

Change coordinator lock acquisition order #1821

Uh oh!

dpkp commented May 29, 2019 •

edited

Loading

Uh oh!

jeffwidman left a comment

Uh oh!

dpkp commented May 30, 2019

Uh oh!

dpkp commented Sep 30, 2019

Uh oh!

huangcuiyang commented Jun 9, 2020

Uh oh!

Uh oh!

Change coordinator lock acquisition order #1821

Change coordinator lock acquisition order #1821

Uh oh!

Conversation

dpkp commented May 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffwidman left a comment

Choose a reason for hiding this comment

Uh oh!

dpkp commented May 30, 2019

Uh oh!

dpkp commented Sep 30, 2019

Uh oh!

huangcuiyang commented Jun 9, 2020

Uh oh!

Uh oh!

dpkp commented May 29, 2019 •

edited

Loading