Skip to content

Change coordinator lock acquisition order #1821

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 30, 2019
Merged

Conversation

dpkp
Copy link
Owner

@dpkp dpkp commented May 29, 2019

Attempt to address heartbeat lock contention by changing the lock acquisition order from client->coordinator to coordinator->client. This aligns with the java client and (I hope) should help improve concurrency coordination between the heartbeat thread loop and the main consumer loop by allowing us to drop the client lock during future processing.


This change is Reviewable

Copy link
Contributor

@jeffwidman jeffwidman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good on paper. Never know for sure til we get it out in the wild. 😃

@dpkp
Copy link
Owner Author

dpkp commented May 30, 2019

The thing to be concerned about here is whether it is possible to find a code path that attempts to acquire the coordinator lock while holding the client lock. Unfortunately, I think this is super tricky because even though we have deferred processing of futures that handle connection responses into an unlocked pending_completions list, there are loads of other futures that are not in this list. Now, most of these are not going to get generated while the client lock is acquired (see git grep future.success). But a very real risk is future.failure(). An obvious place where this might happen is during KafkaClient._maybe_connect, which acquires the client lock and then calls into the BrokerConnection.connect function. If something goes wrong during connect, we will call self.close() and that will cause all in-flight-request futures to be immediately failed. That will cause errbacks to run with the client lock acquired, and if one of those errbacks ends up attempting to acquire the coordinator lock, for example in _handle_join_failure or _handle_heartbeat_failure, then we could get a deadlock.

The java client uses its RequestFutureCompletionHandler to shim these connection failures onto the pending completions queue, in order to avoid the problem of errbacks running while the client lock is acquired. I expect that we will need to implement something similar to really get this under control.

@dpkp dpkp mentioned this pull request Sep 18, 2019
  * Coordinator lock acquired first, client lock acquired second
  * Release client lock to process futures
@dpkp dpkp force-pushed the coordinator_lock_order branch from 937388f to 318c261 Compare September 29, 2019 18:05
@dpkp
Copy link
Owner Author

dpkp commented Sep 30, 2019

I haven't found any specific paths that would trigger a deadlock with this change. The example I mentioned above would require that somehow we get an in-flight-request before the connection is completed. That shouldn't happen, so I'm going to merge and keep an eye out for any other issues.

@dpkp dpkp merged commit 0f929bd into master Sep 30, 2019
@dpkp dpkp deleted the coordinator_lock_order branch September 30, 2019 00:07
@huangcuiyang
Copy link
Contributor

I also think that the repair method you suggested is more thorough and better, and my method is relatively simple, in order to quickly repair production problems.
When the heartbeat thread uses 2 keys to call client.poll to read data, the performance of the main consumer thread is less affected. I have tested the performance.
I hope you can use a better way to solve the design problem of multi-threaded shared client.poll

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants