Skip to content

Worker stuck after "Protocol out of sync" #1744

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
isamaru opened this issue Mar 18, 2019 · 8 comments
Closed

Worker stuck after "Protocol out of sync" #1744

isamaru opened this issue Mar 18, 2019 · 8 comments

Comments

@isamaru
Copy link
Contributor

isamaru commented Mar 18, 2019

After switching to 1.4.5 from an older 1.3 version, we see some workers getting stuck with this pattern in the logs:

  1. "Protocol out of sync" with no previous errors
  2. "Closing connection. KafkaConnectionError: Socket EVENT_READ without in-flight-requests"
  3. "[kafka.client] Node 2 connection failed -- refreshing metadata"
  4. 5 minutes later, worker starts looping out infinite "Duplicate close() with error: [Error 7] RequestTimedOutError: Request timed out after 305000 ms", but with no apparent attempts to reconnect

More detailed (INFO) logs here:
kafka-python-logs.txt

We found a worker which was stuck like this for 2 days, processing no messages but not failing directly or even rebalancing the group, causing lag on its partition. The broker is running and other workers can connect to that broker in that period.
Note that Node 2 is leader for the partition which the worker is assigned to. Group coordinator is Node 1 which is why heartbeat keeps beating.

This seems to be the same thing as #1728 which wasn't completely fixed.

#1733 was about fixing one possible cause for that error (= avoid it). I believe that in our case the error is legitimate (temporary connection problems to broker).
The real issue is that the worker is not able to restore itself after this error happens and becomes stuck instead of either giving up and dying or reconnecting.
I've tried to find code responsible for reconnecting (which doesn't seem to fire) but I don't understand your codebase that well. I will continue investigating, this is important for us.

(We have deployed this on several environments and see this on both 1.0.1 and 2.1 brokers, identified by client as 1.0.0)

@isamaru
Copy link
Contributor Author

isamaru commented Mar 18, 2019

Maybe the root cause could be an incomplete #1733 after all. I realized that the recurring timeouts will only occur if there is some in-flight-request, but there was none present before the connection was closing. Therefore I suspect a race condition between sending and storing to in-flight-request.

Ideally, addition to in-flight_requests in _send could be wrapped like so:

        with self._lock:
            correlation_id = self._protocol.send_request(request)
            log.debug('%s Request %d: %s', self, correlation_id, request)
            expect_response = request.expect_response()

            if expect_response:
                sent_time = time.time()
                assert correlation_id not in self.in_flight_requests, 'Correlation ID already in-flight!'
                self.in_flight_requests[correlation_id] = (future, sent_time)

        if not expect_response:
            future.success(None)

Afterwards, the if not conn.in_flight_requests: check in client_async.py#_poll() needs to use the same lock.

Still, missing path for reconnecting remains a problem. I haven't been able to find any code path which would reenable the connection in async_client poll loop.

@isamaru
Copy link
Contributor Author

isamaru commented Mar 18, 2019

After further investigation, I've come to believe that it is indeed the race condition, not network problems, which are causing this. In our case we are committing manually from the main thread.
The deadlock seems to occur because "protocol out of sync" is raised and connection is closed before in-flight-requests is populated. Close() would usually terminate the futures stored in in-flight-requests but does not do that because it's not yet populated.

When close is called 5 minutes later, in-flight-requests is populated but the short-circuit in conn.py#close() does not clear the futures. Besides locking that I recommend above, we should also clean the in-flight-requests and end futures there even when short-circuiting.

@dpkp
Copy link
Owner

dpkp commented Mar 20, 2019

Interesting. Let me think a bit more about this. You went straight from 1.3.X to 1.4.5? did you ever try 1.4.4 or earlier?

@isamaru
Copy link
Contributor Author

isamaru commented Mar 20, 2019

We've tried to upgrade to 1.4.3 last year, then reverted to 1.3.5 because we had weird cluster issues. Nothing we could pin down, specifically, and not this. No one had time to investigate properly back then.

isamaru pushed a commit to exponea/kafka-python that referenced this issue Mar 21, 2019
isamaru added a commit to exponea/kafka-python that referenced this issue Mar 21, 2019
…ions_with_conn.in_flight_requests

kafka-python dpkp#1744 Fix race conditions with conn.in_flight_requests
isamaru pushed a commit to exponea/kafka-python that referenced this issue Mar 21, 2019
isamaru added a commit to exponea/kafka-python that referenced this issue Mar 21, 2019
…ions_with_conn.in_flight_requests

kafka-python dpkp#1744 Fix one more race condition with IFR
isamaru pushed a commit to exponea/kafka-python that referenced this issue Mar 25, 2019
isamaru pushed a commit to exponea/kafka-python that referenced this issue Mar 27, 2019
isamaru pushed a commit to exponea/kafka-python that referenced this issue Mar 27, 2019
isamaru pushed a commit to exponea/kafka-python that referenced this issue Mar 27, 2019
@vimal3271
Copy link

We are also seeing the same issue where the consumer is getting stuck after running for one/two days. The problem becomes worst since it is not causing re-balance. We are using 1.4.4.

@jeffwidman
Copy link
Contributor

See #1766 (comment) for a status update on this issue.

@vimal3271 1.4.4 is an older version, and one major source of deadlocks was fixed in 1.4.5, but there have been additional fixes to master since then, so please upgrade to master (or wait for the upcoming 1.4.6 release) and if you are still seeing issues then please file a new ticket. Happy to help, just don't want to spend time debugging/fixing issues that have already been solved on master.

@dpkp
Copy link
Owner

dpkp commented Apr 3, 2019

Fixes have been merged to master. I'm going to close, but please reopen if the issue persists.

@dpkp dpkp closed this as completed Apr 3, 2019
@balajibreddi
Copy link

@isamaru is the issue resolved? if so what's the solution? Please help me out here
Even I am facing the same issue for Kafka-python library

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants