-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Worker stuck after "Protocol out of sync" #1744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Maybe the root cause could be an incomplete #1733 after all. I realized that the recurring timeouts will only occur if there is some Ideally, addition to in-flight_requests in
Afterwards, the Still, missing path for reconnecting remains a problem. I haven't been able to find any code path which would reenable the connection in async_client poll loop. |
After further investigation, I've come to believe that it is indeed the race condition, not network problems, which are causing this. In our case we are committing manually from the main thread. When close is called 5 minutes later, |
Interesting. Let me think a bit more about this. You went straight from 1.3.X to 1.4.5? did you ever try 1.4.4 or earlier? |
We've tried to upgrade to 1.4.3 last year, then reverted to 1.3.5 because we had weird cluster issues. Nothing we could pin down, specifically, and not this. No one had time to investigate properly back then. |
…ions_with_conn.in_flight_requests kafka-python dpkp#1744 Fix race conditions with conn.in_flight_requests
…ions_with_conn.in_flight_requests kafka-python dpkp#1744 Fix one more race condition with IFR
We are also seeing the same issue where the consumer is getting stuck after running for one/two days. The problem becomes worst since it is not causing re-balance. We are using 1.4.4. |
See #1766 (comment) for a status update on this issue. @vimal3271 |
Fixes have been merged to master. I'm going to close, but please reopen if the issue persists. |
@isamaru is the issue resolved? if so what's the solution? Please help me out here |
After switching to 1.4.5 from an older 1.3 version, we see some workers getting stuck with this pattern in the logs:
Protocol out of sync
" with no previous errorsClosing connection. KafkaConnectionError: Socket EVENT_READ without in-flight-requests
"[kafka.client] Node 2 connection failed -- refreshing metadata
"Duplicate close() with error: [Error 7] RequestTimedOutError: Request timed out after 305000 ms
", but with no apparent attempts to reconnectMore detailed (INFO) logs here:
kafka-python-logs.txt
We found a worker which was stuck like this for 2 days, processing no messages but not failing directly or even rebalancing the group, causing lag on its partition. The broker is running and other workers can connect to that broker in that period.
Note that Node 2 is leader for the partition which the worker is assigned to. Group coordinator is Node 1 which is why heartbeat keeps beating.
This seems to be the same thing as #1728 which wasn't completely fixed.
#1733 was about fixing one possible cause for that error (= avoid it). I believe that in our case the error is legitimate (temporary connection problems to broker).
The real issue is that the worker is not able to restore itself after this error happens and becomes stuck instead of either giving up and dying or reconnecting.
I've tried to find code responsible for reconnecting (which doesn't seem to fire) but I don't understand your codebase that well. I will continue investigating, this is important for us.
(We have deployed this on several environments and see this on both 1.0.1 and 2.1 brokers, identified by client as 1.0.0)
The text was updated successfully, but these errors were encountered: