-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Correlation IDs do not match [1.4.5] #1748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Background questions: (1) what broker version(s) are you running, and (2) what is your Because the broker protocol is ordered, the protocol parser buffer should also be an ordered queue. This is also how the official java client is structured. So I don't plan to accept any changes to the underlying protocol structure / dequeue. If Correlation Ids do not match there are only a few explanations: (1) tracking of what we actually sent on the socket gets out of sync (the protocol parser dequeue) Going backwards, Thread A: protocol.send_bytes -> reads data from Now when next request that comes along, with the next correlation_id, is sent to the broker, the broker responds, we decode it and compare to the in-flight-requests queue... we see a mismatch. where is the correlation-id response for Thread B ? This matches your logs and I think would be fixed with a simple additional lock in kafka/conn.py send_pending_requests():
If you're able to hot-patch, can you give that a try? |
Thanks a lot for looking into it! This actually makes a lot of sense: errors I see are more consistent with dropping certain messages to be sent, rather then reordering. There was always a message skipped, never a message "ahead of time". It also explains connection timeouts I see that I blamed on the network. I will try the patch immediately and let you know. (I have clusters of 1.0.1 and 2.1, no auth) |
Deployed for over 4 hours and no problems so far. 1.4.5 had several failures for a comparable time period. FYI I'm reproducing the other race condition described in #1744 even with this, I'm preparing a separate fix. |
…ion_in_protocol.send_bytes kafka-python dpkp#1748 Fix race condition in protocol.send_bytes
This is a serious error since it can result in data loss during a I checked our production logs, and I do see @dpkp once #1744 is fixed, I think it'd be a good idea to push another release. |
This one is definitely important to fix quickly. But I don't believe it would cause data loss unless producer is not checking future success/failure or not using retries. CorrelationIdError will trigger all in-flight-requests to fail, including, in my example above, Thread B's request/response future, which does not get lost. #1744 is a little more pernicious, in my opinion, because it causes a future to get lost and block forever. But anyways, agree that we should push a new release with these fixes asap. |
Ah, right, I forgot about that. Yeah, as long as |
I am trying 1.4.5 consumers with autocommit=off
. I occassionally see
Correlation IDs do not match` error in the logs, in different scenarios. Sometimes it's OffsetCommitRequest, sometimes HeartbeatRequest.#1733 uses map to pair responses with correlate_ids, but only does so in
conn.py
.protocol/parser.py
still uses deque and this is where the error occurs.Related to my investigation of #1744, where I am playing with more locking on
in_flight_requests
manipulation and closing inconn.py
. This causes this race condition to occur even more frequently inparser.py
(which is not that surprising).Also related to #1529
The text was updated successfully, but these errors were encountered: