Skip to content

DP-238 Test a fix for race condition with closed connection #1746

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

isamaru
Copy link
Contributor

@isamaru isamaru commented Mar 19, 2019

This change is Reviewable

self.in_flight_requests[correlation_id] = (future, sent_time)
with self._ifr_lock:
if self.disconnected():
log.warning("%s: Race condition: connection already closed.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are actually 2 problems:

  1. Connection is closed with "Protocol out of sync". I suspect I know the root cause but I am intentionally not fixing it now. When I deploy this version, this log message will tell me whether I'm right about it and I will know for sure. That's when I fix it.
  2. When connection is closed, it leaves in_flight_requests hanging.
    The timeline is as follows:
  • send() is called, sends a message
  • before future is saved to in_flight_requests, another thread calls close() (such as because of problem number 1)
  • close() terminates all futures in in_flight_requests, but there are none there
  • thread which called send() stores the future to disconnected connection's in_flight_requests
  • the future remains unresolved forever = worker deadlocks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will deploy this on production and check the behavior. If my theory is right, I will eventually get to see this sequence of events:

  1. "protocol out of sync error"
  2. Connection close
  3. Race condition warning
  4. Worker reestabilishing connection successfully

@dpkp
Copy link
Owner

dpkp commented Mar 21, 2019

Interesting theory -- let us know the results of your test!

@isamaru
Copy link
Contributor Author

isamaru commented Mar 21, 2019

Reproduced this (while testing together with #1752): it's exactly as I suspected.

2019-03-21 11:21:00,891 WARNING [kafka.client] Protocol out of sync on <kafka.conn.BrokerConnection object at 0x7f89bef0a9b0>, closing
2019-03-21 11:21:00,896 INFO [kafka.conn] <BrokerConnection node_id=1 host=<redacted-broker-1>:9092 <connected> [IPv4 ('10.100.3.234', 9092)]>: Closing connection. KafkaConnectionError: Socket EVENT_READ without in-flight-requests
2019-03-21 11:21:00,897 WARNING [kafka.client] Node 1 connection failed -- refreshing metadata
2019-03-21 11:21:00,919 WARNING [kafka.conn] %s: Race condition: connection already closed.
2019-03-21 11:21:00,919 ERROR [kafka.consumer.fetcher] Fetch to node 1 failed: Cancelled: <BrokerConnection node_id=1 host=<redacted-broker-1>:9092 <disconnected> [IPv4 ('10.100.3.234', 9092)]>
2019-03-21 11:21:01,423 INFO [kafka.conn] <BrokerConnection node_id=1 host=<redacted-broker-1>:9092 <connecting> [IPv4 ('10.100.3.234', 9092)]>: connecting to <redacted-broker-1>:9092 [('10.100.3.234', 9092) IPv4]
2019-03-21 11:21:01,424 INFO [kafka.conn] <BrokerConnection node_id=1 host=<redacted-broker-1>:9092 <connecting> [IPv4 ('10.100.3.234', 9092)]>: Connection complete.

The worker would have deadlocked if the future was not closed by the new if self.disconnected() branch in _send.

I'm going to improve this a little and then reopen in a new PR.

@dpkp
Copy link
Owner

dpkp commented Mar 21, 2019

Fantastic sleuthing

@isamaru
Copy link
Contributor Author

isamaru commented Mar 21, 2019

Closing, resolved in #1757

@isamaru isamaru closed this Mar 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants