Redis results backend: apply_async().get() hangs forever after disconnection from redis-server #4857

amitlicht · 2018-06-27T13:27:29Z

Version info

celery: 4.2.0
kombu: 4.2.0
redis-py: 2.10.6
gevent: 1.2.2
redis server: 3.2.11

Steps to reproduce

Celery configuration: broker is rabbitmq, results backend is local redis (password protected). Both broker & client are gevented.

start redis server
start a celery worker
call task.apply_async() once (tasks executes and output is returned)
restart redis server
call task.apply_async() again (caller hangs forever)

Expected behavior

During server downtime: apply_async().get() is expected to raise an error (ConnectionError: Error while reading from socket: ('Connection closed by server.',))
After server is up again: apply_async().get() is expected to return to normal

Actual behavior

During server downtime: apply_async().get() sometimes raises an error and sometimes just hangs
After server is up again: apply_async().get() always hangs. Celery worker receives task request (over rabbitmq broker), executes it. Output is correctly stored in redis under celery-task-meta-TASKID. Caller never actually collects output and returns.

Additional information

Issue looks similar to Celery hangs indefinitely (instead of failing) if redis is not started #4556, however this one is not related to worker bootstrapping but rather to client hanging after disconnection.
Issue also reproduces (in fact, first occurred for us) when celery's redis client disconnects from the server due to network / proxy issues (in our case - middle-man HA proxy drops the connections).

The text was updated successfully, but these errors were encountered:

auvipy · 2018-07-05T19:18:37Z

lets continue the discussion there instead? or this can be open as a separate related issue?

amitlicht · 2018-07-08T10:36:07Z

@auvipy if by "there" you mean in #4556 - while the two look similar, they're not exactly the same issue.

georgepsarakis · 2018-07-08T15:05:13Z

@amitlicht did you try configuring a socket timeout ?

amitlicht · 2018-07-08T15:16:13Z

@georgepsarakis
a. yes, I did, it had no effect on the issue.
b. according to the documentation, the default value for this parameter is 120 seconds. Our process was stuck forever on .get().

I did try to circumvent the entire internal pub-sub usage within celery, by adding our own subscriber on celery-task-meta-TASKID, and having it call on_state_change on its own upon receiving messages. This bypass resolved the issue, that is, disconnections were caught and handled properly by redis-py (with practically no additional work from our client code). The internal celery consumer remains stuck, but the async_result is freed by our bypass consumer.

Just throwing an idea out there - perhaps celery's usage of redis-py somehow disables redis-py disconnection handling?

georgepsarakis · 2018-07-08T15:36:10Z

@amitlicht sorry I did not clarify earlier, this is the socket connect timeout and is a different setting, with a default value of None. Since it refers to connection specifically, I think it may affect the re-connection and re-subscription process.

amitlicht · 2018-07-09T09:43:27Z

@georgepsarakis I did not try redis_socket_connect_timeout, but looking at the documentation this parameter is New in version 5.0.1. As you can see in the original issue description, we encountered the issue in celery 4.2.0, which is the latest stable RC.

georgepsarakis · 2018-07-09T14:31:50Z

@amitlicht this is a typo probably, the setting is available in the latest release: https://github.com/celery/celery/blob/v4.2.0/celery/backends/redis.py#L173

amitlicht · 2018-07-10T10:16:19Z

@georgepsarakis just tried it with socket connect timeout of 10 seconds. Issue remains.
I added a print of redis connection pool's connection_kwargs, here they are:

{u'socket_connect_timeout': 10.0, u'socket_timeout': 120.0, u'db': 0, u'host': u'127.0.0.1', u'password': u'password', u'port': 6395}

georgepsarakis · 2018-07-11T05:05:57Z

@amitlicht thanks for trying this, I will look into this a bit further.

georgepsarakis · 2018-07-11T05:18:29Z

@amitlicht I found this redis-py issue which mentions that the use of TCP Keep-Alive setting may be of use here, since it will result in faster detection of the connection error.

I don't think the Redis backend supports this option though, but it should not be very difficult for you to try it. If you want help, let me know.

amitlicht · 2018-07-11T07:42:35Z

What I don't understand is why you would need any kind of special settings to begin with. As I mentioned, I managed to circumvent the issue entirely by just initializing my own redis-py client, a pubsub consumer out of it, and subscribe it to the celery-task-meta-TASKID channel. No special config whatsoever, my code at the moment looks something like this:

    async_result = task.apply_async(kwargs=kwargs)
    app = task._get_app()
    backend = app.backend
    results_consumer = backend.result_consumer
    task_id = async_result.task_id
    redis_client = StrictRedis(host='localhost', port=5673, password='password')
    pubsub = redis_client.pubsub(ignore_subscribe_messages=True)
    task_key = backend.get_key_for_task(task_id)
    pubsub.subscribe(task_key)

    def try_get_message():
        # See celery.backends.redis.ResultConsumer#drain_events for more details
        m = pubsub.get_message(timeout=30)
        if m and m['type'] == 'message':
            results_consumer.on_state_change(results_consumer._decode_result(m['data']), m)

    def try_get_message_loop():
        while True:
            try_get_message()

    try_get_message_greenlet = gevent.spawn(try_get_message_loop)

    try:
        return async_result.get(timeout=execution_expiry_seconds)
    finally:
        try_get_message_greenlet.kill()
        pubsub.unsubscribe(task_key)
        pubsub.close()

This works as is without any special configuration, server or client side. Which makes me think the problem on celery side is some special configuration to begin with...

georgepsarakis · 2018-07-12T05:22:05Z

@amitlicht thanks for the feedback. I just noticed that you are using gevent, which uses a different result drainer. Could the problem originate from here? What do you think?

If possible please provide a stack trace, generated when you are stopping a client that seems idle, that should help us locate the source of the issue.

acmisiti · 2018-07-19T18:43:10Z

Any update on this?

georgepsarakis · 2018-07-20T04:32:28Z

@acmisiti please check my previous comment. Can you perhaps provide a stack trace generated when stopping an idle client?

amitlicht · 2018-07-22T07:08:16Z

Providing a stacktrace on an idle client is non-trivial since it's all running in a containerized env. As I said we went around it using a rather ugly but working hack...
As per @georgepsarakis's origin suggestion - I wouldn't know as I don't know your code. But in general silently dropping socket.timeout errors sounds like a bad practice.

stringfellow · 2019-03-21T15:03:23Z

@georgepsarakis I can provide a stack trace for this from a keyboard interrupt from running a service in a shell that has the same issue. Only affects Celery >= 4.2.0 -- Kombu version doesn't seem to matter, last known good is Celery == 4.1.0, and issue related to Redis verified by removing the Redis results backend, the issue clears. Below is the stacktrace with the following setup:

software -> celery:4.2.2 (windowlicker) kombu:4.3.0 py:2.7.13
            billiard:3.5.0.5 py-amqp:2.4.2
platform -> system:Linux arch:64bit imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:amqp results:None

Stacktrace:

  File "/opt/conda/envs/the_service/lib/python2.7/site-packages/some/package/that/works.py", line 795, in read_data
    return request.get().get()
  File "/opt/conda/envs/the_service/lib/python2.7/site-packages/celery/result.py", line 224, in get
    on_message=on_message,
  File "/opt/conda/envs/the_service/lib/python2.7/site-packages/celery/backends/async.py", line 188, in wait_for_pending
    for _ in self._wait_for_pending(result, **kwargs):
  File "/opt/conda/envs/the_service/lib/python2.7/site-packages/celery/backends/async.py", line 257, in _wait_for_pending
    sleep(0)
  File "/opt/conda/envs/the_service/lib/python2.7/site-packages/gevent/hub.py", line 167, in sleep
    waiter.get()
  File "/opt/conda/envs/the_service/lib/python2.7/site-packages/gevent/hub.py", line 899, in get
    return self.hub.switch()
  File "/opt/conda/envs/the_service/lib/python2.7/site-packages/gevent/hub.py", line 627, in switch
    switch_out = getattr(getcurrent(), 'switch_out', None)
KeyboardInterrupt

It will just wait forever until the service is killed.

auvipy · 2019-03-21T18:36:18Z

@stringfellow could you kindly try this with celery 4.3rc3 once?

stringfellow · 2019-03-22T11:54:12Z

@auvipy not easily... after finding this issue and posting the stack trace I removed the Redis backend (this is in production and we weren't using the results anyway) - I'm away from work until early April now so am unlikely to have time to do it. Are you not able to reproduce the issue in testing (I couldn't reproduce it on my local Docker stack)?

auvipy · 2020-05-01T10:53:36Z

update?

auvipy · 2021-02-18T07:55:57Z

does updating gevent fix this in master?

amitlicht changed the title ~~Redis results backend: apply_async hangs forever after disconnection from redis-server~~ Redis results backend: apply_async().get() hangs forever after disconnection from redis-server Jul 2, 2018

georgepsarakis added the Component: Redis Results Backend label Jul 8, 2018

georgepsarakis added the Component: Gevent Workers Pool label Jul 20, 2018

auvipy added this to the 4.4.x milestone May 1, 2020

auvipy removed this from the 4.4.x milestone Feb 18, 2021

auvipy added the Status: Feedback Needed ✘ label Feb 18, 2021

auvipy added the Worker Hangs label Feb 18, 2021

auvipy added this to the 5.3 milestone Oct 31, 2021

Nusnus modified the milestones: 5.3, Future Feb 19, 2023

linusphan linked a pull request Oct 22, 2024 that will close this issue

fix: prevent celery from hanging due to spawned greenlet errors in greenlet drainers #9371

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redis results backend: apply_async().get() hangs forever after disconnection from redis-server #4857

Redis results backend: apply_async().get() hangs forever after disconnection from redis-server #4857

amitlicht commented Jun 27, 2018 •

edited by sync-by-unito bot

Loading

auvipy commented Jul 5, 2018

amitlicht commented Jul 8, 2018

georgepsarakis commented Jul 8, 2018 •

edited

Loading

amitlicht commented Jul 8, 2018

georgepsarakis commented Jul 8, 2018

amitlicht commented Jul 9, 2018 •

edited

Loading

georgepsarakis commented Jul 9, 2018

amitlicht commented Jul 10, 2018

georgepsarakis commented Jul 11, 2018

georgepsarakis commented Jul 11, 2018

amitlicht commented Jul 11, 2018

georgepsarakis commented Jul 12, 2018

acmisiti commented Jul 19, 2018

georgepsarakis commented Jul 20, 2018

amitlicht commented Jul 22, 2018

stringfellow commented Mar 21, 2019

auvipy commented Mar 21, 2019

stringfellow commented Mar 22, 2019

auvipy commented May 1, 2020

auvipy commented Feb 18, 2021

Redis results backend: apply_async().get() hangs forever after disconnection from redis-server #4857

Redis results backend: apply_async().get() hangs forever after disconnection from redis-server #4857

Comments

amitlicht commented Jun 27, 2018 • edited by sync-by-unito bot Loading

Version info

Steps to reproduce

Expected behavior

Actual behavior

Additional information

auvipy commented Jul 5, 2018

amitlicht commented Jul 8, 2018

georgepsarakis commented Jul 8, 2018 • edited Loading

amitlicht commented Jul 8, 2018

georgepsarakis commented Jul 8, 2018

amitlicht commented Jul 9, 2018 • edited Loading

georgepsarakis commented Jul 9, 2018

amitlicht commented Jul 10, 2018

georgepsarakis commented Jul 11, 2018

georgepsarakis commented Jul 11, 2018

amitlicht commented Jul 11, 2018

georgepsarakis commented Jul 12, 2018

acmisiti commented Jul 19, 2018

georgepsarakis commented Jul 20, 2018

amitlicht commented Jul 22, 2018

stringfellow commented Mar 21, 2019

auvipy commented Mar 21, 2019

stringfellow commented Mar 22, 2019

auvipy commented May 1, 2020

auvipy commented Feb 18, 2021

amitlicht commented Jun 27, 2018 •

edited by sync-by-unito bot

Loading

georgepsarakis commented Jul 8, 2018 •

edited

Loading

amitlicht commented Jul 9, 2018 •

edited

Loading