Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis results backend: apply_async().get() hangs forever after disconnection from redis-server #4857

Open
amitlicht opened this issue Jun 27, 2018 · 20 comments · May be fixed by #9371
Open

Comments

@amitlicht
Copy link

amitlicht commented Jun 27, 2018

Version info

  • celery: 4.2.0
  • kombu: 4.2.0
  • redis-py: 2.10.6
  • gevent: 1.2.2
  • redis server: 3.2.11

Steps to reproduce

Celery configuration: broker is rabbitmq, results backend is local redis (password protected). Both broker & client are gevented.

  1. start redis server
  2. start a celery worker
  3. call task.apply_async() once (tasks executes and output is returned)
  4. restart redis server
  5. call task.apply_async() again (caller hangs forever)

Expected behavior

  • During server downtime: apply_async().get() is expected to raise an error (ConnectionError: Error while reading from socket: ('Connection closed by server.',))
  • After server is up again: apply_async().get() is expected to return to normal

Actual behavior

  • During server downtime: apply_async().get() sometimes raises an error and sometimes just hangs
  • After server is up again: apply_async().get() always hangs. Celery worker receives task request (over rabbitmq broker), executes it. Output is correctly stored in redis under celery-task-meta-TASKID. Caller never actually collects output and returns.

Additional information

  • Issue looks similar to Celery hangs indefinitely (instead of failing) if redis is not started #4556, however this one is not related to worker bootstrapping but rather to client hanging after disconnection.
  • Issue also reproduces (in fact, first occurred for us) when celery's redis client disconnects from the server due to network / proxy issues (in our case - middle-man HA proxy drops the connections).
@amitlicht amitlicht changed the title Redis results backend: apply_async hangs forever after disconnection from redis-server Redis results backend: apply_async().get() hangs forever after disconnection from redis-server Jul 2, 2018
@auvipy
Copy link
Member

auvipy commented Jul 5, 2018

lets continue the discussion there instead? or this can be open as a separate related issue?

@amitlicht
Copy link
Author

@auvipy if by "there" you mean in #4556 - while the two look similar, they're not exactly the same issue.

@georgepsarakis
Copy link
Contributor

georgepsarakis commented Jul 8, 2018

@amitlicht did you try configuring a socket timeout ?

@amitlicht
Copy link
Author

@georgepsarakis
a. yes, I did, it had no effect on the issue.
b. according to the documentation, the default value for this parameter is 120 seconds. Our process was stuck forever on .get().

I did try to circumvent the entire internal pub-sub usage within celery, by adding our own subscriber on celery-task-meta-TASKID, and having it call on_state_change on its own upon receiving messages. This bypass resolved the issue, that is, disconnections were caught and handled properly by redis-py (with practically no additional work from our client code). The internal celery consumer remains stuck, but the async_result is freed by our bypass consumer.

Just throwing an idea out there - perhaps celery's usage of redis-py somehow disables redis-py disconnection handling?

@georgepsarakis
Copy link
Contributor

@amitlicht sorry I did not clarify earlier, this is the socket connect timeout and is a different setting, with a default value of None. Since it refers to connection specifically, I think it may affect the re-connection and re-subscription process.

@amitlicht
Copy link
Author

amitlicht commented Jul 9, 2018

@georgepsarakis I did not try redis_socket_connect_timeout, but looking at the documentation this parameter is New in version 5.0.1. As you can see in the original issue description, we encountered the issue in celery 4.2.0, which is the latest stable RC.

@georgepsarakis
Copy link
Contributor

@amitlicht this is a typo probably, the setting is available in the latest release: https://github.com/celery/celery/blob/v4.2.0/celery/backends/redis.py#L173

@amitlicht
Copy link
Author

@georgepsarakis just tried it with socket connect timeout of 10 seconds. Issue remains.
I added a print of redis connection pool's connection_kwargs, here they are:

{u'socket_connect_timeout': 10.0, u'socket_timeout': 120.0, u'db': 0, u'host': u'127.0.0.1', u'password': u'password', u'port': 6395}

@georgepsarakis
Copy link
Contributor

@amitlicht thanks for trying this, I will look into this a bit further.

@georgepsarakis
Copy link
Contributor

@amitlicht I found this redis-py issue which mentions that the use of TCP Keep-Alive setting may be of use here, since it will result in faster detection of the connection error.

I don't think the Redis backend supports this option though, but it should not be very difficult for you to try it. If you want help, let me know.

@amitlicht
Copy link
Author

What I don't understand is why you would need any kind of special settings to begin with. As I mentioned, I managed to circumvent the issue entirely by just initializing my own redis-py client, a pubsub consumer out of it, and subscribe it to the celery-task-meta-TASKID channel. No special config whatsoever, my code at the moment looks something like this:

    async_result = task.apply_async(kwargs=kwargs)
    app = task._get_app()
    backend = app.backend
    results_consumer = backend.result_consumer
    task_id = async_result.task_id
    redis_client = StrictRedis(host='localhost', port=5673, password='password')
    pubsub = redis_client.pubsub(ignore_subscribe_messages=True)
    task_key = backend.get_key_for_task(task_id)
    pubsub.subscribe(task_key)

    def try_get_message():
        # See celery.backends.redis.ResultConsumer#drain_events for more details
        m = pubsub.get_message(timeout=30)
        if m and m['type'] == 'message':
            results_consumer.on_state_change(results_consumer._decode_result(m['data']), m)

    def try_get_message_loop():
        while True:
            try_get_message()

    try_get_message_greenlet = gevent.spawn(try_get_message_loop)

    try:
        return async_result.get(timeout=execution_expiry_seconds)
    finally:
        try_get_message_greenlet.kill()
        pubsub.unsubscribe(task_key)
        pubsub.close()

This works as is without any special configuration, server or client side. Which makes me think the problem on celery side is some special configuration to begin with...

@georgepsarakis
Copy link
Contributor

@amitlicht thanks for the feedback. I just noticed that you are using gevent, which uses a different result drainer. Could the problem originate from here? What do you think?

If possible please provide a stack trace, generated when you are stopping a client that seems idle, that should help us locate the source of the issue.

@acmisiti
Copy link

Any update on this?

@georgepsarakis
Copy link
Contributor

@acmisiti please check my previous comment. Can you perhaps provide a stack trace generated when stopping an idle client?

@amitlicht
Copy link
Author

Providing a stacktrace on an idle client is non-trivial since it's all running in a containerized env. As I said we went around it using a rather ugly but working hack...
As per @georgepsarakis's origin suggestion - I wouldn't know as I don't know your code. But in general silently dropping socket.timeout errors sounds like a bad practice.

@stringfellow
Copy link

@georgepsarakis I can provide a stack trace for this from a keyboard interrupt from running a service in a shell that has the same issue. Only affects Celery >= 4.2.0 -- Kombu version doesn't seem to matter, last known good is Celery == 4.1.0, and issue related to Redis verified by removing the Redis results backend, the issue clears. Below is the stacktrace with the following setup:

software -> celery:4.2.2 (windowlicker) kombu:4.3.0 py:2.7.13
            billiard:3.5.0.5 py-amqp:2.4.2
platform -> system:Linux arch:64bit imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:amqp results:None

Stacktrace:

  File "/opt/conda/envs/the_service/lib/python2.7/site-packages/some/package/that/works.py", line 795, in read_data
    return request.get().get()
  File "/opt/conda/envs/the_service/lib/python2.7/site-packages/celery/result.py", line 224, in get
    on_message=on_message,
  File "/opt/conda/envs/the_service/lib/python2.7/site-packages/celery/backends/async.py", line 188, in wait_for_pending
    for _ in self._wait_for_pending(result, **kwargs):
  File "/opt/conda/envs/the_service/lib/python2.7/site-packages/celery/backends/async.py", line 257, in _wait_for_pending
    sleep(0)
  File "/opt/conda/envs/the_service/lib/python2.7/site-packages/gevent/hub.py", line 167, in sleep
    waiter.get()
  File "/opt/conda/envs/the_service/lib/python2.7/site-packages/gevent/hub.py", line 899, in get
    return self.hub.switch()
  File "/opt/conda/envs/the_service/lib/python2.7/site-packages/gevent/hub.py", line 627, in switch
    switch_out = getattr(getcurrent(), 'switch_out', None)
KeyboardInterrupt

It will just wait forever until the service is killed.

@auvipy
Copy link
Member

auvipy commented Mar 21, 2019

@stringfellow could you kindly try this with celery 4.3rc3 once?

@stringfellow
Copy link

@auvipy not easily... after finding this issue and posting the stack trace I removed the Redis backend (this is in production and we weren't using the results anyway) - I'm away from work until early April now so am unlikely to have time to do it. Are you not able to reproduce the issue in testing (I couldn't reproduce it on my local Docker stack)?

@auvipy auvipy added this to the 4.4.x milestone May 1, 2020
@auvipy
Copy link
Member

auvipy commented May 1, 2020

update?

@auvipy auvipy removed this from the 4.4.x milestone Feb 18, 2021
@auvipy
Copy link
Member

auvipy commented Feb 18, 2021

does updating gevent fix this in master?

@auvipy auvipy added this to the 5.3 milestone Oct 31, 2021
@Nusnus Nusnus modified the milestones: 5.3, Future Feb 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment