Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: prevent celery from hanging due to spawned greenlet errors in greenlet drainers #9371

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

linusphan
Copy link

@linusphan linusphan commented Oct 21, 2024

Fixes #4857

Description

This PR fixes an issue where an error raised in the spawned greenlet from the greenlet drainer causes the drainer to stop retrieving task results. Currently, these errors are only logged, making it difficult for clients to handle the situation effectively. As a result, a client may wait indefinitely for task results that will never be fetched, since the greenlet from the drainer has already stopped running. This change ensures that an error is thrown back to clients, enabling them to handle the error appropriately, such as by exiting or restarting the process.

Here are the steps that we took in our testing to reproduce the issue:

  1. Start up Redis.
  2. Start an API client that uses gevent workers that allow for multiple connections.
  3. The API has an endpoint that enqueues a task to the Redis result backend and also waits for the result before sending back a response to the API client.
  4. Don't start any celery workers yet.
  5. Ensure that we've reduced the max connection retries attempt to be able to get the spawned greenlet to raise an error sooner. We can set this in the celery app configuration for example:
    result_backend_transport_options={
        "retry_policy": {
            "max_retries": 1,
        }
    },
  1. Enqueue a task to Redis successfully using a task's delay method.
  2. Turn off Redis and confirm that a message was eventually logged after all connection attempts have been exhausted indicating that celery must be restarted.
  3. Turn Redis back on.
  4. Start a worker, observe that it received and processed the task, however, also observe that the API request is still hanging.
  5. We can cancel the initial request, or leave it and make a new request, and see that while the worker receives it and processes it successfully, the API client still hangs due to the stopped spawned greenlet in the drainer, which is requiring a celery restart.

linusphan and others added 3 commits October 21, 2024 09:51
…anual restart

Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>
Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com>
@Nusnus Nusnus self-requested a review October 21, 2024 23:27
Copy link

codecov bot commented Oct 21, 2024

Codecov Report

Attention: Patch coverage is 84.09091% with 7 lines in your changes missing coverage. Please review.

Project coverage is 78.30%. Comparing base (3257487) to head (6f3a280).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
celery/backends/asynchronous.py 88.09% 4 Missing and 1 partial ⚠️
celery/backends/redis.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9371      +/-   ##
==========================================
+ Coverage   78.24%   78.30%   +0.06%     
==========================================
  Files         153      153              
  Lines       19040    19056      +16     
  Branches     2520     2523       +3     
==========================================
+ Hits        14898    14922      +24     
+ Misses       3856     3848       -8     
  Partials      286      286              
Flag Coverage Δ
unittests 78.28% <84.09%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com>
Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>
@linusphan linusphan force-pushed the fix-result-backend-connection-error-handling branch from 13e673e to 0cfb89d Compare October 21, 2024 23:43
Copy link
Member

@Nusnus Nusnus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests to make sure the changes are behaving as expected 🙏

Thank you

@linusphan linusphan marked this pull request as draft October 22, 2024 01:08
@linusphan
Copy link
Author

linusphan commented Oct 22, 2024

Please add tests to make sure the changes are behaving as expected 🙏

Thank you

Hi @Nusnus, thank you for looking at this PR. 🙏 I'm new here, and Python isn't my primary language. I may have rushed this. Let me make sure that I can run this locally and add the tests. As I look into this would you be able to take a look at the approach here and help confirm if the changes here are the right behavior?

linusphan and others added 4 commits October 22, 2024 08:28
Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>
Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com>
Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>
Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com>
@linusphan linusphan changed the title fix: prevent infinite loop from result backend connection errors in greenlet drainers fix: prevent celery from hanging due to errors stopping greenlet drainers, causing clients to wait indefinitely for task results Oct 23, 2024
@linusphan linusphan requested a review from Nusnus October 23, 2024 16:51
@linusphan linusphan marked this pull request as ready for review October 23, 2024 16:52
@linusphan
Copy link
Author

Hi @Nusnus, I'm noticing that there are 2 smoke tests failing in specific environments that are succeeding in others here. Are you able to help me understand if those tests are flakey or not?

@Nusnus
Copy link
Member

Nusnus commented Oct 23, 2024

Hi @Nusnus, I'm noticing that there are 2 smoke tests failing in specific environments that are succeeding in others here. Are you able to help me understand if those tests are flakey or not?

Flaky.
Rerunning.

@linusphan linusphan changed the title fix: prevent celery from hanging due to errors stopping greenlet drainers, causing clients to wait indefinitely for task results fix: prevent celery from hanging due to spawned greenlet errors in greenlet drainers Oct 23, 2024
linusphan and others added 4 commits October 23, 2024 18:31
Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com>
Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>
Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com>
Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>
…est_EventletDrainer

Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com>
Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>
Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com>
Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>
@linusphan
Copy link
Author

quick update: we think that the PR is ready for a first pass review. We won't be pushing any more commits at the moment.

We also think that code coverage lint is cached or something because we're seeing a check warning stating that "line #L150 was not covered by tests" in celery/backends/asynchronous.py#L150 even though we can confirm there are tests that fail if we decide to update that line to throw a random exception.

The only uncovered part is where this discussion thread is at. Happy to add tests if we reach a conclusion about the specific error and whether this is the approach that we'd like to take here.

@mothershipper
Copy link

Hey @Nusnus don't mean to be a bother, but is there anything we can do to help move this along? We're using this fork in production, but would love to upstream it (or get this into a state that you'd be willing to take on). Thanks!

@Nusnus
Copy link
Member

Nusnus commented Nov 9, 2024

Hey @Nusnus don't mean to be a bother, but is there anything we can do to help move this along? We're using this fork in production, but would love to upstream it (or get this into a state that you'd be willing to take on). Thanks!

I'll give it another look this week, let's see what I can do to help

Copy link
Member

@auvipy auvipy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add some integration test for the change, please?

@mothershipper
Copy link

@auvipy / @thedrow happy to take a look at smoke tests, but is there anything in particular you're looking for coverage on?

The only behavior change should be for users that are already on a non-happy path, the happy path shouldn't have changed. My gut says this may be tough to test without stubbing as it'd require killing redis from within the testing process (and then restarting it for other tests?). Let me know your thoughts, we'll still take a look today either way!

@mothershipper
Copy link

I stand corrected, does look like you have support for initing / killing containers in the smoke tests already -- we'll see what we can do :)

@Nusnus
Copy link
Member

Nusnus commented Nov 19, 2024

I stand corrected, does look like you have support for initing / killing containers in the smoke tests already -- we'll see what we can do :)

Check out the pytest-celery docs for the smoke tests: https://pytest-celery.readthedocs.io

@Nusnus
Copy link
Member

Nusnus commented Dec 17, 2024

CI Issues fixed, 100% passing now.

Copy link
Member

@auvipy auvipy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we improve the test coverage?

@mothershipper
Copy link

@auvipy I've been struggling to get the smoke tests to run locally, and don't want to spam you all/force CI to run broken tests while we figure it out. I've got a bit of time over the holiday break to make a second attempt, but I think we may be close to (or hitting) our limit in terms of being able to contribute effectively here :/

For what it's worth, I don't believe the codecov comment on this PR is accurate, I think it was cached based on earlier commits to the branch that didn't add tests.

@Nusnus
Copy link
Member

Nusnus commented Dec 23, 2024

@auvipy I've been struggling to get the smoke tests to run locally

What issues are you having?
Maybe I can help

@mothershipper
Copy link

@Nusnus how do you all recommend running the smoke tests on local? Was kind of hoping there'd be a make smoke or equivalent to build/boot docker with the right flags set to run the tests. I don't think anything here is insurmountable, we're just running up against the time we can allocate to this right now.

Where I left off was having permission issues in the container when invoking tox, need to figure out which dirs need to be mounted as writable and hope that's the last blocker to execution. Haven't spent a much time trying to get the tests to run outside docker, but it seems like some of the deps are missing prebuilt wheels for apple silicon and need extra system deps -- didn't really want to install anything into my env unless really necessary.

@auvipy auvipy added this to the 5.5.1 milestone Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Redis results backend: apply_async().get() hangs forever after disconnection from redis-server
4 participants