fix: prevent celery from hanging due to spawned greenlet errors in greenlet drainers #9371

linusphan · 2024-10-21T22:29:10Z

Description

This PR fixes an issue where an error raised in the spawned greenlet from the greenlet drainer causes the drainer to stop retrieving task results. Currently, these errors are only logged, making it difficult for clients to handle the situation effectively. As a result, a client may wait indefinitely for task results that will never be fetched, since the greenlet from the drainer has already stopped running. This change ensures that an error is thrown back to clients, enabling them to handle the error appropriately, such as by exiting or restarting the process.

Here are the steps that we took in our testing to reproduce the issue:

Start up Redis.
Start an API client that uses gevent workers that allow for multiple connections.
The API has an endpoint that enqueues a task to the Redis result backend and also waits for the result before sending back a response to the API client.
Don't start any celery workers yet.
Ensure that we've reduced the max connection retries attempt to be able to get the spawned greenlet to raise an error sooner. We can set this in the celery app configuration for example:

    result_backend_transport_options={
        "retry_policy": {
            "max_retries": 1,
        }
    },

Enqueue a task to Redis successfully using a task's delay method.
Turn off Redis and confirm that a message was eventually logged after all connection attempts have been exhausted indicating that celery must be restarted.
Turn Redis back on.
Start a worker, observe that it received and processed the task, however, also observe that the API request is still hanging.
We can cancel the initial request, or leave it and make a new request, and see that while the worker receives it and processes it successfully, the API client still hangs due to the stopped spawned greenlet in the drainer, which is requiring a celery restart.

…anual restart Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com> Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com>

…ckend-connection-error-handling

codecov · 2024-10-21T23:31:09Z

Codecov Report

Attention: Patch coverage is 84.09091% with 7 lines in your changes missing coverage. Please review.

Project coverage is 78.30%. Comparing base (3257487) to head (6f3a280).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
celery/backends/asynchronous.py	88.09%	4 Missing and 1 partial ⚠️
celery/backends/redis.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #9371      +/-   ##
==========================================
+ Coverage   78.24%   78.30%   +0.06%     
==========================================
  Files         153      153              
  Lines       19040    19056      +16     
  Branches     2520     2523       +3     
==========================================
+ Hits        14898    14922      +24     
+ Misses       3856     3848       -8     
  Partials      286      286

Flag	Coverage Δ
unittests	`78.28% <84.09%> (+0.06%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>

Nusnus

Please add tests to make sure the changes are behaving as expected 🙏

Thank you

linusphan · 2024-10-22T01:14:13Z

Please add tests to make sure the changes are behaving as expected 🙏

Thank you

Hi @Nusnus, thank you for looking at this PR. 🙏 I'm new here, and Python isn't my primary language. I may have rushed this. Let me make sure that I can run this locally and add the tests. As I look into this would you be able to take a look at the approach here and help confirm if the changes here are the right behavior?

celery/backends/redis.py

celery/backends/asynchronous.py

Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com> Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com>

linusphan · 2024-10-23T16:54:43Z

Hi @Nusnus, I'm noticing that there are 2 smoke tests failing in specific environments that are succeeding in others here. Are you able to help me understand if those tests are flakey or not?

Nusnus · 2024-10-23T16:56:15Z

Hi @Nusnus, I'm noticing that there are 2 smoke tests failing in specific environments that are succeeding in others here. Are you able to help me understand if those tests are flakey or not?

Flaky.
Rerunning.

Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>

…est_EventletDrainer Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>

Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>

linusphan · 2024-10-24T19:52:22Z

quick update: we think that the PR is ready for a first pass review. We won't be pushing any more commits at the moment.

We also think that code coverage lint is cached or something because we're seeing a check warning stating that "line #L150 was not covered by tests" in celery/backends/asynchronous.py#L150 even though we can confirm there are tests that fail if we decide to update that line to throw a random exception.

The only uncovered part is where this discussion thread is at. Happy to add tests if we reach a conclusion about the specific error and whether this is the approach that we'd like to take here.

mothershipper · 2024-11-07T20:47:42Z

Hey @Nusnus don't mean to be a bother, but is there anything we can do to help move this along? We're using this fork in production, but would love to upstream it (or get this into a state that you'd be willing to take on). Thanks!

Nusnus · 2024-11-09T22:36:33Z

Hey @Nusnus don't mean to be a bother, but is there anything we can do to help move this along? We're using this fork in production, but would love to upstream it (or get this into a state that you'd be willing to take on). Thanks!

I'll give it another look this week, let's see what I can do to help

auvipy

can you also add some integration test for the change, please?

mothershipper · 2024-11-19T18:01:48Z

@auvipy / @thedrow happy to take a look at smoke tests, but is there anything in particular you're looking for coverage on?

The only behavior change should be for users that are already on a non-happy path, the happy path shouldn't have changed. My gut says this may be tough to test without stubbing as it'd require killing redis from within the testing process (and then restarting it for other tests?). Let me know your thoughts, we'll still take a look today either way!

mothershipper · 2024-11-19T19:45:50Z

I stand corrected, does look like you have support for initing / killing containers in the smoke tests already -- we'll see what we can do :)

Nusnus · 2024-11-19T20:28:14Z

I stand corrected, does look like you have support for initing / killing containers in the smoke tests already -- we'll see what we can do :)

Check out the pytest-celery docs for the smoke tests: https://pytest-celery.readthedocs.io

Nusnus · 2024-12-17T19:02:23Z

CI Issues fixed, 100% passing now.

auvipy

can we improve the test coverage?

mothershipper · 2024-12-23T18:31:18Z

@auvipy I've been struggling to get the smoke tests to run locally, and don't want to spam you all/force CI to run broken tests while we figure it out. I've got a bit of time over the holiday break to make a second attempt, but I think we may be close to (or hitting) our limit in terms of being able to contribute effectively here :/

For what it's worth, I don't believe the codecov comment on this PR is accurate, I think it was cached based on earlier commits to the branch that didn't add tests.

Nusnus · 2024-12-23T22:12:31Z

@auvipy I've been struggling to get the smoke tests to run locally

What issues are you having?
Maybe I can help

mothershipper · 2025-01-02T20:18:03Z

@Nusnus how do you all recommend running the smoke tests on local? Was kind of hoping there'd be a make smoke or equivalent to build/boot docker with the right flags set to run the tests. I don't think anything here is insurmountable, we're just running up against the time we can allocate to this right now.

Where I left off was having permission issues in the container when invoking tox, need to figure out which dirs need to be mounted as writable and hope that's the last blocker to execution. Haven't spent a much time trying to get the tests to run outside docker, but it seems like some of the deps are missing prebuilt wheels for apple silicon and need extra system deps -- didn't really want to install anything into my env unless really necessary.

linusphan and others added 3 commits October 21, 2024 09:51

add .tool-versions to .gitignore

30e4d84

propagate event drainer errors to prevent infinite loop and require m…

43d8395

…anual restart Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com> Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com>

Merge branch 'main' of github.com:linusphan/celery into fix-result-ba…

eb46270

…ckend-connection-error-handling

Nusnus self-requested a review October 21, 2024 23:27

remove typing

0cfb89d

Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>

linusphan force-pushed the fix-result-backend-connection-error-handling branch from 13e673e to 0cfb89d Compare October 21, 2024 23:43

Nusnus requested changes Oct 21, 2024

View reviewed changes

linusphan marked this pull request as draft October 22, 2024 01:08

linusphan commented Oct 22, 2024

View reviewed changes

celery/backends/redis.py Show resolved Hide resolved

linusphan commented Oct 22, 2024

View reviewed changes

celery/backends/asynchronous.py Outdated Show resolved Hide resolved

linusphan and others added 4 commits October 22, 2024 08:28

add tests

badb649

add tests and refactor implementation

ac389cd

Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com> Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com>

remove test code and add pydoc for clarity

297d6e3

Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com> Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com>

retrigger ci

51bb91c

linusphan changed the title ~~fix: prevent infinite loop from result backend connection errors in greenlet drainers~~ fix: prevent celery from hanging due to errors stopping greenlet drainers, causing clients to wait indefinitely for task results Oct 23, 2024

linusphan requested a review from Nusnus October 23, 2024 16:51

linusphan marked this pull request as ready for review October 23, 2024 16:52

Merge branch 'main' into fix-result-backend-connection-error-handling

29e0ea8

linusphan changed the title ~~fix: prevent celery from hanging due to errors stopping greenlet drainers, causing clients to wait indefinitely for task results~~ fix: prevent celery from hanging due to spawned greenlet errors in greenlet drainers Oct 23, 2024

linusphan and others added 4 commits October 23, 2024 18:31

raise error in greenlet to ensure it exits, and add more test coverage

005f0a3

Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>

calls teardown_thread when using schedule_thread in tests

e65abec

Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>

use wait() instead of while loop for clarity in teardown_thread for t…

d12c0ec

…est_EventletDrainer Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>

fix lint

87419c3

Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>

Merge branch 'main' into fix-result-backend-connection-error-handling

7099666

auvipy requested changes Nov 19, 2024

View reviewed changes

auvipy and others added 2 commits December 14, 2024 15:31

Merge branch 'main' into fix-result-backend-connection-error-handling

1967ab7

Merge branch 'main' into fix-result-backend-connection-error-handling

02e1f09

auvipy and others added 2 commits December 18, 2024 16:51

Merge branch 'main' into fix-result-backend-connection-error-handling

7d0e768

Merge branch 'main' into fix-result-backend-connection-error-handling

75adf99

auvipy reviewed Dec 22, 2024

View reviewed changes

Merge branch 'main' into fix-result-backend-connection-error-handling

6b83719

Merge branch 'main' into fix-result-backend-connection-error-handling

a9a1b4e

auvipy added this to the 5.5.1 milestone Feb 5, 2025

Merge branch 'main' into fix-result-backend-connection-error-handling

6f3a280

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent celery from hanging due to spawned greenlet errors in greenlet drainers #9371

fix: prevent celery from hanging due to spawned greenlet errors in greenlet drainers #9371

linusphan commented Oct 21, 2024 •

edited

Loading

codecov bot commented Oct 21, 2024 •

edited

Loading

Nusnus left a comment

linusphan commented Oct 22, 2024 •

edited

Loading

linusphan commented Oct 23, 2024

Nusnus commented Oct 23, 2024

linusphan commented Oct 24, 2024

mothershipper commented Nov 7, 2024

Nusnus commented Nov 9, 2024

auvipy left a comment

mothershipper commented Nov 19, 2024

mothershipper commented Nov 19, 2024

Nusnus commented Nov 19, 2024

Nusnus commented Dec 17, 2024

auvipy left a comment

mothershipper commented Dec 23, 2024

Nusnus commented Dec 23, 2024 •

edited

Loading

mothershipper commented Jan 2, 2025

fix: prevent celery from hanging due to spawned greenlet errors in greenlet drainers #9371

Are you sure you want to change the base?

fix: prevent celery from hanging due to spawned greenlet errors in greenlet drainers #9371

Conversation

linusphan commented Oct 21, 2024 • edited Loading

Description

codecov bot commented Oct 21, 2024 • edited Loading

Codecov Report

Nusnus left a comment

Choose a reason for hiding this comment

linusphan commented Oct 22, 2024 • edited Loading

linusphan commented Oct 23, 2024

Nusnus commented Oct 23, 2024

linusphan commented Oct 24, 2024

mothershipper commented Nov 7, 2024

Nusnus commented Nov 9, 2024

auvipy left a comment

Choose a reason for hiding this comment

mothershipper commented Nov 19, 2024

mothershipper commented Nov 19, 2024

Nusnus commented Nov 19, 2024

Nusnus commented Dec 17, 2024

auvipy left a comment

Choose a reason for hiding this comment

mothershipper commented Dec 23, 2024

Nusnus commented Dec 23, 2024 • edited Loading

mothershipper commented Jan 2, 2025

linusphan commented Oct 21, 2024 •

edited

Loading

codecov bot commented Oct 21, 2024 •

edited

Loading

linusphan commented Oct 22, 2024 •

edited

Loading

Nusnus commented Dec 23, 2024 •

edited

Loading