Skip to content

Cannot cleanly shut down an asyncio based server #113538

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
CendioOssman opened this issue Dec 28, 2023 · 33 comments
Closed

Cannot cleanly shut down an asyncio based server #113538

CendioOssman opened this issue Dec 28, 2023 · 33 comments
Labels
topic-asyncio type-bug An unexpected behavior, bug, or error

Comments

@CendioOssman
Copy link
Contributor

CendioOssman commented Dec 28, 2023

Bug report

Bug description:

When writing an asyncio based service, you basically have this sequence:

  1. Create an event loop
  2. Register a SIGTERM handler
  3. Start your server
  4. loop.run_forever()
  5. SIGTERM causes a loop.stop()
  6. Close the server
  7. Close event loop

If there are any connections active at this point, then they don't get discarded until interpreter shutdown, with the result that you get a bunch of ResourceWarnings (and cleanup code might not run).

It would be very useful if there was a Server.close_clients() or something like that. Even a Server.all_transports() would be useful, as then you could do something similar as when doing a Task.cancel() on what you get from loop.all_tasks().

We could poke at Server._transports, but that is something internal that might change in the future.

There is Server.wait_closed(), but that hangs until all clients have gracefully disconnected. It doesn't help us when we want to shut down the service now.

CPython versions tested on:

3.12

Operating systems tested on:

Linux

Linked PRs

@CendioOssman CendioOssman added the type-bug An unexpected behavior, bug, or error label Dec 28, 2023
@CendioOssman
Copy link
Contributor Author

This might be a symptom of a more general issue, that there is a need for cancelling everything that's dependent on the event loop we are shutting down.

Currently, it is possible to cancel tasks, async generators, and executors. The issue here is (sort-of) about shutting down pending file descriptors. But it might also be necessary to be able to cancel pending futures, and possibly more things?

@gvanrossum
Copy link
Member

I've always assumed that it is up to the application to keep track of connections (etc.) in a form that is suitable for shutting down. After all in a typical application there's more to clean up than the client connections -- there may be database connections, and who know what else to handle.

I would definitely not mess with pending futures -- those are sufficiently low-level that you cannot know who is waiting for them and why, and in which order they should be cancelled.

The key point of control you have is probably cancelling tasks -- tasks can do cleanup in response to a cancellation request, and if your tasks form a tree (or even a DAG), they will be cancelled recursively (if task t1 is waiting for task t2, cancelling t1 first cancels t2). So perhaps your application should keep track of the main task per connection?

@gvanrossum
Copy link
Member

gvanrossum commented Dec 28, 2023

Another thought is that this might be something that a higher-level framework or library could offer. E.g. gunicorn or aio-http? (Note that I've never used either, so I may be way off base here.)

@CendioOssman
Copy link
Contributor Author

I've always assumed that it is up to the application to keep track of connections (etc.) in a form that is suitable for shutting down. After all in a typical application there's more to clean up than the client connections -- there may be database connections, and who know what else to handle.

asyncio unfortunately makes this difficult, as it is the one in control of creating the connections. It is done by the Server class, which you cannot override or extend. At best, you can try to track things in the protocol factory callback. But that requires you to also have handling in each protocol class, which is messy and fragile.

Clean up should hopefully be possible using the normal signals from the transports, i.e. connection_lost().

I would definitely not mess with pending futures -- those are sufficiently low-level that you cannot know who is waiting for them and why, and in which order they should be cancelled.

You are probably right. But it does make me uneasy that there isn't a way to centrally force a clean-up on shutdown. If you have to build your own everywhere, then it's easy to make a mistake and overlook something.

The key point of control you have is probably cancelling tasks -- tasks can do cleanup in response to a cancellation request, and if your tasks form a tree (or even a DAG), they will be cancelled recursively (if task t1 is waiting for task t2, cancelling t1 first cancels t2). So perhaps your application should keep track of the main task per connection?

Unfortunately, there isn't always a task involved. Sometimes there is just a protocol, or a construct using futures and callbacks.

Another thought is that this might be something that a higher-level framework or library could offer. E.g. gunicorn or aio-http? (Note that I've never used either, so I may be way off base here.)

Maybe. Those two specific projects are not sufficient, though, as they are just for HTTP.

@gvanrossum
Copy link
Member

Okay, let's dive into the original problem a bit more. I think I would like to see a "scaled-down" example of a realistic server that we might be able to use to demonstrate the problem, reason about it, and evaluate potential solutions. It should be small enough that I can understand the code in 5-10 minutes (my typical attention span :-). Would you be willing to code up a first attempt at such an example? We can iterate on it together.

Regarding Futures, these are used all over the place (internally in asyncio) with specific semantics, sometimes involving cancellation. If we were to start cancelling Futures automatically when a loop is stopped that could have all sorts of unexpected consequences -- probably worse than leaving them alone. So let's declare this out of scope (or if it really bugs you, please open a separate issue, as it has nothing to do with closing servers).

@CendioOssman
Copy link
Contributor Author

This should be a rough mockup of the situation:

#!/usr/bin/python3

import asyncio

class EchoServerProtocol(asyncio.Protocol):
    def connection_made(self, transport):
        peername = transport.get_extra_info('peername')
        print('Connection from {}'.format(peername))
        self.transport = transport

    def data_received(self, data):
        message = data.decode()
        print('Data received: {!r}'.format(message))

        print('Send: {!r}'.format(message))
        self.transport.write(data)

        print('Close the client socket')
        self.transport.close()

def stop():
    loop = asyncio.get_running_loop()
    loop.stop()

def server_up(fut):
    global server
    server = fut.result()

loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)

fut = asyncio.ensure_future(loop.create_server(
        lambda: EchoServerProtocol(),
        '127.0.0.1', 8888))
fut.add_done_callback(server_up)

# Simulate termination
loop.call_later(2.5, stop)

try:
    loop.run_forever()
finally:
    server.close()
    # Can't run this as it will hang:
    #loop.run_until_complete(server.wait_closed())
    loop.close()

If you run this in dev mode and connect a client, you see the issue:

$ python3.12 -X dev echosrv.py 
Connection from ('127.0.0.1', 56290)
sys:1: ResourceWarning: unclosed <socket.socket fd=7, family=2, type=1, proto=6, laddr=('127.0.0.1', 8888), raddr=('127.0.0.1', 56290)>
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/usr/lib64/python3.12/asyncio/selector_events.py:875: ResourceWarning: unclosed transport <_SelectorSocketTransport fd=7>
  _warn(f"unclosed transport {self!r}", ResourceWarning, source=self)
ResourceWarning: Enable tracemalloc to get the object allocation traceback

@jsundahl
Copy link

jsundahl commented Jan 2, 2024

This wasn't a problem until 3.11.5. Here's a simpler example of what I think is the same root issue:

#!/usr/bin/env python3

import asyncio

async def main():
    async def on_connection(reader, writer):
        while True:
            buf = await reader.read(1024)
            print("got bytes", buf)


    server = await asyncio.start_unix_server(on_connection, path="/tmp/sock")
    cr, cw = await asyncio.open_unix_connection(path="/tmp/sock")
    print("sending hi")
    cw.write(b"hi")
    await cw.drain()

    await asyncio.sleep(0.2)

    server.close()
    await server.wait_closed()


asyncio.run(main())

then using pyenv to set the python version and run it we get the following:

$ echo "3.11.4" > .python-version 
$ ./t.py 
sending hi
got bytes b'hi'

I have yet to look at the underlying code but I'm guessing the server implementation in <=3.11.4 is cancelling the reader.read coroutine and ignoring the cancellation error as I'd expect.

$ echo "3.11.5" > .python-version 
$ ./t.py 
sending hi
got bytes b'hi'
Exception ignored in: <function StreamWriter.__del__ at 0x7fe69d0f2c00>
Traceback (most recent call last):
  File "/home/jsundahl/.pyenv/versions/3.11.5/lib/python3.11/asyncio/streams.py", line 396, in __del__
    self.close()
  File "/home/jsundahl/.pyenv/versions/3.11.5/lib/python3.11/asyncio/streams.py", line 344, in close
    return self._transport.close()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jsundahl/.pyenv/versions/3.11.5/lib/python3.11/asyncio/selector_events.py", line 860, in close
    self._loop.call_soon(self._call_connection_lost, None)
  File "/home/jsundahl/.pyenv/versions/3.11.5/lib/python3.11/asyncio/base_events.py", line 761, in call_soon
    self._check_closed()
  File "/home/jsundahl/.pyenv/versions/3.11.5/lib/python3.11/asyncio/base_events.py", line 519, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed

skipping 3.11.6, it's identical to 3.11.5.
3.11.7 is slightly different:

$ echo "3.11.7" > .python-version 
$ ./t.py 
sending hi
got bytes b'hi'
Exception in callback StreamReaderProtocol.connection_made.<locals>.callback(<Task cancell...oon/./t.py:6>>) at /home/jsundahl/.pyenv/versions/3.11.7/lib/python3.11/asyncio/streams.py:248
handle: <Handle StreamReaderProtocol.connection_made.<locals>.callback(<Task cancell...oon/./t.py:6>>) at /home/jsundahl/.pyenv/versions/3.11.7/lib/python3.11/asyncio/streams.py:248>
Traceback (most recent call last):
  File "/home/jsundahl/.pyenv/versions/3.11.7/lib/python3.11/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/jsundahl/.pyenv/versions/3.11.7/lib/python3.11/asyncio/streams.py", line 249, in callback
    exc = task.exception()
          ^^^^^^^^^^^^^^^^
  File "/home/jsundahl/scripts/callsoon/./t.py", line 8, in on_connection
    buf = await reader.read(1024)
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jsundahl/.pyenv/versions/3.11.7/lib/python3.11/asyncio/streams.py", line 708, in read
    await self._wait_for_data('read')
  File "/home/jsundahl/.pyenv/versions/3.11.7/lib/python3.11/asyncio/streams.py", line 540, in _wait_for_data
    await self._waiter
asyncio.exceptions.CancelledError

then 3.12.0 and 3.12.1 hang indefinitely on wait_closed and have the same issue in StreamWriter.__del__ that 3.11.5 has on SIGINT.

@jsundahl
Copy link

jsundahl commented Jan 2, 2024

looks like this is related to #109538

@gvanrossum
Copy link
Member

gvanrossum commented Jan 3, 2024

With @CendioOssman's example (connecting a client that doesn't send anything, just hangs) I get different results -- at first I thought I saw what they saw, but then I couldn't repro it again. Weird, let's say that's my fault.

With @jsundahl's example I get exactly what they got; main is the same as 3.12. Let me do some spelunking.

@gvanrossum
Copy link
Member

A bit more...

  • In 3.10 and before, server.wait_closed() was a no-op, unless you called it before server.close(), in a task (asyncio.create_task(server.wait_closed())). The unclosed connection was just getting abandoned.
  • I attempted to fix this in 3.12.0, and backported the fix (I think this ended up in 3.11.5).
  • Something was wrong with the fix and an improved fix landed in 3.12.1. I think this is what you have in 3.11.7 (3.11.5 regression: StreamWriter.__del__ fails if event loop is already closed #109538).

Alas, the fix waits until all handlers complete, which in this example is never. Hence the hang. I guess the correct way to write the handler is to put a timeout on the await reader.read(1024) call.

I have to think about this more, but I'm out of time for today. Sorry.

@CendioOssman
Copy link
Contributor Author

I think the use case for server.wait_closed() is perfectly reasonable for some scenarios. So I don't think we should change that. If you have the option of trusting your clients to finish in a reasonable time, then it works great. What's missing is a method for when you don't trust your clients.

E.g. a clean version of:

    for fd, transport in loop._transports.items():
        transport.abort()
    loop.run_until_complete(server.wait_closed())

@gvanrossum
Copy link
Member

One workaround: if you're happy with the old (pre-3.11) behavior, just delete the wait_closed() call. It used to be a no-op, and apparently you were happy with that, so you ought to be happy with not calling it at all. :-)

If you don't trust your clients, you can just exit the process without prejudice, right? Perhaps after a timeout. Any warnings or errors getting logged at that point will help you track down the unreliable bits.

I am reluctant trying to backport any more fixes to 3.12 or 3.11, given how sketchy this has been already.

I don't think we can afford to keep track of all transports -- those are owned by the application, not by the loop. However, you could (after a delay) enumerate all tasks and cancel them. That should clean things up.

@jsundahl
Copy link

jsundahl commented Jan 3, 2024

@gvanrossum that's only kind of a solution for 3.12. For 3.11 there's still that "event loop is closed" issue. It seems to me like it should be feasible for the server to own its "on_connection" tasks and cancel them on close.

@jsundahl
Copy link

jsundahl commented Jan 3, 2024

some way to forcibly cancel this task defined here

self._task = self._loop.create_task(res)

@gvanrossum
Copy link
Member

Okay, now I see one improvement: the task.exception() call in callback() inside StreamReaderProtocol.connection_made() should be guarded by if not task.cancelled(), so that when the task is cancelled, the callback (which was added purely to log the error) doesn't crash and cause more errors to be logged.
This is as simple as the adding this to the top of the callback:

                    if task.cancelled():
                        transport.close()
                        return

and this can be backported to 3.12 and 3.11.

I think you will find that with this change, even when that task lingers because the connection is never closed, it doesn't cause any problems.

Shall I proceed with that?

@kumaraditya303 I suppose this was an oversight in your original PR #111601 ?

@jsundahl
Copy link

jsundahl commented Jan 3, 2024

Yeah that's definitely an improvement that could be made. I don't think that fixes the hang in python 3.12 though, but maybe that's not something to take on for 3.12. I'm going to try to find some time in the next couple days to dig in a bit deeper and get a dev environment set up to play around with this issue.

@gvanrossum
Copy link
Member

I'm actually getting cold feet about that fix. The task exists because asyncio.streams.start_server() takes a client_connected_cb which may be either a plain function or a coroutine; if it's a coroutine, it's run in its own task.

Not much is said about the case where client_connected_cb fails. If it's a plain function, it will cause connection_made to fail. It looks like connection_made methods of protocols in general are called using loop.call_soon(), so that means that errors are logged but nothing else is done about it. The transport may remain in a bad state?

If it's a coroutine (i.e., async def), it's wrapped in a task. I think #111601 added the callback so that something is logged when the coroutine fails (causing the task to exit with an exception); the issue #110894 describes a scenario where the client_connected_cb is a coroutine always immediately fails, and the transport is never closed. The fix retrieves the exception, logs it, and closes the transport; if there's no exception, it logs nothing and leaves the transport open.

But what should happen if the task is cancelled? The current state of affairs is that the task.exception() call raises CancelledError, the callback then raises that, something logs that, and the transport remains open. I would like to avoid the spurious logged exception, but should it close the transport or not?

@gvanrossum
Copy link
Member

FWIW the hang is because the callback never returns and the (toy) application has neither a timeout on its reads nor another way to make it exit its infinite loop. One way to for the appplication to have more control would be to use a plain function as the client_connected_cb instead, and let that plain function create its own task that reads from the reader and writes to the writer -- the application can then manage that task as it pleases.

The most likely reason the task is cancelled would seem to be when ^C is hit. In this case, if we don't close the transport, we get these messages (in addition to the traceback):

sys:1: ResourceWarning: unclosed <socket.socket fd=8, family=1, type=1, proto=0, laddr=/tmp/sock>
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/Users/guido/cpython/Lib/asyncio/selector_events.py:865: ResourceWarning: unclosed transport <_SelectorSocketTransport fd=8>
  _warn(f"unclosed transport {self!r}", ResourceWarning, source=self)
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/Users/guido/cpython/Lib/asyncio/streams.py:414: ResourceWarning: loop is closed
  warnings.warn("loop is closed", ResourceWarning)
/Users/guido/cpython/Lib/asyncio/selector_events.py:865: ResourceWarning: unclosed transport <_SelectorSocketTransport fd=7>

If we do close the transport, that's reduced to this:

/Users/guido/cpython/Lib/asyncio/streams.py:414: ResourceWarning: loop is closed
  warnings.warn("loop is closed", ResourceWarning)
/Users/guido/cpython/Lib/asyncio/selector_events.py:865: ResourceWarning: unclosed transport <_SelectorSocketTransport fd=7>

Those remaining messages are about the client (cr, cw) so these are expected. If I arrange to call cw.close() before leaving main(), they go away.

So, concluding, I think that closing the transport when the task is cancelled is the right thing to do. If you agree, please approve the PR.

miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jan 4, 2024
…task is cancelled (pythonGH-113690)

(cherry picked from commit 4681a52)

Co-authored-by: Guido van Rossum <guido@python.org>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jan 4, 2024
…task is cancelled (pythonGH-113690)

(cherry picked from commit 4681a52)

Co-authored-by: Guido van Rossum <guido@python.org>
@gvanrossum
Copy link
Member

@jsundahl @CendioOssman Is there anything else you think needs to be done here? You have my view on the hang in 3.12.

gvanrossum added a commit that referenced this issue Jan 4, 2024
… task is cancelled (GH-113690) (#113714)

(cherry picked from commit 4681a52)

Co-authored-by: Guido van Rossum <guido@python.org>
kulikjak pushed a commit to kulikjak/cpython that referenced this issue Jan 22, 2024
CendioOssman added a commit to CendioOssman/cpython that referenced this issue Jan 22, 2024
Give applications the option of more forcefully terminating client
connections for asyncio servers. Useful when terminating a service and
there is limited time to wait for clients to finish up their work.
aisk pushed a commit to aisk/cpython that referenced this issue Feb 11, 2024
gvanrossum pushed a commit that referenced this issue Mar 11, 2024
These give applications the option of more forcefully terminating client
connections for asyncio servers. Useful when terminating a service and
there is limited time to wait for clients to finish up their work.
@github-project-automation github-project-automation bot moved this from Todo to Done in asyncio Mar 11, 2024
gvanrossum added a commit to gvanrossum/cpython that referenced this issue Mar 11, 2024
gvanrossum added a commit that referenced this issue Mar 12, 2024
…#114432)" (#116632)

Revert "gh-113538: Add asycio.Server.{close,abort}_clients (#114432)"

Reason: The new test doesn't always pass:
#116423 (comment)

This reverts commit 1d0d49a.
@itamaro itamaro reopened this Mar 12, 2024
@github-project-automation github-project-automation bot moved this from Done to In Progress in asyncio Mar 12, 2024
@itamaro
Copy link
Contributor

itamaro commented Mar 12, 2024

Reopening the issue since the PR was reverted

@gvanrossum
Copy link
Member

For posterity, this was the traceback seen in a test:

======================================================================
ERROR: test_abort_clients (test.test_asyncio.test_server.TestServer2.test_abort_clients)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/unittest/async_case.py", line 93, in _callTestMethod
    if self._callMaybeAsync(method) is not None:
       ~~~~~~~~~~~~~~~~~~~~^^^^^^^^
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/unittest/async_case.py", line 115, in _callMaybeAsync
    return self._asyncioRunner.run(
           ~~~~~~~~~~~~~~~~~~~~~~~^
        func(*args, **kwargs),
        ^^^^^^^^^^^^^^^^^^^^^^
        context=self._asyncioTestContext,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/asyncio/base_events.py", line 721, in run_until_complete
    return future.result()
           ~~~~~~~~~~~~~^^
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/test/test_asyncio/test_server.py", line 231, in test_abort_clients
    s_sock = s_wr.get_extra_info('socket')
             ^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'get_extra_info'
----------------------------------------------------------------------

gvanrossum pushed a commit that referenced this issue Mar 18, 2024
These give applications the option of more forcefully terminating client
connections for asyncio servers. Useful when terminating a service and
there is limited time to wait for clients to finish up their work.

This is a do-over with a test fix for gh-114432, which was reverted.
@github-project-automation github-project-automation bot moved this from In Progress to Done in asyncio Mar 18, 2024
vstinner pushed a commit to vstinner/cpython that referenced this issue Mar 20, 2024
…on#116784)

These give applications the option of more forcefully terminating client
connections for asyncio servers. Useful when terminating a service and
there is limited time to wait for clients to finish up their work.

This is a do-over with a test fix for pythongh-114432, which was reverted.
adorilson pushed a commit to adorilson/cpython that referenced this issue Mar 25, 2024
These give applications the option of more forcefully terminating client
connections for asyncio servers. Useful when terminating a service and
there is limited time to wait for clients to finish up their work.
adorilson pushed a commit to adorilson/cpython that referenced this issue Mar 25, 2024
…ort}_clients (python#114432)" (python#116632)

Revert "pythongh-113538: Add asycio.Server.{close,abort}_clients (python#114432)"

Reason: The new test doesn't always pass:
python#116423 (comment)

This reverts commit 1d0d49a.
adorilson pushed a commit to adorilson/cpython that referenced this issue Mar 25, 2024
…on#116784)

These give applications the option of more forcefully terminating client
connections for asyncio servers. Useful when terminating a service and
there is limited time to wait for clients to finish up their work.

This is a do-over with a test fix for pythongh-114432, which was reverted.
diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
These give applications the option of more forcefully terminating client
connections for asyncio servers. Useful when terminating a service and
there is limited time to wait for clients to finish up their work.
diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
…ort}_clients (python#114432)" (python#116632)

Revert "pythongh-113538: Add asycio.Server.{close,abort}_clients (python#114432)"

Reason: The new test doesn't always pass:
python#116423 (comment)

This reverts commit 1d0d49a.
diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
…on#116784)

These give applications the option of more forcefully terminating client
connections for asyncio servers. Useful when terminating a service and
there is limited time to wait for clients to finish up their work.

This is a do-over with a test fix for pythongh-114432, which was reverted.
Glyphack pushed a commit to Glyphack/cpython that referenced this issue Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-asyncio type-bug An unexpected behavior, bug, or error
Projects
Status: Done
Development

No branches or pull requests

5 participants