Skip to content

BlockingIOError: [Errno 11] Resource temporarily unavailable: on GPFS. #87909

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pconesamingo mannequin opened this issue Apr 6, 2021 · 12 comments
Open

BlockingIOError: [Errno 11] Resource temporarily unavailable: on GPFS. #87909

pconesamingo mannequin opened this issue Apr 6, 2021 · 12 comments
Labels
3.8 (EOL) end of life 3.9 only security fixes 3.10 only security fixes topic-IO type-bug An unexpected behavior, bug, or error

Comments

@pconesamingo
Copy link
Mannequin

pconesamingo mannequin commented Apr 6, 2021

BPO 43743
Nosy @gpshead, @giampaolo, @alexeicolin, @pmrv
PRs
  • bpo-43743: add comment stating _USE_CP_SENDFILE should not be removed #26024
  • Files
  • sendfile.py
  • shutil.patch: Fix for BlockingIOError in shutil.copy
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-05-06.19:04:38.663>
    created_at = <Date 2021-04-06.08:21:31.814>
    labels = ['type-bug', '3.8', '3.9', '3.10', 'expert-IO', 'invalid']
    title = 'BlockingIOError: [Errno 11] Resource temporarily unavailable: on GPFS.'
    updated_at = <Date 2022-03-13.15:22:25.341>
    user = 'https://bugs.python.org/pconesamingo'

    bugs.python.org fields:

    activity = <Date 2022-03-13.15:22:25.341>
    actor = 'pmrv'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-05-06.19:04:38.663>
    closer = 'gregory.p.smith'
    components = ['IO']
    creation = <Date 2021-04-06.08:21:31.814>
    creator = 'p.conesa.mingo'
    dependencies = []
    files = ['50671', '50672']
    hgrepos = []
    issue_num = 43743
    keywords = ['patch']
    message_count = 9.0
    messages = ['390297', '391649', '392985', '393134', '393364', '393419', '393429', '415037', '415038']
    nosy_count = 6.0
    nosy_names = ['gregory.p.smith', 'giampaolo.rodola', 'alexeicolin', 'p.conesa.mingo', 'PEAR', 'pmrv']
    pr_nums = ['26024']
    priority = 'normal'
    resolution = 'not a bug'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue43743'
    versions = ['Python 3.8', 'Python 3.9', 'Python 3.10']

    Linked PRs

    @pconesamingo
    Copy link
    Mannequin Author

    pconesamingo mannequin commented Apr 6, 2021

    Hi, one of our users is reporting this starting to happen in a GPFS. All has been working fine for NTFS so far for many years.

    I had a look at my shutil code, and I can see the try/except code trying to fall back to the "slower" copyfileobj(fsrc, fdst).

    But it seems, by the stacktrace bellow that the "catch" is not happening.

    Any idea how to fix this?

    I guess something like:

    import shutil
    shutil._USE_CP_SENDFILE = False

    should avoid the fast_copy attempt.

    > Traceback (most recent call last):
    >   File "/opt/pxsoft/scipion/v3/ubuntu20.04/scipion-em-esrf/esrf/workflow/esrf_launch_workflow.py", line 432, in <module>
    >     project.scheduleProtocol(prot)
    >   File "/opt/pxsoft/scipion/v3/ubuntu20.04/anaconda3/envs/.scipion3env/lib/python3.8/site-packages/pyworkflow/project/project.py", line 633, in scheduleProtocol
    >     pwutils.path.copyFile(self.dbPath, protocol.getDbPath())
    >   File "/opt/px/scipion/v3/ubuntu20.04/anaconda3/envs/.scipion3env/lib/python3.8/site-packages/pyworkflow/utils/path.py", line 247, in copyFile
    >     shutil.copy(source, dest)
    >   File "/opt/pxsoft/scipion/v3/ubuntu20.04/anaconda3/envs/.scipion3env/lib/python3.8/shutil.py", line 415, in copy
    >     copyfile(src, dst, follow_symlinks=follow_symlinks)
    >   File "/opt/pxsoft/scipion/v3/ubuntu20.04/anaconda3/envs/.scipion3env/lib/python3.8/shutil.py", line 272, in copyfile
    >     _fastcopy_sendfile(fsrc, fdst)
    >   File "/opt/pxsoft/scipion/v3/ubuntu20.04/anaconda3/envs/.scipion3env/lib/python3.8/shutil.py", line 169, in _fastcopy_sendfile
    >     raise err
    >   File "/opt/pxsoft/scipion/v3/ubuntu20.04/anaconda3/envs/.scipion3env/lib/python3.8/shutil.py", line 149, in _fastcopy_sendfile
    >     sent = os.sendfile(outfd, infd, offset, blocksize)
    > BlockingIOError: [Errno 11] Resource temporarily unavailable: 'project.sqlite' -> 'Runs/000002_ProtImportMovies/logs/run.db'

    @pconesamingo pconesamingo mannequin added 3.8 (EOL) end of life type-crash A hard crash of the interpreter, possibly with a core dump topic-IO labels Apr 6, 2021
    @alexeicolin
    Copy link
    Mannequin

    alexeicolin mannequin commented Apr 23, 2021

    Can confirm that this BlockingIOError happens on GPFS (alpine) on Summit supercomputer, tested with Python 3.8 and 3.10a7.

    I found that it happens only for file sizes above 65536. Minimal example:

    This filesize works:

    $ rm -f srcfile dstfile && truncate --size 65535 srcfile && python3.10 -c "import shutil; shutil.copyfile(b'srcfile', b'dstfile')"

    This file size (and larger) does not work:

    $ rm -f srcfile dstfile && truncate --size 65536 srcfile && python3.10 -c "import shutil; shutil.copyfile(b'srcfile', b'dstfile')"
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/.../usr/lib/python3.10/shutil.py", line 265, in copyfile
        _fastcopy_sendfile(fsrc, fdst)
      File "/.../usr/lib/python3.10/shutil.py", line 162, in _fastcopy_sendfile
        raise err
      File "/.../usr/lib/python3.10/shutil.py", line 142, in _fastcopy_sendfile
        sent = os.sendfile(outfd, infd, offset, blocksize)
    BlockingIOError: [Errno 11] Resource temporarily unavailable: b'srcfile' -> b'dstfile'

    I tried patching shutil.py to retry the the call on this EAGAIN, but subsequent attempts fail with EAGAIN again indefinitely.

    I also use OP's workaround: set _USE_CP_SENDFILE = False in shutil.py

    @alexeicolin alexeicolin mannequin added 3.10 only security fixes 3.8 (EOL) end of life and removed 3.8 (EOL) end of life labels Apr 23, 2021
    @PEAR
    Copy link
    Mannequin

    PEAR mannequin commented May 5, 2021

    Most probably related: https://www.ibm.com/support/pages/apar/IJ28891

    @gpshead
    Copy link
    Member

    gpshead commented May 6, 2021

    I don't believe CPython should be working around a bug in specific Linux kernel versions in the standard library unless they are extremely pernicious and not considered to be a bug and thus ever be fixed in the OS kernel.

    As the sendfile system call appears to infinitely return one of EAGAIN, EALREADY, EWOULDBLOCK, or EINPROGRESS in this case, there isn't anything CPython could do. A retry/backoff loop won't help.

    This should be worked around at the application level by whatever means are appropriate.

    @gpshead gpshead closed this as completed May 6, 2021
    @gpshead gpshead added invalid and removed type-crash A hard crash of the interpreter, possibly with a core dump labels May 6, 2021
    @gpshead gpshead closed this as completed May 6, 2021
    @gpshead gpshead added type-bug An unexpected behavior, bug, or error invalid and removed type-crash A hard crash of the interpreter, possibly with a core dump labels May 6, 2021
    @pconesamingo
    Copy link
    Mannequin Author

    pconesamingo mannequin commented May 10, 2021

    So, is it ok, when the fast copy fails, not to _GiveupOnFastCopy(err)?

    I can understand that fast copy might fail, but then the Giveup part should happen and it wasn't.

    Additionally, _USE_CP_SENDFILE could be taken, optionally from an environment variable to cancel the fastcopy once we know it will fail?

    @gpshead
    Copy link
    Member

    gpshead commented May 10, 2021

    The logic for bailing out to a slow copy is currently:

    https://github.com/python/cpython/blob/main/Lib/shutil.py#L158

    that condition appears to not be happening in Alexei's test. Suggesting that either at least one sendfile call succeeded and thus offset is non-zero or the lseek failed.

    run that test under pdb and walk thru the code, or under strace to look at the syscalls and find out.

    The question seems to be is if it should be okay to _GiveUpOnFastCopy after a partial (incomplete) copy has already occurred via sendfile.

    @giampaolo
    Copy link
    Contributor

    The question seems to be is if it should be okay to _GiveUpOnFastCopy after a partial (incomplete) copy has already occurred via sendfile.

    I think it should not. For posterity: my rationale for introducing _USE_CP_SENDFILE was to allow monkey patching for corner cases such as this one (see also bpo-36610 / #57884), but expose it as a private name because I expected them to be rare and likely up to a broken underlying implementation, as it appears this is the case. FWIW, I deem _USE_CP_SENDFILE usage in production code as legitimate, and as such it should stay private but never be removed.

    @giampaolo giampaolo added 3.9 only security fixes labels May 10, 2021
    @pmrv
    Copy link
    Mannequin

    pmrv mannequin commented Mar 13, 2022

    I hope you don't mind me necro posting, but I ran into this issue again and
    have a small patch to solve it.

    I attached an MWE that triggers the BlockingIOError reliably on ext4
    filesystems in linux 4.12.14 and python 3.8.12. Running under strace -e
    sendfile gives the following output

    # manually calling sendfile to check that it works

    sendfile(5, 4, [0] => [8388608], 8388608) = 8388608
    # sendfile calls originating in shutil.copy
    sendfile(5, 4, [0] => [8388608], 8388608) = 8388608
    sendfile(5, 4, [8388608], 8388608) = -1 EAGAIN (Resource temporarily unavailable)
    Shutil Failed!
    [Errno 11] Resource temporarily unavailable: '/cmmc/u/zora/scratch/sendfile_bug/tmpaqx2o4uj' -> '/cmmc/u/zora/scratch/sendfile_bug/tmpb8rzg8rg'
    +++ exited with 0 +++

    This shows that the first call to sendfile actually copies the whole file and
    the EAGAIN is only triggered on the second, unnecessary, call. I have tested
    with a small C program that it's triggered whenever sendfile's offset + count
    exceeds the file size of in_fd. This is weird behaviour on the kernels side
    that seems to have changed in newer kernel versions (issue is not present e.g.
    on my 5.16.12 laptop).

    Anyways my patch makes that second call not appear by keeping track of the file
    size and the bytes written so far. It's against the current python main
    branch, but if I see correctly this part hasn't changed in years. I have
    checked the error is not thrown when the patch is applied.

    (I can only attach one file, so patch is attached in a new one.)

    @pmrv
    Copy link
    Mannequin

    pmrv mannequin commented Mar 13, 2022

    Here's the small patch. Sadly I have no overview what the affected linux kernel version are. I guess technically you can all this "working around a bug in specific linux version", but since it's a very minor change that saves one syscall even for non-breaking version, I feel it's justified. Let me know if you'd like any modification done however.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @jmozmoz
    Copy link

    jmozmoz commented Jan 21, 2023

    I have exectly the same problem (using jupyter lab) and the patch provided solves the problem.

    @gpshead gpshead self-assigned this Jan 22, 2023
    @gpshead gpshead reopened this Jan 22, 2023
    gpshead added a commit to gpshead/cpython that referenced this issue Jan 22, 2023
    That triggers an EAGAIN error on a minority of kernels and filesystems.
    @gpshead
    Copy link
    Member

    gpshead commented Jan 22, 2023

    said patch leads to other test failures based on the linux CI on that PR.

    @gpshead gpshead removed their assignment Jan 22, 2023
    @jmozmoz
    Copy link

    jmozmoz commented Jan 22, 2023

    Just for reference here the used version:

    python --version
    Python 3.9.13
    
    cat /etc/centos-release
    CentOS Linux release 8.3.2011
    
    uname -a
    Linux xxx 4.18.0-240.el8.x86_64 #1 SMP Fri Sep 25 19:48:47 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
    
    rpm -qa | grep gpfs
    gpfs.gplbin-4.18.0-240.el8.x86_64-5.1.0-1.x86_64
    gpfs.docs-5.1.0-1.noarch
    gpfs.license.std-5.1.0-1.x86_64
    gpfs.base-5.1.0-1.x86_64
    gpfs.java-5.1.0-1.x86_64
    gpfs.gpl-5.1.0-1.noarch
    gpfs.gskit-8.0.55-12.x86_64
    gpfs.msg.en_US-5.1.0-1.noarch
    gpfs.compression-5.1.0-1.x86_64
    

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 (EOL) end of life 3.9 only security fixes 3.10 only security fixes topic-IO type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants