Skip to content

fetch_openml can raise "PermissionError: [WinError 32] The process cannot access the file because it is being used by another process" #21798

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ogrisel opened this issue Nov 26, 2021 · 11 comments

Comments

@ogrisel
Copy link
Member

ogrisel commented Nov 26, 2021

Describe the bug

On windows, if fetch_openml is run concurrently in 2 processes, for instance when running the test with pytest-xdist, one sometimes get errors such as:

[...]
monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x1439D0F0>
gzip_response = True

    @pytest.mark.parametrize("gzip_response", [True, False])
        version    = 'active'
C:\hostedtoolcache\windows\Python\3.7.9\x86\lib\site-packages\sklearn\datasets\_openml.py:449: in _get_data_description_by_id
    url, error_message, data_home=data_home
        data_home  = 'C:\\Users\\VssAdministrator\\scikit_learn_data\\openml'
        data_id    = 2
        error_message = 'Dataset with data_id 2 not found.'
        url        = 'api/v1/json/data/2'
C:\hostedtoolcache\windows\Python\3.7.9\x86\lib\site-packages\sklearn\datasets\_openml.py:172: in _get_json_content_from_openml_api
    return _load_json()
        _load_json = <function _get_json_content_from_openml_api.<locals>._load_json at 0x14167C00>
        data_home  = 'C:\\Users\\VssAdministrator\\scikit_learn_data\\openml'
        error_message = 'Dataset with data_id 2 not found.'
        url        = 'api/v1/json/data/2'
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = (), kw = {}
local_path = 'C:\\Users\\VssAdministrator\\scikit_learn_data\\openml\\openml.org\\api/v1/json/data/2.gz'

    @wraps(f)
    def wrapper(*args, **kw):
        if data_home is None:
            return f(*args, **kw)
        try:
            return f(*args, **kw)
        except HTTPError:
            raise
        except Exception:
            warn("Invalid cache, redownloading file", RuntimeWarning)
            local_path = _get_local_path(openml_path, data_home)
            if os.path.exists(local_path):
>               os.unlink(local_path)
E               PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\VssAdministrator\\scikit_learn_data\\openml\\openml.org\\api/v1/json/data/2.gz'

Full error log:

https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=35377&view=logs&j=18b0749f-dd9a-5274-d197-77895e43d4e4&t=ba53dc33-2c0b-592b-6f69-b1c7af7ca977

Steps/Code to Reproduce

Run pytest -x -n 4 --pyargs sklearn many times.

Expected Results

No crash, the fetch_openml should be concurrent safe.

Actual Results

See error report above.

Versions

Python dependencies:
          pip: 21.3.1
   setuptools: 47.1.0
      sklearn: 1.1.dev0
        numpy: 1.21.4
        scipy: 1.7.3
       Cython: 0.29.24
       pandas: None
   matplotlib: None
       joblib: 1.1.0
threadpoolctl: 3.0.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: C:\hostedtoolcache\windows\Python\3.7.9\x86\lib\site-packages\numpy\.libs\libopenblas.VTYUM5MXKVFE4PZZER3L7PNO6YB4XFF3.gfortran-win32.dll
        version: 0.3.17
threading_layer: pthreads
   architecture: Nehalem
    num_threads: 2

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: C:\hostedtoolcache\windows\Python\3.7.9\x86\lib\site-packages\scipy\.libs\libopenblas.VTYUM5MXKVFE4PZZER3L7PNO6YB4XFF3.gfortran-win32.dll
        version: 0.3.17
threading_layer: pthreads
   architecture: Nehalem
    num_threads: 2

       user_api: openmp
   internal_api: openmp
         prefix: vcomp
       filepath: C:\Windows\SYSTEM32\VCOMP140.DLL
        version: None
    num_threads: 2
@ogrisel
Copy link
Member Author

ogrisel commented Nov 26, 2021

I am not sure how this code works, but to make the fetch_openml code run safely in concurrent processes, one should use atomic operations:

  • download the file to a temporary local file with a random component in its filename;
  • once the download as completed successfully, rename the file to its definitive filename in the in the end, and ignore any renaming failure if the target filename already exists;
  • if the download fails, delete the temp file and retry: no other process should use the same filename so it should not fail.

@ogrisel
Copy link
Member Author

ogrisel commented Nov 26, 2021

The retry mechanism for fetch_openml is also being discussed in #21397. It's likely that a single PR can fix both issues.

@thomasjpfan
Copy link
Member

I do not think any of the fetch_* are safe running concurrently. Your suggestion in #21798 (comment) works if we are okay with the additional bandwidth of multiple processes downloading from the same source. The alternative is to use a filelock which has the downside of adding a dependency or a codebase to vendor.

@siavrez
Copy link
Contributor

siavrez commented Nov 27, 2021

I can not reproduce the error in Linux. Is the error Windows specific? Other than pytest -n itself is there any scenario that any of fetch_* would need concurrent run in sklearn?

@thomasjpfan
Copy link
Member

thomasjpfan commented Nov 27, 2021

Other than pytest -n itself is there any scenario that any of fetch_* would need concurrent run in sklearn?

If a user spawns multiple Python processes (using joblib, ProcessPoolExecutor, etc) and calls fetch_* concurrently.

@siavrez
Copy link
Contributor

siavrez commented Nov 27, 2021

Other than pytest -n itself is there any scenario that any of fetch_* would need concurrent run in sklearn?

If a user spawns multiple Python processes (using joblib, ProcessPoolExecutor, etc) and calls fetch_* concurrently.

Any function, class or IO operation in sklearn or any other code might be used concurrently and implemented in different ways and a large proportion of them might not be thread safe. My idea is the following path: Instead of changing the original code and moving this level of complexity to fetch_* , deal with it in part of code that creates this problem: tests.

@thomasjpfan
Copy link
Member

We have introduced some complexity in our test code for fetch_* functions that is not fetch_openml:

https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/conftest.py#L81-L82

This code downloads all the necessary files before pytest-xdist distributes the work. To me it still feels like a workaround to get tests to work with pytest-xdist.

Any function, class or IO operation in sklearn or any other code might be used concurrently and implemented in different ways and a large proportion of them might not be thread safe.

We have to choose what we want to be threadsafe and I would prefer to have fetch_* be threadsafe.

Given all that, I think it is important to fix the tests so the CI is stable. I opened #21806 as a quick workaround to fix the tests.

@siavrez
Copy link
Contributor

siavrez commented Nov 30, 2021

So far the script I've come up with for tempfile naming is:

from threading import get_native_id
from string import ascii_lowercase, digits
from random import Random

def _get_tempfile_local_path(dir_name: str) -> str:
    rng = Random(get_native_id())
    tempfile_name = ''.join(rng.choices(ascii_lowercase + digits, k=10))
    tempfile_local_path = os.path.join(dir_name, tempfile_name + '.gz')
    return tempfile_local_path

@adrinjalali
Copy link
Member

@siavrez Python has tempfile builtin :)

I'd be in favor of making fetch_* multiprocess safe.

@siavrez
Copy link
Contributor

siavrez commented Dec 2, 2021

The code was for a naming scheme that would be reproducible from the same thread but it is unnecessary.

@thomasjpfan
Copy link
Member

I think this issue was fixed in #21833

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants