Skip to content

[Windows] PermissionError in datasets fetchers when trying to remove the downloaded archive #9820

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lucianocrt opened this issue Sep 22, 2017 · 18 comments · Fixed by #9847
Closed
Labels
Milestone

Comments

@lucianocrt
Copy link

Description

I have just intalled scikit-learn though pip. I am trying to load a dataset (california_housing) but I keep getting a permission error when scikit-learn tries to delete the file.
I can delete the file myself from the file explorer, which suggests that it is python that locks the file.

Steps/Code to Reproduce

import numpy as np
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()

Expected Results

No error should be thrown and data should be loaded correctly

Actual Results

Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to C:\Users\lucia\scikit_learn_data
Traceback (most recent call last):
File "test.py", line 4, in
housing = fetch_california_housing()
File "C:\Users\lucia\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\datasets\california_housing.py", line 109, in fetch_california_housing
remove(archive_path)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\Users\lucia\scikit_learn_data\cal_housing.tgz'

Versions

Windows-10-10.0.15063-SP0
Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)]
NumPy 1.13.1
SciPy 0.19.1
Scikit-Learn 0.19.0

lucianocrt added a commit to lucianocrt/scikit-learn that referenced this issue Sep 22, 2017
@lucianocrt
Copy link
Author

Figured out what the problem is, suggested a pull request to fix it.

@jnothman
Copy link
Member

jnothman commented Sep 24, 2017 via email

@lesteve
Copy link
Member

lesteve commented Sep 25, 2017

I am going to reopen this issue. @lucianocrt we tend to leave the issue open until the PR has been fixed. By using "Fix #issueNumber" in your PR description, the associated issue gets closed automatically when the PR is merged. For more details, look at this.

@lesteve lesteve reopened this Sep 25, 2017
@lesteve
Copy link
Member

lesteve commented Sep 25, 2017

BTW this could be a problem that exists on Windows in other datasets fetchers, we should probably double-check all of them.

@lesteve lesteve added the Bug label Sep 25, 2017
@lesteve lesteve added this to the 0.19.1 milestone Sep 25, 2017
@lesteve
Copy link
Member

lesteve commented Sep 25, 2017

I quickly checked with a snippet I used for the figshare work, the fetchers affected by this are:

  • fetch_california_housing
  • fetch_rcv1
  • fetch_species_distributions

@massich maybe you could fix them since you worked on the fetchers when moving them to figshare.

I think we should use context managers (e.g. with open('bla') as f:) if available for tar.gz file to make sure we always close the file before removing it. If that does not exist, using try/finally is fine too (as suggested by Joel in the associated PR). You could look at other fetchers while you are at it to double-check why they are not affected by this bug and possibly make the code across fetchers more uniform.

@lesteve lesteve changed the title Error loading california housing dataset [Windows] PermissionError in datasets fetchers when trying to remove the downloaded archive Sep 25, 2017
@jnothman
Copy link
Member

jnothman commented Sep 25, 2017 via email

@jnothman
Copy link
Member

@lesteve, you've marked this for 0.19.1. Is it a regression?

@lesteve
Copy link
Member

lesteve commented Sep 27, 2017

I wanted to double-check this and forgot sorry. This is a regression for fetch_california_housing. For fetch_rcv1 and fetch_species_distribution it is slightly harder to be sure because the URLs do not exist anymore but looking at the source code it is a regression because we were not deleting the downloaded files in 0.18.2.

@jnothman
Copy link
Member

So then we would like this resolved within the next few days ideally as I think we should be releasing. Is @massich taking this up? I'll mark it as needing contributor in any case.

@lesteve
Copy link
Member

lesteve commented Sep 27, 2017

I haven't heard from him yet. I'll chat with him to see whether he has some spare bandwith to tackle this one.

@massich
Copy link
Contributor

massich commented Sep 27, 2017

Hy, I'm a bit out. But I'm looking into this one.

@lesteve
Copy link
Member

lesteve commented Sep 28, 2017

OK thanks, I can do it, if you don't have the time, just le me know. My understanding is that Joel was hoping to do 0.19.1 kind of this week.

@massich
Copy link
Contributor

massich commented Sep 28, 2017

That's fine, if I'm not mistaken something like this should do the trick

fileobj = tempfile.NamedTemporaryFile()
with tarfile.open(mode="r:gz", name=archive_path).extractfile(
        'CaliforniaHousing/cal_housing.data') as f:
    shutil.copyfileobj(f, fileobj)
    fileobj.seek(0)
remove(archive_path)

cal_housing = np.loadtxt(fileobj, delimiter=',')
fileobj.close()

But I think that it should be a way to make it nicer, either move the remove to the end and using a named tempfile to delete afterwards or something. I'll make a PR tonight.

@lesteve
Copy link
Member

lesteve commented Sep 28, 2017

Looks like this could work. Maybe this can be simplified a bit, can you not just do something like:

with tarfile.open(...).extractfile(...) as f:
    np.loadtxt(f, delimiter=',')

@Teghsummy
Copy link

so i have the sane problem - I am assuming we just have to wait for this to be fixed externally for it to work then?

@lesteve
Copy link
Member

lesteve commented Sep 30, 2017

You will have to wait for 0.19.1 to be released. In the mean time as a work-around you can find the file and comment out the line with the remove(...) call.

For example this is the stack-trace I get when trying to use fetch_california_housing on Windows:

Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to C:\Users\lesteve\scikit_learn_data
---------------------------------------------------------------------------
PermissionError                           Traceback (most recent call last)
<ipython-input-1-7f347e637e46> in <module>()
----> 1 from sklearn import datasets; datasets.fetch_california_housing()

C:\Users\lesteve\Miniconda3\lib\site-packages\sklearn\datasets\california_housing.py in fetch_california_housing(data_home, download_if_missing)
    103             name=archive_path).extractfile(
    104                 'CaliforniaHousing/cal_housing.data')
--> 105         remove(archive_path)
    106
    107         cal_housing = np.loadtxt(fileobj, delimiter=',')

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\lesteve\\scikit_learn_data\\cal_housing.tgz'

c:\msys64\home\lesteve\dev>ipython -c "from sklearn import datasets; datasets.fetch_california_housing()"
Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to C:\Users\lesteve\scikit_learn_data
---------------------------------------------------------------------------
PermissionError                           Traceback (most recent call last)
<ipython-input-1-7f347e637e46> in <module>()
----> 1 from sklearn import datasets; datasets.fetch_california_housing()

C:\Users\lesteve\Miniconda3\lib\site-packages\sklearn\datasets\california_housing.py in fetch_california_housing(data_home, download_if_missing)
    103             name=archive_path).extractfile(
    104                 'CaliforniaHousing/cal_housing.data')
--> 105         remove(archive_path)
    106
    107         cal_housing = np.loadtxt(fileobj, delimiter=',')

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\lesteve\\scikit_learn_data\\cal_housing.tgz'

This tells me that to work around the problem I have to comment out the line remove_archive(path) (line 105) in C:\Users\lesteve\Miniconda3\lib\site-packages\sklearn\datasets\california_housing.py.

Notes:

  • depending on the fetcher, there may be multiple similar calls to remove that you need to comment out.
  • this is just a temporary work-around until 0.19.1 is released. I would not encourage you to patch manually packages like this in general.

@blueskea
Copy link

blueskea commented Jul 6, 2018

Version: scikit-learn (0.19.0)
In File: D:/DvpSoft/Anaconda3/envs/ml/Lib/site-packages/sklearn/datasets/california_housing.py
I comment the line "remove(archive_path)", and It's work.

@lesteve
Copy link
Member

lesteve commented Jul 6, 2018

As I mentioned in my comment above, please use scikit-learn 0.19.1. Patching your code was just a temporary work-around that is no longer needed if you use scikit-learn 0.19.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants