Skip to content

fetch_mldata needs to handle sparse matrices as labels #700

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue Mar 15, 2012 · 6 comments
Closed

fetch_mldata needs to handle sparse matrices as labels #700

amueller opened this issue Mar 15, 2012 · 6 comments

Comments

@amueller
Copy link
Member

This might be a special case but the "yeast" data set returns a sparse matrix as labels.
As this is a standard dataset for multi-label prediction, it would be good if we supported that.
Maybe we could even make an example.

@amueller
Copy link
Member Author

done in af3b08a

@davidmarek
Copy link
Contributor

Hi,
I am trying to get familiar with scikit-learn. I have looked at this bug and I think the easiest way to handle sparse matrices is to check if dataset['target'] is sparse and in that case transform it into dense matrix.

My fix looks like this:

    # set axes to sklearn conventions
    if transpose_data:
        dataset['data'] = dataset['data'].T
    if 'target' in dataset:
        if issparse(dataset['target']):
            dataset['target'] = dataset['target'].todense()
        dataset['target'] = dataset['target'].squeeze()

Do you think this is the right way to solve this bug? What else is there to do? Add tests?

@amueller
Copy link
Member Author

You're just half an hour to late, I already fixed the bug. Sorry about that.

My fix was using squeeze only when the target is not sparse, which should be more efficient than converting to dense if there are many outputs.

Let me have a look if I can find anything else that you could give a try.

@amueller
Copy link
Member Author

You can try looking into #569 if you like. #615 should be very easy, #558 should be moderate, #615 should also be ok.

@davidmarek
Copy link
Contributor

Thanks, I'll look at those issues. I wasn't sure where the labels can be used and if there won't be any code depending on getting dense matrix.

@amueller
Copy link
Member Author

Most of the code assumes dense matrices but not inside the fetch_mldata part. So I thought it would be reasonable to let the user decide what to do, once he got his hands on the targets.
I think both solutions have pro and cons - and as this really doesn't come up that often, I just went with the first that came to mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants