Skip to content

[WIP] fetch_openml: ability to return DataFrame #11875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jorisvandenbossche
Copy link
Member

Just a proof of concept I quickly tried out, no docs or tests yet.

Fixes #11818

@jorisvandenbossche jorisvandenbossche changed the title [WIP] fetch_openml: abitly to return DataFrame [WIP] fetch_openml: ability to return DataFrame Aug 21, 2018
@janvanrijn
Copy link
Contributor

@jorisvandenbossche are you still working on this? Shall I take over?

@jorisvandenbossche
Copy link
Member Author

I think it can mainly use some review. Although, actually, probably docs and tests could already be added. Feel free to do that / push this forward!

@janvanrijn
Copy link
Contributor

Great, I will make some time for this. Can I push to your PR, or should I fetch your branch and open a new PR?

@jorisvandenbossche
Copy link
Member Author

Added you as collaborator to my fork, you should be able to push to this PR then

X = data.iloc[:, col_slice_x]
X.columns = data_columns

all_numeric = all(features_dict[feature]['data_type'] == 'numeric'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also consider 'real' and 'integer'


for feature in data_columns:
data_type = features_dict[feature]['data_type']
if data_type == 'numeric':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add the case for 'real' which should also be np.float64

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we need an elif for 'integer' to have integer column.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of integer, it will be tricky to manage the missing values thought.

@glemaitre
Copy link
Member

To think about:

  • When uploading dataset, bool column are encoded as string 'True' and 'False'. It might be helpful to decode them to return a boolean column instead of a string column.

@amueller
Copy link
Member

anyone working on this right now?

@janvanrijn
Copy link
Contributor

On my mental to do list. Apparently the feature is there (by Joris) but needs test cases


nominal_attributes = dict(nominal_attributes)

data = pd.DataFrame(arff_data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any risks in Pandas doing type inference here before we set the dtypes below?

@glemaitre
Copy link
Member

We could try to finish this during the sprint @jorisvandenbossche ?

@glemaitre glemaitre self-requested a review February 27, 2019 13:47
@jnothman
Copy link
Member

jnothman commented Feb 28, 2019 via email

@jnothman
Copy link
Member

jnothman commented Mar 3, 2019

Tests failing, in case you weren't aware

@jnothman
Copy link
Member

Let me know when this wants review.

@amueller
Copy link
Member

Any progress on this? I'd really like to have it ;)

@rth
Copy link
Member

rth commented Jul 12, 2019

Continued and resolved in #13902. Closing.

@rth rth closed this Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fetch_openml: Add an option to which returns a DataFrame
6 participants