Skip to content

[RFC] add pprint for estimators #9099

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

amueller
Copy link
Member

Continuation of #9039.
Right now I didn't replace the __repr__ for easier comparison

@amueller
Copy link
Member Author

FYI right now numpy arrays are converted to lists for printing. If there are very large numpy arrays in the repr, this could be problematic. It also doesn't allow the user to distinguish numpy arrays from lists in the arguments, so that's maybe non-optimal. But doing something else shouldn't be that hard.

@amueller
Copy link
Member Author

from sklearn.pipeline import make_pipeline
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

large = GridSearchCV(make_pipeline(StandardScaler(), SVC()), param_grid={'C': np.logspace(-3, 3, 7)})

from sklearn._pprint import _Formatter
cf = _Formatter(color_changed=True)
f = _Formatter(color_changed=False)

print(f(large) + "\n")
print(cf(large))

image

@amueller
Copy link
Member Author

amueller commented Jun 10, 2017

Code is horrible right now and could be simplified.
Logic between pipeline (which is special cased) and estimator is duplicated, that should probably be simplified. We could also think about special casing BaseSearchCV but it actually looks fine right now.
Feature union might need a similar special case.

Also, this doesn't detect terminal etc so that's broken, but also fixable.

@amueller amueller changed the title add pprint for estimators [RFC] add pprint for estimators Jun 10, 2017
@jnothman
Copy link
Member

I'll have to think about whether there's a way to do that kind of layout without special-casing pipeline, which doesn't seem right

@GaelVaroquaux
Copy link
Member

As mentioned IRL, I think that the functionality is really great (I haven't looked at the code so far)!

+1 for exploring it further

@amueller
Copy link
Member Author

@jnothman depends on what you want. You can comment it out and it will still work. But I want to control where the line-break is manually, and I want the indentation of the tuples to be different.

I thought about this for the html quite a bit. I think it's worth special-casing pipeline because it's important to show this nicely and it won't be as nice without explicit treatment.
I want the indentation of steps to be such that lines up with the "steps" string. But In general I want to indent by 4 whenever I do a line-break. If I would always align with the thing in the line before, you'd get to the end of the line way too quickly.

@amueller
Copy link
Member Author

without special treatment:
image
I could always just indent lists by one, but if you have a line that doesn't start with the list, it looks odd. We could change the indentation based on whether the line starts with the list/tuple or not, though.

@jnothman
Copy link
Member

jnothman commented Jun 12, 2017 via email

@jnothman
Copy link
Member

jnothman commented Aug 8, 2017

Do you think we should also be ordering these parameters in order of the signature, rather than sorting, where possible (i.e. for those with positional arg names)?? I think this would greatly improve readability too as there is often topicalisation (i.e. most important first) and coherence (most related together) in the ordering. On the other hand, sorting allows a user to find the parameter they want...

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Halfway through I realised you probably weren't seeking a code review on this one.

I like it!

I'd be interested in seeing format_pipeline be extended to other compositions (VotingClassifier, FeatureUnion).

estimator.__init__)
init_params = signature(init).parameters
for k, v in params.items():
if v != init_params[k].default:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break for arrays, sparse matrices, Series, etc.

self.indent = 0
self.step = 4
self.width = 79
self.set_formater(object, self.__class__.format_object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double t in formatter

def set_formater(self, obj, callback):
self.types[obj] = callback

def __call__(self, value, **args):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this **args really needed? It seems a hack.

param_repr = self.join_items(items, indent + offset)
return '%s(%s)' % (value.__class__.__name__, param_repr)

def format_pipeline(self, value, indent):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we adopt this for all instances of BaseComposition?

+ self.format_all(params[key], indent + offset), key)
for key in params]
else:
items = [str(key) + '='
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might as well just define color here and factor this line out.


self.default_color = default_color
self.types = {}
self.htchar = ' '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what htchar stands for.

@amueller
Copy link
Member Author

Yeah, the code needs a major cleanup, sorry to waste your time. I should really put a comparison image to master in here to show how much better it actually is ^^. But this is not my highest priority right now.

@amueller
Copy link
Member Author

And yes, definitely the special casing should be extended to the other composite estimators. And sorting vs not sorting is a good question.

@jnothman
Copy link
Member

jnothman commented Aug 15, 2017 via email

@amueller
Copy link
Member Author

The specs would be how the formatting should look like?
And yes, it totally makes sense to open this up.

I like this to be the spec:
image

The problem is that it's very hard to write down a simple spec that works for everything. In particular we're limited in line-length in jupyter, but we're not limited in nesting depth in sklearn. So if you write RFE(RFE(RFE(RFE(RFE(RFE(RFE(LogisticRegresssion)))))), I have no idea how to format that. It's unlikely someone will write something like this. Basically fixed indentation increase + limited width + arbitrary nesting depth is not solvable. I guess we can use ... once the indented line-length becomes too small, or so small that we can't fit the next thing onto it. Or just for a given nesting level? That will create possibly much longer strings than now, though. Right now we limit the string-length and cut off totally arbitrarily. I don't have a good idea how to do that better.

I think it should be something like this:

  • Indentation should increase in steps of 4 and be nested (indentation in master is entirely broken).
  • For composed estimators, make the estimator lists indented by only 1 to match up with opening ( or [.
  • for composed estimators put each estimator name and each estimator on a separate line.
  • use (...) for nesting level 5 (?).

@amueller
Copy link
Member Author

master for comparison:
image

@jnothman
Copy link
Member

jnothman commented Aug 15, 2017 via email

@amueller
Copy link
Member Author

This would be plain text (maybe with terminal colors?). So all estimators and all contexts until we have better ones for fancier contexts ;) The current __repr__/pprint thing has obvious bugs that can't be fixed without doing something like this PR.

@jnothman
Copy link
Member

Well:

RFE(RFE(RFE(RFE(RFE(RFE(RFE(LogisticRegresssion))))))

could be written as:

RFE(rfe1)
where rf1 = RFE(rfe2)
where rfe2 = RFE(rfe3)
where rfe3 = RFE(rfe4)
where rfe4 = RFE(rfe5)
where rfe5 = RFE(rfe6)
where rfe6 = RFE(LogisticRegresssion)

Perhaps this is a bit extreme, but we could extract and define separately nested estimators that are too long (vertically or horizontally) for the current indent...

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jan 11, 2018 via email

@jnothman
Copy link
Member

jnothman commented Jan 11, 2018 via email

@GaelVaroquaux
Copy link
Member

Once this simple modification is in, I'll try to review this PR and aim for merge. I think that it is so terribly useful and will bring a lot of value to users.

@gxyd
Copy link
Contributor

gxyd commented Jan 14, 2018

Is this still valid with 'help wanted'? I could have a go through the discussion (of related PR's as well) and may be try to get it in, in case this still needs help from a contributor.

@jnothman
Copy link
Member

jnothman commented Jan 14, 2018 via email

@amueller
Copy link
Member Author

amueller commented Jun 4, 2018

Putting this one back on my stack... let's see...

@GaelVaroquaux
Copy link
Member

@amueller : I think that we should not be too ambitious with this PR, and just implement my suggested change: "allow ourselves to extend beyond the line limit when the useful fraction of the line becomes too small", add a few tests, and merge. This is very useful to users already.

Do you have the bandwidth to do it, or should we open it for contributions during the sprint?

@amueller
Copy link
Member Author

@GaelVaroquaux I agree. I'm not sure about extending the line length. I have to think about that. But yes, we should keep this one small.
And we can totally open it up during the sprint.

@jnothman
Copy link
Member

In the first place, I don't think 79 characters is an especially reasonable width when code consistency is not the concern. But if the max width becomes a parameter, I suppose we'd better keep to it.

@amueller
Copy link
Member Author

replaced by #11705

@amueller amueller closed this Dec 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants