-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[RFC] add pprint for estimators #9099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
FYI right now numpy arrays are converted to lists for printing. If there are very large numpy arrays in the repr, this could be problematic. It also doesn't allow the user to distinguish numpy arrays from lists in the arguments, so that's maybe non-optimal. But doing something else shouldn't be that hard. |
from sklearn.pipeline import make_pipeline
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
large = GridSearchCV(make_pipeline(StandardScaler(), SVC()), param_grid={'C': np.logspace(-3, 3, 7)})
from sklearn._pprint import _Formatter
cf = _Formatter(color_changed=True)
f = _Formatter(color_changed=False)
print(f(large) + "\n")
print(cf(large)) |
Code is horrible right now and could be simplified. Also, this doesn't detect terminal etc so that's broken, but also fixable. |
I'll have to think about whether there's a way to do that kind of layout without special-casing pipeline, which doesn't seem right |
As mentioned IRL, I think that the functionality is really great (I haven't looked at the code so far)! +1 for exploring it further |
@jnothman depends on what you want. You can comment it out and it will still work. But I want to control where the line-break is manually, and I want the indentation of the tuples to be different. I thought about this for the html quite a bit. I think it's worth special-casing pipeline because it's important to show this nicely and it won't be as nice without explicit treatment. |
if we intend that this does not affect doctests, I think it's something we
can continue to fine-tune. and it's looking very good for a start! I'll try
to review soon
…On 12 Jun 2017 7:38 pm, "Andreas Mueller" ***@***.***> wrote:
without special treatment:
[image: image]
<https://user-images.githubusercontent.com/449558/27027957-8c942cbc-4f63-11e7-8d2d-00761f1bce28.png>
I could always just indent lists by one, but if you have a line that
doesn't start with the list, it looks odd. We could change the indentation
based on whether the line starts with the list/tuple or not, though.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9099 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67fy_5rXXHK273LvM-5pRsYZmJbPks5sDQcfgaJpZM4N2HOC>
.
|
Do you think we should also be ordering these parameters in order of the signature, rather than sorting, where possible (i.e. for those with positional arg names)?? I think this would greatly improve readability too as there is often topicalisation (i.e. most important first) and coherence (most related together) in the ordering. On the other hand, sorting allows a user to find the parameter they want... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Halfway through I realised you probably weren't seeking a code review on this one.
I like it!
I'd be interested in seeing format_pipeline be extended to other compositions (VotingClassifier, FeatureUnion).
estimator.__init__) | ||
init_params = signature(init).parameters | ||
for k, v in params.items(): | ||
if v != init_params[k].default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will break for arrays, sparse matrices, Series, etc.
self.indent = 0 | ||
self.step = 4 | ||
self.width = 79 | ||
self.set_formater(object, self.__class__.format_object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double t in formatter
def set_formater(self, obj, callback): | ||
self.types[obj] = callback | ||
|
||
def __call__(self, value, **args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this **args really needed? It seems a hack.
param_repr = self.join_items(items, indent + offset) | ||
return '%s(%s)' % (value.__class__.__name__, param_repr) | ||
|
||
def format_pipeline(self, value, indent): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we adopt this for all instances of BaseComposition?
+ self.format_all(params[key], indent + offset), key) | ||
for key in params] | ||
else: | ||
items = [str(key) + '=' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might as well just define color here and factor this line out.
|
||
self.default_color = default_color | ||
self.types = {} | ||
self.htchar = ' ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what htchar stands for.
Yeah, the code needs a major cleanup, sorry to waste your time. I should really put a comparison image to master in here to show how much better it actually is ^^. But this is not my highest priority right now. |
And yes, definitely the special casing should be extended to the other composite estimators. And sorting vs not sorting is a good question. |
do we want to write down some specs and open this to a contributor?
…On 15 Aug 2017 3:27 am, "Andreas Mueller" ***@***.***> wrote:
And yes, definitely the special casing should be extended to the other
composite estimators. And sorting vs not sorting is a good question.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9099 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz62TDxXfehMQi9hDW2qFL6PA1h32hks5sYINygaJpZM4N2HOC>
.
|
The specs would be how the formatting should look like? The problem is that it's very hard to write down a simple spec that works for everything. In particular we're limited in line-length in jupyter, but we're not limited in nesting depth in sklearn. So if you write I think it should be something like this:
|
and make this the default __repr__ for which estimators and contexts?
…On 16 Aug 2017 3:19 am, "Andreas Mueller" ***@***.***> wrote:
master for comparison:
[image: image]
<https://user-images.githubusercontent.com/449558/29327174-4d069784-81bc-11e7-8786-dba751d6358c.png>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9099 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz64FdFczQlZ42p9uo9Id21fRgApXsks5sYdMIgaJpZM4N2HOC>
.
|
This would be plain text (maybe with terminal colors?). So all estimators and all contexts until we have better ones for fancier contexts ;) The current |
Well:
could be written as:
Perhaps this is a bit extreme, but we could extract and define separately nested estimators that are too long (vertically or horizontally) for the current indent... |
How about we simply allow ourselves to extend beyond the line limit when the useful fraction of the line becomes too small?
I think that this PR is so useful that we should merge it without too much more changes to it, and possibly revisit after.
Sent from my phone. Please forgive typos and briefness.
…On Jan 11, 2018, 06:37, at 06:37, Joel Nothman ***@***.***> wrote:
Well:
```
RFE(RFE(RFE(RFE(RFE(RFE(RFE(LogisticRegresssion))))))
```
could be written as:
```
RFE(rfe1)
where rf1 = RFE(rfe2)
where rfe2 = RFE(rfe3)
where rfe3 = RFE(rfe4)
where rfe4 = RFE(rfe5)
where rfe5 = RFE(rfe6)
where rfe6 = RFE(LogisticRegresssion)
```
Perhaps this is a bit extreme, but we could extract and define
separately nested estimators that are too long (vertically or
horizontally) for the current indent...
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
#9099 (comment)
|
good idea!
|
Once this simple modification is in, I'll try to review this PR and aim for merge. I think that it is so terribly useful and will bring a lot of value to users. |
Is this still valid with 'help wanted'? I could have a go through the discussion (of related PR's as well) and may be try to get it in, in case this still needs help from a contributor. |
it needs the change suggested by Gaël, and a few test cases
|
Putting this one back on my stack... let's see... |
@amueller : I think that we should not be too ambitious with this PR, and just implement my suggested change: "allow ourselves to extend beyond the line limit when the useful fraction of the line becomes too small", add a few tests, and merge. This is very useful to users already. Do you have the bandwidth to do it, or should we open it for contributions during the sprint? |
@GaelVaroquaux I agree. I'm not sure about extending the line length. I have to think about that. But yes, we should keep this one small. |
In the first place, I don't think 79 characters is an especially reasonable width when code consistency is not the concern. But if the max width becomes a parameter, I suppose we'd better keep to it. |
replaced by #11705 |
Continuation of #9039.
Right now I didn't replace the
__repr__
for easier comparison