-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Expanded ColumnTransformer functionality -- transforming subsets of data #28130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We have a PR candidate here: #11639 @thomasjpfan This is maybe a user-case that you were searching for. |
This issue is thus a duplicate from #11463 |
@glemaitre Sorry I didn't notice that other issue. However this is not a duplicate because I am requesting one additional feature -- the ability to pass in a subset of the data |
I have now edited the issue above to focus just on the unique part of this request. Perhaps the Triage label could be re-added? |
Hi @rebeccaherman1 : what do you mean by "subset of the data"? How is it defined? |
@GaelVaroquaux I mean columns associated with one or more of the transforms within the ColumnTransformer For example, let's say my data is Mx10, and I define the ColumnTransformer using transformers = [('X', StandardScaler(total_variance=True), [0,1,2]),
('Y', StandardScaler(total_variance=True), [3,4,5,6,7]),
('Z', StandardScaler(total_variance=True), [8,9])] Then, maybe later, I'd like to pass in data corresponding to just 'X' (original columns [0,1,2]) and 'Z' (original columns [8,9]), without 'Y'. I would then hope to pass in an M'x5 matrix, and identify the original columns the new data columns correspond to ([0,1,2,8,9]), or the transforms they correspond to (['X', 'Z')]. |
If we want to support this feature, I'll go with "slicing a column transformer": transformers = [('X', StandardScaler(total_variance=True), [0,1,2]),
('Y', StandardScaler(total_variance=True), [3,4,5,6,7]),
('Z', StandardScaler(total_variance=True), [8,9])]
ct = ColumnTransform(transformers)
ct.fit(X)
ct_sliced = ct[["X", "Z"]]
ct_sliced.transform(X_subset) We already have some precedence for this with |
@thomasjpfan This would be wonderful. I tried to look through the code for Would someone else who is more familiar with the code be willing to implement this? Or discuss in more detail how it would be best implemented? |
Describe the workflow you want to enable (edited)
the ability to (inverse_)transform data corresponding to a subset of the ColumnTransformer's component transformations
Describe your proposed solution
Data of a smaller size can be passed in with a new keyword that identifies the relevant component transformations by name
Describe alternatives you've considered, if relevant
a function that subsets a ColumnTransformer object, including adjusting the column numbers
Additional context
In artificial intelligence applications, the researcher may want to transform an entire dataset with column groups for learning, but then transform new data corresponding just to interventions or predictions using the same transformations at a later time. Hence, the need to be able to subset a ColumnTransform or pass in only a part of the data.
See #27957 for more discussion.
The text was updated successfully, but these errors were encountered: