Skip to content

[MRG] Add experimental.ColumnTransformer #9012

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
90 commits
Select commit Hold shift + click to select a range
1937d56
add heterogeneous ColumnTransformer
amueller Jun 5, 2015
95bf6cb
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Jun 6, 2017
914ba53
Get tests/examples working with current sklearn
jorisvandenbossche Jun 6, 2017
2333e61
Add support for numpy arrays and positional columns in dataframes as …
jorisvandenbossche Jun 6, 2017
464f7e6
add support for selecting multiple columns
jorisvandenbossche Jun 6, 2017
7777e2a
doc corrections
jorisvandenbossche Jun 7, 2017
42ce18c
Change to tuples instead of dict
jorisvandenbossche Jun 7, 2017
4a55b9b
Reimplement as subclass of FeatureUnion
jorisvandenbossche Jun 7, 2017
55a5372
Fix-ups and move tests
jorisvandenbossche Jun 7, 2017
74d0639
update docs
jorisvandenbossche Jun 7, 2017
b6883b9
Support selecting multiple columns from dict + ensure passed subset i…
jorisvandenbossche Jun 7, 2017
1c4f09b
Also support slices for positional subsets
jorisvandenbossche Jun 7, 2017
7cef7df
Fix 2d dict items case
jorisvandenbossche Jun 7, 2017
6ceed19
Refactor column selection based on discussion
jorisvandenbossche Jun 8, 2017
e19e3c1
clean-up + add more tests
jorisvandenbossche Jun 8, 2017
0116ac9
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Jun 9, 2017
c7ea079
Nuke swiss army knife (no dict/recarray support)
jorisvandenbossche Jun 9, 2017
acff9dd
Add catch/reraise error with custom message
jorisvandenbossche Jun 9, 2017
4db243c
update docs
jorisvandenbossche Jun 9, 2017
6ab49a8
undo changes to utils
jorisvandenbossche Jun 9, 2017
2dda954
Move to experimental module
jorisvandenbossche Jun 10, 2017
0d0107f
fixup move to experimental
jorisvandenbossche Jun 10, 2017
267ca85
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Jun 10, 2017
0c7b0d7
Move docs
jorisvandenbossche Jun 10, 2017
c711b55
add support for boolean masks
jorisvandenbossche Jun 10, 2017
0cb9770
Add make_column_transformer factory function
jorisvandenbossche Jun 10, 2017
9d24bb1
doc fixups
jorisvandenbossche Jun 10, 2017
11a5c0c
feedback
jorisvandenbossche Jun 10, 2017
a8efeeb
skip feature_extraction docs if pandas not installed
jorisvandenbossche Jun 10, 2017
20976b1
fix doctests + pep8
jorisvandenbossche Jun 10, 2017
e71a390
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Jun 10, 2017
406b2a9
add to sklearn/setup.py
jorisvandenbossche Jun 14, 2017
ae12bbc
feedback
jorisvandenbossche Jun 14, 2017
70ed541
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Jun 14, 2017
16bfae5
possible fix for get_params / set_params
jorisvandenbossche Jun 15, 2017
7ff02a4
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Jun 16, 2017
a753833
updates for feedback
jorisvandenbossche Jun 16, 2017
bb4d721
Don't subclass FeatureUnion + clone passed transformers
jorisvandenbossche Jun 16, 2017
493116f
add named_transformers_ attribute
jorisvandenbossche Jun 19, 2017
a33ad8c
add test that confirms that transformers now actually get cloned
jorisvandenbossche Jun 19, 2017
18b814d
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Jun 26, 2017
6cedbd7
added some more tests
jorisvandenbossche Jun 26, 2017
0229e5b
doc feedback guillaume
jorisvandenbossche Jun 28, 2017
f9d95eb
Merge remote-tracking branch 'origin/master' into amueller/heterogene…
glemaitre Jul 13, 2017
ca1647e
Solve the issue introduce by git during merging
glemaitre Jul 13, 2017
0707319
Addess Joel comments
glemaitre Jul 17, 2017
88ac893
remove validation from init
glemaitre Jul 17, 2017
91a5312
correct comment in example
glemaitre Jul 17, 2017
deb3b78
Do not modify transformer in init
glemaitre Jul 17, 2017
a6d7b77
Factorize _fit_* functions
glemaitre Jul 18, 2017
d287420
minor updates based on feedback
jorisvandenbossche Aug 21, 2017
2920912
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Aug 21, 2017
7b1ce95
refactor try except block to single helper function
jorisvandenbossche Aug 22, 2017
db9b2de
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Oct 27, 2017
e6d81af
move whatsnew + fix bad merge
jorisvandenbossche Oct 27, 2017
733b111
add passthrough kwarg
jorisvandenbossche Oct 27, 2017
6d639f0
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Nov 21, 2017
8d142fd
fixup basic passthrough implementation and tests
jorisvandenbossche Nov 21, 2017
af257e0
fix doctest
jorisvandenbossche Nov 21, 2017
6705233
use pytest setup to skip docs if no pandas
jorisvandenbossche Nov 22, 2017
2b591e4
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Nov 24, 2017
00aef88
move doc fixture to common conftest.py for docs
jorisvandenbossche Nov 24, 2017
9c2df9c
poc of passthrough=True
jorisvandenbossche Dec 5, 2017
4463fa7
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Dec 14, 2017
8d6e034
Update make_column_transformer to accept tuples instead of dict
jorisvandenbossche Dec 14, 2017
82a5697
some clean-up
jorisvandenbossche Dec 15, 2017
04cf4ff
more thoroughly test + fix passthrough
jorisvandenbossche Dec 16, 2017
db2eabd
add test to cover check of transformers
jorisvandenbossche Dec 16, 2017
c402fb2
feedback Joel
jorisvandenbossche Dec 22, 2017
9ae7753
add note on None transformer and 'remainder'
jorisvandenbossche Dec 22, 2017
c222101
small update to the tests
jorisvandenbossche Jan 12, 2018
26bf288
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Jan 12, 2018
8386fae
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Feb 7, 2018
28840ad
flake8
jorisvandenbossche Feb 7, 2018
14c7b1e
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Mar 29, 2018
608ba9a
Move ColumnTransformer from experimental to compose
jorisvandenbossche Mar 29, 2018
22c499c
fix sklearn/__init__.py
jorisvandenbossche Mar 29, 2018
333f878
fixup remaining usage of experimental
jorisvandenbossche Mar 29, 2018
c3f8733
fix doctest example
jorisvandenbossche Mar 29, 2018
4804cd8
switch transformers/columns order in make_column_transformer
jorisvandenbossche Apr 10, 2018
3d3e772
Add special-cased 'drop' and 'passthrough'
jorisvandenbossche Apr 18, 2018
3346268
Implement 'drop'/'passthrough' for remainder instead of passthrough k…
jorisvandenbossche May 1, 2018
7ded77a
remainder -> unspecified
jorisvandenbossche May 1, 2018
4835c29
fix doctests + remaining feedback Joel
jorisvandenbossche May 1, 2018
04bcb1e
pep8
jorisvandenbossche May 1, 2018
3d2a9bc
unspecified -> remainder
jorisvandenbossche May 25, 2018
afb7384
update for feedback
jorisvandenbossche May 25, 2018
d298fc3
switch default from 'drop' to 'passthrough' + add transformer ouput v…
jorisvandenbossche May 25, 2018
4098928
Add NotImplementedError for get_feature_names if columns are passed t…
jorisvandenbossche May 29, 2018
9ab27fb
move docs from feature_extraction.rst -> compose.rst
jorisvandenbossche May 29, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions doc/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,13 @@ def setup_working_with_text_data():
check_skip_network()


def setup_compose():
try:
import pandas # noqa
except ImportError:
raise SkipTest("Skipping compose.rst, pandas not installed")


def pytest_runtest_setup(item):
fname = item.fspath.strpath
if fname.endswith('datasets/labeled_faces.rst'):
Expand All @@ -67,6 +74,8 @@ def pytest_runtest_setup(item):
setup_twenty_newsgroups()
elif fname.endswith('tutorial/text_analytics/working_with_text_data.rst'):
setup_working_with_text_data()
elif fname.endswith('modules/compose.rst'):
setup_compose()


def pytest_runtest_teardown(item):
Expand Down
8 changes: 8 additions & 0 deletions doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -158,8 +158,15 @@ details.
:toctree: generated
:template: class.rst

compose.ColumnTransformer
compose.TransformedTargetRegressor

.. autosummary::
:toctree: generated/
:template: function.rst

compose.make_column_transformer

.. _covariance_ref:

:mod:`sklearn.covariance`: Covariance Estimators
Expand Down Expand Up @@ -1461,6 +1468,7 @@ Low-level methods
utils.testing.assert_raise_message
utils.testing.all_estimators


Recently deprecated
===================

Expand Down
110 changes: 106 additions & 4 deletions doc/modules/compose.rst
Original file line number Diff line number Diff line change
Expand Up @@ -304,9 +304,13 @@ FeatureUnion: composite feature spaces
:class:`FeatureUnion` combines several transformer objects into a new
transformer that combines their output. A :class:`FeatureUnion` takes
a list of transformer objects. During fitting, each of these
is fit to the data independently. For transforming data, the
transformers are applied in parallel, and the sample vectors they output
are concatenated end-to-end into larger vectors.
is fit to the data independently. The transformers are applied in parallel,
and the feature matrices they output are concatenated side-by-side into a
larger matrix.

When you want to apply different transformations to each field of the data,
see the related class :class:`sklearn.compose.ColumnTransformer`
(see :ref:`user guide <column_transformer>`).

:class:`FeatureUnion` serves the same purposes as :class:`Pipeline` -
convenience and joint parameter estimation and validation.
Expand Down Expand Up @@ -357,4 +361,102 @@ and ignored by setting to ``None``::
.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_plot_feature_stacker.py`
* :ref:`sphx_glr_auto_examples_hetero_feature_union.py`


.. _column_transformer:

ColumnTransformer for heterogeneous data
========================================

.. warning::

The :class:`compose.ColumnTransformer <sklearn.compose.ColumnTransformer>`
class is experimental and the API is subject to change.

Many datasets contain features of different types, say text, floats, and dates,
where each type of feature requires separate preprocessing or feature
extraction steps. Often it is easiest to preprocess data before applying
scikit-learn methods, for example using `pandas <http://pandas.pydata.org/>`__.
Processing your data before passing it to scikit-learn might be problematic for
one of the following reasons:

1. Incorporating statistics from test data into the preprocessors makes
cross-validation scores unreliable (known as *data leakage*),
for example in the case of scalers or imputing missing values.
2. You may want to include the parameters of the preprocessors in a
:ref:`parameter search <grid_search>`.

The :class:`~sklearn.compose.ColumnTransformer` helps performing different
transformations for different columns of the data, within a
:class:`~sklearn.pipeline.Pipeline` that is safe from data leakage and that can
be parametrized. :class:`~sklearn.compose.ColumnTransformer` works on
arrays, sparse matrices, and
`pandas DataFrames <http://pandas.pydata.org/pandas-docs/stable/>`__.

To each column, a different transformation can be applied, such as
preprocessing or a specific feature extraction method::

>>> import pandas as pd
>>> X = pd.DataFrame(
... {'city': ['London', 'London', 'Paris', 'Sallisaw'],
... 'title': ["His Last Bow", "How Watson Learned the Trick",
... "A Moveable Feast", "The Grapes of Wrath"]})

For this data, we might want to encode the ``'city'`` column as a categorical
variable, but apply a :class:`feature_extraction.text.CountVectorizer
<sklearn.feature_extraction.text.CountVectorizer>` to the ``'title'`` column.
As we might use multiple feature extraction methods on the same column, we give
each transformer a unique name, say ``'city_category'`` and ``'title_bow'``::

>>> from sklearn.compose import ColumnTransformer
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> column_trans = ColumnTransformer(
... [('city_category', CountVectorizer(analyzer=lambda x: [x]), 'city'),
... ('title_bow', CountVectorizer(), 'title')])

>>> column_trans.fit(X) # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
ColumnTransformer(n_jobs=1, remainder='passthrough', transformer_weights=None,
transformers=...)

>>> column_trans.get_feature_names()
... # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
['city_category__London', 'city_category__Paris', 'city_category__Sallisaw',
'title_bow__bow', 'title_bow__feast', 'title_bow__grapes', 'title_bow__his',
'title_bow__how', 'title_bow__last', 'title_bow__learned', 'title_bow__moveable',
'title_bow__of', 'title_bow__the', 'title_bow__trick', 'title_bow__watson',
'title_bow__wrath']

>>> column_trans.transform(X).toarray()
... # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0],
[0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1]]...)

In the above example, the
:class:`~sklearn.feature_extraction.text.CountVectorizer` expects a 1D array as
input and therefore the columns were specified as a string (``'city'``).
However, other transformers generally expect 2D data, and in that case you need
to specify the column as a list of strings (``['city']``).

Apart from a scalar or a single item list, the column selection can be specified
as a list of multiple items, an integer array, a slice, or a boolean mask.
Strings can reference columns if the input is a DataFrame, integers are always
interpreted as the positional columns.

The :func:`~sklearn.compose.make_columntransformer` function is available
to more easily create a :class:`~sklearn.compose.ColumnTransformer` object.
Specifically, the names will be given automatically. The equivalent for the
above example would be::

>>> from sklearn.compose import make_column_transformer
>>> column_trans = make_column_transformer(
... ('city', CountVectorizer(analyzer=lambda x: [x])),
... ('title', CountVectorizer()))
>>> column_trans # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
ColumnTransformer(n_jobs=1, remainder='passthrough', transformer_weights=None,
transformers=[('countvectorizer-1', ...)

.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_column_transformer.py`
2 changes: 1 addition & 1 deletion doc/modules/feature_extraction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -916,7 +916,7 @@ Some tips and tricks:
(Note that this will not filter out punctuation.)


The following example will, for instance, transform some British spelling
The following example will, for instance, transform some British spelling
to American spelling::

>>> import re
Expand Down
4 changes: 4 additions & 0 deletions doc/whats_new/v0.20.rst
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,10 @@ Preprocessing
the maximum value in the features. :issue:`9151` by
:user:`Vighnesh Birodkar <vighneshbirodkar>` and `Joris Van den Bossche`_.

- Added :class:`compose.ColumnTransformer`, which allows to apply
different transformers to different columns of arrays or pandas
DataFrames. By `Andreas Müller`_ and `Joris Van den Bossche`_.

- Added :class:`preprocessing.PowerTransformer`, which implements the Box-Cox
power transformation, allowing users to map data from any distribution to a
Gaussian distribution. This is useful as a variance-stabilizing transformation
Expand Down
89 changes: 22 additions & 67 deletions examples/hetero_feature_union.py → examples/column_transformer.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""
=============================================
Feature Union with Heterogeneous Data Sources
=============================================
==================================================
Column Transformer with Heterogeneous Data Sources
==================================================

Datasets can often contain components of that require different feature
extraction and processing pipelines. This scenario might occur when:
Expand All @@ -12,12 +12,12 @@
require different processing pipelines.

This example demonstrates how to use
:class:`sklearn.feature_extraction.FeatureUnion` on a dataset containing
:class:`sklearn.compose.ColumnTransformer` on a dataset containing
different types of features. We use the 20-newsgroups dataset and compute
standard bag-of-words features for the subject line and body in separate
pipelines as well as ad hoc features on the body. We combine them (with
weights) using a FeatureUnion and finally train a classifier on the combined
set of features.
weights) using a ColumnTransformer and finally train a classifier on the
combined set of features.

The choice of features is not particularly helpful, but serves to illustrate
the technique.
Expand All @@ -38,50 +38,11 @@
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.svm import SVC


class ItemSelector(BaseEstimator, TransformerMixin):
"""For data grouped by feature, select subset of data at a provided key.

The data is expected to be stored in a 2D data structure, where the first
index is over features and the second is over samples. i.e.

>> len(data[key]) == n_samples

Please note that this is the opposite convention to scikit-learn feature
matrixes (where the first index corresponds to sample).

ItemSelector only requires that the collection implement getitem
(data[key]). Examples include: a dict of lists, 2D numpy array, Pandas
DataFrame, numpy record array, etc.

>> data = {'a': [1, 5, 2, 5, 2, 8],
'b': [9, 4, 1, 4, 1, 3]}
>> ds = ItemSelector(key='a')
>> data['a'] == ds.transform(data)

ItemSelector is not designed to handle data grouped by sample. (e.g. a
list of dicts). If your data is structured this way, consider a
transformer along the lines of `sklearn.feature_extraction.DictVectorizer`.

Parameters
----------
key : hashable, required
The key corresponding to the desired value in a mappable.
"""
def __init__(self, key):
self.key = key

def fit(self, x, y=None):
return self

def transform(self, data_dict):
return data_dict[self.key]


class TextStats(BaseEstimator, TransformerMixin):
"""Extract features from each document for DictVectorizer"""

Expand All @@ -104,21 +65,22 @@ def fit(self, x, y=None):
return self

def transform(self, posts):
features = np.recarray(shape=(len(posts),),
dtype=[('subject', object), ('body', object)])
# construct object dtype array with two columns
# first column = 'subject' and second column = 'body'
features = np.empty(shape=(len(posts), 2), dtype=object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are going to be in the "column" namespace, where we support pandas dataframes, should we use a pandas dataframe in this example, rather than a object array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this example I didn't use pandas, as it seems a bit overhead (it would just be for temporarily putting the two columns in a frame to pass it to a next frame). But we certainly need another example with a pandas dataframe (eg with adults).
But can change it here as well if needed.

for i, text in enumerate(posts):
headers, _, bod = text.partition('\n\n')
bod = strip_newsgroup_footer(bod)
bod = strip_newsgroup_quoting(bod)
features['body'][i] = bod
features[i, 1] = bod

prefix = 'Subject:'
sub = ''
for line in headers.split('\n'):
if line.startswith(prefix):
sub = line[len(prefix):]
break
features['subject'][i] = sub
features[i, 0] = sub

return features

Expand All @@ -127,38 +89,31 @@ def transform(self, posts):
# Extract the subject & body
('subjectbody', SubjectBodyExtractor()),

# Use FeatureUnion to combine the features from subject and body
('union', FeatureUnion(
transformer_list=[
# Use C toolumnTransformer to combine the features from subject and body
('union', ColumnTransformer(
[
# Pulling features from the post's subject line (first column)
('subject', TfidfVectorizer(min_df=50), 0),

# Pipeline for pulling features from the post's subject line
('subject', Pipeline([
('selector', ItemSelector(key='subject')),
('tfidf', TfidfVectorizer(min_df=50)),
])),

# Pipeline for standard bag-of-words model for body
# Pipeline for standard bag-of-words model for body (second column)
('body_bow', Pipeline([
('selector', ItemSelector(key='body')),
('tfidf', TfidfVectorizer()),
('best', TruncatedSVD(n_components=50)),
])),
]), 1),

# Pipeline for pulling ad hoc features from post's body
('body_stats', Pipeline([
('selector', ItemSelector(key='body')),
('stats', TextStats()), # returns a list of dicts
('vect', DictVectorizer()), # list of dicts -> feature matrix
])),

]), 1),
],

# weight components in FeatureUnion
# weight components in ColumnTransformer
transformer_weights={
'subject': 0.8,
'body_bow': 0.5,
'body_stats': 1.0,
},
}
)),

# Use a SVC classifier on the combined features
Expand Down
4 changes: 4 additions & 0 deletions sklearn/compose/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,12 @@

"""

from ._column_transformer import ColumnTransformer, make_column_transformer
from ._target import TransformedTargetRegressor


__all__ = [
'ColumnTransformer',
'make_column_transformer',
'TransformedTargetRegressor',
]
Loading