Skip to content

FeatureSelector for Pipeline (New Feature) #3560

@rasbt

Description

@rasbt

Hi,
I was wondering if it would be worthwhile to add a simple FeatureSelector class that can be used in scikit's Pipeline. For example if a user wants to select particular columns (useful for cross-validation for example) and compare it against feature selection techniques etc.

E.g.,

clf1 = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('reduce_dim', FeatureSelector(cols=(1,3))),    # select feature 2 and 4
    ('classifier', GaussianNB())   
    ]) 

clf2 = Pipeline(steps=[
    ('scaler', StandardScaler()),    
    ('reduce_dim', PCA(n_components=2)),
    ('classifier', GaussianNB())   
    ])

...

The code could be as simple as:

import numpy as np

class FeatureSelector(object):

    def __init__(self, cols):
        self.cols = cols

    def transform(self, X, y=None):
        col_list = []
        for c in self.cols:
            col_list.append(X[:, c:c+1])
        return np.concatenate(col_list, axis=1)

    def fit(self, X, y=None):
        return self

PS: How would I add a label to this issue track (I read in the scikit doc. that labels are recommended)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions