1
- .. _feature_extraction :
1
+ .. _feature_extraction :
2
2
3
3
==================
4
4
Feature extraction
@@ -53,8 +53,8 @@ is a traditional numerical feature::
53
53
[ 0., 1., 0., 12.],
54
54
[ 0., 0., 1., 18.]])
55
55
56
- >>> vec.get_feature_names ()
57
- ['city=Dubai', 'city=London', 'city=San Francisco', 'temperature']
56
+ >>> vec.get_feature_names_out ()
57
+ array( ['city=Dubai', 'city=London', 'city=San Francisco', 'temperature'], ...)
58
58
59
59
:class: `DictVectorizer ` accepts multiple string values for one
60
60
feature, like, e.g., multiple categories for a movie.
@@ -69,10 +69,9 @@ and its year of release.
69
69
array([[0.000e+00, 1.000e+00, 0.000e+00, 1.000e+00, 2.003e+03],
70
70
[1.000e+00, 0.000e+00, 1.000e+00, 0.000e+00, 2.011e+03],
71
71
[0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 1.974e+03]])
72
- >>> vec.get_feature_names() == [' category=animation' , ' category=drama' ,
73
- ... ' category=family' , ' category=thriller' ,
74
- ... ' year' ]
75
- True
72
+ >>> vec.get_feature_names_out()
73
+ array(['category=animation', 'category=drama', 'category=family',
74
+ 'category=thriller', 'year'], ...)
76
75
>>> vec.transform({' category' : [' thriller' ],
77
76
... ' unseen_feature' : ' 3' }).toarray()
78
77
array([[0., 0., 0., 1., 0.]])
@@ -111,8 +110,9 @@ suitable for feeding into a classifier (maybe after being piped into a
111
110
with 6 stored elements in Compressed Sparse ... format>
112
111
>>> pos_vectorized.toarray()
113
112
array([[1., 1., 1., 1., 1., 1.]])
114
- >>> vec.get_feature_names()
115
- ['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat', 'word-2=the']
113
+ >>> vec.get_feature_names_out()
114
+ array(['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat',
115
+ 'word-2=the'], ...)
116
116
117
117
As you can imagine, if one extracts such a context around each individual
118
118
word of a corpus of documents the resulting matrix will be very wide
@@ -340,10 +340,9 @@ Each term found by the analyzer during the fit is assigned a unique
340
340
integer index corresponding to a column in the resulting matrix. This
341
341
interpretation of the columns can be retrieved as follows::
342
342
343
- >>> vectorizer.get_feature_names() == (
344
- ... ['and', 'document', 'first', 'is', 'one',
345
- ... 'second', 'the', 'third', 'this'])
346
- True
343
+ >>> vectorizer.get_feature_names_out()
344
+ array(['and', 'document', 'first', 'is', 'one', 'second', 'the',
345
+ 'third', 'this'], ...)
347
346
348
347
>>> X.toarray()
349
348
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
@@ -406,8 +405,8 @@ however, similar words are useful for prediction, such as in classifying
406
405
writing style or personality.
407
406
408
407
There are several known issues in our provided 'english' stop word list. It
409
- does not aim to be a general, 'one-size-fits-all' solution as some tasks
410
- may require a more custom solution. See [NQY18 ]_ for more details.
408
+ does not aim to be a general, 'one-size-fits-all' solution as some tasks
409
+ may require a more custom solution. See [NQY18 ]_ for more details.
411
410
412
411
Please take care in choosing a stop word list.
413
412
Popular stop word lists may include words that are highly informative to
@@ -742,9 +741,8 @@ decide better::
742
741
743
742
>>> ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2))
744
743
>>> counts = ngram_vectorizer.fit_transform(['words', 'wprds'])
745
- >>> ngram_vectorizer.get_feature_names() == (
746
- ... [' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp'])
747
- True
744
+ >>> ngram_vectorizer.get_feature_names_out()
745
+ array([' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp'], ...)
748
746
>>> counts.toarray().astype(int)
749
747
array([[1, 1, 1, 0, 1, 1, 1, 0],
750
748
[1, 1, 0, 1, 1, 1, 0, 1]])
@@ -758,17 +756,15 @@ span across words::
758
756
>>> ngram_vectorizer.fit_transform(['jumpy fox'])
759
757
<1x4 sparse matrix of type '<... 'numpy.int64'>'
760
758
with 4 stored elements in Compressed Sparse ... format>
761
- >>> ngram_vectorizer.get_feature_names() == (
762
- ... [' fox ', ' jump', 'jumpy', 'umpy '])
763
- True
759
+ >>> ngram_vectorizer.get_feature_names_out()
760
+ array([' fox ', ' jump', 'jumpy', 'umpy '], ...)
764
761
765
762
>>> ngram_vectorizer = CountVectorizer(analyzer='char', ngram_range=(5, 5))
766
763
>>> ngram_vectorizer.fit_transform(['jumpy fox'])
767
764
<1x5 sparse matrix of type '<... 'numpy.int64'>'
768
765
with 5 stored elements in Compressed Sparse ... format>
769
- >>> ngram_vectorizer.get_feature_names() == (
770
- ... ['jumpy', 'mpy f', 'py fo', 'umpy ', 'y fox'])
771
- True
766
+ >>> ngram_vectorizer.get_feature_names_out()
767
+ array(['jumpy', 'mpy f', 'py fo', 'umpy ', 'y fox'], ...)
772
768
773
769
The word boundaries-aware variant ``char_wb `` is especially interesting
774
770
for languages that use white-spaces for word separation as it generates
0 commit comments