Skip to content

Commit eb9fe80

Browse files
tylerlaniganjnothman
authored andcommitted
DOC add example regarding feature scaling (scikit-learn#7912)
also add load_wine to datasets
1 parent fe07b8c commit eb9fe80

File tree

7 files changed

+552
-34
lines changed

7 files changed

+552
-34
lines changed
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
#!/usr/bin/python
2+
# -*- coding: utf-8 -*-
3+
"""
4+
=========================================================
5+
Importance of Feature Scaling
6+
=========================================================
7+
8+
Feature scaling though standardization (or Z-score normalization)
9+
can be an important preprocessing step for many machine learning
10+
algorithms. Standardization involves rescaling the features such
11+
that they have the properties of a standard normal distribution
12+
with a mean of zero and a standard deviation of one.
13+
14+
While many algorithms (such as SVM, K-nearest neighbors, and logistic
15+
regression) require features to be normalized, intuitively we can
16+
think of Principle Component Analysis (PCA) as being a prime example
17+
of when normalization is important. In PCA we are interested in the
18+
components that maximize the variance. If one component (e.g. human
19+
height) varies less than another (e.g. weight) because of their
20+
respective scales (meters vs. kilos), PCA might determine that the
21+
direction of maximal variance more closely corresponds with the
22+
'weight' axis, if those features are not scaled. As a change in
23+
height of one meter can be considered much more important than the
24+
change in weight of one kilogram, this is clearly incorrect.
25+
26+
To illustrate this, PCA is performed comparing the use of data with
27+
:class:`StandardScaler <sklearn.preprocessing.StandardScaler>` applied,
28+
to unscaled data. The results are visualized and a clear difference noted.
29+
The 1st principal component in the unscaled set can be seen. It can be seen
30+
that feature #13 dominates the direction, being a whole two orders of
31+
magnitude above the other features. This is contrasted when observing
32+
the principal component for the scaled version of the data. In the scaled
33+
version, the orders of magnitude are roughly the same across all the features.
34+
35+
The dataset used is the Wine Dataset available at UCI. This dataset
36+
has continuous features that are heterogeneous in scale due to differing
37+
properties that they measure (i.e alcohol content, and malic acid).
38+
39+
The transformed data is then used to train a naive Bayes classifier, and a
40+
clear difference in prediction accuracies is observed wherein the dataset
41+
which is scaled before PCA vastly outperforms the unscaled version.
42+
43+
"""
44+
from __future__ import print_function
45+
from sklearn.model_selection import train_test_split
46+
from sklearn.preprocessing import StandardScaler
47+
from sklearn.decomposition import PCA
48+
from sklearn.naive_bayes import GaussianNB
49+
from sklearn import metrics
50+
import matplotlib.pyplot as plt
51+
from sklearn.datasets import load_wine
52+
from sklearn.pipeline import make_pipeline
53+
print(__doc__)
54+
55+
# Code source: Tyler Lanigan <tylerlanigan@gmail.com>
56+
# Sebastian Raschka <mail@sebastianraschka.com>
57+
58+
# License: BSD 3 clause
59+
60+
RANDOM_STATE = 42
61+
FIG_SIZE = (10, 7)
62+
63+
64+
features, target = load_wine(return_X_y=True)
65+
66+
# Make a train/test split using 30% test size
67+
X_train, X_test, y_train, y_test = train_test_split(features, target,
68+
test_size=0.30,
69+
random_state=RANDOM_STATE)
70+
71+
# Fit to data and predict using pipelined GNB and PCA.
72+
unscaled_clf = make_pipeline(PCA(n_components=2), GaussianNB())
73+
unscaled_clf.fit(X_train, y_train)
74+
pred_test = unscaled_clf.predict(X_test)
75+
76+
# Fit to data and predict using pipelined scaling, GNB and PCA.
77+
std_clf = make_pipeline(StandardScaler(), PCA(n_components=2), GaussianNB())
78+
std_clf.fit(X_train, y_train)
79+
pred_test_std = std_clf.predict(X_test)
80+
81+
# Show prediction accuracies in scaled and unscaled data.
82+
print('\nPrediction accuracy for the normal test dataset with PCA')
83+
print('{:.2%}\n'.format(metrics.accuracy_score(y_test, pred_test)))
84+
85+
print('\nPrediction accuracy for the standardized test dataset with PCA')
86+
print('{:.2%}\n'.format(metrics.accuracy_score(y_test, pred_test_std)))
87+
88+
# Extract PCA from pipeline
89+
pca = unscaled_clf.named_steps['pca']
90+
pca_std = std_clf.named_steps['pca']
91+
92+
# Show first principal componenets
93+
print('\nPC 1 without scaling:\n', pca.components_[0])
94+
print('\nPC 1 with scaling:\n', pca_std.components_[0])
95+
96+
# Scale and use PCA on X_train data for visualization.
97+
scaler = std_clf.named_steps['standardscaler']
98+
X_train_std = pca_std.transform(scaler.transform(X_train))
99+
100+
# visualize standardized vs. untouched dataset with PCA performed
101+
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=FIG_SIZE)
102+
103+
104+
for l, c, m in zip(range(0, 3), ('blue', 'red', 'green'), ('^', 's', 'o')):
105+
ax1.scatter(X_train[y_train == l, 0], X_train[y_train == l, 1],
106+
color=c,
107+
label='class %s' % l,
108+
alpha=0.5,
109+
marker=m
110+
)
111+
112+
for l, c, m in zip(range(0, 3), ('blue', 'red', 'green'), ('^', 's', 'o')):
113+
ax2.scatter(X_train_std[y_train == l, 0], X_train_std[y_train == l, 1],
114+
color=c,
115+
label='class %s' % l,
116+
alpha=0.5,
117+
marker=m
118+
)
119+
120+
ax1.set_title('Training dataset after PCA')
121+
ax2.set_title('Standardized training dataset after PCA')
122+
123+
for ax in (ax1, ax2):
124+
ax.set_xlabel('1st principal component')
125+
ax.set_ylabel('2nd principal component')
126+
ax.legend(loc='upper right')
127+
ax.grid()
128+
129+
plt.tight_layout()
130+
131+
plt.show()

sklearn/datasets/__init__.py

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,18 +3,18 @@
33
including methods to load and fetch popular reference datasets. It also
44
features some artificial data generators.
55
"""
6-
6+
from .base import load_breast_cancer
7+
from .base import load_boston
78
from .base import load_diabetes
89
from .base import load_digits
910
from .base import load_files
1011
from .base import load_iris
11-
from .base import load_breast_cancer
1212
from .base import load_linnerud
13-
from .base import load_boston
14-
from .base import get_data_home
15-
from .base import clear_data_home
1613
from .base import load_sample_images
1714
from .base import load_sample_image
15+
from .base import load_wine
16+
from .base import get_data_home
17+
from .base import clear_data_home
1818
from .covtype import fetch_covtype
1919
from .kddcup99 import fetch_kddcup99
2020
from .mlcomp import load_mlcomp
@@ -78,6 +78,7 @@
7878
'load_sample_images',
7979
'load_svmlight_file',
8080
'load_svmlight_files',
81+
'load_wine',
8182
'make_biclusters',
8283
'make_blobs',
8384
'make_circles',

sklearn/datasets/base.py

Lines changed: 122 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -242,6 +242,122 @@ def load_files(container_path, description=None, categories=None,
242242
DESCR=description)
243243

244244

245+
def load_data(module_path, data_file_name):
246+
"""Loads data from module_path/data/data_file_name.
247+
248+
Parameters
249+
----------
250+
data_file_name : String. Name of csv file to be loaded from
251+
module_path/data/data_file_name. For example 'wine_data.csv'.
252+
253+
Returns
254+
-------
255+
data : Numpy Array
256+
A 2D array with each row representing one sample and each column
257+
representing the features of a given sample.
258+
259+
target : Numpy Array
260+
A 1D array holding target variables for all the samples in `data.
261+
For example target[0] is the target varible for data[0].
262+
263+
target_names : Numpy Array
264+
A 1D array containing the names of the classifications. For example
265+
target_names[0] is the name of the target[0] class.
266+
"""
267+
with open(join(module_path, 'data', data_file_name)) as csv_file:
268+
data_file = csv.reader(csv_file)
269+
temp = next(data_file)
270+
n_samples = int(temp[0])
271+
n_features = int(temp[1])
272+
target_names = np.array(temp[2:])
273+
data = np.empty((n_samples, n_features))
274+
target = np.empty((n_samples,), dtype=np.int)
275+
276+
for i, ir in enumerate(data_file):
277+
data[i] = np.asarray(ir[:-1], dtype=np.float64)
278+
target[i] = np.asarray(ir[-1], dtype=np.int)
279+
280+
return data, target, target_names
281+
282+
283+
def load_wine(return_X_y=False):
284+
"""Load and return the wine dataset (classification).
285+
286+
.. versionadded:: 0.18
287+
288+
The wine dataset is a classic and very easy multi-class classification
289+
dataset.
290+
291+
================= ==============
292+
Classes 3
293+
Samples per class [59,71,48]
294+
Samples total 178
295+
Dimensionality 13
296+
Features real, positive
297+
================= ==============
298+
299+
Read more in the :ref:`User Guide <datasets>`.
300+
301+
Parameters
302+
----------
303+
return_X_y : boolean, default=False.
304+
If True, returns ``(data, target)`` instead of a Bunch object.
305+
See below for more information about the `data` and `target` object.
306+
307+
Returns
308+
-------
309+
data : Bunch
310+
Dictionary-like object, the interesting attributes are:
311+
'data', the data to learn, 'target', the classification labels,
312+
'target_names', the meaning of the labels, 'feature_names', the
313+
meaning of the features, and 'DESCR', the
314+
full description of the dataset.
315+
316+
(data, target) : tuple if ``return_X_y`` is True
317+
318+
The copy of UCI ML Wine Data Set dataset is
319+
downloaded and modified to fit standard format from:
320+
https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data
321+
322+
Examples
323+
--------
324+
Let's say you are interested in the samples 10, 80, and 140, and want to
325+
know their class name.
326+
327+
>>> from sklearn.datasets import load_wine
328+
>>> data = load_wine()
329+
>>> data.target[[10, 80, 140]]
330+
array([0, 1, 2])
331+
>>> list(data.target_names)
332+
['class_0', 'class_1', 'class_2']
333+
"""
334+
module_path = dirname(__file__)
335+
data, target, target_names = load_data(module_path, 'wine_data.csv')
336+
337+
with open(join(module_path, 'descr', 'wine_data.rst')) as rst_file:
338+
fdescr = rst_file.read()
339+
340+
if return_X_y:
341+
return data, target
342+
343+
return Bunch(data=data, target=target,
344+
target_names=target_names,
345+
DESCR=fdescr,
346+
feature_names=['alcohol',
347+
'malic_acid',
348+
'ash',
349+
'alcalinity_of_ash',
350+
'magnesium',
351+
'total_phenols',
352+
'flavanoids',
353+
'nonflavanoid_phenols',
354+
'proanthocyanins',
355+
'color_intensity',
356+
'hue',
357+
'od280/od315_of_diluted_wines',
358+
'proline'])
359+
360+
245361
def load_iris(return_X_y=False):
246362
"""Load and return the iris dataset (classification).
247363
@@ -292,18 +408,7 @@ def load_iris(return_X_y=False):
292408
['setosa', 'versicolor', 'virginica']
293409
"""
294410
module_path = dirname(__file__)
295-
with open(join(module_path, 'data', 'iris.csv')) as csv_file:
296-
data_file = csv.reader(csv_file)
297-
temp = next(data_file)
298-
n_samples = int(temp[0])
299-
n_features = int(temp[1])
300-
target_names = np.array(temp[2:])
301-
data = np.empty((n_samples, n_features))
302-
target = np.empty((n_samples,), dtype=np.int)
303-
304-
for i, ir in enumerate(data_file):
305-
data[i] = np.asarray(ir[:-1], dtype=np.float64)
306-
target[i] = np.asarray(ir[-1], dtype=np.int)
411+
data, target, target_names = load_data(module_path, 'iris.csv')
307412

308413
with open(join(module_path, 'descr', 'iris.rst')) as rst_file:
309414
fdescr = rst_file.read()
@@ -370,18 +475,7 @@ def load_breast_cancer(return_X_y=False):
370475
['malignant', 'benign']
371476
"""
372477
module_path = dirname(__file__)
373-
with open(join(module_path, 'data', 'breast_cancer.csv')) as csv_file:
374-
data_file = csv.reader(csv_file)
375-
first_line = next(data_file)
376-
n_samples = int(first_line[0])
377-
n_features = int(first_line[1])
378-
target_names = np.array(first_line[2:4])
379-
data = np.empty((n_samples, n_features))
380-
target = np.empty((n_samples,), dtype=np.int)
381-
382-
for count, value in enumerate(data_file):
383-
data[count] = np.asarray(value[:-1], dtype=np.float64)
384-
target[count] = np.asarray(value[-1], dtype=np.int)
478+
data, target, target_names = load_data(module_path, 'breast_cancer.csv')
385479

386480
with open(join(module_path, 'descr', 'breast_cancer.rst')) as rst_file:
387481
fdescr = rst_file.read()
@@ -517,12 +611,12 @@ def load_diabetes(return_X_y=False):
517611
518612
(data, target) : tuple if ``return_X_y`` is True
519613
520-
.. versionadded:: 0.18
614+
.. versionadded:: 0.18
521615
"""
522616
base_dir = join(dirname(__file__), 'data')
523617
data = np.loadtxt(join(base_dir, 'diabetes_data.csv.gz'))
524618
target = np.loadtxt(join(base_dir, 'diabetes_target.csv.gz'))
525-
619+
526620
if return_X_y:
527621
return data, target
528622

@@ -554,7 +648,7 @@ def load_linnerud(return_X_y=False):
554648
'targets', the two multivariate datasets, with 'data' corresponding to
555649
the exercise and 'targets' corresponding to the physiological
556650
measurements, as well as 'feature_names' and 'target_names'.
557-
651+
558652
(data, target) : tuple if ``return_X_y`` is True
559653
560654
.. versionadded:: 0.18
@@ -608,7 +702,7 @@ def load_boston(return_X_y=False):
608702
609703
(data, target) : tuple if ``return_X_y`` is True
610704
611-
.. versionadded:: 0.18
705+
.. versionadded:: 0.18
612706
613707
Examples
614708
--------

sklearn/datasets/data/breast_cancer.csv

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
569,30,malignant,benign,,,,,,,,,,,,,,,,,,,,,,,,,,,
1+
569,30,malignant,benign
22
17.99,10.38,122.8,1001,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
33
20.57,17.77,132.9,1326,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
44
19.69,21.25,130,1203,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0

0 commit comments

Comments
 (0)