SK Learn 1
SK Learn 1
SK Learn 1
Scikit-Learn, one of the best known Python package for machine learning algorithms that provide solid and efficient implementations
of a range of machine learning algorithms.
It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering
and dimensionality reduction via a consistence interface in Python.
The names "Scikit-Learn" and "Sklearn" both are same, with the difference that "scikit-learn" is the official name of the package and
is mostly used only for installing the package, whereas "sklearn" is the abbreviated name which is used when we have to use it for
python programming.
Installing Scikit-learn
Using pip
Scikit-learn Version
In [8]:
from sklearn import __version__
print(__version__)
0.24.2
Dataset
A collection of data is called dataset. It is having the following two components: features and response.
Features: The variables of data are called its features. They are also known as predictors, inputs or attributes.
Feature matrix: It is the collection of features. By convention, this features matrix is often stored in a variable named X.
Response: It is the output variable that basically depends upon the feature variables. They are also known as target, label or
output.
Response/Target Vector: It is used to represent response column. Generally, we have just one response column.
Target Names: It represent the possible values taken by a response vector.
Built-in Dataset
Scikit-learn have few built-in datasets like "iris" and "digits", "boston", or "diabetes" .
In [26]:
# Let's load iris dataset
In [14]:
print("iris dataset description: \n{}".format(iris.DESCR))
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
.. topic:: References
In [35]:
print("iris data size is {}".format(iris.data.shape))
print("iris target size is {}".format(iris.target.shape))
print("iris data has {} features".format(iris.data.shape[1]))
print("the feature names: \n\t{}".format(iris.feature_names))
print("iris data has {} samples".format(iris.data.shape[0]))
print("the target label names: \n\t{}".format(iris.target_names))
In [42]:
X = iris.data # Feature matrix
y = iris.target # Target Vector
First 10 rows of X:
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]]
y datatype: int64
y dimensions: 1
First 10 rows of Y:
[0 0 0 0 0 0 0 0 0 0]
Splitting Dataset