0% found this document useful (0 votes)
11 views

Machine Learning Notes

Uploaded by

Diksha Manchanda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Machine Learning Notes

Uploaded by

Diksha Manchanda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

MACHINE LEARNING MATERIAL

Basic libraries required: pandas, Data Visualisation: Enables to


matplotlib.pyplot, seaborn, numpy understand features and their
few data samples with head() method. relationship among themselves and with
use info() method to get quick output label.
description of data. Relationship between features: Standard
to understand nature of numeric correlation coefficient between features.
attribites, we use describe() method. Part 1: Feature Extraction KNNImputer: Uses k-nearest neighbours
When we look at the test set, we are DictVectorizer: Converts lists of approach to fill missing values in a
likely to notice patterns in that and mappings of feature name and feature dataset. The missing value of an
based on that we may select certain value, into a matrix. attribute in a specific example is filled
models. This leads to biased estimation with the mean value of the same
on test set, which may not generalize attribute of n_neighbors closest
well in practice. This is called data neighbors. The nearest neighbours are
snooping bias. decided based on
Scikit-Learn provides a few functions for Euclidean distance.
creating test sets based on Random
sampling, which randomly selects k%
points in the test set. Stratified FeatureHasher: High-speed, low-
sampling, which samples test examples memory vectorizer that uses feature
such that they are representative of hashing technique. Instead of building a
overall distribution. hash table of the features, as the
Random Sampling: train_test_split() vectorizers do, it applies a hash function
function performs random sampling to the features to determine their
with random_state parameter to set the column index in sample matrices
random seed, which ensures that the directly. This results in increased speed
same examples are selected for test sets and reduced memory usage, at the
across runs. test_size parameter for expense of inspectability; the hasher
specifying size of the test set. shuffle does not remember what the input
flag to specify if the data needs to be features looked like and has no
shuffled before splitting. Provision for inverse_transform method. Output of
processing multiple datasets with an this transformer is scipy.sparse matrix.
identical number of rows and selecting sklearn.feature_extraction.image.* has
the same indices from these datasets. useful APIs to extract features from
Useful when labels are in different image data.
dataframe. sklearn.feature_extraction.text.* has
useful APIs to extract features from text
data.
Part 2: Data Cleaning
Handling Missing Values
Missing values occur due to errors in
data capture such as sensor
malfunctioning, measurement errors
etc. Many ML algorithms do not work
with missing data and need all features
to be present. Discarding records
containing missing values would result
in loss of valuable training samples.
sklearn.impute API provides
functionality to fill missing
values in a dataset.
Stratified Sampling: Data distribution MissingIndicator provides indicators for
may not be uniform in real world data. missing values.
Random sampling - by its nature - SimpleImputer: Fills missing values with
one of the following strategies: 'mean', Marking imputed values
introduces biases in such data sets. It is useful to indicate the presence of
How do we sample: We divide the 'median', 'most_frequent' and
'constant'. missing values in the dataset.
population into homogenous groups MissingIndicator helps us get those
called strata. Data is sampled from each indications. It returns a binary matrix,
stratum so as to match it with the True values correspond to missing
overall data distribution. Scikit-Learn entries in original dataset.
provides a class StratifiedShuffleSplit Numeric Transformers
that helps us in stratified sampling. Feature Scaling:
Numerical features with different scales a feature which has same value, i.e. zero
leads to slower convergence of iterative variance.
optimization procedures. It is a good
practice to scale numerical features so
that all of them are on the same scale. LabelEncoder: Encodes target labels
1. StandardScaler: with value between 0 and K-1, where K
is number of distinct values.

OrdinalEncoder: Encodes categorical


2. MinMaxScaler: features with value between 0 and
K − 1, where K is number of distinct
values.

OrdinalEncoder can operate multi


dimensional data, while LabelEncoder
can transform only 1D data.
3. MaxAbsScaler: LabelBinarizer: Several regression and
binary classification can be extended
to multi-class setup in one-vs-all fashion.
This involves training a single regressor
or classifier per class. For this, we need
to convert multi-class labels to binary
labels, and LabelBinarizer performs this
task.

FunctionTransformer:

If estimator supports multiclass data,


LabelBinarizer is not needed.
MultiLabelBinarizer: Encodes categorical
features with value between 0 and
K − 1, where K is number of classes.

Polynomial transformation:

Add_dummy_feature: Augments dataset


with a column vector, each value in
the column vector is 1.

KBinsDiscretizer:
PART 3: FEATURE SELECTION
The features that do not contribute
significantly, can be removed. It leads to
decrease in size of the dataset and
hence, the computation cost of fitting a
model.
Filter Based
Categorical Transformers Removing features with low variance
OneHotEncoder: Encodes categorical Variance Threshold: Removes all
feature or label as a one-hot numeric features with variance below a certain
array. Creates one binary column for threshold, as specified by the user, from
each of K unique values. Exactly one input feature matrix. By default removes
column has 1 in it and rest have 0.

You might also like