CH 02 Data Handling Technique
CH 02 Data Handling Technique
Data
Data imputation
validation
Features of data preprocessing
• Observations are not recorded for certain fields due to some reasons.
There might be a failure in recording the values due to human error.
• Missing completely at
random
• Missing at random
• Missing not at random
Missing Completely At Random (MCAR)
• In MCAR, the probability of data being missing is the same for all the observations. In this case, there is no
relationship between the missing data and any other values observed or unobserved (the data which is not
recorded) within the given dataset. That is, missing values are completely independent of other data. There is no
pattern.
• In the case of MCAR data, the value could be missing due to human error, some system/equipment failure, loss
of sample, or some unsatisfactory technicalities while recording the values. For Example, suppose in a library
there are some overdue books. Some values of overdue books in the computer system are missing. The reason
might be a human error, like the librarian forgetting to type in the values. So, the missing values of overdue books
are not related to any other variable/data in the system. It should not be assumed as it’s a rare case. The
advantage of such data is that the statistical analysis remains unbiased.
Missing At Random (MAR)
• MAR data means that the reason for missing values can be explained by variables on which you have complete information, as there is
some relationship between the missing data and other values/data. In this case, the data is not missing for all the observations. It is
missing only within sub-samples of the data, and there is some pattern in the missing values.
• For example, if you check the survey data, you may find that all the people have answered their ‘Gender,’ but ‘Age’ values
are mostly missing for people who have answered their ‘Gender’ as ‘female.’ (The reason being most of the females don’t want to reveal
their age.)
• So, the probability of data being missing depends only on the observed value or data. In this case, the variables ‘Gender’ and ‘Age’ are
related. The reason for missing values of the ‘Age’ variable can be explained by the ‘Gender’ variable, but you can not predict the missing
value itself.
• Suppose a poll is taken for overdue books in a library. Gender and the number of overdue books are asked in the poll. Assume that most of
the females answer the poll and men are less likely to answer. So why the data is missing can be explained by another factor, that is gender.
In this case, the statistical analysis might result in bias. Getting an unbiased estimate of the parameters can be done only by modeling the
missing data.
Missing Not At Random (MNAR)
• Missing values depend on the unobserved data. If there is some structure/pattern in missing data and other observed data can
• If the missing data does not fall under the MCAR or MAR, it can be categorized as MNAR. It can happen due to the reluctance of
people to provide the required information. A specific group of respondents may not answer some questions in a survey.
• For example, suppose the name and the number of overdue books are asked in the poll for a library. So most of the people having
no overdue books are likely to answer the poll. People having more overdue books are less likely to answer the poll. So, in this
case, the missing value of the number of overdue books depends on the people who have more books overdue.
• Another example is that people having less income may refuse to share some information in a survey or questionnaire.
• In the case of MNAR as well, the statistical analysis might result in bias.
Handling Missing Data
• Why Do We Need to Care About Handling Missing Data?
• Many machine learning algorithms fail if the dataset contains missing values. However,
algorithms like K-nearest and Naive Bayes support data with missing values.
• You may end up building a biased machine learning model, leading to incorrect results if
the missing values are not handled properly.
• Now that you have found the missing data, how do you handle the missing values?
• Analyze each column with missing values carefully to understand the reasons behind the
missing of those values, as this information is crucial to choose the strategy for handling
the missing values.
• If you can make an educated guess about the missing value, then
you can replace it with some arbitrary value using the following
code. E.g., in the following code, we are replacing the missing
values of the ‘Dependents’ column with ‘0’.
Replacing with the mean
train_df['Loan_Amount_Term']= train_df['Loan_Amount_Term'].fillna(train_df['Loan_Amount_Term'].median())
Replacing with the previous value – forward
fill
• In some cases, imputing the
values with the previous value
instead of the mean, mode, or
median is more appropriate.
2) Standard Scaler
3) MaxAbsScaler
4) Robust Scaler
8) Custom Transformer
Base code in python
df_scaled = df.copy()
col_names = ['Income', 'Age’]
features = df_scaled[col_names]
We will execute this snippet before using a new scaler every time.
MinMax Scaler
Though (0, 1) is the default range, we can define our range of max and
min values as well.
MinMax Scaler
Income (data
given) x_scaled = (x – x_min)/(x_max – x_min)
15000 Xmax-Xmin= X-Xmin = X scaled= 13,200/1,18,200
1,20,000-1800 15,000 – 1800 = 0.111675
= 1,18,200/- = 13,200
1800 1800-1800=0 Calculate
120000
10000
The min-max scaler lets you set the range in which you want the variables to be.
Standard Scaler
• For each feature, the Standard Scaler scales the values such that the
mean is 0 and the standard deviation is 1(or the variance).
x_scaled = x – mean/std_dev
Standard Scaler assumes that the distribution of the variable is normal.
In case, the variables are not normally distributed,
• It first takes the absolute value of each value in the column and
then takes the maximum value out of those. This operation
scales the data between the range [-1, 1].
MaxAbsScaler
Age and income can not be negative let us take one more variable as Balance
• Thus, in our example, while plotting the histogram of Income, it ranges from
0 to 1,20,000:
While our Income column had extreme values ranging from 1800 to 1,20,000 – the log values are
now ranging from approximately 7.5 to 11.7! Thus, the log operation had a dual role
if our data has negative values, NAN or values ranging from 0 to 1, we cannot apply log transform
directly – since the log of negative numbers and numbers between 0 and 1 is undefined,
Original data and Log Transformed data
Power Transformer Scaler
• If we are using L1 norm, the values in each column are converted so that
the sum of their absolute values along the row = 1
• If we are using L2 norm, the values in each column are first squared and
added so that the sum of their absolute values along the row = 1
Log base 10
Categorial Feature handling
Categorial data Label Encoding One Hot Encoding
Due to this, there is a very high probability that the model captures the
relationship between countries such as India < Japan < the US.
One-Hot Encoding vs Label Encoding
58
Box Plot
Box plot is another very simple visualization tool to detect outliers
which use the concept of Interquartile range (IQR) technique.
3 3
For Normal Distributions Use Inter-Quartile Range (IQR)
Use empirical relations of Normal proximity rule. Use a percentile-based approach.
distribution. The data points that fall below For Example, data points that are far from the
The data points that fall below mean- Q1 – 1.5 IQR or above the third
99% percentile and less than 1 percentile are
3*(sigma) or above mean+3*(sigma) are quartile Q3 + 1.5 IQR are
outliers, where Q1 and Q3 are considered an outlier.
outliers, where mean and sigma are the
average value and standard deviation of a the 25th and 75th percentile of
particular column. the dataset, respectively. IQR
represents the inter-quartile
range and is given by Q3 – Q1.
Imbalanced data handling
Data imbalance Problem
• Classification problems are quite common in the machine learning world. As we
know in the classification problem we try to predict the class label by studying the
input data or predictor where the target or output variable is a categorical variable
in nature.
• Imbalanced data refers to those types of datasets where the target class has an
uneven distribution of observations, i.e one class label has a very high number of
observations and the other has a very low number of observations.
• So, the number of fraud per 100 transactions is less than 2%, or we can say
more than 98% transaction is “No Fraud” in nature. Here, the class “No
Fraud” is called the majority class, and the much smaller in size “Fraud”
class is called the minority class.
Data imbalance
More such example of imbalanced data is –
· Disease diagnosis
· Customer churn prediction
· Fraud detection
· Natural disaster
Data imbalance
• the main problem with imbalanced dataset prediction is how
accurately are we actually predicting both majority and minority
class?
• Let’s explain it with an example of disease diagnosis.
• Let’s assume we are going to predict disease from an existing dataset
where for every 100 records only 5 patients are diagnosed with the
disease. What is majority class and minority class ?
• the majority class is 95% with no disease and the minority class is
only 5% with the disease.
• Now, ML model might predicts that all 100 out of 100 patients have
no disease.
Data imbalance
• Sometimes when the records of a certain class are much more than the other class, our
classifier may get biased towards the prediction.
• In this case, the confusion matrix for the classification problem shows how well our model
classifies the target classes and we arrive at the accuracy of the model from the confusion
matrix.
• It is calculated based on the total no of correct predictions by the model divided by the
total no of predictions. In the above case it is (0+95)/(0+95+0+5)=0.95 or 95%. It means
that the model fails to identify the minority class yet the accuracy score of the model will
be 95%.
Data imbalance
• Thus our traditional approach of classification and model
accuracy calculation is not useful in the case of the imbalanced
dataset.
ML Classifier with Imbalance Data set
Majority Class
healthy
New Case Minority Class
Heart disease
if the classifier
identifies the
minority class
poorly, i.e. more of FN increases FP increases
this class wrongfully
predicted as the
majority class then
false negatives will
increase if the classifier predicts the minority class but the prediction is
erroneous and false-positive increases, the precision metric will
be low
Approach to deal with the imbalanced dataset
problem
• In rare cases like fraud detection or disease prediction, it is vital
to identify the minority classes correctly. Techniques for the
same are
1. Choose Proper Evaluation Metric
if the classifier predicts the minority class but the prediction is erroneous and false-positive
increases, the precision metric will be low and so as F1 score. Also, if the classifier identifies
the minority class poorly, i.e. more of this class wrongfully predicted as the majority class
then false negatives will increase, so recall and F1 score will low. F1 score only increases if
both the number and quality of prediction improves.
F1 score keeps the balance between precision and recall and improves the score only if the
classifier identifies more of a certain class correctly.
Choose proper evaluation matrix
• Precision:
• the number of true positives divided by all positive predictions.
• Precision is also called Positive Predictive Value.
• It is a measure of a classifier’s exactness.
• Low precision indicates a high number of false positives.
• Recall:
• the number of true positives divided by the number of positive values in the test data.
• The recall is also called Sensitivity or the True Positive Rate. It is a measure of a classifier’s
completeness.
• Low recall indicates a high number of false negatives.
• F1: Score:
• the weighted average of precision and recall.
• Area Under ROC Curve (AUROC):
• AUROC represents the likelihood of your model distinguishing observations from two classes.
• In other words, if you randomly select one observation from each class, what’s the
probability that your model will be able to “rank” them correctly?
2. Resampling (Oversampling and Under sampling)
• When we are using an imbalanced dataset, we can oversample the minority class using replacement.
• Similarly, we can randomly delete rows from the majority class to match them with the minority class which
is called under sampling.
• After sampling the data we can get a balanced dataset for both majority and minority classes. So, when both
classes have a similar number of records present in the dataset, we can assume that the classifier will give
equal importance to both classes.
under sampling.
oversampling.
2. Resampling (Oversampling and Under sampling)
It has been observed that our target class has an imbalance. So,
we’ll try to upsample the data so that the minority class
matches with the majority class.
from sklearn.utils import resample
#create two different dataframe of majority and minority class
df_majority = df_train[(df_train['Is_Lead']==0)]
df_minority = df_train[(df_train['Is_Lead']==1)]
# upsample minority class
df_minority_upsampled = resample(df_minority,
replace=True, # sample with replacement
n_samples= 131177, # to match majority class
random_state=42) # reproducible results
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_minority_upsampled, df_majority])
After upsampling, the distribution of class is balanced as below –
2. Resampling (Oversampling and Under sampling)
Sklearn.utils
resample can be
used for both
undersamplings the
majority class and
oversample
minority class
instances.
3. SMOTE
• In real-life machine learning problems, it’s almost rare that all the
variables in the dataset are useful for building a model.
83
1. Filter Methods
• features are selected on the basis of statistics measures.
• The filter method filters out the irrelevant feature and
redundant columns from the model by using different
metrics through ranking.
Some common techniques of Filter methods are
• Information Gain
• Chi-square Test
• Fisher’s Score
• Correlation Coefficient
• Variance Threshold
• Mean Absolute Difference (MAD)
2. Wrapper Methods
• In wrapper methodology, the selection of features is done by
considering it as a search problem, in which different combinations are
made, evaluated, and compared with other combinations.
• It trains the algorithm by using the subset of features iteratively.
• On the basis of the output of the model, features are added or
subtracted, and with this feature set, the model is trained again.
• Filter methods pick up the intrinsic Wrappers require some method to search the These methods encompass the benefits of
properties of the features measured space of all possible subsets of features, both the wrapper and filter methods by
via univariate statistics instead of assessing their quality by learning and including interactions of features but also
cross-validation performance. These evaluating a classifier with that feature subset. maintaining reasonable computational costs.
methods are faster and less Embedded methods are iterative in the sense
computationally expensive than The wrapper methods usually result in better that takes care of each iteration of the model
wrapper methods. When dealing predictive accuracy than filter methods. training process and carefully extract those
with high-dimensional data, it is features which contribute the most to the
computationally cheaper to use filter training for a particular iteration.
methods.
• Information Gain • Forward Feature Selection • Regularization (L1)
• Chi-square Test • Backward Feature Elimination • Tree-based methods
• Fisher’s Score • Exhaustive Feature Selection
• Correlation Coefficient • Recursive Feature Elimination
• Variance Threshold
• Mean Absolute Difference (MAD)
Information Gain
study.
Correlation Coefficient
• A correlation coefficient is a numerical measure of some
type of correlation, meaning a statistical relationship
between two variables. It lies between -1 to +1.
• r =correlation coefficient
• xi = values of the x-variable in a sample
• yi = values of the y-variable in a sample
• x ¯ and y ¯ = mean of x and y respectively
Variance Threshold
• It removes all features whose variance
doesn’t meet some threshold.
• By default, it removes all zero-
variance features, i.e., features with
the same value in all samples.
The get_support returns a Boolean vector where True
• The assumption made using this means the variable does not have zero variance.
• When using PCA, we take as input our original data and try to find a
combination of the input features which can best summarize the original
data distribution so that to reduce its original dimensions.
• LDA aims to maximize the distance between the mean of each class and
minimize the spreading within the class itself.
• LDA uses therefore within classes and between classes as measures. This is
a good choice because maximizing the distance between the means of
each class when projecting the data in a lower-dimensional space can lead
to better classification results