0% found this document useful (0 votes)
80 views7 pages

Outlier Detection and Capping

There are several techniques to detect and handle outliers in a dataset. The document discusses and demonstrates 1) using z-scores to identify outliers more than 3 standard deviations from the mean, 2) capping outlier values between the 1st and 99th percentiles to remove their influence, and 3) two methods for capping outliers in Python - using np.where() to replace values below/above thresholds and clip() to restrict values within a given range. Boxplots are used before and after handling outliers to visualize the impact of these outlier treatment steps.

Uploaded by

santro985
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views7 pages

Outlier Detection and Capping

There are several techniques to detect and handle outliers in a dataset. The document discusses and demonstrates 1) using z-scores to identify outliers more than 3 standard deviations from the mean, 2) capping outlier values between the 1st and 99th percentiles to remove their influence, and 3) two methods for capping outliers in Python - using np.where() to replace values below/above thresholds and clip() to restrict values within a given range. Boxplots are used before and after handling outliers to visualize the impact of these outlier treatment steps.

Uploaded by

santro985
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Handling Outliers

May 30, 2022

[18]: import numpy as np


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

[6]: data=pd.read_excel("S:\\Data Science\\Projects\\Datasets\\Term Deposit\\train.


,→xlsx")

data.head()

[6]: age job marital education default balance housing loan \


0 58 management married tertiary no 2143 yes no
1 44 technician single secondary no 29 yes no
2 33 entrepreneur married secondary no 2 yes yes
3 47 blue-collar married unknown no 1506 yes no
4 33 unknown single unknown no 1 no no

contact day month duration campaign pdays previous poutcome y


0 unknown 5 may 261 1 -1 0 unknown no
1 unknown 5 may 151 1 -1 0 unknown no
2 unknown 5 may 76 1 -1 0 unknown no
3 unknown 5 may 92 1 -1 0 unknown no
4 unknown 5 may 198 1 -1 0 unknown no

1 There are different techniques to detect outliers


1.1 1- Z Score
1.2 Z= (Observation- Mean)/Standard Deviation
1.2.1 Z Score is not advisable if the data has skewness.
We calculate Z score for each observation and if the Zscore > 3 or Zscore < -3 then we classify that
observation as an outlier. Any point out of 3 standard deviations is known as an outlier.
It also means if any value greater than or lower than 3 stanadrd deviations from mean then it is
treated as an outlier
[10]: sns.displot(x='age',data=data)

1
[10]: <seaborn.axisgrid.FacetGrid at 0x1f9b2f74af0>

The data is not normally distributed and is right skewed

2 The following methods are used when the data is normally dis-
tributed
[4]: high_limit=data.age.mean()+3*data.age.std()
low_limit=data.age.mean()-3*data.age.std()

[5]: print(high_limit,'\n')
print(low_limit)

72.79249633725466

9.079924091402077

2
[6]: data['age'].describe()

[6]: count 45211.000000


mean 40.936210
std 10.618762
min 18.000000
25% 33.000000
50% 39.000000
75% 48.000000
max 95.000000
Name: age, dtype: float64

[20]: sns.boxplot(y='age',data=data) # boxplots are used to plot outliers

[20]: <AxesSubplot:ylabel='age'>

3 The very common method to cap outliers is quantile method.

4 We cap the values between 1 percentile and 99 percentile value.

5 This method can be used even if the data is skewed.


[21]: # We hvae to set the limits.
# The lower limit will be 1 percentile value.
# The upper limit will be 99 percentile value.

3
6 We are treating the column age.
[22]: lower_limit=data.age.quantile(0.01)
upper_limit=data.age.quantile(0.99)
print(lower_limit)
print(upper_limit)

23.0
71.0
Nowthat we have got our lower and upper limits. Now we have to cap the data between these two
values. This way outliers will be removed and that we can check with the help of the bixplot.

7 To cap the values we can use two methods.

8 1- np.where()

9 2- clip()
We will use both the methods one by one.

10 1- np.where()
[24]: data.age=np.where(data.age<lower_limit,lower_limit,np.where(data.
,→age>upper_limit,upper_limit,data.age))

data.age.head()

[24]: 0 58.0
1 44.0
2 33.0
3 47.0
4 33.0
Name: age, dtype: float64

we have capped the values. Now let’s check the outilers with the help of a boxplot
[26]: sns.boxplot(y='age',data=data)

[26]: <AxesSubplot:ylabel='age'>

4
Now we can compare the both boxplots that we created earlier and this one and can see that we
ahve traeted the outliers.

11 2- clip()
To use the clip() let’s import the data again and then check and remove outliers

[28]: data=pd.read_excel("S:\\Data Science\\Projects\\Datasets\\Term Deposit\\train.


,→xlsx")

data.head()

[28]: age job marital education default balance housing loan \


0 58 management married tertiary no 2143 yes no
1 44 technician single secondary no 29 yes no
2 33 entrepreneur married secondary no 2 yes yes
3 47 blue-collar married unknown no 1506 yes no
4 33 unknown single unknown no 1 no no

contact day month duration campaign pdays previous poutcome y


0 unknown 5 may 261 1 -1 0 unknown no
1 unknown 5 may 151 1 -1 0 unknown no
2 unknown 5 may 76 1 -1 0 unknown no
3 unknown 5 may 92 1 -1 0 unknown no
4 unknown 5 may 198 1 -1 0 unknown no

[31]: sns.boxplot(y='age',data=data)

5
[31]: <AxesSubplot:ylabel='age'>

Here we can see that there are a lot of outliers in my age column.
[32]: data.age=data.age.clip(lower=data.age.quantile(0.01),upper=data.age.quantile(0.
,→99))

data.head()

[32]: age job marital education default balance housing loan \


0 58 management married tertiary no 2143 yes no
1 44 technician single secondary no 29 yes no
2 33 entrepreneur married secondary no 2 yes yes
3 47 blue-collar married unknown no 1506 yes no
4 33 unknown single unknown no 1 no no

contact day month duration campaign pdays previous poutcome y


0 unknown 5 may 261 1 -1 0 unknown no
1 unknown 5 may 151 1 -1 0 unknown no
2 unknown 5 may 76 1 -1 0 unknown no
3 unknown 5 may 92 1 -1 0 unknown no
4 unknown 5 may 198 1 -1 0 unknown no

Now let’s create a boxplot again to check whether the outliers have been capped or not.
[33]: sns.boxplot(y='age',data=data)

[33]: <AxesSubplot:ylabel='age'>

6
Here we can see that we have succeessully capped the outliers.

12 Thank you
[ ]:

You might also like