Outlier Detection and Capping
Outlier Detection and Capping
data.head()
1
[10]: <seaborn.axisgrid.FacetGrid at 0x1f9b2f74af0>
2 The following methods are used when the data is normally dis-
tributed
[4]: high_limit=data.age.mean()+3*data.age.std()
low_limit=data.age.mean()-3*data.age.std()
[5]: print(high_limit,'\n')
print(low_limit)
72.79249633725466
9.079924091402077
2
[6]: data['age'].describe()
[20]: <AxesSubplot:ylabel='age'>
3
6 We are treating the column age.
[22]: lower_limit=data.age.quantile(0.01)
upper_limit=data.age.quantile(0.99)
print(lower_limit)
print(upper_limit)
23.0
71.0
Nowthat we have got our lower and upper limits. Now we have to cap the data between these two
values. This way outliers will be removed and that we can check with the help of the bixplot.
8 1- np.where()
9 2- clip()
We will use both the methods one by one.
10 1- np.where()
[24]: data.age=np.where(data.age<lower_limit,lower_limit,np.where(data.
,→age>upper_limit,upper_limit,data.age))
data.age.head()
[24]: 0 58.0
1 44.0
2 33.0
3 47.0
4 33.0
Name: age, dtype: float64
we have capped the values. Now let’s check the outilers with the help of a boxplot
[26]: sns.boxplot(y='age',data=data)
[26]: <AxesSubplot:ylabel='age'>
4
Now we can compare the both boxplots that we created earlier and this one and can see that we
ahve traeted the outliers.
11 2- clip()
To use the clip() let’s import the data again and then check and remove outliers
data.head()
[31]: sns.boxplot(y='age',data=data)
5
[31]: <AxesSubplot:ylabel='age'>
Here we can see that there are a lot of outliers in my age column.
[32]: data.age=data.age.clip(lower=data.age.quantile(0.01),upper=data.age.quantile(0.
,→99))
data.head()
Now let’s create a boxplot again to check whether the outliers have been capped or not.
[33]: sns.boxplot(y='age',data=data)
[33]: <AxesSubplot:ylabel='age'>
6
Here we can see that we have succeessully capped the outliers.
12 Thank you
[ ]: