Modulo 4 - EDA - Ipynb - Colaboratory
Modulo 4 - EDA - Ipynb - Colaboratory
Modulo 4 - EDA - Ipynb - Colaboratory
ipynb - Colaboratory
LEMBRETE: Fazer o import dos datasets usados no ambiente do colab antes de executar os
comandos.
Import de pacotes
!pip install sweetviz
Collecting sweetviz
import sweetviz as sv
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython import display
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_colwidth', 10000)
https://colab.research.google.com/drive/1QXuP1KcCTqO1o-Uk9hV3gBI5w3xxrxw4#printMode=true 1/21
Jardeilsom do Nascimento Oliveira - blakjd2@hotmail.com - IP: 200.232.133.64
28/04/2022 12:37 Modulo 4 - EDA.ipynb - Colaboratory
Import da base
metadata = pd.read_excel('metadata.xlsx')
metadata
Feature Feature_Type
0 age numeric
5 housing categorical,nominal h
6 loan categorical,nominal h
df = pd.read_csv('new_train.csv', sep=',')
df.head()
https://colab.research.google.com/drive/1QXuP1KcCTqO1o-Uk9hV3gBI5w3xxrxw4#printMode=true 2/21
Jardeilsom do Nascimento Oliveira - blakjd2@hotmail.com - IP: 200.232.133.64
28/04/2022 12:37 Modulo 4 - EDA.ipynb - Colaboratory
Estatísticas básicas
# Método 'info' retorna diversas informações relacionadas ao Dataframe, dentre elas número
df.info()
<class 'pandas.core.frame.DataFrame'>
# Número de linhas e colunas do Dataframe
df.shape
(32950, 16)
# Função len (length) para Dataframes retorna o número de linhas
len(df)
https://colab.research.google.com/drive/1QXuP1KcCTqO1o-Uk9hV3gBI5w3xxrxw4#printMode=true 3/21
Jardeilsom do Nascimento Oliveira - blakjd2@hotmail.com - IP: 200.232.133.64
28/04/2022 12:37 Modulo 4 - EDA.ipynb - Colaboratory
32950
# Método nunique retorna os valores únicos para cada variável (análogo ao "remover duplica
df.nunique()
age 75
job 12
marital 4
education 8
default 3
housing 3
loan 3
contact 2
month 10
day_of_week 5
duration 1467
campaign 40
pdays 27
previous 8
poutcome 3
y 2
dtype: int64
Análise Univariada
# Retornar as 5 primeiras linhas do Dataframe (5 é o default, é possível alterar esse núme
df['age'].head()
0 49
1 37
2 78
3 36
4 59
# Retornar as 5 últimas linhas do Dataframe (mesmo default do 'head')
df['age'].tail()
32945 28
32946 52
32947 54
32948 29
32949 35
# Soma de todos os valores de uma coluna (no caso, coluna "age")
df['age'].sum()
https://colab.research.google.com/drive/1QXuP1KcCTqO1o-Uk9hV3gBI5w3xxrxw4#printMode=true 4/21
Jardeilsom do Nascimento Oliveira - blakjd2@hotmail.com - IP: 200.232.133.64
28/04/2022 12:37 Modulo 4 - EDA.ipynb - Colaboratory
1318465
# Valor mínimo observado para determinada coluna
df['age'].min()
17
# Valor médio
df['age'].mean()
40.01411229135053
# Valor máximo
df['age'].max()
98
# Boxplot dos dados referentes à coluna "Age". É possível observar onde estão dispostos os
sns.boxplot(x=df["age"])
<matplotlib.axes._subplots.AxesSubplot at 0x7f9a334c6050>
# O histograma também facilita a visualização da distribuição dos dados, fundamental na es
plt.hist(df['age'], 50, facecolor='b')
plt.show()
https://colab.research.google.com/drive/1QXuP1KcCTqO1o-Uk9hV3gBI5w3xxrxw4#printMode=true 5/21
Jardeilsom do Nascimento Oliveira - blakjd2@hotmail.com - IP: 200.232.133.64
28/04/2022 12:37 Modulo 4 - EDA.ipynb - Colaboratory
df.describe(include='int64')
df.describe(include='object')
unique 12 4 8 3 3 3 2 10
Análise de missings
https://colab.research.google.com/drive/1QXuP1KcCTqO1o-Uk9hV3gBI5w3xxrxw4#printMode=true 6/21
Jardeilsom do Nascimento Oliveira - blakjd2@hotmail.com - IP: 200.232.133.64
28/04/2022 12:37 Modulo 4 - EDA.ipynb - Colaboratory
df.isnull().sum()
age 0
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 0
month 0
day_of_week 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
y 0
dtype: int64
Tabela de Frequencia
df['poutcome'].value_counts()
nonexistent 28416
failure 3429
success 1105
df['contact'].value_counts()
cellular 20908
telephone 12042
df['age'].value_counts().hist()
https://colab.research.google.com/drive/1QXuP1KcCTqO1o-Uk9hV3gBI5w3xxrxw4#printMode=true 7/21
Jardeilsom do Nascimento Oliveira - blakjd2@hotmail.com - IP: 200.232.133.64
28/04/2022 12:37 Modulo 4 - EDA.ipynb - Colaboratory
<matplotlib.axes._subplots.AxesSubplot at 0x7f9a31f76490>
prev_y = pd.crosstab(index=df["previous"], columns=df["y"],margins=True)
prev_y
y no yes All
previous
3 74 101 175
4 29 31 60
5 4 10 14
6 2 3 5
7 1 0 1
job_y = pd.crosstab(index=df["job"], columns=df["y"],margins=True)
job_y
y no yes All
job
https://colab.research.google.com/drive/1QXuP1KcCTqO1o-Uk9hV3gBI5w3xxrxw4#printMode=true 8/21
Jardeilsom do Nascimento Oliveira - blakjd2@hotmail.com - IP: 200.232.133.64
28/04/2022 12:37 Modulo 4 - EDA.ipynb - Colaboratory
Histograma
df.dtypes
age int64
job object
marital object
education object
default object
housing object
loan object
contact object
month object
day_of_week object
duration int64
campaign int64
pdays int64
previous int64
poutcome object
y object
dtype: object
sns.histplot(data=df, x="pdays")
<matplotlib.axes._subplots.AxesSubplot at 0x7f9a31ea48d0>
sns.histplot(data=df, x="duration")
https://colab.research.google.com/drive/1QXuP1KcCTqO1o-Uk9hV3gBI5w3xxrxw4#printMode=true 9/21
Jardeilsom do Nascimento Oliveira - blakjd2@hotmail.com - IP: 200.232.133.64
28/04/2022 12:37 Modulo 4 - EDA.ipynb - Colaboratory
<matplotlib.axes._subplots.AxesSubplot at 0x7f9a31e146d0>
df['duration'].describe()
count 32950.000000
mean 258.127466
std 258.975917
min 0.000000
25% 103.000000
50% 180.000000
75% 319.000000
max 4918.000000
df['duration'].median()
180.0
df['duration'].mode()
0 90
dtype: int64
sns.histplot(data=df, x="campaign")
https://colab.research.google.com/drive/1QXuP1KcCTqO1o-Uk9hV3gBI5w3xxrxw4#printMode=true 10/21
Jardeilsom do Nascimento Oliveira - blakjd2@hotmail.com - IP: 200.232.133.64
28/04/2022 12:37 Modulo 4 - EDA.ipynb - Colaboratory
<matplotlib.axes._subplots.AxesSubplot at 0x7f9a33900790>
Boxplot
sns.boxplot(x=df["campaign"])
<matplotlib.axes._subplots.AxesSubplot at 0x7f9a338915d0>
df['campaign'].value_counts()
1 14121
2 8469
3 4300
4 2116
5 1255
6 773
7 493
8 329
9 220
10 187
11 142
12 92
13 74
14 52
17 51
15 45
16 42
18 27
20 22
21 20
19 16
https://colab.research.google.com/drive/1QXuP1KcCTqO1o-Uk9hV3gBI5w3xxrxw4#printMode=true 11/21
Jardeilsom do Nascimento Oliveira - blakjd2@hotmail.com - IP: 200.232.133.64
28/04/2022 12:37 Modulo 4 - EDA.ipynb - Colaboratory
22 13
24 12
23 12
27 9
25 8
26 7
31 7
29 7
28 6
30 6
35 4
33 3
43 2
32 2
42 2
34 1
37 1
40 1
56 1
display.Image("IQR.png")
Grafico de Dispersão
df.dtypes
age int64
job object
marital object
education object
default object
housing object
loan object
contact object
month object
day_of_week object
duration int64
https://colab.research.google.com/drive/1QXuP1KcCTqO1o-Uk9hV3gBI5w3xxrxw4#printMode=true 12/21
Jardeilsom do Nascimento Oliveira - blakjd2@hotmail.com - IP: 200.232.133.64
28/04/2022 12:37 Modulo 4 - EDA.ipynb - Colaboratory
campaign int64
pdays int64
previous int64
poutcome object
y object
dtype: object
sns.scatterplot(data=df, x="campaign", y="duration")
<matplotlib.axes._subplots.AxesSubplot at 0x7f9a2ddfa950>
sns.scatterplot(data=df, x="pdays", y="duration")
<matplotlib.axes._subplots.AxesSubplot at 0x7f9a2dd7c110>
https://colab.research.google.com/drive/1QXuP1KcCTqO1o-Uk9hV3gBI5w3xxrxw4#printMode=true 13/21
Jardeilsom do Nascimento Oliveira - blakjd2@hotmail.com - IP: 200.232.133.64
28/04/2022 12:37 Modulo 4 - EDA.ipynb - Colaboratory
Correlações
df.corr()
sns.heatmap(df.corr(), annot=True, fmt="f")
<matplotlib.axes._subplots.AxesSubplot at 0x7f9a2dd624d0>
sns.catplot(x="duration", y="y", data=df)
https://colab.research.google.com/drive/1QXuP1KcCTqO1o-Uk9hV3gBI5w3xxrxw4#printMode=true 14/21
Jardeilsom do Nascimento Oliveira - blakjd2@hotmail.com - IP: 200.232.133.64
28/04/2022 12:37 Modulo 4 - EDA.ipynb - Colaboratory
<seaborn.axisgrid.FacetGrid at 0x7f9a2dd82750>
sns.catplot(x="campaign", y="y", data=df)
<seaborn.axisgrid.FacetGrid at 0x7f9a2dc0b650>
sns.catplot(x="age", y="y", data=df)
https://colab.research.google.com/drive/1QXuP1KcCTqO1o-Uk9hV3gBI5w3xxrxw4#printMode=true 15/21
Jardeilsom do Nascimento Oliveira - blakjd2@hotmail.com - IP: 200.232.133.64
28/04/2022 12:37 Modulo 4 - EDA.ipynb - Colaboratory
<seaborn.axisgrid.FacetGrid at 0x7f9a2db7ec50>
Análise Multivariada
sns.relplot(x="age", y="duration", hue="y", data=df);
https://colab.research.google.com/drive/1QXuP1KcCTqO1o-Uk9hV3gBI5w3xxrxw4#printMode=true 16/21
Jardeilsom do Nascimento Oliveira - blakjd2@hotmail.com - IP: 200.232.133.64
28/04/2022 12:37 Modulo 4 - EDA.ipynb - Colaboratory
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
metadata
Feature Feature_Type
0 age numeric
5 housing categorical,nominal h
6 loan categorical,nominal h
df_pca = df[['age', 'duration','campaign','pdays','previous']]
df_pca.head()
https://colab.research.google.com/drive/1QXuP1KcCTqO1o-Uk9hV3gBI5w3xxrxw4#printMode=true 17/21
Jardeilsom do Nascimento Oliveira - blakjd2@hotmail.com - IP: 200.232.133.64
28/04/2022 12:37 Modulo 4 - EDA.ipynb - Colaboratory
0 49 227 4 999 0
pca = PCA(n_components=2, random_state=42)
1 37 202 2 999 1
df_expl_pca = StandardScaler().fit_transform(df_pca)
2 78 1148 1 999 0
3 36 120 2 999 0
df_expl_pca
4 59 368 2 999 0
array([[ 0.86373877, -0.12019627, 0.52298128, 0.19658384, -0.35012691],
...,
result_pca = pca.fit_transform(df_expl_pca)
result_pca_df = pd.DataFrame(result_pca,
columns=['component1','component2'])
result_pca_df
component1 component2
0 -0.425175 -0.509855
1 1.005371 -0.146158
2 0.265589 2.274575
3 -0.421084 -0.115342
4 -0.197363 0.194940
pca.explained_variance_ratio_
array([0.32246681, 0.2116934 ])
https://colab.research.google.com/drive/1QXuP1KcCTqO1o-Uk9hV3gBI5w3xxrxw4#printMode=true 18/21
Jardeilsom do Nascimento Oliveira - blakjd2@hotmail.com - IP: 200.232.133.64
28/04/2022 12:37 Modulo 4 - EDA.ipynb - Colaboratory
df_resp_pca = pd.concat([df['y'], result_pca_df], axis=1)
df_resp_pca
y component1 component2
0 no -0.425175 -0.509855
1 no 1.005371 -0.146158
3 no -0.421084 -0.115342
4 no -0.197363 0.194940
fig = plt.figure(figsize= (10,10))
ax = fig.add_subplot(1,1,1)
ax.set_xlabel('Component_1', fontsize = 15)
ax.set_ylabel('Component_2', fontsize = 15)
ax.set_title('PCA 2 componentes', fontsize = 20)
targets = ['yes','no']
colors = ['r', 'b']
for target, color in zip(targets,colors):
indicesToKeep = df_resp_pca['y'] == target
ax.scatter(df_resp_pca.loc[indicesToKeep, 'component1']
, df_resp_pca.loc[indicesToKeep, 'component2']
, c = color
, s = 50)
ax.legend(targets)
ax.grid()
https://colab.research.google.com/drive/1QXuP1KcCTqO1o-Uk9hV3gBI5w3xxrxw4#printMode=true 19/21
Jardeilsom do Nascimento Oliveira - blakjd2@hotmail.com - IP: 200.232.133.64
28/04/2022 12:37 Modulo 4 - EDA.ipynb - Colaboratory
https://colab.research.google.com/drive/1QXuP1KcCTqO1o-Uk9hV3gBI5w3xxrxw4#printMode=true 20/21
Jardeilsom do Nascimento Oliveira - blakjd2@hotmail.com - IP: 200.232.133.64
28/04/2022 12:37 Modulo 4 - EDA.ipynb - Colaboratory
https://colab.research.google.com/drive/1QXuP1KcCTqO1o-Uk9hV3gBI5w3xxrxw4#printMode=true 21/21
Jardeilsom do Nascimento Oliveira - blakjd2@hotmail.com - IP: 200.232.133.64