Comprehensive EDA Python Guide
Comprehensive EDA Python Guide
1. Introduction to EDA
Exploratory Data Analysis (EDA) is a crucial step in data analysis that helps understand the data,
uncover patterns, spot anomalies, test hypotheses, and check assumptions with the help of
import pandas as pd
import numpy as np
df = pd.read_csv('your_dataset.csv')
Comprehensive Guide for Exploratory Data Analysis in Python
3. Data Overview
print(df.head())
print(df.describe())
print(df.info())
Comprehensive Guide for Exploratory Data Analysis in Python
4. Data Cleaning
print(df.isnull().sum())
df.fillna(df.mean(), inplace=True)
# df['column_name'].fillna(df['column_name'].median(), inplace=True)
# df['column_name'].fillna(df['column_name'].mode()[0], inplace=True)
# df.dropna(inplace=True)
# Handling Duplicates
print(df.duplicated().sum())
df.drop_duplicates(inplace=True)
Comprehensive Guide for Exploratory Data Analysis in Python
5. Data Preprocessing
df = pd.get_dummies(df, columns=['categorical_column'])
le = LabelEncoder()
df['ordinal_column'] = le.fit_transform(df['ordinal_column'])
# Feature Engineering
z_scores = stats.zscore(df['column_name'])
abs_z_scores = np.abs(z_scores)
df = df[filtered_entries]
Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3 - Q1
filtered_entries = ((df['column_name'] >= (Q1 - 1.5 * IQR)) & (df['column_name'] <= (Q3 + 1.5 *
IQR)))
df = df[filtered_entries]
Comprehensive Guide for Exploratory Data Analysis in Python
# Min-Max Scaling
scaler = MinMaxScaler()
# Standardization
scaler = StandardScaler()
8. Data Visualization
# Univariate Analysis
# Histogram
plt.figure(figsize=(10, 6))
sns.histplot(df['column_name'], kde=True)
plt.title('Histogram of column_name')
plt.show()
# Boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(x=df['column_name'])
plt.title('Boxplot of column_name')
plt.show()
# Bivariate Analysis
# Scatter plot
plt.figure(figsize=(10, 6))
plt.show()
plt.figure(figsize=(12, 8))
plt.title('Correlation Heatmap')
plt.show()
# Multivariate Analysis
# Pairplot
sns.pairplot(df)
plt.show()
# Violin plot
plt.figure(figsize=(10, 6))
plt.title('Violin plot')
plt.show()
Comprehensive Guide for Exploratory Data Analysis in Python
9. Summarizing Findings
print("Key Findings:")
# Imbalanced Data
print(df['target'].value_counts())
smote = SMOTE()
# Large Datasets
import dask.dataframe as dd
df = dd.read_csv('large_dataset.csv')
df['date_column'] = pd.to_datetime(df['date_column'])
df.set_index('date_column', inplace=True)
Comprehensive Guide for Exploratory Data Analysis in Python
# Resampling
df_resampled = df.resample('M').mean()
# Text Data
# Using CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(df['text_column'])
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(df['text_column'])