0% found this document useful (0 votes)
18 views

Comprehensive EDA Python Guide

Cheat Sheet

Uploaded by

Muhammad Faizan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Comprehensive EDA Python Guide

Cheat Sheet

Uploaded by

Muhammad Faizan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Comprehensive Guide for Exploratory Data Analysis in Python

Comprehensive Guide for Exploratory Data Analysis in Python

1. Introduction to EDA

Exploratory Data Analysis (EDA) is a crucial step in data analysis that helps understand the data,

uncover patterns, spot anomalies, test hypotheses, and check assumptions with the help of

summary statistics and graphical representations.


Comprehensive Guide for Exploratory Data Analysis in Python

2. Loading Libraries and Dataset

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from scipy import stats

from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Example: Loading a CSV file

df = pd.read_csv('your_dataset.csv')
Comprehensive Guide for Exploratory Data Analysis in Python

3. Data Overview

# Display the first few rows of the dataset

print(df.head())

# Display summary statistics

print(df.describe())

# Display information about the dataset

print(df.info())
Comprehensive Guide for Exploratory Data Analysis in Python

4. Data Cleaning

# Handling Missing Values

print(df.isnull().sum())

df.fillna(df.mean(), inplace=True)

# Alternatively, you can fill missing values with median or mode

# df['column_name'].fillna(df['column_name'].median(), inplace=True)

# df['column_name'].fillna(df['column_name'].mode()[0], inplace=True)

# Dropping rows with missing values

# df.dropna(inplace=True)

# Handling Duplicates

print(df.duplicated().sum())

df.drop_duplicates(inplace=True)
Comprehensive Guide for Exploratory Data Analysis in Python

5. Data Preprocessing

# Encoding Categorical Variables

df = pd.get_dummies(df, columns=['categorical_column'])

# Label Encoding for ordinal data

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df['ordinal_column'] = le.fit_transform(df['ordinal_column'])

# Feature Engineering

df['new_feature'] = df['existing_feature1'] * df['existing_feature2']


Comprehensive Guide for Exploratory Data Analysis in Python

6. Outlier Detection and Treatment

# Using Z-score to identify outliers

z_scores = stats.zscore(df['column_name'])

abs_z_scores = np.abs(z_scores)

filtered_entries = (abs_z_scores < 3)

df = df[filtered_entries]

# Using IQR (Interquartile Range) to identify outliers

Q1 = df['column_name'].quantile(0.25)

Q3 = df['column_name'].quantile(0.75)

IQR = Q3 - Q1

filtered_entries = ((df['column_name'] >= (Q1 - 1.5 * IQR)) & (df['column_name'] <= (Q3 + 1.5 *

IQR)))

df = df[filtered_entries]
Comprehensive Guide for Exploratory Data Analysis in Python

7. Scaling and Normalization

# Min-Max Scaling

scaler = MinMaxScaler()

df[['column1', 'column2']] = scaler.fit_transform(df[['column1', 'column2']])

# Standardization

scaler = StandardScaler()

df[['column1', 'column2']] = scaler.fit_transform(df[['column1', 'column2']])


Comprehensive Guide for Exploratory Data Analysis in Python

8. Data Visualization

# Univariate Analysis

# Histogram

plt.figure(figsize=(10, 6))

sns.histplot(df['column_name'], kde=True)

plt.title('Histogram of column_name')

plt.show()

# Boxplot

plt.figure(figsize=(10, 6))

sns.boxplot(x=df['column_name'])

plt.title('Boxplot of column_name')

plt.show()

# Bivariate Analysis

# Scatter plot

plt.figure(figsize=(10, 6))

sns.scatterplot(x='column1', y='column2', data=df)

plt.title('Scatter plot between column1 and column2')

plt.show()

# Heatmap for correlation


Comprehensive Guide for Exploratory Data Analysis in Python

plt.figure(figsize=(12, 8))

sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

plt.title('Correlation Heatmap')

plt.show()

# Multivariate Analysis

# Pairplot

sns.pairplot(df)

plt.show()

# Violin plot

plt.figure(figsize=(10, 6))

sns.violinplot(x='categorical_column', y='numeric_column', data=df)

plt.title('Violin plot')

plt.show()
Comprehensive Guide for Exploratory Data Analysis in Python

9. Summarizing Findings

print("Key Findings:")

print("1. Description of key patterns or anomalies.")

print("2. Potential relationships between features.")

print("3. Insights on missing values and outliers.")


Comprehensive Guide for Exploratory Data Analysis in Python

10. Adjusting for Different Problems and Constraints

# Imbalanced Data

# Check class distribution

print(df['target'].value_counts())

# Oversampling using SMOTE

from imblearn.over_sampling import SMOTE

smote = SMOTE()

X_res, y_res = smote.fit_resample(X, y)

# Large Datasets

# Using Dask for larger-than-memory computations

import dask.dataframe as dd

df = dd.read_csv('large_dataset.csv')

# Time Series Data

# Converting a column to datetime

df['date_column'] = pd.to_datetime(df['date_column'])

# Setting the date column as index

df.set_index('date_column', inplace=True)
Comprehensive Guide for Exploratory Data Analysis in Python

# Resampling

df_resampled = df.resample('M').mean()

# Text Data

# Using CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()

X = cv.fit_transform(df['text_column'])

# Using TF-IDF Vectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()

X = tfidf.fit_transform(df['text_column'])

You might also like