Exploratory Data Analysis in Python
Exploratory Data Analysis in Python
1. Loading Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.preprocessing import MinMaxScaler, StandardScaler
Exploratory Data Analysis in Python
2. Loading the Dataset
# Example: Loading a CSV file
df = pd.read_csv('your_dataset.csv')
Exploratory Data Analysis in Python
3. Data Overview
# Display the first few rows of the dataset
print(df.head())
# Display summary statistics
print(df.describe())
# Display information about the dataset
print(df.info())
Exploratory Data Analysis in Python
4. Cleaning Data
# Handling missing values
print(df.isnull().sum())
df.fillna(df.mean(), inplace=True)
# Handling duplicates
print(df.duplicated().sum())
df.drop_duplicates(inplace=True)
Exploratory Data Analysis in Python
5. Preprocessing Data
# Encoding categorical variables
df = pd.get_dummies(df, columns=['categorical_column'])
# Feature Engineering
df['new_feature'] = df['existing_feature1'] * df['existing_feature2']
Exploratory Data Analysis in Python
6. Outlier Detection and Treatment
# Using Z-score to identify outliers
z_scores = stats.zscore(df['column_name'])
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 3)
df = df[filtered_entries]
Exploratory Data Analysis in Python
7. Scaling and Normalization
# Min-Max Scaling
scaler = MinMaxScaler()
df[['column1', 'column2']] = scaler.fit_transform(df[['column1', 'column2']])
# Alternatively, for Standardization
# scaler = StandardScaler()
# df[['column1', 'column2']] = scaler.fit_transform(df[['column1', 'column2']])
Exploratory Data Analysis in Python
8. Data Visualization (Examples)
# Histogram
plt.figure(figsize=(10, 6))
sns.histplot(df['column_name'], kde=True)
plt.title('Histogram of column_name')
plt.show()
# Boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(x=df['column_name'])
plt.title('Boxplot of column_name')
plt.show()
# Scatter plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x='column1', y='column2', data=df)
plt.title('Scatter plot between column1 and column2')
plt.show()
# Heatmap for correlation
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()
Exploratory Data Analysis in Python
Exploratory Data Analysis in Python
9. Summarizing Findings
print("Key Findings:")
print("1. Description of key patterns or anomalies.")
print("2. Potential relationships between features.")
print("3. Insights on missing values and outliers.")