05 AIHC Exp01
05 AIHC Exp01
Aim: Collect, Clean, Integrate and Transform Healthcare Data based on specific disease
Objective: The objective of this experiment is to perform basic pre processing on healthcare
data set using python libraries
Theory:
Data Collection- Data collection is the process of gathering and measuring information from
countless different sources. In order to use the data we collect to develop practical artificial
intelligence (AI) and machine learning solutions, it must be collected and stored in a way that
makes sense for the business problem at hand.
Data Cleaning: Cleaning data refers to the way of deleting wrong, corrupted, wrongly
formatted, duplicate information, or incomplete information from a dataset. The possibility of
duplicating or mislabelling data increases when two or more data sources are combined.
Data Integration: Data integration is the practice of consolidating data from disparate sources
into a single dataset with the ultimate goal of providing users with consistent access and
delivery of data across the spectrum of subjects and structure types, and to meet the
information needs of all applications and business processes.
Code: -
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.decomposition import PCA
from scipy import stats
# Load the dataset
df = pd.read_csv('heart.csv')
# Display basic information
print("Initial Data:")
print(df.head())
print("\nMissing Values:")
print(df.isnull().sum())
print("\nData Description:")
print(df.describe())
# Data Cleaning
# Handle missing values
df_filled = df.fillna(df.median()) # Filling missing values with median
# Remove duplicates
df_no_duplicates = df_filled.drop_duplicates()
# Outlier Detection and Treatment
z_scores = np.abs(stats.zscore(df_no_duplicates.select_dtypes(include=[np.number])))
df_no_outliers = df_no_duplicates[(z_scores < 3).all(axis=1)]
# Normalization and Standardization
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_no_outliers.select_dtypes(include=[np.number])),
columns=df_no_outliers.select_dtypes(include=[np.number]).columns)
min_max_scaler = MinMaxScaler()
df_normalized =
pd.DataFrame(min_max_scaler.fit_transform(df_no_outliers.select_dtypes(include=[np.number])),
columns=df_no_outliers.select_dtypes(include=[np.number]).columns)
# Feature Engineering
df_no_outliers['age_group'] = pd.cut(df_no_outliers['age'], bins=[20, 40, 60, 80], labels=['20-39', '40-
59', '60-79'])
# Encoding Categorical Variables
df_no_outliers['sex'] = df_no_outliers['sex'].map({0: 'female', 1: 'male'})
df_encoded = pd.get_dummies(df_no_outliers, columns=['sex'])
# PCA for Dimensionality Reduction
df_std = StandardScaler().fit_transform(df_no_outliers.select_dtypes(include=[np.number]))
pca = PCA(n_components=2)
df_pca = pca.fit_transform(df_std)
df_pca_df = pd.DataFrame(data=df_pca, columns=['PC1', 'PC2'])
# Print processed data and PCA result
Output:
Conclusion: -
We efficiently collected, cleaned, integrated, and transformed healthcare data focused on a
specific disease. The process involved handling missing values, removing duplicates,
addressing outliers, and applying normalization and standardization. We enhanced the dataset
through feature engineering and categorical encoding, and utilized PCA for dimensionality
reduction. These steps optimized the data for accurate analysis and meaningful insights.