0% found this document useful (0 votes)
19 views6 pages

05 AIHC Exp01

Uploaded by

laxitac115
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views6 pages

05 AIHC Exp01

Uploaded by

laxitac115
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Vidyavardhini’s College of Engineering & Technology

Name: Durvesh Kajrekar


Class: BE/CSE-DS
Experiment No.1
Aim: Collect, Clean, Integrate and Transform Healthcare
Data based on specific disease

Date of Performance: 26/7/24

Date of Submission: 2/8/

HAIMLSBL701 AI&ML in Healthcare Lab


Vidyavardhini’s College of Engineering & Technology

Aim: Collect, Clean, Integrate and Transform Healthcare Data based on specific disease

Objective: The objective of this experiment is to perform basic pre processing on healthcare
data set using python libraries

Theory:

Data Collection- Data collection is the process of gathering and measuring information from
countless different sources. In order to use the data we collect to develop practical artificial
intelligence (AI) and machine learning solutions, it must be collected and stored in a way that
makes sense for the business problem at hand.

Data Cleaning: Cleaning data refers to the way of deleting wrong, corrupted, wrongly
formatted, duplicate information, or incomplete information from a dataset. The possibility of
duplicating or mislabelling data increases when two or more data sources are combined.

Data Integration: Data integration is the practice of consolidating data from disparate sources
into a single dataset with the ultimate goal of providing users with consistent access and
delivery of data across the spectrum of subjects and structure types, and to meet the
information needs of all applications and business processes.

Data transformation: Data transformation is the process of converting, cleansing, and


structuring data into a usable format that can be analyzed to support decision making
processes, and to propel the growth of an organization. Data transformation is used when data
needs to be converted to match that of the destination system.

Code: -
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.decomposition import PCA
from scipy import stats
# Load the dataset
df = pd.read_csv('heart.csv')
# Display basic information
print("Initial Data:")

HAIMLSBL701 AI&ML in Healthcare Lab


Vidyavardhini’s College of Engineering & Technology

print(df.head())
print("\nMissing Values:")
print(df.isnull().sum())
print("\nData Description:")
print(df.describe())
# Data Cleaning
# Handle missing values
df_filled = df.fillna(df.median()) # Filling missing values with median
# Remove duplicates
df_no_duplicates = df_filled.drop_duplicates()
# Outlier Detection and Treatment
z_scores = np.abs(stats.zscore(df_no_duplicates.select_dtypes(include=[np.number])))
df_no_outliers = df_no_duplicates[(z_scores < 3).all(axis=1)]
# Normalization and Standardization
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_no_outliers.select_dtypes(include=[np.number])),
columns=df_no_outliers.select_dtypes(include=[np.number]).columns)
min_max_scaler = MinMaxScaler()
df_normalized =
pd.DataFrame(min_max_scaler.fit_transform(df_no_outliers.select_dtypes(include=[np.number])),
columns=df_no_outliers.select_dtypes(include=[np.number]).columns)
# Feature Engineering
df_no_outliers['age_group'] = pd.cut(df_no_outliers['age'], bins=[20, 40, 60, 80], labels=['20-39', '40-
59', '60-79'])
# Encoding Categorical Variables
df_no_outliers['sex'] = df_no_outliers['sex'].map({0: 'female', 1: 'male'})
df_encoded = pd.get_dummies(df_no_outliers, columns=['sex'])
# PCA for Dimensionality Reduction
df_std = StandardScaler().fit_transform(df_no_outliers.select_dtypes(include=[np.number]))
pca = PCA(n_components=2)
df_pca = pca.fit_transform(df_std)
df_pca_df = pd.DataFrame(data=df_pca, columns=['PC1', 'PC2'])
# Print processed data and PCA result

HAIMLSBL701 AI&ML in Healthcare Lab


Vidyavardhini’s College of Engineering & Technology

print("\nProcessed Data (first few rows):")


print(df_no_outliers.head())
print("\nEncoded Data (first few rows):")
print(df_encoded.head())
print("\nPCA Data (first few rows):")
print(df_pca_df.head())

Google Collaboratory Link: - https://colab.research.google.com/drive/1Xr-WnJa-


OARr_EZvGOyNUGZctVaWm1xS?usp=sharing

Output:

HAIMLSBL701 AI&ML in Healthcare Lab


Vidyavardhini’s College of Engineering & Technology

HAIMLSBL701 AI&ML in Healthcare Lab


Vidyavardhini’s College of Engineering & Technology

Conclusion: -
We efficiently collected, cleaned, integrated, and transformed healthcare data focused on a
specific disease. The process involved handling missing values, removing duplicates,
addressing outliers, and applying normalization and standardization. We enhanced the dataset
through feature engineering and categorical encoding, and utilized PCA for dimensionality
reduction. These steps optimized the data for accurate analysis and meaningful insights.

HAIMLSBL701 AI&ML in Healthcare Lab

You might also like