0% found this document useful (0 votes)

27 views9 pages

Data Preprocessing Example Programs1

The document contains multiple Python programs focused on data pre-processing, integration, reduction, transformation, and feature engineering using various techniques. It includes steps for handling missing values, removing duplicates, encoding categorical variables, scaling features, and applying dimensionality reduction methods like PCA. Additionally, it demonstrates feature engineering concepts using the Titanic dataset, including creating new features and saving the processed data to a CSV file.

Uploaded by

Nikhil Nikki

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views9 pages

Data Preprocessing Example Programs1

Uploaded by

Nikhil Nikki

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

1.

Write a Python program for Data pre-processing and include Data

cleaning steps

import pandas as pd

import numpy as np

from sklearn.preprocessing import LabelEncoder, StandardScaler

from scipy.stats import zscore

# Sample dataset creation

data = {

'Name': ['Alice', 'Bob', 'Charlie', None, 'Eve', 'Frank', 'Alice', 'Grace'],

'Age': [25, np.nan, 30, 35, 29, 120, 25, 33],

'Gender': ['Female', 'Male', 'Male', 'Female', None, 'Male', 'Female', 'Female'],

'Income': [50000, 60000, np.nan, 80000, 75000, 300000, 50000, 72000],

'Location': ['New York', 'San Francisco', 'Chicago', None, 'New York', 'Chicago', 'New York',
'Chicago']

# Load data into a DataFrame

df = pd.DataFrame(data)

print("Original DataFrame:")

print(df)

# Step 1: Handling Missing Values

# Fill missing numerical values with the column median

df['Age'] = df['Age'].fillna(df['Age'].median())

df['Income'] = df['Income'].fillna(df['Income'].median())

# Fill missing categorical values with the mode

df['Name'] = df['Name'].fillna('Unknown')

df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])
df['Location'] = df['Location'].fillna('Unknown')

# Step 2: Removing Duplicates

# Check for duplicates and drop them

df = df.drop_duplicates()

# Step 3: Handling Outliers

# Remove rows where numerical features have Z-scores > 3 (extreme outliers)

z_scores = zscore(df[['Age', 'Income']])

df = df[(np.abs(z_scores) < 3).all(axis=1)]

# Step 4: Encoding Categorical Variables

# Convert categorical columns to numerical using Label Encoding

label_encoder = LabelEncoder()

df['Gender'] = label_encoder.fit_transform(df['Gender']) # Female = 0, Male = 1

df['Location'] = label_encoder.fit_transform(df['Location']) # Encoding Location

# Step 5: Removing Irrelevant Features

# Drop the 'Name' column as it's non-informative for analysis

df = df.drop(columns=['Name'])

# Step 6: Scaling Features

# Normalize numerical features using StandardScaler

scaler = StandardScaler()

df[['Age', 'Income']] = scaler.fit_transform(df[['Age', 'Income']])

# Final Cleaned DataFrame

print("\nCleaned DataFrame:")

print(df)
2. Write a Python program and include Data Integration steps and Data
cleaning steps
import pandas as pd

import numpy as np

from sklearn.preprocessing import LabelEncoder, StandardScaler

# Sample Datasets

data1 = {

'CustomerID': [1, 2, 3, 4],

'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],

'Age': [25, np.nan, 30, 35],

'Gender': ['Female', 'Male', 'Male', 'Female']

data2 = {

'CustomerID': [3, 4, 5, 6],

'Income': [60000, 80000, np.nan, 75000],

'Location': ['New York', 'Chicago', None, 'San Francisco']

# Step 1: Load Data

df1 = pd.DataFrame(data1)

df2 = pd.DataFrame(data2)

print("Dataset 1:")

print(df1)

print("\nDataset 2:")

print(df2)

# Step 2: Data Integration (Merging Datasets)

# Merge datasets on CustomerID

df = pd.merge(df1, df2, on='CustomerID', how='outer')

print("\nIntegrated Dataset:")

print(df)

df10 = pd.merge(df1, df2, on='CustomerID', how='inner')

print("\nIntegrated Dataset:")

print(df10)

# Step 3: Handling Missing Values

# Fill missing numerical values with the median

df['Age'] = df['Age'].fillna(df['Age'].median())

df['Income'] = df['Income'].fillna(df['Income'].median())

# Fill missing categorical values with the mode

df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])

df['Location'] = df['Location'].fillna('Unknown')

# Step 4: Removing Duplicates

df = df.drop_duplicates()

# Step 5: Encoding Categorical Variables

# Encode Gender and Location columns

label_encoder = LabelEncoder()

df['Gender'] = label_encoder.fit_transform(df['Gender'])

df['Location'] = label_encoder.fit_transform(df['Location'])

# Step 6: Scaling Numerical Features

scaler = StandardScaler()

df[['Age', 'Income']] = scaler.fit_transform(df[['Age', 'Income']])

# Final Processed Dataset

print("\nProcessed Dataset:")

print(df)

3. Write a Python Program and include data reduction functions

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split

# Sample Dataset
data = {
'Feature1': [2.5, 3.0, 2.8, 3.2, 2.7, 3.5, 2.9],
'Feature2': [1.2, 1.5, 1.3, 1.7, 1.4, 1.8, 1.6],
'Feature3': [0.8, 0.9, 1.0, 0.7, 1.1, 0.6, 1.2],
'Feature4': [10, 20, 15, 10, 20, 15, 10],
'Target': [0, 1, 1, 0, 1, 0, 1]
}

# Load data into a DataFrame

df = pd.DataFrame(data)
print("Original Dataset:")
print(df)

# Splitting features and target

X = df.drop(columns=['Target'])
y = df['Target']

# Step 1: Feature Scaling

# Scale features to normalize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 2: Dimensionality Reduction using PCA

# Reduce dimensions to 2 principal components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print("\nData after PCA (Dimensionality Reduction):")
print(pd.DataFrame(X_pca, columns=['PC1', 'PC2']))
# Step 3: Feature Selection
# Select top 2 features based on statistical significance
selector = SelectKBest(score_func=f_classif, k=2)
X_selected = selector.fit_transform(X, y)
print("\nData after Feature Selection (Top 2 Features):")
print(pd.DataFrame(X_selected, columns=['Feature1', 'Feature2']))

# Step 4: Sampling (Data Size Reduction)

# Split data into a smaller sample for testing
X_train, X_sample, y_train, y_sample = train_test_split(X, y, test_size=0.4,
random_state=42)
print("\nSampled Data (40% of Original Dataset):")
print(pd.DataFrame(X_sample, columns=X.columns))

4. Write a Python Program and include data transformation

functions

import pandas as pd

import numpy as np

from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder

from scipy.stats import boxcox

# Sample Dataset

data = {

'Feature1': [10, 20, 30, 40, 50],

'Feature2': [1.2, 3.5, 5.1, 7.3, 9.0],

'Feature3': [1000, 1500, 2000, 2500, 3000],

'Category': ['A', 'B', 'A', 'C', 'B']

# Load data into a DataFrame

df = pd.DataFrame(data)

print("Original Dataset:")
print(df)

# Step 1: Normalization

# Normalize Feature2 to a range of [0, 1]

minmax_scaler = MinMaxScaler()

df['Feature2_Normalized'] = minmax_scaler.fit_transform(df[['Feature2']])

# Step 2: Standardization

# Standardize Feature1 to have a mean of 0 and standard deviation of 1

std_scaler = StandardScaler()

df['Feature1_Standardized'] = std_scaler.fit_transform(df[['Feature1']])

# Step 3: Log Transformation

# Apply log transformation to Feature3 to reduce skewness

df['Feature3_Log'] = np.log(df['Feature3'])

# Step 4: Power Transformation (Box-Cox Transformation)

# Apply Box-Cox transformation to Feature2 (requires positive values)

df['Feature2_BoxCox'], _ = boxcox(df['Feature2'])

# Step 5: Encoding Categorical Variables

# Label Encoding for Category

label_encoder = LabelEncoder()

df['Category_LabelEncoded'] = label_encoder.fit_transform(df['Category'])

# One-Hot Encoding for Category

onehot_encoder = OneHotEncoder(sparse=False)

onehot_encoded = onehot_encoder.fit_transform(df[['Category']])

onehot_encoded_df = pd.DataFrame(onehot_encoded,
columns=onehot_encoder.get_feature_names_out(['Category']))

df = pd.concat([df, onehot_encoded_df], axis=1)

# Display Transformed Dataset

print("\nTransformed Dataset:")

print(df)

5. Write a Python program for Feature Engineering concepts.

(include Titanic dataset)
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

# Load Titanic dataset (replace 'train.csv' with the path to your dataset)
data = pd.read_csv('train.csv')

# Display the first few rows of the dataset

print("Initial Data Sample:")
print(data.head())

# Function for feature engineering

def feature_engineering(data):
# Make a copy of the dataset
df = data.copy()

# Handle missing values

imputer_age = SimpleImputer(strategy='median')
df['Age'] = imputer_age.fit_transform(df[['Age']])

imputer_embarked = SimpleImputer(strategy='most_frequent')
df['Embarked'] = imputer_embarked.fit_transform(df[['Embarked']])

# Drop columns that are not useful

df.drop(['Cabin', 'Ticket', 'Name'], axis=1, inplace=True)

# Encode categorical variables

label_encoder = LabelEncoder()
df['Sex'] = label_encoder.fit_transform(df['Sex'])

one_hot_encoder = OneHotEncoder(sparse=False, drop='first')

embarked_encoded = one_hot_encoder.fit_transform(df[['Embarked']])
embarked_encoded_df = pd.DataFrame(embarked_encoded,
columns=one_hot_encoder.get_feature_names_out(['Embarked']))
df = pd.concat([df, embarked_encoded_df], axis=1)
df.drop(['Embarked'], axis=1, inplace=True)

# Create new features

df['FamilySize'] = df['SibSp'] + df['Parch']
df['IsAlone'] = (df['FamilySize'] == 0).astype(int)

# Scale numerical features

scaler = StandardScaler()
numerical_features = ['Age', 'Fare']
df[numerical_features] = scaler.fit_transform(df[numerical_features])

return df

# Perform feature engineering on the Titanic dataset

processed_data = feature_engineering(data)

# Display the processed data

print("\nProcessed Data Sample:")
print(processed_data.head())

# Save the processed data to a new CSV file

processed_data.to_csv('processed_titanic_data.csv', index=False)
print("\nProcessed data saved to 'processed_titanic_data.csv'.")

(Feature Engineering) (Extended-Cheatsheet)
100% (1)
(Feature Engineering) (Extended-Cheatsheet)
9 pages
Mercedes-Benz Greener Manufacturing Ai
0% (1)
Mercedes-Benz Greener Manufacturing Ai
16 pages
Statistics and Probability - q4 - Mod2 - Identifying Parameter To Be Tested Given A Real Life Problem - V2
100% (2)
Statistics and Probability - q4 - Mod2 - Identifying Parameter To Be Tested Given A Real Life Problem - V2
20 pages
Advance Python
No ratings yet
Advance Python
5 pages
Advanced Feature Engineering and Data Preprocessing in Machine Learning
No ratings yet
Advanced Feature Engineering and Data Preprocessing in Machine Learning
7 pages
Data Clearning
No ratings yet
Data Clearning
7 pages
Data Cleaning
No ratings yet
Data Cleaning
7 pages
Dsbda Lab - 1 - 1736243987425
No ratings yet
Dsbda Lab - 1 - 1736243987425
10 pages
EX2- BIGDATA_san[1]
No ratings yet
EX2- BIGDATA_san[1]
9 pages
Exp-2 ML
No ratings yet
Exp-2 ML
6 pages
ML - Lab Manual
No ratings yet
ML - Lab Manual
54 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
Train
No ratings yet
Train
17 pages
Exp 2
No ratings yet
Exp 2
6 pages
Exp 2 Data Preprocessing - Cleaning The Dataset Obtained From The UCI ML Repository
No ratings yet
Exp 2 Data Preprocessing - Cleaning The Dataset Obtained From The UCI ML Repository
9 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
Preprocessing
No ratings yet
Preprocessing
9 pages
UNITIV BtechIot
No ratings yet
UNITIV BtechIot
43 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
ML Manual
No ratings yet
ML Manual
30 pages
MACHINE LEARNING Manual
No ratings yet
MACHINE LEARNING Manual
36 pages
ML 8 Program
No ratings yet
ML 8 Program
5 pages
ML Complete Notes Hridoy
No ratings yet
ML Complete Notes Hridoy
5 pages
Project Paarth
No ratings yet
Project Paarth
21 pages
MLLab Manual
No ratings yet
MLLab Manual
24 pages
Iii Aid - ML
No ratings yet
Iii Aid - ML
30 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
26 pages
ML-LAB1
No ratings yet
ML-LAB1
11 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Step-by-Step Explanation of Python Data Preprocessing Script
No ratings yet
Step-by-Step Explanation of Python Data Preprocessing Script
9 pages
ML Assignment
No ratings yet
ML Assignment
34 pages
Machinelearning
No ratings yet
Machinelearning
26 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
18 pages
ML Lab
No ratings yet
ML Lab
14 pages
Program 1
No ratings yet
Program 1
6 pages
AIL303 M
No ratings yet
AIL303 M
22 pages
MDS372 Lab4 2448001
No ratings yet
MDS372 Lab4 2448001
17 pages
AI&ML
No ratings yet
AI&ML
9 pages
ML - Datascience Manual
No ratings yet
ML - Datascience Manual
64 pages
Data Preprocessing 2
No ratings yet
Data Preprocessing 2
5 pages
About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository
No ratings yet
About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository
5 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
ML Journal
No ratings yet
ML Journal
53 pages
Untitled Document 5
No ratings yet
Untitled Document 5
3 pages
Practical 3 - Categorical Feature Engineering
No ratings yet
Practical 3 - Categorical Feature Engineering
6 pages
ML Short Code - Under Updating
No ratings yet
ML Short Code - Under Updating
4 pages
Machine Learning Lab Assignment 1
No ratings yet
Machine Learning Lab Assignment 1
23 pages
Résumé-Analyse Des Données resumee resumee
No ratings yet
Résumé-Analyse Des Données resumee resumee
4 pages
Machine Learning Lab File
No ratings yet
Machine Learning Lab File
45 pages
Multi Classification - Py (For 1 Class TP, TN, FP, FN)
No ratings yet
Multi Classification - Py (For 1 Class TP, TN, FP, FN)
25 pages
23BCE7199 ML Lab Assignment
No ratings yet
23BCE7199 ML Lab Assignment
15 pages
DA Programs
No ratings yet
DA Programs
44 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
9 pages
AIML
No ratings yet
AIML
13 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
ML-Lab05-Data Preprocessing Techniques in Python
No ratings yet
ML-Lab05-Data Preprocessing Techniques in Python
7 pages
Data Science Record - 05
No ratings yet
Data Science Record - 05
20 pages
StarterNotebook - Jupyter Notebook
No ratings yet
StarterNotebook - Jupyter Notebook
12 pages
Huy
No ratings yet
Huy
11 pages
Parametric & Non-Parametric Test... (Stats) Part2
100% (1)
Parametric & Non-Parametric Test... (Stats) Part2
4 pages
Operating Characteristic (OC) Curve
100% (2)
Operating Characteristic (OC) Curve
4 pages
Study-On-Ratio-Analysis-Of-Siddartha Bank LTD
No ratings yet
Study-On-Ratio-Analysis-Of-Siddartha Bank LTD
35 pages
6 Anal
No ratings yet
6 Anal
1 page
hw5 PDF
No ratings yet
hw5 PDF
3 pages
PTSP Ii Ece
No ratings yet
PTSP Ii Ece
3 pages
L 11, One Sample Test
No ratings yet
L 11, One Sample Test
10 pages
Untitled
No ratings yet
Untitled
3 pages
Proposal Car
No ratings yet
Proposal Car
8 pages
Question Bank For Probabilty Queueing Theory Regulation 2013
100% (1)
Question Bank For Probabilty Queueing Theory Regulation 2013
19 pages
Reseach Report
No ratings yet
Reseach Report
13 pages
Class - Xii Mathematics Ncert Solutions: S A PA
No ratings yet
Class - Xii Mathematics Ncert Solutions: S A PA
10 pages
Architectural Research Methods of
No ratings yet
Architectural Research Methods of
10 pages
(eBook PDF) Global Business Today 10th Edition instant download
No ratings yet
(eBook PDF) Global Business Today 10th Edition instant download
64 pages
Unit 5 - Numerical Problems
No ratings yet
Unit 5 - Numerical Problems
5 pages
ELE127 Extended Abstract ITALIIC 2023 AGRONUTRI-X MOBILE LEARNING APPLICATION
No ratings yet
ELE127 Extended Abstract ITALIIC 2023 AGRONUTRI-X MOBILE LEARNING APPLICATION
8 pages
Consistency
No ratings yet
Consistency
3 pages
Quiz 1 1ST Sem
No ratings yet
Quiz 1 1ST Sem
1 page
Stat 331 Tutorial Set
No ratings yet
Stat 331 Tutorial Set
2 pages
Research Study Title:: o o o o
No ratings yet
Research Study Title:: o o o o
4 pages
Effect of Age Length of Service and Educ 721c6f4a
No ratings yet
Effect of Age Length of Service and Educ 721c6f4a
8 pages
P-Value 0.2, 0.05 Data Is Not Normal Reject H0: Tests of Normality
No ratings yet
P-Value 0.2, 0.05 Data Is Not Normal Reject H0: Tests of Normality
2 pages
Hindusthan Zinc LTD Work Life Balance
No ratings yet
Hindusthan Zinc LTD Work Life Balance
93 pages
Relationship Between Nurse Case Manager's Communication Skills and Patient Satisfaction at Hospital in Jakarta
No ratings yet
Relationship Between Nurse Case Manager's Communication Skills and Patient Satisfaction at Hospital in Jakarta
6 pages
Nordis Final
No ratings yet
Nordis Final
6 pages
(eBook PDF) Spring in Action 5th Edition pdf download
No ratings yet
(eBook PDF) Spring in Action 5th Edition pdf download
129 pages
Prof C R Muthukrishnannba Obe Awareness 16jan23
No ratings yet
Prof C R Muthukrishnannba Obe Awareness 16jan23
122 pages
Normal Distribution
No ratings yet
Normal Distribution
28 pages
Cold Storage Case Analysis Final
No ratings yet
Cold Storage Case Analysis Final
7 pages