0% found this document useful (0 votes)
10 views15 pages

Final

ML

Uploaded by

ralajody
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views15 pages

Final

ML

Uploaded by

ralajody
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

final

November 21, 2024

0.1 Data Load & Evaluation


[207]: # Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

[208]: # Load the dataset


df = pd.read_csv('./airlines_delay.csv') # Replace with your actual dataset␣
↪path

[209]: # Display the first few rows


print(df.head())

Flight Time Length Airline AirportFrom AirportTo DayOfWeek Delay


0 2313 1296 141 DL ATL HOU 1 0
1 6948 360 146 OO COS ORD 4 0
2 1247 1170 143 B6 BOS CLT 3 0
3 31 1410 344 US OGG PHX 6 0
4 563 692 98 FL BMI ATL 4 0

[210]: # Check the shape of the dataset


print(f"The dataset has {df.shape[0]} rows and {df.shape[1]} columns.")

The dataset has 539382 rows and 8 columns.

[211]: # Display basic information about the dataset


df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 539382 entries, 0 to 539381
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Flight 539382 non-null int64
1 Time 539382 non-null int64
2 Length 539382 non-null int64

1
3 Airline 539382 non-null object
4 AirportFrom 539382 non-null object
5 AirportTo 539382 non-null object
6 DayOfWeek 539382 non-null int64
7 Delay 539382 non-null int64
dtypes: int64(5), object(3)
memory usage: 32.9+ MB

[212]: # Check for missing values


print("Missing values in each column:")
print(df.isnull().sum())

Missing values in each column:


Flight 0
Time 0
Length 0
Airline 0
AirportFrom 0
AirportTo 0
DayOfWeek 0
Delay 0
dtype: int64

[213]: sns.countplot(data=df,x="Airline",hue="Delay")

[213]: <Axes: xlabel='Airline', ylabel='count'>

2
[214]: import matplotlib.pyplot as plt
df['Delay'].value_counts().plot(kind='bar', color=['blue', 'orange'])
plt.title("Target Distribution: Delay")
plt.xlabel("Delay (0: No, 1: Yes)")
plt.ylabel("Count")
plt.show()

3
[215]: # Plotting density plots for each feature
# df.plot(kind='density', subplots=True, layout=(4, 2), figsize=(12, 12),␣
↪sharex=False)

# plt.tight_layout()
# plt.show()

[216]: # Plot boxplots for numerical features


# numerical_features = ['Time', 'Length']
# for feature in numerical_features:
# sns.boxplot(data=df, x=feature)
# plt.title(f"Boxplot for {feature}")
# plt.show()

[217]: print(df[df['Length'] == 0])

Flight Time Length Airline AirportFrom AirportTo DayOfWeek Delay


190180 106 635 0 F9 DEN MSP 6 0
330963 103 375 0 F9 MSP DEN 7 0
339408 107 851 0 F9 MSP DEN 6 0
373680 493 1060 0 B6 BOS SEA 7 1

4
[218]: df = df[df['Length'] > 0]
print(df.shape)

(539378, 8)

[219]: # Cap outliers in 'Length' using the 95th percentile


upper_bound = df['Length'].quantile(0.95) # 95th percentile

# Use .loc to explicitly modify the column


df.loc[df['Length'] > upper_bound, 'Length'] = upper_bound

# Check the updated column


print(f"Updated statistics for 'Length':\n{df['Length'].describe()}")

# Boxplot after capping


plt.figure(figsize=(8, 4))
sns.boxplot(x=df['Length'])
plt.title('Boxplot for Length (After Capping Outliers)')
plt.show()

Updated statistics for 'Length':


count 539378.000000
mean 129.435084
std 61.947185
min 23.000000
25% 81.000000
50% 115.000000
75% 162.000000
max 280.000000
Name: Length, dtype: float64

5
0.2 Feature engineering
[220]: from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
le = LabelEncoder()

# Apply label encoding


for col in ['Airline', 'AirportFrom', 'AirportTo']:
df.loc[:, col] = le.fit_transform(df[col])

[221]: df.head()

[221]: Flight Time Length Airline AirportFrom AirportTo DayOfWeek Delay


0 2313 1296 141 5 16 129 1 0
1 6948 360 146 12 65 208 4 0
2 1247 1170 143 3 35 60 3 0
3 31 1410 280 14 203 217 6 0
4 563 692 98 8 32 16 4 0

[222]: from sklearn.preprocessing import StandardScaler

# Scale numerical features


scaler = StandardScaler()
numerical_features = ['Time', 'Length']

6
# Apply StandardScaler
df.loc[:, numerical_features] = scaler.fit_transform(df[numerical_features]).
↪astype('float64')

print(df[numerical_features].dtypes)

Time float64
Length float64
dtype: object
C:\Users\4217m\AppData\Local\Temp\ipykernel_36332\1039195114.py:8:
FutureWarning: Setting an item of incompatible dtype is deprecated and will
raise in a future error of pandas. Value '[ 1.77406756 -1.59228932 1.32090414
… 0.08729259 -0.31551935
-0.11770991]' has dtype incompatible with int64, please explicitly cast to a
compatible dtype first.
df.loc[:, numerical_features] =
scaler.fit_transform(df[numerical_features]).astype('float64')
C:\Users\4217m\AppData\Local\Temp\ipykernel_36332\1039195114.py:8:
FutureWarning: Setting an item of incompatible dtype is deprecated and will
raise in a future error of pandas. Value '[ 0.1866901 0.2674041 0.2189757
… -0.89487741 -1.04016259
-1.20159058]' has dtype incompatible with int64, please explicitly cast to a
compatible dtype first.
df.loc[:, numerical_features] =
scaler.fit_transform(df[numerical_features]).astype('float64')
Create features that combine information from multiple columns
[223]: # Add a feature for long flights
df.loc[:, 'IsLongFlight'] = (df['Length'] > df['Length'].median()).astype(int)

[224]: # Convert DayOfWeek to binary (e.g., is_weekend)


df.loc[:, 'IsWeekend'] = df['DayOfWeek'].isin([6, 7]).astype(int)

[225]: # Create a feature for flight speed


df.loc[:, 'Speed'] = df['Length'] / df['Time']

[226]: df.head()

[226]: Flight Time Length Airline AirportFrom AirportTo DayOfWeek Delay \


0 2313 1.774068 0.186690 5 16 129 1 0
1 6948 -1.592289 0.267404 12 65 208 4 0
2 1247 1.320904 0.218976 3 35 60 3 0
3 31 2.184073 2.430539 14 203 217 6 0
4 563 -0.398240 -0.507450 8 32 16 4 0

IsLongFlight IsWeekend Speed

7
0 1 0 0.105233
1 1 0 -0.167937
2 1 0 0.165777
3 1 1 1.112847
4 0 0 1.274233

[227]: import seaborn as sns


import matplotlib.pyplot as plt

# Calculate correlation matrix


correlation_matrix = df.corr()

# Plot heatmap of correlation matrix


plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()

# Correlation with target variable (Delay)


print("Correlation with Delay:")
print(correlation_matrix['Delay'].sort_values(ascending=False))

8
Correlation with Delay:
Delay 1.000000
Time 0.150453
Airline 0.066936
AirportTo 0.047984
Length 0.044238
IsLongFlight 0.026485
AirportFrom 0.018461
Speed -0.000088
IsWeekend -0.018198
DayOfWeek -0.026196
Flight -0.046178
Name: Delay, dtype: float64

[228]: # Drop low-impact and redundant features

9
df = df.drop(columns=['Speed', 'IsWeekend', 'IsLongFlight']) # Retain Length␣
↪and DayOfWeek

0.3 Model Training


[229]: # Define features and target
X = df.drop('Delay', axis=1) # Drop the target column
y = df['Delay'] # Target variable

[230]: from sklearn.model_selection import train_test_split

# Split the data (80% training, 20% testing)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)

# Verify the shapes of the splits


print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(431502, 7) (107876, 7) (431502,) (107876,)

[231]: from sklearn.ensemble import RandomForestClassifier


from sklearn.metrics import classification_report, confusion_matrix

# Train a Random Forest Classifier


model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model


print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Classification Report:
precision recall f1-score support

0 0.65 0.67 0.66 59754


1 0.58 0.55 0.56 48122

accuracy 0.62 107876


macro avg 0.61 0.61 0.61 107876
weighted avg 0.62 0.62 0.62 107876

10
Confusion Matrix:
[[40111 19643]
[21530 26592]]

[232]: # from sklearn.ensemble import RandomForestClassifier


# from sklearn.metrics import classification_report, confusion_matrix

# # Train a Random Forest Classifier


# model = RandomForestClassifier(random_state=42, class_weight='balanced')
# model.fit(X_train, y_train)

# # Make predictions
# y_pred = model.predict(X_test)

# # Evaluate the model


# print("Classification Report:")
# print(classification_report(y_test, y_pred))

# print("Confusion Matrix:")
# print(confusion_matrix(y_test, y_pred))

[233]: from imblearn.over_sampling import SMOTE


from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)

# Initialize SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to oversample the minority class


X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

# Check the distribution after SMOTE


print("Class distribution before SMOTE:", y_train.value_counts())
print("Class distribution after SMOTE:", y_train_sm.value_counts())

# Train a Random Forest Classifier


model = RandomForestClassifier(random_state=42)
model.fit(X_train_sm, y_train_sm)

# Make predictions on the test set


y_pred = model.predict(X_test)

11
# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Class distribution before SMOTE: Delay


0 239361
1 192141
Name: count, dtype: int64
Class distribution after SMOTE: Delay
1 239361
0 239361
Name: count, dtype: int64
Classification Report:
precision recall f1-score support

0 0.65 0.64 0.65 59754


1 0.57 0.58 0.57 48122

accuracy 0.61 107876


macro avg 0.61 0.61 0.61 107876
weighted avg 0.62 0.61 0.62 107876

Confusion Matrix:
[[38513 21241]
[20298 27824]]

[245]: from sklearn.preprocessing import LabelEncoder

# Apply Label Encoding


label_encoders = {}
for col in ['Airline', 'AirportFrom', 'AirportTo']:
le = LabelEncoder()
X_train_sm[col] = le.fit_transform(X_train_sm[col])
X_test[col] = le.transform(X_test[col]) # Use the same encoding for test␣
↪data

label_encoders[col] = le # Save the encoder for later use

[246]: from lightgbm import LGBMClassifier


from sklearn.metrics import classification_report, confusion_matrix

# Initialize the LightGBM Classifier


lgbm_model = LGBMClassifier(random_state=42)

# Train the model

12
lgbm_model.fit(X_train_sm, y_train_sm)

# Make predictions
y_pred = lgbm_model.predict(X_test)

# Evaluate the model


print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

[LightGBM] [Info] Number of positive: 239361, number of negative: 239361


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of
testing was 0.003909 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1412
[LightGBM] [Info] Number of data points in the train set: 478722, number of used
features: 7
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
Classification Report:
precision recall f1-score support

0 0.68 0.70 0.69 59754


1 0.62 0.60 0.61 48122

accuracy 0.65 107876


macro avg 0.65 0.65 0.65 107876
weighted avg 0.65 0.65 0.65 107876

Confusion Matrix:
[[41829 17925]
[19423 28699]]

[247]: from xgboost import XGBClassifier


from sklearn.metrics import classification_report, confusion_matrix

# Initialize the XGBoost Classifier


xgb_model = XGBClassifier(random_state=42, use_label_encoder=False,␣
↪eval_metric='logloss')

# Train the model


xgb_model.fit(X_train_sm, y_train_sm)

# Make predictions
y_pred = xgb_model.predict(X_test)

13
# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

C:\Users\4217m\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n
2kfra8p0\LocalCache\local-packages\Python311\site-packages\xgboost\core.py:158:
UserWarning: [23:38:08] WARNING: C:\buildkite-agent\builds\buildkite-windows-
cpu-autoscaling-group-i-0ed59c031377d09b8-1\xgboost\xgboost-ci-
windows\src\learner.cc:740:
Parameters: { "use_label_encoder" } are not used.

warnings.warn(smsg, UserWarning)
Classification Report:
precision recall f1-score support

0 0.69 0.69 0.69 59754


1 0.62 0.61 0.61 48122

accuracy 0.66 107876


macro avg 0.65 0.65 0.65 107876
weighted avg 0.66 0.66 0.66 107876

Confusion Matrix:
[[41498 18256]
[18698 29424]]

[248]: from sklearn.model_selection import GridSearchCV

param_grid = {
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.1, 0.2],
'n_estimators': [100, 200, 300]
}

grid_search = GridSearchCV(
estimator=XGBClassifier(random_state=42),
param_grid=param_grid,
scoring='accuracy',
cv=3,
verbose=1
)

grid_search.fit(X_train_sm, y_train_sm)

14
print(grid_search.best_params_)

Fitting 3 folds for each of 27 candidates, totalling 81 fits


{'learning_rate': 0.2, 'max_depth': 7, 'n_estimators': 300}

[249]: # Train the model with best parameters


best_xgb_model = XGBClassifier(
random_state=42,
learning_rate=0.2,
max_depth=7,
n_estimators=300
)
best_xgb_model.fit(X_train_sm, y_train_sm)

# Make predictions
y_pred = best_xgb_model.predict(X_test)

# Evaluate the model


print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Classification Report:
precision recall f1-score support

0 0.70 0.70 0.70 59754


1 0.62 0.62 0.62 48122

accuracy 0.66 107876


macro avg 0.66 0.66 0.66 107876
weighted avg 0.66 0.66 0.66 107876

Confusion Matrix:
[[41686 18068]
[18080 30042]]

15

You might also like