Final
Final
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 539382 entries, 0 to 539381
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Flight 539382 non-null int64
1 Time 539382 non-null int64
2 Length 539382 non-null int64
1
3 Airline 539382 non-null object
4 AirportFrom 539382 non-null object
5 AirportTo 539382 non-null object
6 DayOfWeek 539382 non-null int64
7 Delay 539382 non-null int64
dtypes: int64(5), object(3)
memory usage: 32.9+ MB
[213]: sns.countplot(data=df,x="Airline",hue="Delay")
2
[214]: import matplotlib.pyplot as plt
df['Delay'].value_counts().plot(kind='bar', color=['blue', 'orange'])
plt.title("Target Distribution: Delay")
plt.xlabel("Delay (0: No, 1: Yes)")
plt.ylabel("Count")
plt.show()
3
[215]: # Plotting density plots for each feature
# df.plot(kind='density', subplots=True, layout=(4, 2), figsize=(12, 12),␣
↪sharex=False)
# plt.tight_layout()
# plt.show()
4
[218]: df = df[df['Length'] > 0]
print(df.shape)
(539378, 8)
5
0.2 Feature engineering
[220]: from sklearn.preprocessing import LabelEncoder
# Initialize LabelEncoder
le = LabelEncoder()
[221]: df.head()
6
# Apply StandardScaler
df.loc[:, numerical_features] = scaler.fit_transform(df[numerical_features]).
↪astype('float64')
print(df[numerical_features].dtypes)
Time float64
Length float64
dtype: object
C:\Users\4217m\AppData\Local\Temp\ipykernel_36332\1039195114.py:8:
FutureWarning: Setting an item of incompatible dtype is deprecated and will
raise in a future error of pandas. Value '[ 1.77406756 -1.59228932 1.32090414
… 0.08729259 -0.31551935
-0.11770991]' has dtype incompatible with int64, please explicitly cast to a
compatible dtype first.
df.loc[:, numerical_features] =
scaler.fit_transform(df[numerical_features]).astype('float64')
C:\Users\4217m\AppData\Local\Temp\ipykernel_36332\1039195114.py:8:
FutureWarning: Setting an item of incompatible dtype is deprecated and will
raise in a future error of pandas. Value '[ 0.1866901 0.2674041 0.2189757
… -0.89487741 -1.04016259
-1.20159058]' has dtype incompatible with int64, please explicitly cast to a
compatible dtype first.
df.loc[:, numerical_features] =
scaler.fit_transform(df[numerical_features]).astype('float64')
Create features that combine information from multiple columns
[223]: # Add a feature for long flights
df.loc[:, 'IsLongFlight'] = (df['Length'] > df['Length'].median()).astype(int)
[226]: df.head()
7
0 1 0 0.105233
1 1 0 -0.167937
2 1 0 0.165777
3 1 1 1.112847
4 0 0 1.274233
8
Correlation with Delay:
Delay 1.000000
Time 0.150453
Airline 0.066936
AirportTo 0.047984
Length 0.044238
IsLongFlight 0.026485
AirportFrom 0.018461
Speed -0.000088
IsWeekend -0.018198
DayOfWeek -0.026196
Flight -0.046178
Name: Delay, dtype: float64
9
df = df.drop(columns=['Speed', 'IsWeekend', 'IsLongFlight']) # Retain Length␣
↪and DayOfWeek
# Make predictions
y_pred = model.predict(X_test)
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
Classification Report:
precision recall f1-score support
10
Confusion Matrix:
[[40111 19643]
[21530 26592]]
# # Make predictions
# y_pred = model.predict(X_test)
# print("Confusion Matrix:")
# print(confusion_matrix(y_test, y_pred))
# Initialize SMOTE
smote = SMOTE(random_state=42)
11
# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
Confusion Matrix:
[[38513 21241]
[20298 27824]]
12
lgbm_model.fit(X_train_sm, y_train_sm)
# Make predictions
y_pred = lgbm_model.predict(X_test)
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
Confusion Matrix:
[[41829 17925]
[19423 28699]]
# Make predictions
y_pred = xgb_model.predict(X_test)
13
# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
C:\Users\4217m\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n
2kfra8p0\LocalCache\local-packages\Python311\site-packages\xgboost\core.py:158:
UserWarning: [23:38:08] WARNING: C:\buildkite-agent\builds\buildkite-windows-
cpu-autoscaling-group-i-0ed59c031377d09b8-1\xgboost\xgboost-ci-
windows\src\learner.cc:740:
Parameters: { "use_label_encoder" } are not used.
warnings.warn(smsg, UserWarning)
Classification Report:
precision recall f1-score support
Confusion Matrix:
[[41498 18256]
[18698 29424]]
param_grid = {
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.1, 0.2],
'n_estimators': [100, 200, 300]
}
grid_search = GridSearchCV(
estimator=XGBClassifier(random_state=42),
param_grid=param_grid,
scoring='accuracy',
cv=3,
verbose=1
)
grid_search.fit(X_train_sm, y_train_sm)
14
print(grid_search.best_params_)
# Make predictions
y_pred = best_xgb_model.predict(X_test)
Classification Report:
precision recall f1-score support
Confusion Matrix:
[[41686 18068]
[18080 30042]]
15