S-2
S-2
S-2
Ticket:N3
Group:E27-24
Eldar
Questions:
1.Feature engineering:You are developing a model to predict car prices based on
features such as mileage,age,and condition.Explain how you would create new
features from this data to improve model performance.
2.Model selection:You are tasked with predicting heart disease based on patient
healt records.How would you decide between using logistic regression and
decision trees for this classification problem?
3.Cross-validation:You have developed a machine learning model fir housing price
prediction and want to test its robustness.How would you implement k-fold cross
vakudation to ensure your model generalizes well to unseen data?
4.Hyperparameter tuning:You are using XGBoost for a classification task,but your
model is overfitting.How you would adjust the learning rate and other
hyperprameters to address this issue?
5.Evalution metrics:in a binary classification problem with a highly imbalanced
dataset what evaluation metric would you prioritize to avoid the problem of
misleading accuracy?
Answers:
1. Mileage Binning:
Group mileage into categories like "low," "medium," and "high" to
capture nonlinear relationships.
2. Age Buckets:
Create age ranges, e.g., "0-3 years," "4-7 years," etc., to better
represent depreciation patterns.
3. Condition Encoding:
Convert categorical "condition" values (e.g., "excellent," "good,"
"fair") into numeric labels or use one-hot encoding.
4. Mileage per Year:
Create a new feature: mileage divided by age, representing how
intensively the car was used annually.
5. Interaction Features:
Combine features such as age × mileage or age × condition to
capture interactions affecting price.
6. Log Transformation:
Apply log transformation to skewed features like mileage to
normalize their distribution.
7. Age-Condition Interaction:
Combine age and condition into a single feature to show how wear
and tear varies by age.
2. Model Selection for Heart Disease Prediction:
1. Logistic Regression:
o Suitable if the relationship between features and the target is
approximately linear.
o Works well with fewer, independent, and continuous
features.
o Provides interpretable results (e.g., feature coefficients
showing impact on heart disease).
o Less prone to overfitting if data is limited.
2. Decision Trees:
o Handles nonlinear relationships and interactions between
features effectively.
o Works well with mixed data types (numerical and
categorical).
o Provides intuitive decision rules but is prone to overfitting
without pruning or regularization.
Choose the number of folds, kkk. A common choice is k=5k = 5k=5 or k=10k = 10k=10,
balancing computational efficiency and robust evaluation.
Divide the dataset into kkk equally-sized (or nearly equal) subsets (folds). Ensure the splitting is
random but stratified if the target variable has an imbalanced distribution to maintain
proportionality across folds.
Rotate the validation fold so that every fold is used exactly once as a validation set. This ensures
the model is tested on all data points without overlap between training and validation in any
iteration.
5. Aggregate Results:
Calculate performance metrics (e.g., mean absolute error (MAE), root mean squared error
(RMSE), or R2R^2R2) for each fold.
Compute the average and standard deviation of these metrics to get an overall measure of
the model's performance and robustness.
6. Additional Considerations:
Advantages:
By implementing k-fold cross-validation, you can confidently assess your housing price
prediction model's ability to generalize, avoiding overfitting and ensuring stable performance.
4. Hyperparameter Tuning in XGBoost to Address Overfitting
XGBoost (Extreme Gradient Boosting) is a powerful algorithm, but it can overfit, especially if
the model is too complex or improperly tuned. Below are strategies for tuning key
hyperparameters to mitigate overfitting.
The learning rate controls the step size at each iteration when optimizing weights.
Lower the Learning Rate (η\etaη): Reduce it to a smaller value (e.g., from 0.3 to 0.01 or
0.001).
Compensate with More Trees: When lowering η\etaη, increase the number of boosting
rounds (n_estimators) to allow the model to learn gradually and reduce the risk of
overfitting.
XGBoost uses decision trees as base learners, so tree-specific parameters significantly impact
model complexity:
lambda (L2 regularization): Increase this parameter to reduce the impact of large weights.
alpha (L1 regularization): Use this to drive certain weights to zero, especially for sparse
data.
4. Reduce Tree Complexity:
subsample: Use a fraction of the training data for each tree (e.g., 0.5-0.8).
colsample_bytree: Use a fraction of features for each tree (e.g., 0.5-0.8).
These parameters introduce randomness and reduce the risk of overfitting.
5. Early Stopping:
Use early stopping to monitor performance on a validation set. Stop training when the
validation loss or error does not improve after a certain number of rounds (e.g.,
early_stopping_rounds=50).
6. Cross-Validation:
These metrics are crucial for imbalanced datasets, as they focus on individual class
performance:
o F1-score is ideal when both false positives and false negatives are
important.
Analyze the confusion matrix to understand the true positives, true negatives,
false positives, and false negatives. This helps tailor metrics to your use case.
When to Use: If the negative class is also important, ROC-AUC can be helpful.
Caveat: In highly imbalanced datasets, ROC-AUC may give overly optimistic
results due to the large number of true negatives.
Example: