0% found this document useful (0 votes)
113 views10 pages

EDS Mini Project

Uploaded by

Noor Hasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views10 pages

EDS Mini Project

Uploaded by

Noor Hasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Understanding Titanic Survival:

A Statistical and Machine


Learning Approach
Environmental Data Science (13300)
Mini-project

Noor Hasan
Mat no: 5008008
Date: 18.03.2025
1
Background
This study analyzes survival factors of passengers aboard the Titanic. The sinking of
the Titanic in 1912 led to the loss of over 1,500 lives, but survival rates varied
significantly based on demographic and socioeconomic factors.

What’s the dataset about?

● The dataset contains 891 passenger details, including age, gender, class, fare,
and family relations.
● The dataset needs cleaning and transformation before analysis.
● The goal is to find which factors significantly influenced survival.

2
Research Questions
1. Did gender affect survival chances?
2. Did social class influence survival?
3. Did younger passengers have a better survival rate?
4. Did having family members onboard improve survival chances?
5. Can machine learning models predict survival accurately?

3
Data Analysis Methods
This analysis was conducted in R with Rstudio, following these steps:

● Data Cleaning & Preprocessing:


○ Missing values in Age were filled with the median, and missing values in Embarked were replaced
with the most frequent value.
○ Unnecessary columns (PassengerId, Ticket, Cabin) were removed.
○ Categorical variables (Sex, Pclass, Embarked) were converted into factors.
● Exploratory Data Analysis (EDA):
○ Examined survival rates across gender, class, age groups, and family size.
○ Created bar charts visualize differences.
● Statistical Testing & Feature Analysis:
○ Chi-Square Tests: Checked the association between categorical variables (e.g., gender and
survival).
○ Logistic Regression: Measured the impact of different factors on survival.
○ Random Forest Model: Evaluated feature importance and improved prediction accuracy.
● Library used (dplyr, ggplot2, caret, randomForest)
4
Results and Discussion

Did Gender Affect Survival? Did Social Class Influence Survival?


● Women had a significantly higher survival rate than men. ● First-class passengers had a much higher survival
● Chi-Square Test: p-value < 2.2e-16 → Strong evidence that rate than third-class passengers.
gender influenced survival. ● Chi-Square Test: p-value < 2.2e-16 → Strong evidence
that class influenced survival.

5
Result and Discussion

Family Influence on Survival


Age and Survival
● Having more family members onboard slightly
● Older passengers had a lower chance of survival.
improved survival chances.
● Chi-Square Test: p-value < 0.001 → Significant
● Chi-Square Test: p-value = 0.003 → Significant
association between age group and survival.
association between family size and survival.

6
Result and Discussion
Logistic Regression Findings
● A logistic regression model was trained using all passenger details.
● Key Predictors of Survival:
○ Gender: Males had a much lower survival probability (p < 2e-16).
○ Passenger Class: Lower-class passengers were less likely to survive (p < 2e-16).
○ Age: Older passengers had a lower survival rate (p = 4.35e-07).
● Accuracy: Logistic Regression correctly predicted survival with approximately 78%
accuracy.

7
Result and Discussion

Random Forest Findings

● A Random Forest model was


trained using 100 decision trees.
● Predictions were tested for accuracy.
● Most Important Features:
1. Gender
2. Passenger Class
3. Age Group
4. Siblings/Spouses Onboard
● Accuracy: Random Forest correctly
predicted survival with
approximately 83% accuracy.
8
Conclusion

Key Takeaways:
1. Women were significantly more likely to survive than men.
2. First-class passengers had the highest survival rates.
3. Younger passengers had a better survival rate.
4. Having family members onboard slightly improved survival chances.
5. Random Forest was more accurate than Logistic Regression for survival prediction.
Implications:
● This analysis confirms that survival on the Titanic was influenced by socioeconomic and
demographic factors.
● The findings support the historical "Women and children first" evacuation policy.
Use of LLM
● ChatGPT was used to debug codes and make the plots more aesthetic.

9
Thank you

10

You might also like