Advanced Statistics Assignment: Business Report (PGP - DSBA)
Advanced Statistics Assignment: Business Report (PGP - DSBA)
Advanced Statistics Assignment: Business Report (PGP - DSBA)
1
Contents:
1 – Salary Data.......................................................................................................................................3
1.1 Problem 1.1............................................................................................................................3
1.2 Problem 1.2.............................................................................................................................4
1.3 Problem 1.3.............................................................................................................................4
2
Problem 1
Salary Data
1.1 State the null and the alternate hypothesis for conducting one-way ANOVA for both
Education and Occupation individually.
The Null and Alternate Hypothesis for the One Way ANOVA for Education are:
H0: The mean Salary variable for each educational level is equal.
Ha1: For at least one of the means of Salary for level of Education is different.
The Null and Alternate Hypothesis for the One Way ANOVA for Occupation are:
H0: The mean Salary variable for each Occupation type is equal.
Ha2: For at least one of the means of Salary for type of Occupation is different.
Where Alpha = 0.05
3
1.2 Perform one-way ANOVA for Education with respect to the variable ‘Salary’. State whether the
null hypothesis is accepted or rejected based on the ANOVA results.
Since the p-value is less than Alpha we reject Null Hypothesis (H0) for Education.
1.3 Perform one-way ANOVA for variable Occupation with respect to the variable ‘Salary’. State
whether the null hypothesis is accepted or rejected based on the ANOVA results.
Since the p-value is greater than Alpha we cannot reject Null Hypothesis (H0) for Occupation.
1.4 What is the interaction between the two treatments? Analyze the effects of one variable on the
other (Education and Occupation) with the help of an interaction plot.
4
Observation:
➢ From above plot we can figure out that people with educational level:
• Doctorates: are into higher salary brackets and are mostly in Prof-specialty, Exec-managerial
or in sales profile, very few are doing Adm-clerical jobs.
• Bachelors: fall in mid income range and found mostly working as an Exec - managers, Adm-
clerks or into sales but very few are found in Prof- specialty profile.
• HS-grads: are in low-income brackets, mostly doing Prof-specialty or Adm - clerical work and
few are doing Sales but hardly any in Exec-managerial role.
1.5 Perform a two-way ANOVA based on the Education and Occupation (along with their interaction
Education*Occupation) with the variable ‘Salary’. State the null and alternative hypotheses and state
your results. How will you interpret this result?
The Null and Alternate Hypothesis for the Two Way ANOVA for each Occupation type and Education level
are:
H0: The mean Salary variable for each Occupation type and Education level are equal
Ha2: For at least one of the means of Salary for type of Occupation and Education level are not equal.
Where Alpha = 0.05
5
Due to the inclusion of the interaction term, we can see changes in the p-value of the first two treatments.
And we see that the p-value of the interaction term suggests that the Null Hypothesis is rejected in this
case.
1.6 Explain the business implications of performing ANOVA for this particular case study.
• ANOVA can be used to forecast Salary trends by analyzing patterns in data to better understand the
future hike of Salary.
• ANOVA test indicates that the Education level coupled with Occupation has significant influence over
salary than alone occupation type with comparison to educational background.
• Plays comprehensive role while setting up salary bands. As similar job titles with different industries
demands varying salary package as per job profile, plus years of experience for the job matters here
deciding scale of a person.
• We must also take note of that high salaries are offered to Bachelor’s degree holders than Doctorates
for few occupations. So, we can say that there are some shortcomings of dataset provided which
reduces accuracy of the test and analysis done, as there can be few more other important variables
which can impact salary such as years of experience, specialization, industry/domain etc.
6
Problem 2
Education – Post 12th Standard
The dataset Education - Post 12th Standard.csv contains information on various colleges. You are
expected to do a Principal Component Analysis for this case study according to the instructions
given. The data dictionary of the 'Education - Post 12th Standard.csv' can be found in the following
file: Data Dictionary.xlsx.
2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be performed].
What insight do you draw from the EDA?
Univariate Analysis:
7
Observations:
• Data consists of 777 Universities with 18 Variables but not a single categorical variable present.
• Index of columns: ‘Name’, 'Apps', 'Accept', 'Enroll', 'Top10perc', 'Top25perc', 'F.Undergrad',
'P.Undergrad', 'Outstate', 'Room.Board', 'Books', 'Personal', 'PhD', 'Terminal', 'S.F.Ratio',
'perc.alumni', 'Expend', 'Grad.Rate'.
• Perc.alumni have minimum values as 0. Needs to be cleaned.
• There are no missing values in the data.
• Very few students fall under topper students with Top 10% and 25%.
• F.Undergrad field needs cleanup, has a minimum value as 139 and maximum value as 31643.
• From Quick Summary table we found that max % of Graduated students are 118% which needs to be
rectified as it shouldn’t go beyond 100. So, the error was found on row 95 for Cazenovia College has
to be corrected using Median value.
Histplot and boxplot to check Distribution, Density and outliers of each Variables:
8
9
10
11
12
Observations:
Multivariate Analysis:
Using Pair plot to see relationship of all Variables among each other.
13
Observation:
Few pairs have very high co-relation:
14
2.2 Is scaling necessary for PCA in this case? Give justification and perform scaling.
Answer:
Scaled data-set:
15
Observations:
• Both the terms, Covariance and Correlation matrices measure the relationship and the dependency
between two variables.
• “Covariance” indicates the direction of the linear relationship between variables.
• “Correlation” on the other hand measures both the strength and direction of the linear relationship
between two variables.
• Correlation refers to the scaled form of covariance. Covariance is affected by the change in scale.
Covariance indicates the direction of the linear relationship between variables. Correlation on the
other hand measures both the strength and direction of the linear relationship between two
variables.
16
Observations:
2.4 Check the dataset for outliers before and after scaling. What insight do you derive here?
17
Insight:
2.5 Extract the eigenvalues and eigenvectors. [Using Sklearn PCA Print Both]
Eigen Values:
18
Eigen Vectors:
19
2.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a data frame
with the original features.
In below table we can see that first PC or Array explains 33.12% variance in our dataset, while first seven
features capture 70.12% variance.
20
Heatmap for the same:
2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with two
places of decimals only).
In PCA, given a mean centered dataset X with n sample and p variables, the first principal component PC1 is
given by the linear combination of the original variables X_1, X_2, …, X_p PC_1 = w_{17}X_1 + w_{16}X_2 +
… + w_{1p}X_p.
The explicit form of the PC1 is as below:
21
2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide on the
optimum number of principal components? What do the eigenvectors indicate?
It can be seen that optimum number of 9 principal components are required to get cumulative variance up
to 90%
2.9 Explain the business implication of using the Principal Component Analysis for this case study.
How may PCs help in the further analysis?
• PCA is used in exploratory data analysis and for making predictive models, can be done only on
continuous variables.
• PCA used for dimensionality reduction by projecting each data point onto only the first few principal
components to obtain lower-dimensional data while preserving as much of the data's variation as
possible. In this case we can reduce dimensions from 17 to 9that explains over 90% variances.
• The first principal component can equivalently be defined as a direction that maximizes the variance
of the projected data.
• The i th principal component can be taken as a direction orthogonal (i.e., at 90 degrees to one
another.) to the first i-1 principal components that maximizes the variance of the projected data.
• Using the components we can now understand the reduced multicollinearity in the dataset.
22
23