Advanced Statistics Assignment: Business Report (PGP - DSBA)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Advanced Statistics Assignment

Business Report (PGP – DSBA)

1
Contents:
1 – Salary Data.......................................................................................................................................3
1.1 Problem 1.1............................................................................................................................3
1.2 Problem 1.2.............................................................................................................................4
1.3 Problem 1.3.............................................................................................................................4

1.4 Problem 1.4.............................................................................................................................4-5


1.5 Problem 1.5.............................................................................................................................5-6
1.6 Problem 1.6..............................................................................................................................6
2 – Education – Post 12th Standard........................................................................................................7

2.1 Problem 2.1..............................................................................................................................7-14


2.2 Problem 2.2..............................................................................................................................15-16
2.3. Problem 2.3.............................................................................................................................16-17
2.4. Problem 2.4.............................................................................................................................17-18
2.5. Problem 2.5.............................................................................................................................18-19

2.6. Problem 2.6.............................................................................................................................20-21


2.7. Problem 2.7.............................................................................................................................21
2.8. Problem 2.8.............................................................................................................................22
2.9. Problem 2.9..............................................................................................................................22

2
Problem 1
Salary Data

Salary is hypothesized to depend on educational qualification and occupation. To understand the


dependency, the salaries of 40 individuals [SalaryData.csv] are collected and each person’s
educational qualification and occupation are noted. Educational qualification is at three levels, High
school graduate, Bachelor, and Doctorate. Occupation is at four levels, Administrative and clerical,
Sales, Professional or specialty, and Executive or managerial. A different number of observations
are in each level of education – occupation combination.
[Assume that the data follows a normal distribution. In reality, the normality assumption may not
always hold if the sample size is small.]

1.1 State the null and the alternate hypothesis for conducting one-way ANOVA for both
Education and Occupation individually.
The Null and Alternate Hypothesis for the One Way ANOVA for Education are:
H0: The mean Salary variable for each educational level is equal.

Ha1: For at least one of the means of Salary for level of Education is different.

The Null and Alternate Hypothesis for the One Way ANOVA for Occupation are:
H0: The mean Salary variable for each Occupation type is equal.

Ha2: For at least one of the means of Salary for type of Occupation is different.
Where Alpha = 0.05

• If the p-value is < 0.05, then we reject the null hypothesis.


• If the p-value is >= 0.05, then we fail to reject the null hypothesis.

3
1.2 Perform one-way ANOVA for Education with respect to the variable ‘Salary’. State whether the
null hypothesis is accepted or rejected based on the ANOVA results.

Since the p-value is less than Alpha we reject Null Hypothesis (H0) for Education.

1.3 Perform one-way ANOVA for variable Occupation with respect to the variable ‘Salary’. State
whether the null hypothesis is accepted or rejected based on the ANOVA results.

Since the p-value is greater than Alpha we cannot reject Null Hypothesis (H0) for Occupation.

1.4 What is the interaction between the two treatments? Analyze the effects of one variable on the
other (Education and Occupation) with the help of an interaction plot.

4
Observation:
➢ From above plot we can figure out that people with educational level:
• Doctorates: are into higher salary brackets and are mostly in Prof-specialty, Exec-managerial
or in sales profile, very few are doing Adm-clerical jobs.
• Bachelors: fall in mid income range and found mostly working as an Exec - managers, Adm-
clerks or into sales but very few are found in Prof- specialty profile.
• HS-grads: are in low-income brackets, mostly doing Prof-specialty or Adm - clerical work and
few are doing Sales but hardly any in Exec-managerial role.

1.5 Perform a two-way ANOVA based on the Education and Occupation (along with their interaction
Education*Occupation) with the variable ‘Salary’. State the null and alternative hypotheses and state
your results. How will you interpret this result?

The Null and Alternate Hypothesis for the Two Way ANOVA for each Occupation type and Education level
are:
H0: The mean Salary variable for each Occupation type and Education level are equal
Ha2: For at least one of the means of Salary for type of Occupation and Education level are not equal.
Where Alpha = 0.05

• If the p-value is < 0.05, then we reject the null hypothesis.


• If the p-value is >= 0.05, then we fail to reject the null hypothesis.

5
Due to the inclusion of the interaction term, we can see changes in the p-value of the first two treatments.
And we see that the p-value of the interaction term suggests that the Null Hypothesis is rejected in this
case.

1.6 Explain the business implications of performing ANOVA for this particular case study.

Business implications of performing ANOVA:

• ANOVA can be used to forecast Salary trends by analyzing patterns in data to better understand the
future hike of Salary.
• ANOVA test indicates that the Education level coupled with Occupation has significant influence over
salary than alone occupation type with comparison to educational background.
• Plays comprehensive role while setting up salary bands. As similar job titles with different industries
demands varying salary package as per job profile, plus years of experience for the job matters here
deciding scale of a person.
• We must also take note of that high salaries are offered to Bachelor’s degree holders than Doctorates
for few occupations. So, we can say that there are some shortcomings of dataset provided which
reduces accuracy of the test and analysis done, as there can be few more other important variables
which can impact salary such as years of experience, specialization, industry/domain etc.

6
Problem 2
Education – Post 12th Standard

The dataset Education - Post 12th Standard.csv contains information on various colleges. You are
expected to do a Principal Component Analysis for this case study according to the instructions
given. The data dictionary of the 'Education - Post 12th Standard.csv' can be found in the following
file: Data Dictionary.xlsx.

2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be performed].
What insight do you draw from the EDA?

Step1: Import: a) all the necessary libraries and b) The Data


Step2: Describing the Data after loading it. Checking for datatypes, number of columns and rows, checking
for missing number of values, describing its min, max, mean values. Depending upon requirement dropping
off missing values or replacing it.

Univariate Analysis:

7
Observations:

• Data consists of 777 Universities with 18 Variables but not a single categorical variable present.
• Index of columns: ‘Name’, 'Apps', 'Accept', 'Enroll', 'Top10perc', 'Top25perc', 'F.Undergrad',
'P.Undergrad', 'Outstate', 'Room.Board', 'Books', 'Personal', 'PhD', 'Terminal', 'S.F.Ratio',
'perc.alumni', 'Expend', 'Grad.Rate'.
• Perc.alumni have minimum values as 0. Needs to be cleaned.
• There are no missing values in the data.
• Very few students fall under topper students with Top 10% and 25%.
• F.Undergrad field needs cleanup, has a minimum value as 139 and maximum value as 31643.
• From Quick Summary table we found that max % of Graduated students are 118% which needs to be
rectified as it shouldn’t go beyond 100. So, the error was found on row 95 for Cazenovia College has
to be corrected using Median value.

Histplot and boxplot to check Distribution, Density and outliers of each Variables:

8
9
10
11
12
Observations:

• Right Skewed data in variables: 'PhD', 'Terminal.


• Data is skewed Leftside in variables: Apps', 'Accept', 'Enroll', 'Top10perc', 'F.Undergrad',
'P.Undergrad', 'Room.Board', 'Books', 'Personal', 'S.F.Ratio','perc.alumni', 'Expend'.
• Data normally distributed in Variables: Top25perc , 'Outstate', ','Grad.Rate.
• Presence of outliers can be seen in almost every variable except Top25perc.

Multivariate Analysis:
Using Pair plot to see relationship of all Variables among each other.

13
Observation:
Few pairs have very high co-relation:

• Application and acceptance


• Students from top 10% schools and from top 25% schools
• Students from top 10% schools and Graduation rate
• Enrollment and Full-time undergrad students
• PHD faculties and Terminal.
Below Heatmap exhibits multicollinearity issue as significant number of high co-relation variables pairs /
features. When the statistical significance of independent variable is undermined Multicollinearity is
observed.

14
2.2 Is scaling necessary for PCA in this case? Give justification and perform scaling.

Answer:

• Our dataset has 18 attributes initially hence we get 18 principal components.


• Once we get the amount of variance explained by each principal component, we can decide how
many components we need for our model based on the amount of information we want to retain.
• Hence, yes, it is necessary to normalize data before performing PCA.
• The PCA calculates a new projection to our data set.
• If we normalize our data, all variables have the same standard deviation, thus all variables have the
same weight and our PCA calculates relevant axis. This skews the PCA towards high magnitude
features. We can speed up gradient descent or calculations in algorithm by scaling.
• Scaling of Data can be done using Z-Score method or Standard Scalar in SkLearn.

Scaled data-set:

15
Observations:

• After Scaling Standard deviation is 1.0 for all variables.


• Post scaling Q1(25%) value and minimum values difference is lesser than original dataset in most of
the variables.
2.3 Comment on the comparison between the covariance and the correlation matrices from this
data. [on scaled data]

• Both the terms, Covariance and Correlation matrices measure the relationship and the dependency
between two variables.
• “Covariance” indicates the direction of the linear relationship between variables.
• “Correlation” on the other hand measures both the strength and direction of the linear relationship
between two variables.
• Correlation refers to the scaled form of covariance. Covariance is affected by the change in scale.
Covariance indicates the direction of the linear relationship between variables. Correlation on the
other hand measures both the strength and direction of the linear relationship between two
variables.

16
Observations:

• Highest correlation is seen among:


• Enroll variable with F.Undergrad
• Enroll with Accept
• Apps with Accept and Apps
• Least correlations observed with SF Ratio variable with: Expend, Outstate, Grad Rate, perc.alumni ,
Room board and Top10perc.

2.4 Check the dataset for outliers before and after scaling. What insight do you derive here?

Boxplot before scaling:

Boxplot after scaling:

17
Insight:

• Outliers are still present as scaling does not remove outliers.

2.5 Extract the eigenvalues and eigenvectors. [Using Sklearn PCA Print Both]

Eigen Values:

18
Eigen Vectors:

19
2.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a data frame
with the original features.

In below table we can see that first PC or Array explains 33.12% variance in our dataset, while first seven
features capture 70.12% variance.

Using Z-score we have done dimension reduction from 17 PCAs to 9 PCAs.

20
Heatmap for the same:

Hence after PCA, the multicollinearity is highly reduced.

2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with two
places of decimals only).
In PCA, given a mean centered dataset X with n sample and p variables, the first principal component PC1 is
given by the linear combination of the original variables X_1, X_2, …, X_p PC_1 = w_{17}X_1 + w_{16}X_2 +
… + w_{1p}X_p.
The explicit form of the PC1 is as below:

21
2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide on the
optimum number of principal components? What do the eigenvectors indicate?

It can be seen that optimum number of 9 principal components are required to get cumulative variance up
to 90%

2.9 Explain the business implication of using the Principal Component Analysis for this case study.
How may PCs help in the further analysis?

The business implication of using the Principal Component Analysis:

• PCA is used in exploratory data analysis and for making predictive models, can be done only on
continuous variables.
• PCA used for dimensionality reduction by projecting each data point onto only the first few principal
components to obtain lower-dimensional data while preserving as much of the data's variation as
possible. In this case we can reduce dimensions from 17 to 9that explains over 90% variances.
• The first principal component can equivalently be defined as a direction that maximizes the variance
of the projected data.
• The i th principal component can be taken as a direction orthogonal (i.e., at 90 degrees to one
another.) to the first i-1 principal components that maximizes the variance of the projected data.
• Using the components we can now understand the reduced multicollinearity in the dataset.

22
23

You might also like