Business Report: Statistical Methods For Decision Making Project PGP-DSBA Online Athisya Nadar 9 May 2021
Business Report: Statistical Methods For Decision Making Project PGP-DSBA Online Athisya Nadar 9 May 2021
Business Report: Statistical Methods For Decision Making Project PGP-DSBA Online Athisya Nadar 9 May 2021
GREAT LEARNING 2
WHOLESALE CUSTOMERS ANALYSIS
A wholesale distributor operating in different regions of Portugal has
information on annual spending of several items in their stores across
different regions and channels. The data consists of 440 large retailers’ annual
spending on 6 different varieties of products in 3 different regions (Lisbon,
Oporto, Other) and across different sales channel (Hotel, Retail).
GREAT LEARNING 3
1.1 USE METHODS OF DESCRIPTIVE STATISTICS TO SUMMARIZE DATA.
Descriptive statistics is concerned with Data Summarization Graphs/Charts and tables. The methods
of descriptive statics include Distribution, which deals with each value's frequency, Measures of
Central Tendency and Measures of variability. The most widely used measures of central tendency is
Arithmetic Mean, Median, and Mode.
Mean is defined as the arithmetic average of all observations in the data set.
Median is defined as the middle value in the data set arranged in ascending or descending order.
Mode is defined as the most frequently occurring value in the distribution; it has the largest frequency.
Range is the simplest of all measures of dispersion. It is calculated as the difference between
maximum and minimum value in the data set.
Inter-Quartile Range (IQR) is computed on middle 50% of the observations after eliminating the
highest and lowest 25% of observations in a data set that is arranged in ascending order. IQR is less
affected by outliers.
Standard deviation is the square root of variance in simple words
The table below shows the description of the Wholesale customer dataset:
In the table below we can see some sample records which has 2 categorical variable and 6
numerical variables. The data consists of 440 large retailers’ annual spending on 6 different
varieties of products in 3 different regions (Lisbon, Oporto, Other) and across different sales
channel (Hotel, Retail).
GREAT LEARNING 4
The Region that has spent the most is Other(10677599) and the region that has
spent the least is Oporto(1555088).
The Channel that has spent the most is Hotel(7999569) and the channel that has
spent the least is Retail(6619931).
GREAT LEARNING 5
Figure 1
When we sum up the spending across each channel and region, we get the total spending
across each channel and region in the following table. the 6 different varieties of items which
include Fresh, Milk, grocery, frozen, detergent paper, delicatessen spending can be further
summarized in the bar graph
GREAT LEARNING 6
GREAT LEARNING 7
From the above graph, we can see that at Lisbon most spent product are Fresh products and
the least spent product is Delicatessen. At Oporto, the most spent product are Fresh products
and least spent products are Delicatessen. In other category, the most spent product are Fresh
products and least spent product are Delicatessen
GREAT LEARNING 8
The above graph clearly shows that the most spent product in retail category is Grocery
products and least spent product in retail category is the Frozen food products. In Hotel
category the most spent product is the Fresh products and least spent product is the
Detergents paper
The common descriptive measures of variability are the range, IQR, variance, and standard
deviation. To check the inconsistent behavior of an item we can calculate the coefficient of
variation of each of the variable. The following pie chart explains how each of the item has
performed across the 3 different locations Lisbon, Oporto and other against both retail and
hotel category.
This table shows that coefficient of variance of Fresh products is 105.25% while that of
Delicatessen is 184.42%. Therefore, Fresh products show the most inconsistent behavior and
Delicatessen shows the least inconsistent behavior
GREAT LEARNING 9
This pair plot helps us to understand the relationship between the 6 food items.
GREAT LEARNING 10
GREAT LEARNING 11
GREAT LEARNING 12
1.4 Are there any outliers in the data? Back up your answer with a suitable plot/technique with
the help of detailed comments.
From this Boxplot, we can clearly see that all the 6 items have outliers
Outliers are observations in a dataset that don’t fit in some way. Perhaps the most common or
familiar type of outlier is the observations that are far from the rest of the observations or the
center of mass of observations. Outliers can skew statistical measures and data distributions,
providing a misleading representation of the underlying data and relationships. Removing
outliers from data prior to modeling can result in a better fit of the data and, in turn, more
skillful predictions.
GREAT LEARNING 13
1.5 On the basis of your analysis, what are your recommendations for the business? How can
your analysis help the business to solve its problem? Answer from the business
perspective
GREAT LEARNING 14
PROBLEM 2
THE STUDENT NEWS SERVICE AT CLEAR MOUNTAIN STATE UNIVERSITY
(CMSU) HAS DECIDED TO GATHER DATA ABOUT THE UNDERGRADUATE
STUDENTS THAT ATTEND CMSU. CMSU CREATES AND DISTRIBUTES A
SURVEY OF 14 QUESTIONS AND RECEIVES RESPONSES FROM 62
UNDERGRADUATES (STORED IN THE SURVEY DATA SET).
2.1. For this data, construct the following contingency tables (Keep Gender as row variable)
2.1.1. Gender and Major
Gender Major
Female Accounting 3
CIS 3
Economics/Finance 7
International Business 4
Management 4
Other 3
Retailing/Marketing 9
Male Accounting 4
CIS 1
Economics/Finance 4
International Business 2
Management 6
GREAT LEARNING 15
Other 4
Retailing/Marketing 5
Undecided 3
The following table shows the number of male and female against whether they intent to
graduate or no along with some who are undecided
Female No 9
Undecided 13
Yes 11
Male No 3
Undecided 9
Yes 17
The following table displays the employment status with the number of males and females for
each type of employment.
Gender Employment
Female Full-Time 3
Part-Time 24
Unemployed 6
Male Full-Time 7
Part-Time 19
Unemployed 3
GREAT LEARNING 16
2.1.4. Gender and Computer
The following table show the number of male and female students who use tablet, laptop or
desktop.
Gender Computer
Female Desktop 2
Laptop 29
Tablet 2
Male Desktop 3
Laptop 26
2.2. Assume that the sample is representative of the population of CMSU. Based on the data,
answer the following question:
2.2.1. What is the probability that a randomly selected CMSU student will be male?
The probability that a randomly selected CMSU student will be male is 0.46774193548387094
2.2.2. What is the probability that a randomly selected CMSU student will be female?
The probability that a randomly selected CMSU student will be female is 0.532258064516129
2.3. Assume that the sample is representative of the population of CMSU. Based on the data,
answer the following question:
2.3.1. Find the conditional probability of different majors among the male students in CMSU.
The following table shows the conditional probability of different majors among the male students in
CMSU which is calculated by number of male students in accounting, CIS, economics/finance,
international business, management, other, Retailing/Marketing ,undecided/total number of male
students
GREAT LEARNING 17
male_prob
Gender
Major
Accounting 4 0.137931
CIS 1 0.034483
Economics/Finance 4 0.137931
Management 6 0.206897
Other 4 0.137931
Retailing/Marketing 5 0.172414
Undecided 3 0.103448
2.3.2 Find the conditional probability of different majors among the female students of CMSU.
The following table show the conditional probability of different majors among the female students of
CMSU which is calculated by number of females in accounting, CIS, economics/finance, international
business, management, other/total number of female students
Gender female_prob
Major
Accounting 3 0.090909
CIS 3 0.090909
Economics/Finance 7 0.212121
Management 4 0.121212
GREAT LEARNING 18
Other 3 0.090909
Retailing/Marketing 9 0.272727
2.4. Assume that the sample is a representative of the population of CMSU. Based on the data,
answer the following question:
2.4.1. Find the probability That a randomly chosen student is a male and intends to graduate.
probability That a randomly chosen student is a male and intends to graduate = number of male
students who intends to graduate/total number of students=17/62
2.4.2 Find the probability that a randomly selected student is a female and does NOT have a
laptop.
probability that a randomly selected student is a female and does NOT have a laptop = number of
female students without laptop/total number of students=4/62
Probability that a randomly chosen student is a female and does not have a laptop is
0.06451612903225806
2.5. Assume that the sample is representative of the population of CMSU. Based on the data,
answer the following question:
2.5.1. Find the probability that a randomly chosen student is a male or has full-time
employment?
probability that a randomly chosen student is a male or has full-time employment = number of male
students or students who have full time employment/total number of students=32/62
Probability that a randomly chosen student is a male or has full time empl
oyment is 0.5161290322580645
2.5.2. Find the conditional probability that given a female student is randomly chosen, she is
majoring in international business or management.
conditional probability that given a female student is randomly chosen, she is majoring in international
business or management= number of female student from international business or
management/total number of female students=8/33
Probability that a randomly chosen student is female and has Major in Management or International
Business 0.24242424242424243
GREAT LEARNING 19
2.6. Construct a contingency table of Gender and Intent to Graduate at 2 levels (Yes/No). The
Undecided students are not considered now and the table is a 2x2 table. Do you think the
graduate intention and being female are independent events?
yes no
male 17 3
female 11 9
Two events A and B are said to be independent if the fact that one event has occurred does not
affect the probability that the other event will occur. We can see out of 29 male, 17 intent to graduate
and out of 33 female only 11 intent to graduate.
Events A and B are independent if the equation P(A∩B) = P(A) · P(B) holds true.
Hence, the graduate intention and being female are dependent events
2.7. Note that there are four numerical (continuous) variables in the data set, GPA, Salary,
Spending, and Text Messages.
2.7.1. If a student is chosen randomly, what is the probability that his/her GPA is less than 3?
Probability that a randomly chosen student's GPA is less than 3 = number of students with GPA less
than 3/total number of students=17/62
2.7.2. Find the conditional probability that a randomly selected male earns 50 or more. Find
the conditional probability that a randomly selected female earns 50 or more.
Probability that a randomly selected male earns more than 50 = number male students who earns 50
or more/total number of male=14/29
GREAT LEARNING 20
Probability that a randomly selected Female earns more than 50 = number of females earning 50 or
more/number of females=18/33
2.8.1 Note that there are four numerical (continuous) variables in the data set, GPA, Salary,
Spending and Text Messages. For each of them comment whether they follow a normal
distribution.
GREAT LEARNING 21
And to confirm whether these four data sets are following normal distribution or not, we done the
Shapiro–Wilk test and the output from Python we got –
ShapiroResult for GPA(statistics=0.994, p=0.987)
p-value is more than 0.05
ShapiroResult for Salary(statistics=0.971, p=0.147)
p-value is more than 0.05
ShapiroResult for Spending(statistics=0.984, p=0.589)
p-value is more than 0.05
ShapiroResult for Text Messages(statistics=0.980, p=0.408)
p-value is more than 0.05
By these details we confirm that out of the given four data sets ‘GPA’ ,‘Salary’ , Spending’ and ‘Text
Messages’ are following normal distribution.
GREAT LEARNING 22
2.8.2 Write a note summarizing your conclusions for this whole Problem 2.
From this analysis, we can conclude that the sample survey conducted for the students from central
Missouri state university shows that there are multiple factors that affect the graduation of a student.
The survey conducted by Student News Service at Clear Mountain State University (CMSU) has
information about what major the undergrad students are pursuing, whether they intent to graduate,
what is their GPA, nature of their employment and their salary, social networking, spending,
satisfaction, computer and text messages. Using our analysis, we have constructed contingency
tables and calculated probabilities between these variables. We can conclude that in order to help
students graduate and find suitable employment the university can work on improving the
infrastructure by providing easy access to computers and conducting social networking events. The
probabilities of male students graduating is more than that of female students, so female students
need more support and choice of major.
Problem 3
The file (A & B shingles.csv) includes 36 measurements (in pounds per 100
square feet) for A shingles and 31 for B shingles.
3.1 Do you think there is evidence that means moisture contents in both types of shingles are
within the permissible limits? State your conclusions clearly showing all steps.
p value: 0.07477633144907513
There is not enough evidence to conclude that the mean moisture content for Sample A shingles is
less than 0.35 pounds per 100 square feet. p-value = 0.0748. If the population mean moisture
content is in fact no less than 0.35 pounds per 100 square feet, the probability of observing a sample
of 36 shingles that will result in a sample mean moisture content of 0.3167 pounds per 100 square
feet or less is .0748.
GREAT LEARNING 23
t_statistic, p_value = ttest_1samp(df.B, 0.35,nan_policy='omit' )
print('One sample t test \nt statistic: {0} p value: {1} '.format(t_statistic, p_value/2))
Since pvalue < 0.05, reject H0 . There is enough evidence to conclude that the mean moisture
content for Sample B shingles is not less than 0.35 pounds per 100 square feet. p-value = 0.0021. If
the population mean moisture content is in fact no less than 0.35pounds per 100 square feet, the
probability of observing a sample of 31 shingles that will result in a sample mean moisture content of
0.2735 pounds per 100 square feet or less is .0021.
3.2 Do you think that the population mean for shingles A and B are equal? Form the
hypothesis and conduct the test of the hypothesis. What assumption do you need to check
before the test for equality of means is performed?
H0 : μ(A)= μ(B)
Ha : μ(A)!= μ(B)
α = 0.05
t_statistic=1.29
pvalue=0.202
and we can say that population mean for shingles A and B are equal Test Assumptions When
running a two-sample t-test, the basic assumptions are that the distributions of the two populations
are normal, and that the variances of the two distributions are the same. If those assumptions are not
likely to be met, another testing procedure could be use.
GREAT LEARNING 24
Problem 1 Problem 2 Problem 3
GREAT LEARNING 25
Thank You !
GREAT LEARNING 26