Assignment#8614 2
Assignment#8614 2
Assignment#8614 2
Semester: 3rd
Assignment2
Q.1 Explain three major measures of central tendency. Also explain the
procedure to calculate them.
Central Tendency
In statistics, the central tendency is the descriptive summary of a data set. Through
the single value from the dataset, it reflects the centre of the data distribution.
Moreover, it does not provide information regarding individual data from the
dataset, where it gives a summary of the dataset. Generally, the central tendency
The central tendency is stated as the statistical measure that represents the single
value of the entire distribution or a dataset. It aims to provide an accurate
description of the entire data in the distribution.
The central tendency of the dataset can be found out using the three important
measures namely mean, median and mode.
Mean
The mean represents the average value of the dataset. It can be calculated as the
sum of all the values in the dataset divided by the number of values. In general, it
is considered as the arithmetic mean. Some other measures of mean used to find
the central tendency are as follows:
● Geometric Mean
● Harmonic Mean
● Weighted Mean
It is observed that if all the values in the dataset are the same, then all geometric,
arithmetic and harmonic mean values are the same. If there is variability in the
data, then the mean value differs. Calculating the mean value is completely easy.
The formula to calculate the mean value is given by:
In symmetric data distribution, the mean value is located accurately at the centre.
Median
Median is the middle value of the dataset in which the dataset is arranged in the
ascending order or in descending order. When the dataset contains an even
number of values, then the median value of the dataset can be found by taking the
mean of the middle two values.
Consider the given dataset with the odd number of observations arranged in
descending order – 23, 21, 18, 16, 15, 13, 12, 10, 9, 7, 6, 5, and 2
Here 12 is the middle or median number that has 6 values above it and 6 values
below it.
Now, consider another example with an even number of observations that are
arranged in descending order – 40, 38, 35, 33, 32, 30, 29, 27, 26, 24, 23, 22, 19, and
17
When you look at the given dataset, the two middle values obtained are 27 and 29.
Now, find out the mean value for these two numbers.
i.e.,(27+29)/2 =28
Mode
The mode represents the frequently occurring value in the dataset. Sometimes the
dataset may contain multiple modes and in some cases, it does not contain any
mode at all.
Based on the properties of the data, the measures of central tendency are selected.
● If you have skewed distribution, the best measure of finding the central
tendency is the median.
Video Lesson
The central tendency measure is defined as the number used to represent the
center
or middle of a set of data values. The three commonly used measures of central
tendency are the mean, median, and mode.
Definition
● mode
● median
● mean
Mode
Consider this dataset showing the retirement age of 11 people, in whole years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
This table shows a simple frequency distribution of the retirement age data.
ImageDescription
The mode has an advantage over the median and the mean as it can be found for
both numerical and categorical (non-numerical) data.
The are some limitations to using the mode. In some distributions, the mode may
not reflect the centre of the distribution very well. When the distribution of
retirement age is ordered from lowest to highest value, it is easy to see that the
centre of the distribution is 57 years, but the mode is lower, at 54 years.
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
It is also possible for there to be more than one mode for the same distribution of
data, (bi-modal, or multi-modal).
Median
The median is the middle value in distribution when the values are arranged in
ascending or descending order.
The median divides the distribution in half (there are 50% of observations on either
side of the median value). In a distribution with an odd number of observations,
the median value is the middle value.
The median is less affected by outliers and skewed data than the mean and is
usually the preferred measure of central tendency when the distribution is not
symmetrical.
Mean
The mean is the sum of the value of each observation in a dataset divided by the
number of observations. This is also known as the arithmetic average.
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
The mean can be used for both continuous and discrete numeric data.
The mean cannot be calculated for categorical data, as the values cannot be
summed.
As the mean includes every value in the distribution the mean is influenced by
outliers and skewed distributions.
The population mean is indicated by the Greek symbol µ (pronounced ‘mu’). When
the mean is calculated on a distribution from a sample it is indicated by the symbol
x̅ (pronounced X-bar).
● making estimates about populations (for example, the mean SAT score of all
11th graders in the US).
● testing hypotheses to draw conclusions about populations (for example, the
relationship between SAT scores and family income).
Table of contents
Descriptive statistics
Example: Descriptive statisticsYou collect data on the SAT scores of all 11th
graders in a school for three years.
Inferential statistics
Most of the time, you can only acquire data from samples, because it is too difficult
or expensive to collect data from the whole population that you’re interested in.
You can use inferential statistics to make estimates and test hypotheses about the
whole population of 11th graders in the state based on your sample data.
There are two important types of estimates you can make about the population: point
estimates and interval estimates.
Both types of estimates are important for gathering a clear idea of where a parameter
is likely to lie.
Confidence intervals
While a point estimate gives you a precise value for the parameter you are interested
in, a confidence interval tells you the uncertainty of the point estimate. They are best
used in combination with each other.
Hypothesis testing
Hypotheses, or predictions, are tested using statistical tests. Statistical tests also
estimate sampling errors so that valid inferences can be made.
● the population that the sample comes from follows a normal distribution of
scores
● the sample size is large enough to represent the population
● the variances, a measure of variability, of each group being compared are
similar
Comparison tests
Means can only be found for interval or ratio data, while medians and rankings are
Comparison test Parametric? What’s being compared? Samples
Correlation tests
Correlation tests determine the extent to which two variables are associated.
Although Pearson’s r is the most statistically powerful test, Spearman’s r is
appropriate for interval and ratio variables when the data doesn’t follow a normal
distribution.
The chi square test of independence is the only test that can be used
with nominal variables.
Regression tests
Regression tests demonstrate whether changes in predictor variables cause changes
in an outcome variable. You can decide which regression test to use based on the
number and types of variables you have as predictors and outcomes.
Data transformations help you make your data normally distributed using
mathematical operations, like taking the square root of each value.
Inferential Statistics
Inferential statistics is a branch of statistics that makes the use of various analytical
tools to draw inferences about the population data from sample data. Apart from
inferential statistics, descriptive statistics forms another branch of statistics.
It's important to remember that correlation does not always indicate causation.
Two variables can be correlated without either variable causing the other. For
instance, ice cream sales and drownings might be correlated, but that doesn't mean
that ice cream causes drownings—instead, both ice cream sales and drownings
increase when the weather is hot. Relationships like this are called spurious
correlations.
Detailed Question -
I would like to test a number of hypotheses but I am not sure which analysis method
is best. I am therefore looking for information indicating instances when to use
each method, including interpretation and reporting of results. Can you suggest
suitable resources? I am new to research work.
Answer:
The usage of correlation analysis or regression analysis depends on your data set
and the objective of the study. Correlation analysis is used to quantify the degree
to which two variables are related.
A correlation reflects the strength and/or direction of the relationship between two
(or more) variables. The direction of a correlation can be either positive or negative.
Positive correlation Both variables change As height increases, weight also increases
in the same direction
Correlational research is ideal for gathering data quickly from natural settings. That
helps you generalize your findings to real-life situations in an externally valid way.
You want to find out if there is an association between two variables, but you don’t
expect to find a causal relationship between them.
You think there is a causal relationship between two variables, but it is impractical,
unethical, or too costly to conduct experimental research that manipulates one of
the variables.
You have developed a new instrument for measuring your variable, and you need
to test its reliability or validity.
There are many different methods you can use in correlational research. In the
social and behavioral sciences, the most common data collection methods for this
type of research include surveys, observations, and secondary data.
Surveys
Surveys are a quick, flexible way to collect standardized data from many
participants, but it’s important to ensure that your questions are worded in an
unbiased way and capture relevant insights.
Naturalistic observation
Naturalistic observation is a type of field research where you gather data about a
behavior or phenomenon in its natural environment.
Secondary data
Instead of collecting original data, you can also use data that has already been
collected for a different purpose, such as official records, polls, or previous studies.
Using secondary data is inexpensive and fast, because data collection is complete.
However, the data may be unreliable, incomplete or not entirely relevant, and you
have no control over the reliability or validity of the data collection procedures.
After collecting data, you can statistically analyze the relationship between
variables using correlation or regression analyses, or both. You can also visualize
the relationships between variables with a scatterplot.
Correlation analysis
Using a correlation analysis, you can summarize the relationship between variables
into a correlation coefficient: a single number that describes the strength and
direction of the relationship between variables. With this number, you’ll quantify
the degree of the relationship between variables.
Regression analysis
With a regression analysis, you can predict how much a change in one variable will
be associated with a change in the other variable. The result is a regression
equation that describes the line on a graph of your variables.
It’s important to remember that correlation does not imply causation. Just because
you find a correlation between two things doesn’t mean you can conclude one of
them causes the other for a few reasons.
Directionality problem
If two variables are correlated, it could be because one of them is a cause and the
other is an effect. But the correlational research design doesn’t allow you
to infer which is which. To err on the side of caution, researchers don’t conclude
causality from correlational studies.
The F-distribution contains all of the possible values for a test statistic. It is
determined by the degrees of freedom and is always skewed right, meaning that
all of the values are greater than zero.
The two sample F-test is used when comparing the variances of two populations.
This allows the researcher to determine whether there are statistically significant
differences between the two populations.
Whether you’re guessing if it’s going to rain tomorrow, betting on a sports team
to win an away match, framing a policy for an insurance company, or simply trying
your luck on blackjack at the casino, probability and distributions come into action
in all aspects of life to determine the likelihood of events.
Having a sound statistical background can be incredibly beneficial in the daily life
of a data scientist. Probability is one of the main building blocks of data science and
machine learning. While the concept of probability gives us mathematical
calculations, statistical distributions help us visualize what’s happening
underneath.
Level up your AI game: Dive deep into Large Language Models with us!
Having a good grip on statistical distribution makes exploring a new dataset and
finding patterns within a lot easier. It helps us choose the appropriate machine
learning model to fit our data on and speeds up the overall process.
PRO TIP: Join our data science bootcamp program today to enhance your data
science skillset!
In this blog, we will be going over diverse types of data, the common distributions
for each of them, and compelling examples of where they are applied in real life.
When you roll a die or pick a card from a deck, you have a limited number of
outcomes possible. This type of data is called Discrete Data, which can only take a
specified number of values. For example, in rolling a die, the specified values are 1,
2, 3, 4, 5, and 6.
Depending on the type of data we use, we have grouped distributions into two
categories, discrete distributions for discrete data (finite outcomes) and continuous
distributions for continuous data (infinite outcomes).
Discrete distributions
● Given multiple trials, each of them is independent of the other. That is, the
outcome of one trial doesn’t affect another one.
● Each trial can lead to just two possible results (e.g., winning or losing), with
probabilities p and (1 – p).
Poisson distribution deals with the frequency with which an event occurs within a
specific interval. Instead of the probability of an event, Poisson distribution
requires knowing how often it happens in a particular period or distance. For
example, a cricket chirps two times in 7 seconds on average.
● An event can occur any number of times (within the defined period).
The graph of Poisson distribution plots the number of instances an event occurs in
the standard interval of time and the probability of each one.
Continuous distributions
Conclusion
Published on May 23, 2022 by Shaun Turney. Revised on June 22, 2023.
● The chi-square goodness of fit test is used to test whether the frequency
distribution of a categorical variable is different from your expectations.
● The chi-square test of independence is used to test whether two categorical
variables are related to each other.
Pearson’s chi-square (Χ2) tests, often referred to simply as chi-square tests, are
among the most common nonparametric tests.
House sparrow 15
House finch 12
Black-capped 9
chickadee
Common grackle 8
Frequency of visits by bird species Test hypotheses about frequency
at a bird feeder during a 24-hour distributions
period
There are two types of Pearson’s chi-square
Bird species Frequency tests, but they both test whether the
A chi-square test (a chi-square goodness of fit test) can test whether these
observed frequencies are significantly different from what was expected, such as
equal frequencies.
Right-handed Left-handed
American 236 19
Canadian 157 16
● Academic style
● Vague sentenc
Both of Pearson’s chi-square tests use the same formula to calculate the test
statistic, chi-square (Χ2):
Where:
The larger the difference between the observations and the expectations (O − E in
the equation), the bigger the chi-square will be.
A Pearson’s chi-square test may be an appropriate option for your data if all of the
following are true:
1. You want to test a hypothesis about one or more categorical variables. If
one or more of your variables is quantitative, you should use a
different statistical test. Alternatively, you could convert the quantitative
variable into a categorical variable by separating the observations into
intervals.
Mathematically, these are actually the same test. However, we often think of them
as different tests because they’re used for different purposes.
You can use a chi-square goodness of fit test when you have one categorical
variable. It allows you to test whether the frequency distribution of the categorical
variable is significantly different from your expectations.
● Alternative hypothesis (HA): The bird species visit the bird feeder
in different proportions.
● Null hypothesis (H0): The bird species visit the bird feeder in
the same proportions as the average over the past five years.
● Alternative hypothesis (HA): The bird species visit the bird feeder
in different proportions from the average over the past five years.
You can use a chi-square test of independence when you have two categorical
variables.
● Null hypothesis (H0): The proportion of people who are left-handed is the
same for Americans and Canadians.
McNemar’s test is a test that uses the chi-square test statistic. It isn’t a variety
of Pearson’s chi-square test, but it’s closely related. You can conduct this test when
you have a related pair of categorical variables that each have two groups.
Like vanilla 47 32
Dislike vanilla 8 13
● Null hypothesis (H0): The proportion of people who like chocolate is the
same as the proportion of people who like vanilla.
There are several other types of chi-square tests that are not Pearson’s chi-square
tests, including the test of a single variance and the likelihood ratio chi-square
test.
The exact procedure for performing a Pearson’s chi-square test depends on which
test you’re using, but it generally follows these steps:
1. Create a table of the observed and expected frequencies. This can
sometimes be the most difficult step because you will need to carefully
consider which expected values are most appropriate for your null
hypothesis.
5. Decide whether to reject the null hypothesis. You should reject the
null hypothesis if the chi-square value is greater than the critical value. If you
reject the null hypothesis, you can conclude that your data are significantly
different from what you expected.
● You don’t need to provide a reference or formula since the chi-square test is
a commonly used statistic.
● Refer to chi-square using its Greek symbol, Χ2. Although the symbol looks
very similar to an “X” from the Latin alphabet, it’s actually a different symbol.
Greek symbols should not be italicized.
Practice questions
powered by Typeform
If you want to know more about statistics, methodology, or research bias, make
sure to check out some of our other articles with explanations and examples.
Statistics
● Statistical power
● Descriptive statistics
Methodology
● Double-blind study
● Case-control study
● Research bias
● Hawthorne effect
● Unconscious bias
● You can use the test when you have counts of values for a categorical
variable.
● The Chi-square goodness of fit test checks whether your sample data is likely
to be from a specific theoretical distribution.
● What do we need?
● For the goodness of fit test, we need one variable. We also need an idea, or
hypothesis, about how that variable is distributed. Here are a couple of
examples:
● We have bags of candy with five flavors in each bag. The bags should contain
an equal number of pieces of each flavor. The idea we'd like to test is that
the proportions of the five flavors in each bag are the same.
● Understanding results
● Let’s use a few graphs to understand the test and the results.
● A simple bar chart of the data shows the observed counts for the flavors of
candy:
*********************