Educational Statistics

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 23
At a glance
Powered by AI
The key takeaways are that statistics deals with the collection, analysis, interpretation and presentation of data. It involves studying samples to make inferences about populations. Variables can be numerical or categorical, and discrete or continuous data. There are different scales of measurement with increasing precision.

The different types of variables are categorical, ordinal, interval and ratio. Categorical variables are unordered while ordinal variables have a clear ordering but not equal intervals. Interval and ratio variables have equally spaced intervals.

The different scales of measurement are nominal, ordinal, interval and ratio scales with increasing precision. Nominal scales only classify data into categories while ordinal scales have ordering but not equal intervals. Interval and ratio scales have equally spaced intervals, with ratio scales also having a true zero point.

ASIAN DEVELOPMENT FOUNDATION COLLEGE

Graduate School
Tacloban City
COMPREHENSIVE EXAMINATION
in
EDUCATIONAL STATISTICS

ALFREDA O. LLANTADA- MAED

1. Discuss the meaning/ concepts and nature of statistics


2. Explain the different kinds of variables of variables according to
2.1 Functional Relationship
2.2 Continuity of values
2.3 Scales of measurement

3. Differentiate parametric test and non-parametric test as tools in statistics.

4. Discuss the type of data presentation and cite example to substantiate your
answer.

5. Compare and contrast the four scales of measurement.

6. Give at least ( 5) sampling techniques and give examples to concretize your


answer.

7. Discuss the following basic measurements in statistics and give the formula for
each.

8. Mention/describe at least five (5) statistical tests/ tools commonly used in


research and give their users. You may include the Formula
1. Discuss the meaning/ concepts and nature of statistics

Answers:
Statistics -Are defined as numerical data, and is the field of math that deals with
the collection, tabulation and interpretation of numerical data. An example of
statistics is a report of numbers saying how many followers of each religion
there are in a particular country.
In statistics, we generally want to study a population. You can think of a
population as a collection of persons, things or objects under study. To study the
population, we select a sample. The idea of sampling is to collect a portion (or
subset) of the larger population. The science of statistics deals with the collection,
analysis, interpretation, and presentation of data. We see and use data in our
everyday lives. In statistics, we generally want to study a population. You can
think of a population as a collection of persons, things, or objects under study. To
study the population, we select a sample. The idea of sampling is to select a portion
(or subset) of the larger population and study that portion (the sample) to gain
information about the population. Data are the result of sampling from a
population.
Example:
If you wished to compute the overall grade point average at your school, it would
make sense to select a sample of students who attend the school. The data collected
from the sample would be the students’ grade point averages.
Show the population and sample.
In presidential elections, opinion poll samples of 1,000–2,000 people are taken.
The opinion poll is supposed to represent the views of the people in the entire
country.
Show the population and sample.
City of Houston wants to know if the annual household income in the city is higher
than national average. The statisticians collect data from 1500 families.
Show the population and sample.
An automobile manufacturer wanted to know if more than 50% of the US drivers
own at least a domestic car. This company surveyed 10,000 drivers over US.
Show the population and sample.
From the sample data, we can calculate a statistic. A statistic is a number that
represents a property of the sample. For example, if we consider one math class to
be a sample of the population of all math classes, then the average number of
points earned by students in that one math class at the end of the term is an
example of a statistic. The statistic is an estimate of a population parameter. A
parameter is a number that is a property of the population. Since we considered all
math classes to be the population, then the average number of points earned per
student over all the math classes is an example of a parameter.

Population: all math classes


Sample: One of the math classes
Parameter: Average number of points earned per student over all math classes
One of the main concerns in the field of statistics is how accurately a statistic
estimates a parameter. The accuracy really depends on how well the sample
represents the population. The sample must contain the characteristics of the
population in order to be a representative sample. We are interested in both the
sample statistic and the population parameter in inferential statistics. In a later
chapter, we will use the sample statistic to test the validity of the established
population parameter.
A variable, notated by capital letters such as X and Y, is a characteristic of interest
for each person or thing in a population. Variables may be numerical or
categorical. Numerical variables take on values with equal units such as weight in
pounds and time in hours. Categorical variables place the person or thing into a
category.
Example:
If we assume that X is equal to the number of points earned by one math student at
the end of a term, then X is a numerical variable.
If we let Y be a person’s party affiliation, then some examples of Y include
Republican, Democrat, and Independent. Y is a categorical variable.
We could do some math with values of X (calculate the average number of points
earned, for example), but it makes no sense to do math with values of Y
(calculating an average party affiliation makes no sense).
Data are the actual values of the variable. They may be numbers or they may be
words. Datum is a single value
Two words that come up often in statistics are mean and proportion.
If you were to take three exams in your math classes and obtain scores of 86, 75,
and 92, you would calculate your mean score by adding the three exam scores and
dividing by three (your mean score would be 84.3 to one decimal place). If, in your
math class, there are 40 students and 22 are men and 18 are women, then the
proportion of men students is 2240 and the proportion of women students is 1840.
Mean and proportion are discussed in more detail in later chapters.
NOTE (We will learn the meaning of mean in next chapter! )
The words “mean” and “average” are often used interchangeably. The substitution
of one word for the other is common practice. The technical term is “arithmetic
mean,” and “average” is technically a center location. However, in practice among
non-statisticians, “average” is commonly accepted for “arithmetic mean.”

2. 2. Explain the different kinds of variables of variables according to


2.1 Functional Relationship
2.2 Continuity of values
2.3 Scales of measurement
Answers:

2. Explain the different kinds of variables according to:


2.1 Functional Relationship

Common Types of Variables


 Categorical variable: variables than can be put into categories. For example,
the category “Toothpaste Brands” might contain the
variables Colgate and Aquafresh.
 Confounding variable: extra variables that have a hidden effect on your
experimental results.
 Continuous variable: a variable with infinite number of values, like “time”
or “weight”.
 Control variable: a factor in an experiment which must be held constant. For
example, in an experiment to determine whether light makes plants grow
faster, you would have to control for soil quality and water.
 Dependent variable: the outcome of an experiment. As you change the
independent variable, you watch what happens to the dependent variable.
 Discrete variable: a variable that can only take on a certain number of
values. For example, “number of cars in a parking lot” is discrete because a car
park can only hold so many cars.
 Independent variable: a variable that is not affected by anything that you, the
researcher, does. Usually plotted on the x-axis.
 Lurking variable: a “hidden” variable the affects the relationship between
the independent and dependent variables.
 A measurement variable has a number associated with it. It’s an “amount” of
something, or a”number” of something.
 Nominal variable: another name for categorical variable.
 Ordinal variable: similar to a categorical variable, but there is a clear order.
For example, income levels of low, middle, and high could be considered
ordinal.
 Qualitative variable: a broad category for any variable that can’t be
counted (i.e. has no numerical value). Nominal and ordinal variables fall under
this umbrella term.
 Quantitative variable: A broad category that includes any variable that can
be counted, or has a numerical value associated with it. Examples of variables
that fall into this category include discrete variables and ratio variables.
 Random variables are associated with random processes and give numbers
to outcomes of random events.
 A ranked variable is an ordinal variable; a variable where every data point
can be put in order (1st, 2nd, 3rd, etc.).
 Ratio variables: similar to interval variables, but has a meaningful zero.
Less Common Types of Variables
 Active Variable: a variable that is manipulated by the researcher.
 Antecedent Variable: a variable that comes before the independent variable.
 Attribute variable: another name for a categorical variable (in statistical
software) or a variable that isn’t manipulated (in design of experiments).
 Binary variable: a variable that can only take on two values, usually 0/1.
Could also be yes/no, tall/short or some other two-variable combination.
 Collider Variable: a variable represented by a node on a causal graph that
has paths pointing in as well as out.
 Covariate variable: similar to an independent variable, it has an effect on the
dependent variable but is usually not the variable of interest. See also: Non
comitant variable.
 Criterion variable: another name for a dependent variable, when the variable
is used in non-experimental situations.
 Dichotomous variable: Another name for a binary variable.
 Dummy Variables: used in regression analysis when you want to assign
relationships to unconnected categorical variables. For example, if you had the
categories “has dogs” and “owns a car” you might assign a 1 to mean “has
dogs” and 0 to mean “owns a car.”
 Endogenous variable: similar to dependent variables, they are affected by
other variables in the system. Used almost exclusively in econometrics.
 Exogenous variable: variables that affect others in the system.
 Explanatory Variable: a type of independent variable. When a variable is
independent, it is not affected at all by any other variables. When a variable
isn’t independent for certain, it’s an explanatory variable.
 Extraneous variables are any variables that you are not intentionally
studying in your experiment or test.
 A grouping variable (also called a coding variable, group variable or by
variable) sorts data within data files into categories or groups.
 Identifier Variables: variables used to uniquely identify situations.
 Indicator variable: another name for a dummy variable.
 Interval variable: a meaningful measurement between two variables. Also
sometimes used as another name for a continuous variable.
 Intervening variable: a variable that is used to explain the relationship
between variables.
 Latent Variable: a hidden variable that can’t be measured or observed
directly.
 Manifest variable: a variable that can be directly observed or measured.
 Manipulated variable: another name for independent variable.
 Mediating variable or intervening variable: variables that explain how the
relationship between variables happens. For example, it could explain the
difference between the predictor and criterion.
 Moderating variable: changes the strength of an effect between independent
and dependent variables. For example, psychotherapy may reduce stress levels
for women more than men, so sex moderates the effect between psychotherapy
and stress levels.
 Nuisance Variable: an extraneous variable that increases variability overall.
 Observed Variable: a measured variable (usually used in SEM).
 Outcome variable: similar in meaning to a dependent variable, but used in a
non-experimental study.
 Polychotomous variables: variables that can have more than two values.
 Predictor variable: similar in meaning to the independent variable, but used
in regression and in non-experimental studies.
 Responding variable: an informal term for dependent variable, usually used
in science fairs.
 Scale Variable: basically, another name for a measurement variable.
 Study Variable (Research Variable): can mean any variable used in a study,
but does have a more formal definition when used in a clinical trial.
 Test Variable: another name for the Dependent Variable.
 Treatment variable: another name for independent variable.
Types of Variables: References
Dodge, Y. (2008). The Concise Encyclopedia of Statistics. Springer.
Everitt, B. S.; Skrondal, A. (2010), The Cambridge Dictionary of Statistics,
Cambridge University Press.
Gonick, L. (1993). The Cartoon Guide to Statistics. Harper Perennial.
CITE THIS AS:
Stephanie Glen. "Types of Variables in Statistics and Research"
From StatisticsHowTo.com: Elementary Statistics for the rest of
us! https://www.statisticshowto.com/probability-and-statistics/types-of-variables/

2.2 Continuity of Values

Answer:
Continuity, in mathematics, rigorous formulation of the intuitive concept of
a function that varies with no abrupt breaks or jumps. A function is a relationship
in which every value of an independent variable—say x—is associated with a
value of a dependent variable—say y. Continuity of a function is sometimes
expressed by saying that if the x-values are close together, then the y-values of the
function will also be close. But if the question “How close?” is asked, difficulties
arise.

For close x-values, the distance between the y-values can be large even if the
function has no sudden jumps. For example, if y = 1,000x, then two values of x that
differ by 0.01 will have corresponding y-values differing by 10. On the other hand,
for any point x, points can be selected close enough to it so that the y-values of this
function will be as close as desired, simply by choosing the x-values to be closer
than 0.001 times the desired closeness of the y-values.
Thus, continuity is defined precisely by saying that a function f(x) is continuous at
a point x0 of its domain if and only if, for any degree of closeness ε desired for
the y-values, there is a distance δ for the x-values (in the above example equal to
0.001ε) such that for any x of the domain within the distance δ from x0, f(x) will be
within the distance ε from f(x0). In contrast, the function that equals 0 for x less
than or equal to 1 and that equals 2 for x larger than 1 is not continuous at the
point x = 1, because the difference between the value of the function at 1 and at
any point ever so slightly greater than 1 is never less than 2.
A function is said to be continuous if and only if it is continuous at every point of
its domain. A function is said to be continuous on an interval, or subset of its
domain, if and only if it is continuous at each point of the interval. The sum,
difference, and product of continuous functions with the same domain are also
continuous, as is the quotient, except at points at which the denominator is zero.
Continuity can also be defined in terms of limits by saying that f(x) is continuous
at x0 of its domain if and only if, for values of x in its domain,
A more abstract definition of continuity can be given in terms of sets, as is done
in topology, by saying that for any open set of y-values, the corresponding set of x-
values is also open. (A set is “open” if each of its elements has a “neighbourhood,”
or region enclosing it, that lies entirely within the set.) Continuous functions are
the most basic and widely studied class of functions in mathematical analysis, as
well as the most commonly occurring ones in physical situations.

2.3 Scales of measurement

Answers:
What is Measurement? Normally, when one hears the term measurement, they may
think in terms of measuring the length of something (ie. the length of a piece of
wood) or measuring a quantity of something (ie. a cup of flour). This represents a
limited use of the term measurement. In statistics, the term measurement is used
more broadly and is more appropriately termed scales of measurement. Scales of
measurement refer to ways in which variables/numbers are defined and
categorized. Each scale of measurement has certain properties which in turn
determines the appropriateness for use of certain statistical analyses. The four
scales of measurement are nominal, ordinal, interval, and ratio. Nominal:
Categorical data and numbers that are simply used as identifiers or names represent
a nominal scale of measurement. Numbers on the back of a baseball jersey (St.
Louis Cardinals 1 = Ozzie Smith) and your social security number are examples of
nominal data. If I conduct a study and I'm including gender as a variable, I will
code Female as 1 and Male as 2 or visa versa when I enter my data into the
computer. Thus, I am using the numbers 1 and 2 to represent categories of data.
Ordinal: An ordinal scale of measurement represents an ordered series of
relationships or rank order. Individuals competing in a contest may be fortunate to
achieve first, second, or third place. First, second, and third place represent ordinal
data. If Roscoe takes first and Wilbur takes second, we do not know if the
competition was close; we only know that Roscoe outperformed Wilbur. Likert-
type scales (such as "On a scale of 1 to 10 with one being no pain and ten being
high pain, how much pain are you in today?") also represent ordinal data.
Fundamentally, these scales do not represent a measurable quantity. An individual
may respond 8 to this question and be in less pain than someone else who
responded 5. A person may not be in half as much pain if they responded 4 than if
they responded 8. All we know from this data is that an individual who responds 6
is in less pain than if they responded 8 and in more pain than if they responded 4.
Therefore, Likert-type scales only represent a rank ordering. Interval: A scale
which represents quantity and has equal units but for which zero represents simply
an additional point of measurement is an interval scale. The Fahrenheit scale is a
clear example of the interval scale of measurement. Thus, 60 degree Fahrenheit or
-10 degrees Fahrenheit are interval data. Measurement of Sea Level is another
example of an interval scale. With each of these scales there is direct, measurable
quantity with equality of units. In addition, zero does not represent the absolute
lowest value. Rather, it is point on the scale with numbers both above and below it
(for example, -10 degrees Fahrenheit). Ratio: The ratio scale of measurement is
similar to the interval scale in that it also represents quantity and has equality of
units. However, this scale also has an absolute zero (no numbers exist below the
zero). Very often, physical measures will represent ratio data (for example, height
and weight). If one is measuring the length of a piece of wood in centimeters, there
is quantity, equal units, and that measure can not go below zero centimeters. A
negative length is not possible. The table below will help clarify the fundamental
differences between the four scales of measurement.

3.Differentiate Parametric and Non- Parametric Test as tools in statistics.

Answers:
Nonparametric Tests vs. Parametric Tests
By Jim Frost 79 Comments
Nonparametric tests don’t require that your data follow the normal distribution.
They’re also known as distribution-free tests and can provide benefits in certain
situations. Typically, people who perform statistical hypothesis tests are more
comfortable with parametric tests than nonparametric tests.
You’ve probably heard it’s best to use nonparametric tests if your data are not
normally distributed—or something along these lines. That seems like an easy way
to choose, but there’s more to the decision than that.
In this post, I’ll compare the advantages and disadvantages to help you decide
between using the following types of statistical hypothesis tests:

o Parametric analyses to assess group means


o Nonparametric analyses to assess group medians
In particular, I’d like you to focus on one key reason to perform a nonparametric
test that doesn’t get the attention it deserves! If you need a primer on the basics,
read my hypothesis testing overview.

Related Pairs of Parametric and Nonparametric Tests


Nonparametric tests are a shadow world of parametric tests. In the table below, I
show linked pairs of statistical hypothesis tests.
Parametric tests of means Nonparametric tests of medians

1-sample t-test 1-sample Sign, 1-sample Wilcoxon


2-sample t-test Mann-Whitney test

One-Way ANOVA Kruskal-Wallis, Mood’s median test

Factorial DOE with a factor and a


Friedman test
blocking variable
Advantages of Parametric Tests
Advantage 1: Parametric tests can provide trustworthy results with distributions
that are skewed and non -normal
Many people aren’t aware of this fact, but parametric analyses can produce reliable
results even when your continuous data are non-normally distributed. You just
have to be sure that your sample size meets the requirements for each analysis in
the table below. Simulation studies have identified these requirements. Read here
for more information about these studies.
Advantages of Parametric Tests
Advantage 1: Parametric tests can provide trustworthy results with distributions
that are skewed and non -normal
Many people aren’t aware of this fact, but parametric analyses can produce reliable
results even when your continuous data are non -normally distributed. You just
have to be sure that your sample size meets the requirements for each analysis in
the table below. Simulation studies have identified these requirements. Read here
for more information about these studies.
Sample size requirements for non-
Parametric analyses
normal data
1-sample t-test Greater than 20
Each group should have more than 15
2-sample t-test
observations
o For 2-9 groups, each group should

have more than 15 observations


One-Way ANOVA
o For 10-12 groups, each group should

have more than 20 observations


You can use these parametric tests with non normally distributed data thanks to the
central limit theorem. For more information about it, read my post: Central Limit
Theorem Explained.
Related posts: The Normal Distribution and How to Identify the Distribution of
Your Data.
Advantage 2: Parametric tests can provide trustworthy results when the groups
have different amounts of variability
It’s true that nonparametric tests don’t require data that are normally
distributed. However, nonparametric tests have the disadvantage of an additional
requirement that can be very hard to satisfy. The groups in a
nonparametric analysis typically must all have the same variability (dispersion).
Nonparametric analyses might not provide accurate results when variability differs
between groups.
Conversely, parametric analyses, like the 2-sample t-test or one-way ANOVA,
allow you to analyze groups with unequal variances. In most statistical software,
it’s as easy as checking the correct box! You don’t have to worry about groups
having different amounts of variability when you use a par.

4.Discuss the type of data presentation and cite examples to substance your
answer.

Statistical data presentation

Introduction
Data are a set of facts, and provide a partial picture of reality. Whether data are
being collected with a certain purpose or collected data are being utilized,
questions regarding what information the data are conveying, how the data can be
used, and what must be done to include more useful information must constantly
be kept in mind.
Since most data are available to researchers in a raw format, they must be
summarized, organized, and analyzed to usefully derive information from them.
Furthermore, each data set needs to be presented in a certain way depending on
what it is used for. Planning how the data will be presented is essential before
appropriately processing raw data.
First, a question for which an answer is desired must be clearly defined. The more
detailed the question is, the more detailed and clearer the results are. A broad
question results in vague answers and results that are hard to interpret. In other
words, a well-defined question is crucial for the data to be well-understood later.
Once a detailed question is ready, the raw data must be prepared before processing.
These days, data are often summarized, organized, and analyzed with statistical
packages or graphics software. Data must be prepared in such a way they are
properly recognized by the program being used. The present study does not discuss
this data preparation process, which involves creating a data frame,
creating/changing rows and columns, changing the level of a factor, categorical
variable, coding, dummy variables, variable transformation, data transformation,
missing value, outlier treatment, and noise removal.
We describe the roles and appropriate use of text, tables, and graphs (graphs, plots,
or charts), all of which are commonly used in reports, articles, posters, and
presentations. Furthermore, we discuss the issues that must be addressed when
presenting various kinds of information, and effective methods of presenting data,
which are the end products of research, and of emphasizing specific information.
Go to:

Data Presentation
Data can be presented in one of the three ways:
–as text;
–in tabular form; or
–in graphical form.
Methods of presentation must be determined according to the data format, the
method of analysis to be used, and the information to be emphasized.
Inappropriately presented data fail to clearly convey information to readers and
reviewers. Even when the same information is being conveyed, different methods
of presentation must be employed depending on what specific information is going
to be emphasized. A method of presentation must be chosen after carefully
weighing the advantages and disadvantages of different methods of presentation.
For easy comparison of different methods of presentation, let us look at a table
(Table 1) and a line graph (Fig. 1) that present the same information [1]. If one
wishes to compare or introduce two values at a certain time point, it is appropriate
to use text or the written language. However, a table is the most appropriate when
all information requires equal attention, and it allows readers to selectively look at
information of their own interest. Graphs allow readers to understand the overall
trend in data, and intuitively understand the comparison results between two
groups. One thing to always bear in mind regardless of what method is used,
however, is the simplicity of presentation.
5. Compare and contrast the four scales of measurement
Answers:

The Four Scales of Measurement


Data can be classified as being on one of four scales: nominal, ordinal, interval
or ratio. Each level of measurement has some important properties that are useful
to know. For example, only the ratio scale has meaningful zeros.

A pie chart displays groups of nominal variables (i.e. categories).


1. Nominal Scale. Nominal variables (also called categorical variables) can be
placed into categories. They don’t have a numeric value and so cannot be added,
subtracted, divided or multiplied. They also have no order; if they appear to have
an order then you probably have ordinal variables instead. 2. Ordinal Scale.
2. The ordinal scale .contains things that you can place in order. For example,
hottest to coldest, lightest to heaviest, richest to poorest. Basically, if you can rank
data by 1st, 2nd, 3rd place (and so on), then you have data that’s on an ordinal
scale.

3. Interval Scale. An interval scale has ordered numbers with meaningful


divisions. Temperature is on the interval scale: a difference of 10 degrees between
90 and 100 means the same as 10 degrees between 150 and 160. Compare that to
high school ranking (which is ordinal), where the difference between 1st and 2nd
might be .01 and between 10th and 11th .5. If you have meaningful divisions, you
have something on the interval scale.

. Ratio Scale. The ratio scale is exactly the same as the interval scale with one
major difference: zero is meaningful. For example, a height of zero is meaningful
(it means you don’t exist). Compare that to a temperature of zero, which while it
exists, it doesn’t mean anything in particular (although admittedly, in the Celsius
scale it’s the freezing point for water).
Weight is measured on the ratio scale.
6. Give at least (5) sampling techniques and give examples to concretize your
answer.
Answers:
Probability Sampling Methods
1. Simple random sampling
In this case each individual is chosen entirely by chance and each member of the
population has an equal chance, or probability, of being selected. One way of
obtaining a random sample is to give each individual in a population a number, and
then use a table of random numbers to decide which individuals to include.1 
For example, if you have a sampling frame of 1000 individuals, labelled 0 to 999,
use groups of three digits from the random number table to pick your sample. So, if
the first three numbers from the random number table were 094, select the
individual labelled “94”, and so on.
As with all probability sampling methods, simple random sampling allows the
sampling error to be calculated and reduces selection bias. A specific advantage is
that it is the most straightforward method of probability sampling. A disadvantage
of simple random sampling is that you may not select enough individuals with your
characteristic of interest, especially if that characteristic is uncommon. It may also
be difficult to define a complete sampling frame and inconvenient to contact them,
especially if different forms of contact are required (email, phone, post) and your
sample units are scattered over a wide geographical area.
 
2. Systematic sampling
Individuals are selected at regular intervals from the sampling frame. The intervals
are chosen to ensure an adequate sample size. If you need a sample size n from a
population of size x, you should select every x/nth individual for the sample.
 For example, if you wanted a sample size of 100 from a population of 1000,
select every 1000/100 = 10th member of the sampling frame.
Systematic sampling is often more convenient than simple random sampling, and it
is easy to administer. However, it may also lead to bias, for example if there are
underlying patterns in the order of the individuals in the sampling frame, such that
the sampling technique coincides with the periodicity of the underlying pattern. As
a hypothetical example, if a group of students were being sampled to gain their
opinions on college facilities, but the Student Record Department’s central list of
all students was arranged such that the sex of students alternated between male and
female, choosing an even interval (e.g. every 20th student) would result in a sample
of all males or all females. Whilst in this example the bias is obvious and should be
easily corrected, this may not always be the case.
 
3. Stratified sampling
In this method, the population is first divided into subgroups (or strata) who all
share a similar characteristic. It is used when we might reasonably expect the
measurement of interest to vary between the different subgroups, and we want to
ensure representation from all the subgroups.
For example, in a study of stroke outcomes, we may stratify the population
by sex, to ensure equal representation of men and women. The study sample is
then obtained by taking equal sample sizes from each stratum. In stratified
sampling, it may also be appropriate to choose non-equal sample sizes from each
stratum. For example, in a study of the health outcomes of nursing staff in a
county, if there are three hospitals each with different numbers of nursing staff
(hospital A has 500 nurses, hospital B has 1000 and hospital C has 2000), then it
would be appropriate to choose the sample numbers from each
hospital proportionally (e.g. 10 from hospital A, 20 from hospital B and 40 from
hospital C). This ensures a more realistic and accurate estimation of the health
outcomes of nurses across the county, whereas simple random sampling would
over-represent nurses from hospitals A and B. The fact that the sample was
stratified should be taken into account at the analysis stage.
Stratified sampling improves the accuracy and representativeness of the results by
reducing sampling bias. However, it requires knowledge of the appropriate
characteristics of the sampling frame (the details of which are not always
available), and it can be difficult to decide which characteristic(s) to stratify by

4. Clustered sampling
In a clustered sample, subgroups of the population are used as the sampling unit,
rather than individuals. The population is divided into subgroups, known as
clusters, which are randomly selected to be included in the study. Clusters are
usually already defined, for example individual GP practices or towns could be
identified as clusters. In single-stage cluster sampling, all members of the chosen
clusters are then included in the study. In two-stage cluster sampling, a selection of
individuals from each cluster is then randomly selected for inclusion. Clustering
should be taken into account in the analysis.
The General Household survey, which is undertaken annually in England, is a
good example of a (one-stage) cluster sample. All members of the selected
households (clusters) are included in the survey.1
Cluster sampling can be more efficient that simple random sampling, especially
where a study takes place over a wide geographical region. For instance, it is easier
to contact lots of individuals in a few GP practices than a few individuals in many
different GP practices. Disadvantages include an increased risk of bias, if the
chosen clusters are not representative of the population, resulting in an increased
sampling error.
 
Non-Probability Sampling Methods
1. Convenience sampling
Convenience sampling is perhaps the easiest method of sampling, because
participants are selected based on availability and willingness to take part. Useful
results can be obtained, but the results are prone to significant bias, because those
who volunteer to take part may be different from those who choose not to
(volunteer bias), and the sample may not be representative of other characteristics,
such as age or sex. Note: volunteer bias is a risk of all non-probability sampling
methods.
 
2. Quota sampling
This method of sampling is often used by market researchers. Interviewers are
given a quota of subjects of a specified type to attempt to recruit.
For example, an interviewer might be told to go out and select 20 adult men, 20
adult women, 10 teenage girls and 10 teenage boys so that they could interview
them about their television viewing. Ideally the quotas chosen would
proportionally represent the characteristics of the underlying population.
Whilst this has the advantage of being relatively straightforward and potentially
representative, the chosen sample may not be representative of other characteristics
that weren’t considered (a consequence of the non-random nature of sampling). 2
3. Judgement (or Purposive) Sampling
Also known as selective, or subjective, sampling, this technique relies on the
judgement of the researcher when choosing who to ask to participate.
Researchers may implicitly thus choose a “representative” sample to suit their
needs, or specifically approach individuals with certain characteristics. This
approach is often used by the media when canvassing the public for opinions and
in qualitative research.
Judgement sampling has the advantage of being time-and cost-effective to perform
whilst resulting in a range of responses (particularly useful in qualitative research).
However, in addition to volunteer bias, it is also prone to errors of judgement by
the researcher and the findings, whilst being potentially broad, will not necessarily
be representative.
4. Snowball sampling
This method is commonly used in social sciences when investigating hard-to-reach
groups. Existing subjects are asked to nominate further subjects known to them, so
the sample increases in size like a rolling snowball.
For example, when carrying out a survey of risk behaviors amongst intravenous
drug users, participants may be asked to nominate other users to be interviewed.
Snowball sampling can be effective when a sampling frame is difficult to identify.
However, by selecting friends and acquaintances of subjects already investigated,
there is a significant risk of selection bias (choosing a large number of people with
similar characteristics or views to the initial individual identified).
Bias in sampling
There are five important potential sources of bias that should be considered when
selecting a sample, irrespective of the method used. Sampling bias may be
introduced when:1
1. Any pre-agreed sampling rules are deviated from
2. People in hard-to-reach groups are omitted
3. Selected individuals are replaced with others, for example if they are
difficult to contact
4. There are low response rates
An out-of-date list is used as the sample frame (for example, if it excludes people
who have recently

7. Discuss the following basic measurement in statistics and give the formula
each
Answer:
Statistics Basics: Overview
The most common basic statistics terms you’ll come across are the mean, mode
and median. These are all what are known as “Measures of Central Tendency.”
Also important in this early chapter of statistics is the shape of a distribution. This
tells us something about how data is spread out around the mean or median.
Perhaps the most common distribution you’ll see is the normal distribution,
sometimes called a bell curve. Heights, weights, and many other things found in
nature tend to be shaped like this:
Overview
Stuck on how to find the mean, median, & mode in statistics?
1. he mean is the average of a data set.
2. The mode is the most common number in a data set.
3. The median is the middle of the set of numbers.
Of the three, the mean is the only one that requires a formula. I like to think of it in
the other dictionary sense of the word (as in, it’s mean as opposed to nice!). That’s
because, compared to the other two, it’s not as easy to work with.

Hints to remember the difference


Having trouble remembering the difference between the mean, median and mode?
Here’s a couple of hints that can help. You can also check out the tutors
at Chegg.com (your first 30 minutes is free!).
 “A la mode” is a French word that means fashionable ; It also refers to a
popular way of serving ice cream. So “Mode” is the most popular or
fashionable member of a set of numbers. The word MOde is also like MOst.
 The “Mean” requires you do arithmetic (adding all the numbers and
dividing) so that’s the “mean” one.
 “Median” has the same number of letters as “Middle”.
Mean vs Median
Both are measures of where the center of a data set lies (called “Central Tendency”
in stats), but they are usually different numbers. For example, take this list of
numbers: 10, 10, 20, 40, 70.
 The mean (informally, the “average“) is found by adding all of the numbers
together and dividing by the number of items in the set: 10 + 10 + 20 + 40 + 70
/ 5 = 30.
 The median is found by ordering the set from lowest to highest and finding
the exact middle. The median is just the middle number: 20.
Sometimes the two will be the same number. For example, the data set 1, 2, 4, 6, 7
has a mean of 1 + 2 + 4 + 6 + 7 / 5 = 4 and a median (a middle) of 4.

Mean vs Average: What’s the Difference?


When you first started out in mathematics, you were probably taught that an
average was a “middling” amount for a set of numbers. You added up the numbers,
divided by the number of items you can and voila! you get the average. For
example, the average of 10, 5 and 20 is:
10 + 6 + 20 = 36 / 3 = 12.
The you started studying statistics and all of a sudden the “average” is now called
the mean. What happened? The answer is that they have the same meaning(they
are synonyms).
That said, technically, the word mean is short for the arithmetic mean. We use
different words in stats, because there are multiple different types of means, and
they all do different things.

Specific “Means” commonly used in StatsYou’ll probably come across these in


your stats class. They have very narrow meaning Mean of the sampling
distribution: used with probability distributions, especially with the Central Limit
Theorem. It’s an average of a set of distributions.
 Sample mean: the average value in a sample.
 Population mean: the average value in a population.
References
Kenney, J. F. and Keeping, E. S. Mathematics of Statistics, Pt. 1, 3rd ed. Princeton,
NJ: Van Nostrand, 1962.
s:

Other Types
There are other types of means, and you’ll use them in various branches of math.
Most have very narrow applications to fields like finance or physics; if you’re in
elementary statistics you probably won’t work with them.
These are some of the most common types you’ll come across.
1. Weighted mean.
2. Harmonic mean.
3. Geometric mean.
4. Arithmetic-Geometric mean.
5. Root-Mean Square mean.
6. Heronian mean.
1. Weighted Mean
These are fairly common in statistics, especially when studying populations.
Instead of each data point contributing equally to the final average, some data
points contribute more than others. If all the weights are equal, then this will
equal the arithmetic mean. There are certain circumstances when this can give
incorrect information, as shown by Simpson’s Paradox.
2. Harmonic Mean

1. The harmonic formula.


To find it:

A. Add the reciprocals of the numbers in the set. To find a reciprocal, flip
the fraction so that the numerator becomes the denominator and the
denominator becomes the numerator. For example, the reciprocal of 6/1 is
1/6.
B. Divide the answer by the number of items in the set.
C. Take the reciprocal of the result.
The harmonic mean is used quite a lot in physics. In some cases involving rates
and ratios it gives a better average than the arithmetic mean. You’ll also find
uses in geometry, finance and computer science.
2. Geometric Mean

This type has very narrow and specific uses in finance, social sciences and
technology. For example, let’s say you own stocks that earn 5% the first year,
20% the second year, and 10% the third year. If you want to know the average
rate of return, you can’t use the arithmetic average. Why? Because when you
are finding rates of return you are multiplying, not adding. For example, the
first year you are multiplying by 1.05.

3. Arithmetic-Geometric Mean
This is used mostly in calculus and in machine computation (i.e. as the basic
for many computer calculations). It’s related to the perimeter of an ellipse.
When it was first developed by Gauss, it was used to calculate planetary orbits.
The arithmetic-geometric is (not surprisingly!) a blend of the arithmetic and
geometric averages. The math is quite complicated but you can find a
relatively simple explanation of the math here.
4. Root-Mean Square
It is very useful in fields that study sine waves, like electrical engineering. This
particular type is also called the quadratic average. See: Quadratic Mean / Root
Mean Square.
5. Heronian Mean
Used in geometry to find the volume of a pyramidal frustum. A pyramidal
frustum is basically a pyramid with the tip sliced off.
1. sed in geometry to find the volume of a pyramidal frustum. A pyramidal
frustum is basically a pyramid with the tip sliced off.
2. What is the Mode?
The mode is the most common number in a set. For example, the mode in this set
of numbers is 21:
21, 21, 21, 23, 24, 26, 26, 28, 29, 30, 31, 33
3. What is the Median?
The median is the middle number in a data set. To find the median, list your data
points in ascending order and then find the middle number. The middle number in
this set is 28 as there are 4 numbers below it and 4 numbers above:
23, 24, 26, 26, 28, 29, 30, 31, 33
Note: If you have an even set of numbers, average the middle two to find the
median. For example, the median of this set of numbers is 28.5 (28 + 29 / 2).
23, 24, 26, 26, 28, 29, 30, 31, 33, 34
How to find the mean, median and mode by hand: Steps
How to find the mean, median and mode: MODE
 Step 1: Put the numbers in order so that you can clearly see patterns.
For example, lets say we have 2, 19, 44, 44, 44, 51, 56, 78, 86, 99, 99.
The mode is the number that appears the most often. In this case: 44, which
appears three times.
How to find the mean, median and mode: MEAN
 Step 2: Add the numbers up to get a total.
Example: 2 +19 + 44 + 44 +44 + 51 + 56 + 78 + 86 + 99 + 99 = 622.  Set this
number aside for a moment.
 Step 3: Count the amount of numbers in the series.
In our example (2, 19, 44, 44, 44, 51, 56, 78, 86, 99, 99), we have 11 numbers.
 Step 4: Divide the number you found in step 2 by the number you found in
step
3.In our example: 622 / 11 = 56.5454545. This is the mean, sometimes called
the average.

8. Mention/Describe at least five(5) Statistical tests/tools commonly used in


research and give their uses.

Answers:

Basic statistical tools in research and data analysis


Zulfiqar Ali and S Bala Bhaskar1

INTRODUCTION
Statistics is a branch of science that deals with the collection, organisation, analysis
of data and drawing of inferences from the samples to the whole population.[1]
This requires a proper design of the study, an appropriate selection of the study
sample and choice of a suitable statistical test. An adequate knowledge of statistics
is necessary for proper designing of an epidemiological study or a clinical trial.
Improper statistical methods may result in erroneous conclusions which may lead
to unethical practice.

VARIABLES
Variable is a characteristic that varies from one individual member of population to
another individual.[3] Variables such as height and weight are measured by some
type of scale, convey quantitative information and are called as quantitative
variables. Sex and eye color give qualitative information and are called as
qualitative variables
Quantitative variables
Quantitative or numerical data are subdivided into discrete and continuous
measurements. Discrete numerical data are recorded as a whole number such as 0,
1, 2, 3,… (integer), whereas continuous data can assume any value. Observations
that can be counted constitute the discrete data and observations that can be
measured constitute the continuous data. Examples of discrete data are number of
episodes of respiratory arrests or the number of re-intubations in an intensive care
unit. Similarly, examples of continuous data are the serial serum glucose levels,
partial pressure of oxygen in arterial blood and the oesophageal temperature.
A hierarchical scale of increasing precision can be used for observing and
recording the data which is based on categorical, ordinal, interval and ratio scales
Categorical or nominal variables are unordered. The data are merely classified into
categories and cannot be arranged in any particular order. If only two categories
exist (as in gender male and female), it is called as a dichotomous (or binary) data.
The various causes of re-intubation in an intensive care unit due to upper airway
obstruction, impaired clearance of secretions, hypoxemia, hypercapnia, pulmonary
oedema and neurological impairment are examples of categorical variables.
Ordinal variables have a clear ordering between the variables. However, the
ordered data may not have equal intervals. Examples are the American Society of
Anesthesiologists status or Richmond agitation-sedation scale.
Interval variables are similar to an ordinal variable, except that the intervals
between the values of the interval variable are equally spaced. A good example of
an interval scale is the Fahrenheit degree scale used to measure temperature. With
the Fahrenheit scale, the difference between 70° and 75° is equal to the difference
between 80° and 85°: The units of measurement are equal throughout the full range
of the scale.
Ratio scales are similar to interval scales, in that equal differences between scale
values have equal quantitative meaning. However, ratio scales also have a true zero
point, which gives them an additional property. For example, the system of
centimetres is an example of a ratio scale. There is a true zero point and the value
of 0 cm means a complete absence of length. The thyromental distance of 6 cm in
an adult may be twice that of a child in whom it may be 3 cm.

Descriptive statistics
The extent to which the observations cluster around a central location is described
by the central tendency and the spread towards the extremes is described by the
degree of dispersion.
Measures of central tendency
The measures of central tendency are mean, median and mode.[6] Mean (or the
arithmetic average) is the sum of all the scores divided by the number of scores.
Mean may be influenced profoundly by the extreme variables. For example, the
average stay of organophosphorus poisoning patients in ICU may be influenced by
a single patient who stays in ICU for around 5 months because of septicaemia. The
extreme values are called outliers. The formula for the mean is

Mean, 
where x = each observation and n = number of observations. Median[6] is defined
as the middle of a distribution in a ranked data (with half of the variables in the
sample above and half below the median value) while mode is the most frequently
occurring variable in a distribution. Range defines the spread, or variability, of a
sample.[7] It is described by the minimum and maximum values of the variables. If
we rank the data and after ranking, group the observations into percentiles, we can
get better information of the pattern of spread of the variables. In percentiles, we
rank the observations into 100 equal parts. We can then describe 25%, 50%, 75%
or any other percentile amount. The median is the 50th percentile. The interquartile
range will be the observations in the middle 50% of the observations about the
median (25th -75th percentile). Variance[7] is a measure of how spread out is the
distribution. It gives an indication of how close an individual observation clusters
about the mean value. The variance of a population is defined by the following
formula:

where σ2 is the population variance, X is the population mean, Xi is the ith element


from the population and N is the number of elements in the population. The
variance of a sample is defined by slightly different formula:

You might also like