Statistical Tests
Statistical Tests
Statistical Tests
Statistical Tests
Table of Contents
1 Statistical Hypothesis Testing
2 Relationship Between Variables
2.1 Linear Relationship
2.2 Non-Linear Relationship
3 Statistical Correlation
3.1 Pearson Product-Moment Correlation
3.2 Spearman Rank Correlation Coefficient
3.3 Partial Correlation Analysis
3.4 Correlation and Causation
4 Regression
4.1 Linear Regression Analysis
4.2 Multiple Regression Analysis
4.3 Correlation and Regression
5 Student’s T-Test
5.1 Independent One-Sample T-Test
5.2 Independent Two-Sample T-Test
5.3 Dependent T-Test for Paired Samples
5.4 Student’s T-Test (II)
6.1 One-Way ANOVA
6.2 Two-Way ANOVA
6.3 Factorial Anova
6.4 Repeated Measures ANOVA
7 Nonparametric Statistics
7.1 Cohen’s Kappa
7.2 Mann-Whitney U-Test
7.3 Wilcoxon Signed Rank Test
8.2 Z-Test
8.3 F-Test
8.4 Factor Analysis
8.5 ROC Curve Analysis
8.6 Meta Analysis
Copyright Notice
Copyright © 2014. All rights reserved, including the right of reproduction in
whole or in part in any form. No parts of this book may be reproduced in any form without
written permission of the copyright owner.
Notice of Liability
The author(s) and publisher both used their best efforts in preparing this book and the
instructions contained herein. However, the author(s) and the publisher make no warranties of
any kind, either expressed or implied, with regards to the information contained in this book,
and especially disclaim, without limitation, any implied warranties of merchantability and
fitness for any particular purpose.
In no event shall the author(s) or the publisher be responsible or liable for any loss of profits or
other commercial or personal damages, including but not limited to special incidental,
consequential, or any other damages, in connection with or arising out of furnishing,
performance or use of this book.
Throughout this book, trademarks may be used. Rather than put a trademark symbol in every
occurrence of a trademarked name, we state that we are using the names in an editorial
fashion only and to the benefit of the trademark owner with no intention of infringement of the
trademarks. Thus, copyrights on individual photographic, trademarks and clip art images
reproduced in this book are retained by the respective owner.
Published by
1 Statistical Hypothesis Testing
It is also used to remove the chance process in an experiment and establish its validity and
relationship with the event under consideration.
For example, suppose you want to study the effect of smoking on the
occurrence of lung cancer cases. If you take a small group, it may
happen that there appears no correlation at all, and you find that there
are many smokers with healthy lungs and many non-smokers with lung
However, it can just happen that this is by chance, and in the overall population this isn't true.
In order to remove this element of chance and increase the reliability of our hypothesis, we
use statistical hypothesis testing.
In this, you will first assume a hypothesis that smoking and lung cancer are unrelated. This is
called the 'null hypothesis', which is central to any statistical hypothesis testing.
You should therefore first choose a distribution for the experimental group. Normal distribution
is one of the most common distributions encountered in nature, but it can be different in
different special cases.
This means that if the experiment suggests that the probability of a chance event in the
experiment is less than this critical value, then the null hypothesis can be rejected.
If the null hypothesis is rejected, then we need to look for an alternative hypothesis that is in
line with the experimental observations.
There is also the gray area in between, like at the 15-20% level, in which it is hard to say
whether the null hypothesis can be rejected. In such cases, we can say that there is reason
enough to doubt the validity of the null hypothesis but there isn't enough evidence to suggest
that we reject the null hypothesis altogether.
A result in the gray area often leads to more exploration before concluding anything.
Accepting a Hypothesis
The other thing with statistical hypothesis testing is that there can only be an experiment
performed that doubts the validity of the null hypothesis, but there can be no experiment that
can somehow demonstrate that the null hypothesis is actually valid. This because of the
falsifiability-principle in the scientific method.
Therefore it is a tricky situation for someone who wants to show the independence of the two
events, like smoking and lung cancer in our previous example.
This problem can be overcome using a confidence interval and then arguing that the
experimental data reveals that the first event has a negligible (as much as the confidence
interval) effect, if at all, on the second event.
In the figure below, we can see that one can argue the independence is within 0.05 times the
standard deviation.
How to cite this article:
Siddharth Kalla (Nov 15, 2009). Statistical Hypothesis Testing. Retrieved from
2 Relationship Between Variables
There are several different kinds of relationships between variables. Before drawing a
conclusion, you should first understand how one variable changes with the other. This means
you need to establish how the variables are related - is the relationship linear or quadratic or
inverse or logarithmic or something else?
Suppose you measure a volume of a gas in a cylinder and measure its pressure. Now you
start compressing the gas by pushing a piston all while maintaining the gas at the room
temperature. The volume of gas decreases while the pressure increases. You note down
different values on a graph paper.
If you take enough measurements, you can see a shape of a parabola defined by
xy=constant. This is because gases follow Boyle's law that says when temperature is
constant, PV = constant. Here, by taking data you are relating the pressure of the gas with its
volume. Similarly, many relationships are linear in nature.
However, in social sciences, things get much more complicated because parameters may or
may not be directly related. There could be a number of indirect consequences and deducing
cause and effect can be challenging.
Only when the change in one variable actually causes the change in another parameter is
there a causal relationship. Otherwise, it is simply a correlation. Correlation doesn't imply
causation. There are ample examples and various types of fallacies in use.
A famous example to prove the point: Increased ice-cream sales shows a strong correlation to
deaths by drowning. It would obviously be wrong to conclude that consuming ice-creams
causes drowning. The explanation is that more ice-cream gets sold in the summer, when
more people go to the beach and other water bodies and therefore increased deaths by
It is important to understand the relationship between variables to draw the right conclusions.
Even the best scientists can get this wrong and there are several instances of how studies get
correlation and causation mixed up.
Siddharth Kalla (Jul 26, 2011). Relationship Between Variables. Retrieved from
2.1 Linear Relationship
A linear relationship is one where increasing or decreasing one variable n times will
cause a corresponding increase or decrease of n times in the other variable too. In
simpler words, if you double one variable, the other will double as well.
For example:
For a given material, if the volume of the material is doubled, its weight will also double.
This is a linear relationship. If the volume is increased 10 times, the weight will also
increase by the same factor.
If you take the perimeter of a square and its side, they are linearly related. If you take a
square that has sides twice as large, the perimeter will also become twice larger.
The cost of objects is usually linear. If a notebook costs $1, then ten notebooks will cost
The force of gravity between the earth and an object is linear in nature. If the mass of
the object doubles, the force of gravity acting on it will also be double.
As can be seen from the above examples, a number of very important physical phenomena
can be described by a linear relationship.
Apart from these physical processes, there are many correlations between variables that can
be approximated by a linear relationship. This greatly simplifies a problem at hand because a
linear relationship is much simpler to study and analyze than a non-linear one.
Constant of Proportionality
The constant of proportionality is an important concept that emerges from a linear
relationship. By using this constant, we can formulate the actual formula that describes one
variable in terms of the other.
For example, in our first example, the constant of proportionality between mass and volume is
called density. Thus we can mathematically write:
The constant of proportionality, the density, is defined from the above equation - it is the mass
per unit volume of the material.
If you plot these variables on a graph paper, the slope of the straight line is the constant of
In this example, if you plot mass on the y-axis and volume on the x-axis, you will find that the
slope of the line thus formed gives the density.
Linear relationships are not limited to physical phenomena but are frequently encountered in
all kinds of scientific research and methodologies. An understanding of linear relationships is
essential to understand these relationships between variables.
Siddharth Kalla (Jan 10, 2011). Linear Relationship. Retrieved from
2.2 Non-Linear Relationship
Linear relationships are the easiest to understand and study and a number of very important
physical phenomena are linear. However, it doesn't cover the whole ambit of our
mathematical techniques and non-linear relationships are fundamental to a number of most
important and intriguing physical and social phenomena around.
There are an endless variety of non-linear relationships that one can encounter. However,
most of them can still fit into other categories, like polynomial, logarithmic, etc.
The side of a square and its area are not linear. In fact, this is a quadratic relationship. If
you double the side of a square, its area will increase 4 times.
While charging a capacitor, the amount of charge and time are non-linearly dependent.
Thus the capacitor is not twice as charged after 2 seconds as it was after 1 second. This
is an exponential relationship.
For example, the pressure and volume of nitrogen during an isentropic expansion are related
as PV1.4 which is highly non-linear but fits neatly into this equation.
Next, a number of non-linear relationships are monotonic in nature. This means they do not
oscillate and steadily increase or decrease. This is good to study because they behave
qualitatively like linear relationships for a number of cases.
A linear relationship is the simplest to understand and therefore can serve as the first
approximation of a non-linear relationship. The limits of validity need to be well noted. In fact,
a number of phenomena were thought to be linear but later scientists realized that this was
only true as an approximation.
Consider special theory of relativity that redefined our perceptions of space and time. It gives
the full non-linear relationship between variables. They can very well be approximated to be
linear in Newtonian mechanics as a first approximation at lower speeds. If you consider
momentum, in Newtonian mechanics it is linearly dependent on velocity. If you double the
velocity, the momentum will double. However, at speeds approaching those of light, this
becomes a highly non-linear relationship.
Some of the greatest scientific challenges need the study of non-linear relationships. The
study of turbulence, which is one of the greatest unsolved problems in science and
engineering, needs the study of a non-linear differential equation.
Siddharth Kalla (Feb 17, 2011). Non-Linear Relationship. Retrieved from
3 Statistical Correlation
For example, consider the variables family income and family expenditure. It is well known
that income and expenditure increase or decrease together. Thus they are related in the
sense that change in any one variable is accompanied by change in the other variable.
Again price and demand of a commodity are related variables; when price increases demand
will tend to decreases and vice versa.
If the change in one variable is accompanied by a change in the other, then the variables are
said to be correlated. We can therefore say that family income and family expenditure, price
and demand are correlated.
In the case of family income and family expenditure, it is easy to see that they both rise or fall
together in the same direction. This is called positive correlation.
In case of price and demand, change occurs in the opposite direction so that increase in one
is accompanied by decrease in the other. This is called negative correlation.
Coefficient of Correlation
Statistical correlation is measured by what is called coefficient of correlation (r). Its numerical
value ranges from +1.0 to -1.0. It gives us an indication of the strength of relationship.
In general, r > 0 indicates positive relationship, r < 0 indicates negative relationship while r = 0
indicates no relationship (or that the variables are independent and not related). Here r = +1.0
describes a perfect positive correlation and r = -1.0 describes a perfect negative correlation.
Closer the coefficients are to +1.0 and -1.0, greater is the strength of the relationship between
the variables.
As a rule of thumb, the following guidelines on strength of relationship are often useful (though
many experts would somewhat disagree on the choice of boundaries).
Value of r Strength of relationship
-1.0 to -0.5 or 1.0 to 0.5 Strong
-0.5 to -0.3 or 0.3 to 0.5 Moderate
-0.3 to -0.1 or 0.1 to 0.3 Weak
-0.1 to 0.1 None or very weak
Correlation is only appropriate for examining the relationship between meaningful quantifiable
data (e.g. air pressure, temperature) rather than categorical data such as gender, favorite
color etc.
While 'r' (correlation coefficient) is a powerful tool, it has to be handled with care.
1. The most used correlation coefficients only measure linear relationship. It is therefore
perfectly possible that while there is strong non linear relationship between the variables
, r is close to 0 or even 0. In such a case, a scatter diagram can roughly indicate the
existence or otherwise of a non linear relationship.
2. One has to be careful in interpreting the value of 'r'. For example, one could compute 'r'
between the size of shoe and intelligence of individuals, heights and income.
Irrespective of the value of 'r', it makes no sense and is hence termed chance or non-
sense correlation.
3. 'r' should not be used to say anything about cause and effect relationship. Put
differently, by examining the value of 'r', we could conclude that variables X and Y are
related. However the same value of 'r' does not tell us if X influences Y or the other way
round. Statistical correlation should not be the primary tool used to study causation,
because of the problem with third variables.
3.1 Pearson Product-Moment Correlation
In the study of relationships, two variables are said to be correlated if change in one variable
is accompanied by change in the other - either in the same or reverse direction.
This coefficient is used if two conditions are satisfied
First, it tells us the direction of relationship. Once the coefficient is computed, ρ > 0 will indicate
positive relationship, ρ < 0 will indicate negative relationship while ρ = 0 indicates non existence
of any relationship.
Second, it ensures (mathematically) that the numerical value of ρ range from -1.0 to +1.0. This
enables us to get an idea of the strength of relationship - or rather the strength of linear
relationship between the variables. Closer the coefficients are to +1.0 or -1.0, greater is the
strength of the linear relationship.
As a rule of thumb, the following guidelines are often useful (though many experts could
somewhat disagree on the choice of boundaries).
Range of Ρ
Value of ρ Strength of relationship
-1.0 to -0.5 or 1.0 to 0.5 Strong
-0.5 to -0.3 or 0.3 to 0.5 Moderate
-0.3 to -0.1 or 0.1 to 0.3 Weak
-0.1 to 0.1 None or very weak
Properties of Ρ
This measure of correlation has interesting properties, some of which are enunciated below:
1. People often tend to forget or gloss over the fact that ρ is a measure of linear relationship.
Consequently a small value of ρ is often interpreted to mean non existence of relationship
when actually it only indicates non existence of a linear relationship or at best a very
weak linear relationship.
A scatter diagram can reveal the same and one is well advised to observe the same
before firmly concluding non existence of a relationship. If the scatter diagram points to
a non linear relationship, an appropriate transformation can often attain linearity in which
case ρ can be recomputed.
For example, one could compute ρ between size of a shoe and intelligence of individuals,
heights and income. Irrespective of the value of ρ, such a correlation makes no sense
and is hence termed chance or non-sense correlation.
3. ρ should not be used to say anything aboutcause and effect relationship. Put differently,
by examining the value of ρ, we could conclude that variables X and Y are related.
However the same value of ρ does not tell us if X influences Y or the other way round - a
fact that is of grave import in regression analysis.
3.2 Spearman Rank Correlation Coefficient
Spearman Rank
Correlation Coefficient
uses ranks to calculate
Whenever we are interested to know if two variables are related to each other, we use a
statistical technique known as correlation. If the change in one variable brings about a change
in the other variable, they are said to be correlated.
A well known measure of correlation is the Pearson product moment correlation coefficient
which can be calculated if the data is in interval/ ratio scale.
The Spearman Rank Correlation Coefficient is its analogue when the data is in terms of ranks.
One can therefore also call it correlation coefficient between the ranks. The correlation
coefficient is sometimes denoted by rs.
As an example, let us consider a musical (solo vocal) talent contest where 10 competitors are
evaluated by two judges, A and B. Usually judges award numerical scores for each contestant
after his/her performance.
A product moment correlation coefficient of scores by the two judges hardly makes sense
here as we are not interested in examining the existence or otherwise of a linear relationship
between the scores.
What makes more sense is correlation between ranks of contestants as judged by the two
judges. Spearman Rank Correlation Coefficient can indicate if judges agree to each other's
views as far as talent of the contestants are concerned (though they might award different
numerical scores) - in other words if the judges are unanimous.
rs = correlation coefficient
In general,
Assigning Ranks
In order to compute Spearman Rank Correlation Coefficient, it is necessary that the data be
ranked. There are a few issues here.
Contestant No. 1 2 3 4 5 6 7 8 9 10
Score by Judge A 5 9 3 8 6 7 4 8 4 6
Score by Judge B 7 8 6 7 8 5 10 6 5 8
Ranks are assigned separately for the two judges either starting from the highest or from the
lowest score. Here, the highest score given by Judge A is 9.
If we begin from the highest score, we assign rank 1 to contestant 2 corresponding to the
score of 9.
The second highest score is 8 but two competitors have been awarded the score of 8. In this
case both the competitors are assigned a common rank which is the arithmetic mean of ranks
2 and 3. In this way, scores of Judge A can be converted into ranks.
Similarly, ranks are assigned to the scores awarded by Judge B and then difference between
ranks for each contestant are used to evaluate rs. For the above example, ranks are as
Contestant No. 1 2 3 4 5 6 7 8 9 10
Ranks of scores by Judge A 7 1 10 2.5 5.5 4 8.5 2.5 8.5 5.5
Ranks of scores by Judge B 5.5 3 7.5 5.5 3 9.5 1 7.5 9.5 3
Spearman Rank
Correlation Coefficient is a
non-parametric measure of
Spearman Rank Correlation Coefficient tries to assess the relationship between ranks without
making any assumptions about the nature of their relationship.
Hence it is a non-parametric measure - a feature which has contributed to its popularity and
wide spread use.
Another advantage with this measure is that it is much easier to use since it does not matter
which way we rank the data, ascending or descending. We may assign rank 1 to the smallest
value or the largest value, provided we do the same thing for both sets of data.
The only requirement is that data should be ranked or at least converted into ranks. (Mar 20, 2009). Spearman Rank Correlation Coefficient. Retrieved from
3.3 Partial Correlation Analysis
Partial correlation analysis involves studying the linear relationship between two
variables after excluding the effect of one or more independent factors.
Simple correlation does not prove to be an all-encompassing technique especially under the
above circumstances. In order to get a correct picture of the relationship between two
variables, we should first eliminate the influence of other variables.
For example, study of partial correlation between price and demand would involve studying
the relationship between price and demand excluding the effect of money supply, exports, etc.
In simple correlation, we measure the strength of the linear relationship between two
variables, without taking into consideration the fact that both these variables may be
influenced by a third variable.
For example, when we study the correlation between price (dependent variable) and demand (
independent variable), we completely ignore the effect of other factors like money supply,
import and exports etc. which definitely have a bearing on the price.
The correlation co-efficient between two variables X1 and X2, studied partially after eliminating
the influence of the third variable X3 from both of them, is the partial correlation co-efficient r
Simple correlation between two variables is called the zero order co-efficient since in simple
correlation, no factor is held constant. The partial correlation studied between two variables by
keeping the third variable constant is called a first order co-efficient, as one variable is kept
constant. Similarly, we can define a second order co-efficient and so on. The partial
correlation co-efficient varies between -1 and +1. Its calculation is based on the simple
correlation co-efficient.
The partial correlation analysis assumes great significance in cases where the phenomena
under consideration have multiple factors influencing them, especially in physical and
experimental sciences, where it is possible to control the variables and the effect of each
variable can be studied separately. This technique is of great use in various experimental
designs where various interrelated phenomena are to be studied.
However, this technique suffers from some limitations some of which are stated below.
The calculation of the partial correlation co-efficient is based on the simple correlation co-
efficient. However, simple correlation coefficient assumes linear relationship. Generally
this assumption is not valid especially in social sciences, as linear relationship rarely
exists in such phenomena.
As the order of the partial correlation co-efficient goes up, its reliability goes down.
Its calculation is somewhat cumbersome - often difficult to the mathematically uninitiated
(though software's have made life a lot easier).
Multiple Correlation
Another technique used to overcome the drawbacks of simple correlation is multiple
regression analysis.
Here, we study the effects of all the independent variables simultaneously on a dependent
variable. For example, the correlation co-efficient between the yield of paddy (X1) and the
other variables, viz. type of seedlings (X2), manure (X3), rainfall (X4), humidity (X5) is the
multiple correlation co-efficient R1.2345 . This co-efficient takes value between 0 and +1.
The limitations of multiple correlation are similar to those of partial correlation. If multiple and
partial correlation are studied together, a very useful analysis of the relationship between the
different variables is possible.
3.4 Correlation and Causation
Causality is the area of statistics that is most commonly misused, and misinterpreted, by non-
specialists. Media sources, politicians and lobby groups often leap upon a perceived
correlation, and use it to 'prove' their own beliefs. They fail to understand that, just because
results show a correlation, there is no proof of an underlying causality.
Many people assume that because a poll, or a statistic, contains many numbers, it must be
scientific, and therefore correct.
Overcoming this tendency is part of academic training of students and academics in most
fields, from physics to the arts. The ability to evaluate data objectively, is absolutely crucial to
academic success.
The results seemed to show a correlation between the two variables, so the paper printed the
headline; "Parental smoking causes children to misbehave." The Professor leading the
investigation stated that cigarette packets should carry warnings about social issues alongside
the prominent health warnings.
However, there are a number of problems with this assumption. The first is that correlations
can often work in reverse. For example, it is perfectly possible that the parents smoked
because of the stress of looking after delinquent children.
Another cause may be that social class causes the correlation; the lower classes are usually
more likely to smoke and are more likely to have delinquent children. Therefore, parental
smoking and delinquency are both symptoms of the problem of poverty and may well have no
direct link between them.
The principle of correlation and causation is very important for anybody working as a scientist
or researcher. It is also a useful principle for non-scientists, especially those studying politics,
media and marketing. Understanding causality promotes a greater understanding, and honest
evaluation of the alleged facts given by pollsters.
Imagine an expensive advertising campaign, based around intense market research, where
misunderstanding a correlation could cost a lot of money in advertising, production costs, and
damage to the company's reputation.
Coon, D. & Mitterer, J.O. (2009). Psychology: A Journey (4th Ed.). Belmont, CA: Cengage
Kassin, S.M., Fein, S., Markus, H.R. (2011). Social Psychology, Belmont, CA: Wadsworth
Cengage Learning
Kornblum, W. (2003). Sociology in a Changing World (6th Ed.). Belmont, CA: Wadsworth
Cengage Learning
Martyn Shuttleworth (Feb 26, 2008). Correlation and Causation. Retrieved from
4 Regression
Linear regression analysis is a powerful technique used for predicting the unknown
value of a variable from the known value of another variable.
More precisely, if X and Y are two related variables, then linear regression analysis helps us
to predict the value of Y for a given value of X or vice verse.
For example age of a human being and maturity are related variables.
Then linear regression analyses can predict level of maturity given age
of a human being.
Y = a + bX
of these lines make sense.
Exactly which of these will be appropriate for the analysis in hand will depend on labeling of
dependent and independent variable in the problem to be analyzed.
For example, consider two variables crop yield (Y) and rainfall (X). Here
construction of regression line of Y on X would make sense and would
be able to demonstrate the dependence of crop yield on rainfall. We
would then be able to estimate crop yield given rainfall.
Careless use of linear regression analysis could mean construction of regression line of X on
Y which would demonstrate the laughable scenario that rainfall is dependent on crop yield;
this would suggest that if you grow really big crops you will be guaranteed a heavy rainfall.
Regression Coefficient
The coefficient of X in the line of regression of Y on X is called the regression coefficient of Y
on X. It represents change in the value of dependent variable (Y) corresponding to unit
change in the value of independent variable (X).
For instance if the regression coefficient of Y on X is 0.53 units, it would indicate that Y will
increase by 0.53 if X increased by 1 unit. A similar interpretation can be given for the
regression coefficient of X on Y.
Once a line of regression has been constructed, one can check how good it is (in terms of
predictive ability) by examining the coefficient of determination (R2). R2 always lies between 0
and 1. All software provides it whenever regression procedure is run.
R2 - coefficient of determination
The closer R2 is to 1, the better is the model and its prediction. A related question is whether
the independent variable significantly influences the dependent variable. Statistically, it is
equivalent to testing the null hypothesis that the regression coefficient is zero. This can be
done using t-test.
Assumption of Linearity
Linear regression does not test whether data is linear. It finds the slope and the intercept
assuming that the relationship between the independent and dependent variable can be best
explained by a straight line.
One can construct the scatter plot to confirm this assumption. If the scatter plot reveals
non linear relationship, often a suitable transformation can be used to attain linearity.
4.2 Multiple Regression Analysis
Multiple regression analysis is a powerful technique used for predicting the unknown
value of a variable from the known value of two or more variables- also called the
More precisely, multiple regression analysis helps us to predict the value of Y for given values
of X1, X2, …, Xk.
For example the yield of rice per acre depends upon quality of seed,
fertility of soil, fertilizer used, temperature, rainfall. If one is interested to
study the joint affect of all these variables on rice yield, one can use this
An additional advantage of this technique is it also enables us to study
the individual influence of these variables on yield.
Y = b0 + b1 X1 + b2 X2 + …………………… + bk Xk
The appropriateness of the multiple regression model as a whole can be tested by the F-test
in the ANOVA table. A significant F indicates a linear relationship between Y and at least one
of the X's.
R2 - coefficient of determination
All software provides it whenever regression procedure is run. The closer R2 is to 1, the better
is the model and its prediction.
A related question is whether the independent variables individually influence the dependent
variable significantly. Statistically, it is equivalent to testing the null hypothesis that the
relevant regression coefficient is zero.
This can be done using t-test. If the t-test of a regression coefficient is significant, it indicates
that the variable is in question influences Y significantly while controlling for other independent
explanatory variables.
Multiple regression technique does not test whether data are linear. On the contrary, it
proceeds by assuming that the relationship between the Y and each of Xi's is linear. Hence as
a rule, it is prudent to always look at the scatter plots of (Y, Xi), i= 1, 2,…,k. If any plot
suggests non linearity, one may use a suitable transformation to attain linearity.
Multiple regression analysis is used when one is interested in predicting a continuous
dependent variable from a number of independent variables. If dependent variable is
dichotomous, then logistic regression should be used.
4.3 Correlation and Regression
Correlation and linear regression are the most commonly used techniques for
investigating the relationship between two quantitative variables.
The goal of a correlation analysis is to see whether two measurement variables co vary, and
to quantify the strength of the relationship between the variables, whereas regression
expresses the relationship in the form of an equation.
For example, in students taking a Maths and English test, we could use correlation to
determine whether students who are good at Maths tend to be good at English as well, and
regression to determine whether the marks in English can be predicted for given marks in
A Caveat
It must, however, be considered that there may be a third variable related to both of the
variables being investigated, which is responsible for the apparent correlation. Correlation
does not imply causation. Also, a nonlinear relationship may exist between two variables that
would be inadequately described, or possibly even undetected, by the correlation coefficient.
Why Use Regression
In regression analysis, the problem of interest is the nature of the relationship itself between
the dependent variable (response) and the (explanatory) independent variable.
The analysis consists of choosing and fitting an appropriate model, done by the method of
least squares, with a view to exploiting the relationship between the variables to help estimate
the expected response for a given value of the independent variable. For example, if we are
interested in the effect of age on height, then by fitting a regression line, we can predict the
height for a given age.
Some underlying assumptions governing the uses of correlation and regression are as follows.
The observations are assumed to be independent. For correlation, both variables should be
random variables, but for regression only the dependent variable Y must be random. In
carrying out hypothesis tests, the response variable should follow Normal distribution and the
variability of Y should be the same for each value of the predictor variable. A scatter diagram
of the data provides an initial check of the assumptions for regression.
The second main use for correlation and regression is to see whether two variables are
associated, without necessarily inferring a cause-and-effect relationship. In this case,
neither variable is determined by the experimenter; both are naturally variable. If an
association is found, the inference is that variation in X may cause variation in Y, or
variation in Y may cause variation in X, or variation in some other factor may affect both
X and Y.
The third common use of linear regression is estimating the value of one variable
corresponding to a particular value of the other variable. (Jan 18, 2010). Correlation and Regression. Retrieved from
5 Student’s T-Test
The student's t-test is a statistical method that is used to see if two sets of data differ
The method assumes that the results follow the normal distribution (also called student's t-
distribution) if the null hypothesis is true. This null hypothesis will usually stipulate that there is
no significant difference between the means of the two data sets.
It is best used to try and determine whether there is a difference between two independent
sample groups. For the test to be applicable, the sample groups must be completely
independent, and it is best used when the sample size is too small to use more advanced
Before using this type of test it is essential to plot the sample data from the two samples and
make sure that it has a reasonably normal distribution, or the student's t test will not be
suitable. It is also desirable to randomly assign samples to the groups, wherever possible.
You might be trying to determine if there is a significant difference in test scores between two
groups of children taught by different methods.
The null hypothesis might state that there is no significant difference in the mean test scores
of the two sample groups and that any difference down to chance.
The student's t test can then be used to try and disprove the null hypothesis.
The two sample groups being tested must have a reasonably normal distribution. If the
distribution is skewed, then the student's t test is likely to throw up misleading results. The
distribution should have only one main peak (= mode) near the mean of the group.
If the data does not adhere to the above parameters, then either a large data sample is
needed or, preferably, a more complex form of data analysis should be used.
The student's t test can let you know if there is a significant difference in the means of the two
sample groups and disprove the null hypothesis. Like all statistical tests, it cannot prove
anything, as there is always a chance of experimental error occurring. But the test can support
a hypothesis.
However, it is still useful for measuring small sample populations and determining if there is a
significant difference between the groups.
Martyn Shuttleworth (Feb 19, 2008). Student’s T-Test. Retrieved from
5.1 Independent One-Sample T-Test
An independent one-sample t-test is used to test whether the average of a sample differ
significantly from a population mean, a specified value μ0.
When you compare each sample to a "known truth", you would use the (independent) one-
sample t-test. If you are comparing two samples not strictly related to each other, the
independent two-sample t-test is used.
Any single sample statistical test that uses t-distribution can be called a 'one-sample t-test'.
This test is used when we have a random sample and we want to test if it is significantly
different from a population mean.
Hypothesis to Be Tested
Generally speaking, this test involves testing the null hypothesis H0: μ = μ0 against the
alternative hypothesis, H1: μ ≠ μ0 where μ is the population mean and μ0 is a specific value of the
population mean that we would like to test for acceptance.
An example may clarify the calculation and hypothesis testing of the independent one-sample
t-test better.
An Example
Suppose that the teacher of a school claims that an average student of his school studies 8
hours per day during weekends and we desire to test the truth of this claim.
The statistical methodology for this purpose requires that we begin by first specifying the
hypothesis to be tested.
In this case, the null hypothesis would be H0: μ = 8, which essentially states that mean hours of
study per day is no different from 8 hours. And the alternative hypothesis is, H1: μ ≠ 8, which
is negation of the teacher's claim.
Collecting Samples
In the next step, we take a sample of say 10 students of the school and collect data on how
long they study during weekends.
We cannot infer anything directly from this mean - as to whether the claim is to be accepted or
rejected as it could very well have happened that by sheer luck (even though the sample was
drawn randomly). Students included in the sample may have been those who studied fewer
than 8 hours.
On the other hand, it could also be the case that the claim was indeed inappropriate.
If the null hypothesis is rejected, it means that the sample came from a population with mean
study hours significantly different from 8 hours.
On the other hand if the null hypothesis is accepted, it means that there is no evidence to
suggest that average study hours were significantly different from 8 hours - thereby
establishing evidence of the claim.
This test is one of the most popular small sample test widely used in all disciplines - medicine,
behavioral science, physical science etc. However, this test can be used only if the
background assumptions are satisfied.
The population from which the sample has been drawn should be normal - appropriate
statistical methods exist for testing this assumption (For example the Kolmogorov
Smirnov non parametric test). It has however been shown that minor departures from
normality do not affect this test - this is indeed an advantage.
The population standard deviation is not known.
Sample observations should be random.
The test used for dealing with problems relating the large samples are different from the one
used for small samples. We often use z-test for large samples.
5.2 Independent Two-Sample T-Test
The independent two-sample t-test is used to test whether population means are
significantly different from each other, using the means from randomly drawn samples.
Any statistical test that uses two samples drawn independently of each other and using t-
distribution, can be called a 'two-sample t-test'.
Hypothesis Testing
Generally speaking, this test involves testing the null hypothesis H0: μ(x) = μ(y) against the
alternative research hypothesis, H1: μ(x) ≠ μ(y) where μ(x) and μ(y) are respectively the
population mean of the two populations from which the two samples have been drawn.
An Example
Suppose that a school has two buildings - one for girls and the other for boys. Suppose that
the principal want to know if the pupils of the two buildings are working equally hard, in the
sense that they put in equal number of hours in studies on the average.
Statistically speaking, the principal is interested in testing whether the average number of
hours studied by boys is significantly different from the average for girls.
1. To calculate, we begin by specifying the hypothesis to be tested.
In this case, the null hypothesis would be H0: μ(boys) = μ(girls), which essentially states
that mean study hours for boys and girls are no different.
2. In the second step, we take a sample of say 10 students from the boy's building and 15
from girl's building and collect data on how long they study daily. These 10 and 15
different study hours are our two samples.
It is not difficult to see that the two samples have been drawn independent of each
other - an essential requirement of the independent two-sample t-test.
Suppose that the sample mean turns out to be 7.25 hours for boys and 8.5 for girls. We
cannot infer anything directly from these sample means - specifically as to whether boys
and girls were equally hard working as it could very well have happened by sheer luck
(even though the samples were drawn randomly) that boys included in the boy's sample
were those who studied fewer hours.
On the other hand, it could also be the case that girls were indeed working harder than
3. The third step would involve performing the independent two-sample t-test which helps
us to either accept or reject the null hypothesis.
If the null hypothesis is rejected, it means that two buildings were significantly different in
terms of number of hours of hard work.
On the other hand if the null hypothesis is accepted, one can conclude that there is no
evidence to suggest that the two buildings differed significantly and that boys and girls
can be said to be at par.
Along with the independent single sample t-test, this test is one of the most widely tests.
However, this test can be used only if the background assumptions are satisfied.
The populations from which the samples have been drawn should be normal -
appropriate statistical methods exist for testing this assumption (For example, the
Kolmogorov Smirnov non-parametric test). One needs to note that the normality
assumption has to be tested individually and separately for the two samples. It has
however been shown that minor departures from normality do not affect this test - this is
indeed an advantage.
The standard deviation of the populations should be equal i.e. σX2 = σY2 = σ2, where σ2 is
unknown. This assumption can be tested by the F-test.
Samples have to be randomly drawn independent of each other. There is however no
requirement that the two samples should be of equal size - often times they would be
unequal though the odd case of equal size cannot be ruled out.
44 (Oct 12, 2009). Independent Two-Sample T-Test. Retrieved from
5.3 Dependent T-Test for Paired Samples
The dependent t-test for paired samples is used when the samples are paired. This
implies that each individual observation of one sample has a unique corresponding
member in the other sample.
two samples have been "matched" or "paired", in some way. (matched subjects design)
The emphasis being on pairing of observations, it is obvious that the samples are dependent -
hence the name.
Any statistical test involving paired samples and using t-distribution can be called 't-test for
paired samples'.
An Example
Let us illustrate the meaning of a paired sample. Suppose that we are required to examine if a
newly developed intervention program for disadvantaged students has an impact. For this
purpose, we need to obtain scores from a sample of n such students in a standardized test
before administering the program.
After the program is over, the same test needs to be administered to the same group of
students and scores obtained again.
There are two samples: 1) the sample of prior intervention scores (pretest) and, 2) the post
intervention scores (posttest). The samples are related in the sense that each pretest has a
corresponding posttest as both were obtained from the same student.
If the score of each student (ith) before and after the program is xi and yi respectively, then
the pair (xi, yi) corresponds to the same subject (student in this case).
This is what is meant by paired sample. It is very important that two scores for each individual
student be correctly identified and labeled as the differences di =│ xi - yi │are used to determine
the test statistic and consequently the p-value.
1. With the above framework, the null hypothesis would be H0: there is no significant
difference between pre and post intervention scores, which essentially states that the
intervention program was not effective. The alternative hypothesis is H1: there is
significant difference between pre and post intervention scores.
2. Once the hypotheses have been framed, the second step involves taking the sample of
pre and post intervention scores and determining the sum, ∑│ xi - yi │. Logically speaking,
a small sum could indicate truth of the null hypothesis.
On the other hand, it could also be the case that the program was indeed useful.
3. The third step involves performing the dependent t-test for paired samples which helps
us to either accept or reject the null hypothesis. If the null hypothesis is rejected, one
can infer that the program was useful.
On the other hand if the null hypothesis is accepted, one can conclude that there is no
evidence to suggest the program did have an impact.
This test has a few background assumptions which need to be satisfied.
1. The sample of differences (di's) should be normal - an assumption that can be tested -
for instance by the Kolmogorov Smirnov non-parametric test.
It has however been shown that minor departures from normality do not affect this test -
this is indeed an advantage.
2. The samples should be dependent and it should be possible to identify specific pairs.
3. An obvious requirement is that the two samples should be of equal size.
For Small Samples
This test is a small sample test. It is difficult to draw the clearest line of demarcation between
large and small sample.
Statisticians have generally agreed that a sample may be considered small if its size is < 30
(below 30). (Feb 26, 2009). Dependent T-Test for Paired Samples. Retrieved from
5.4 Student’s T-Test (II)
Any statistical test that uses t distribution can be called a t-test, or the "student's t-
test". It is basically used when the sample size is small i.e. n<30.
For example, if a person wants to test the hypothesis that mean height
of student's of a college is not different from 150 cms, he can take a
sample of size say 20 from the college. From the mean height of these
students, he can test the hypothesis. The test to be used for this
purpose is t-test.
1. Student's t-test for single mean is used to test a hypothesis on a specific value of the
population mean. Statistically speaking, we test the null hypothesis H0: μ =0μagainst the
alternative hypothesis H1: μ >< μ0 where μ is the population mean and
0 isμa specific value
of the population that we would like to test for acceptance.
The example on heights of students explained above requires this test. In that example,
μ0 = 150.
2. The t-test for difference of means is used to test the hypothesis that two populations
have the same mean.
For example suppose one is interested to test if there is any significant difference
between the mean height of male and female students in a particular college. In such a
situation, t-test for difference of means can be applied. One would have to take two
independent samples from the college- one from males and the other from females in
order to perform this test.
An additional assumption of this test is that the variance of the two populations is equal.
3. A paired t-test is usually used when the two samples are dependent- this happens
when each individual observation of one sample has a unique relationship with a
particular member of the other sample.
For example we may wish to test if a newly developed intervention program for
disadvantaged students is useful. For this, we need to obtain scores from say 22
students in a standardized test before administering the program. After the program is
over, the same test needs to be administered again on the same group of 22 students
and scores obtained.
The two samples- the sample of prior intervention scores and the sample of post
intervention scores are related as each student has two scores. The samples are
therefore dependent. The paired t-test can is applicable in such scenarios.
4. A t-test for correlation coefficient is used for testing an observed sample correlation
coefficient (r).
Irrespective of the type of t-test used, two assumptions have to be met.
1. the populations from which the samples are drawn are normal.
2. the population standard deviation is not known.
Student's t-test is a small sample test. It is difficult to drawn a clearest line of demarcation
between large and small sample. Statisticians have generally agreed that a sample may be
considered small if its size is < 30.
The test used for dealing with problems relating the large samples are different from the one
used for small samples. We often use z-test for large samples. (Jul 20, 2009). Student’s T-Test (II). Retrieved from
Analysis Of Variance
The Analysis Of Variance, popularly known as the ANOVA, can be used in cases where
there are more than two groups.
When we have only two samples we can use the t-test to compare the means of the samples
but it might become unreliable in case of more than two samples. If we only compare two
means, then the t-test (independent samples) will give the same results as the ANOVA.
It is used to compare the means of more than two samples. This can be understood better
with the help of an example.
It has been termed as one-way as there is only one category whose effect has been studied
and balanced as the same number of men has been assigned on each exercise. Thus the
basic idea is to test whether the samples are all alike or not.
But conducting such multiple t-tests can lead to severe complications and in such
circumstances we use ANOVA. Thus, this technique is used whenever an alternative
procedure is needed for testing hypotheses concerning means when there are several
This is a case of one-way or one-factor ANOVA since there is only one factor, fertilizer. We
may also be interested to study the effect of fertility of the plots of land. In such a case we
would have two factors, fertilizer and fertility. This would be a case of two-way or two-factor
ANOVA. Similarly, a third factor may be incorporated to have a case of three-way or three-
factor ANOVA.
But this difference may also be the result of certain other factors which are attributed to
chance and which are beyond human control. This factor is termed as “error”. Thus, the
differences or variations that exist within a plot of land may be attributed to error.
Thus, estimates of the amount of variation due to assignable causes (or variance between the
samples) as well as due to chance causes (or variance within the samples) are obtained
separately and compared using an F-test and conclusions are drawn using the value of F.
There are four basic assumptions used in ANOVA.
the errors are independent
they are normally distributed
6.1 One-Way ANOVA
We can say we have a framework for one-way ANOVA when we have a single factor with
three or more levels and multiple observations at each level.
In this kind of layout, we can calculate the mean of the observations within each level of our
The concepts of factor, levels and multiple observations at each level can be best understood
by an example.
The factor being studied is age. There is just one factor (age) and hence a situation
appropriate for one-way ANOVA.
Further suppose that the employees have been classified into three groups (levels):
less than 40
40 to 55
above 55
These three groups are the levels of factor age - there are three levels here. With this design,
we shall have multiple observations in the form of scores on Occupational Stress from a
number of employees belonging to the three levels of factor age. We are interested to know
whether all the levels i.e. age groups have equal stress on the average.
Non-significance of the test statistic (F-statistic) associated with this technique would imply
that age has no effect on stress experienced by employees in their respective occupations.
On the other hand, significance would imply that stress afflicts different age groups differently.
Hypothesis Testing
Formally, the null hypothesis to be tested is of the form:
H0: All the age groups have equal stress on the average or 1μ = 2μ = μ
where 1μ, 2μ, 3μ are mean stress scores for the three age groups.
H1: The mean stress of at least one age group is significantly different.
In the above example, if we considered only two age groups, say below 40 and above 40,
then the independent samples t-test would have been enough although application of ANOVA
would have also produced the same result.
In the example considered above, there were three age groups and hence it was necessary to
use one-way ANOVA.
Often the interest is on acceptance or rejection of the null hypothesis. If it is rejected, this
technique will not identify the level which is significantly different. One has to perform t-tests
for this purpose.
This implies that if there exists difference between the means, we would have to carry out 3C2
independent t-tests in order to locate the level which is significantly different. It would be kC2
number of t-tests in the general one-way ANOVA design with k levels.
One of the principle advantages of this technique is that the number of observations need not
be the same in each group.
For the validity of the results, some assumptions have been checked to hold before the
technique is applied. These are:
Each level of the factor is applied to a sample. The population from which the sample
was obtained must be normally distributed.
The samples must be independent.
The variances of the population must be equal.
Out of these three, only replication and randomization have to be satisfied while designing
and implementing any one-way ANOVA experiment.
Replication refers to the application of each individual level of the factor to multiple subjects.
In the above example, in order to apply the principle of replication, we had obtained
occupational stress scores from more than one employee in each level (age group).
Randomization refers to the random allocation of the experimental units. In our example,
employees were selected randomly for each of the age groups.
6.2 Two-Way ANOVA
A Two-Way ANOVA is useful when we desire to compare the effect of multiple levels of
two factors and we have multiple observations at each level.
One-Way ANOVA compares three or more levels of one factor. But some experiments
involve two factors each with multiple levels in which case it is appropriate to use Two-Way
Let us discuss the concepts of factors, levels and observation through an example.
A Two-Way ANOVA is a
design with two factors.
Let us suppose that the Human Resources Department of a company desires to know if
occupational stress varies according to age and gender.
Further suppose that the employees have been classified into three groups or levels:
In this design, factor age has three levels and gender two. In all, there are 3 x 2 = 6 groups or
cells. With this layout, we obtain scores on occupational stress from employee(s) belonging to
the six cells.
The basic version has one observation in each cell - one occupational stress score from one
employee in each of the six cells.
The second version has more than one observation per cell but the number of observations in
each cell must be equal. The advantage of the second version is it also helps us to test if
there is any interaction between the two factors.
For instance, in the example above, we may be interested to know if there is any interaction
between age and gender.
This helps us to know if age and gender are independent of each other - they are independent
if the effect of age on stress remains the same irrespective of whether we take gender into
Hypothesis Testing
In the basic version there are two null hypotheses to be tested.
H01: All the age groups have equal stress on the average
H02: Both the gender groups have equal stress on the average.
H03: The two factors are independent or that interaction effect is not present.
The assumptions in both versions remain the same - normality, independence and equality of
An important advantage of this design is it is more efficient than its one-way counterpart.
There are two assignable sources of variation - age and gender in our example - and
this helps to reduce error variation thereby making this design more efficient.
Unlike One-Way ANOVA, it enables us to test the effect of two factors at the same time.
One can also test for independence of the factors provided there are more than one
observation in each cell. The only restriction is that the number of observations in each
cell has to be equal (there is no such restriction in case of one-way ANOVA).
The principle of local control means to make the observations as homogeneous as possible
so that error due to one or more assignable causes may be removed from the experimental
In our example if we divided the employees only according to their age, then we would have
ignored the effect of gender on stress which would then accumulate with the experimental
But we divided them not only according to age but also according to gender which would help
in reducing the error - this is application of the principle of local control for reducing error
variation and making the design more efficient.
6.3 Factorial Anova
Experiments where the effects of more than one factor are considered together are
called 'factorial experiments' and may sometimes be analyzed with the use of factorial
For instance, the academic achievement of a student depends on study habits of the student
as well as home environment. We may have two simple experiments, one to study the effect
of study habits and another for home environment.
Independence of Factors
But these experiments will not give us any information about the dependence or
independence of the two factors, namely study habit and home environment.
In such cases, we resort to Factorial ANOVA which not only helps us to study the effect of two
or more factors but also gives information about their dependence or independence in the
same experiment. There are many types of factorial designs like 22, 23, 32 etc. The simplest
of them all is the 22 or 2 x 2 experiment.
An Example
In these experiments, the factors are applied at different levels. In a 2 x 2 factorial design,
there are 2 factors each being applied in two levels.
Let us illustrate this with the help of an example. Suppose that a new drug has been
developed to control hypertension.
We want to test the effect of quantity of the drug taken and the effect of gender. Here, the
quantity of the drug is the first factor and gender is the second factor (or vice versa).
Suppose that we consider two quantities, say 100 mg and 250 mg of the drug (1 / 2). These
two quantities are the two levels of the first factor.
Similarly, the two levels of the second factor are male and female (A / B).
Thus we have two factors each being applied at two levels. In other words, we have a 2 x 2
factorial design.
Here we have 4 different treatment groups, one for each combination of levels of factors - by
convention, the groups are denoted by A1, A2, B1, B2. These groups mean the following.
Here, the quantity of the drug and gender are the independent variables whereas reduction of
hypertension after one month is the dependent variable.
In our example, there are two main effects - quantity and gender.
Factorial ANOVA also enables us to examine the interaction effect between the factors. An
interaction effect is said to exist when differences on one factor depend on the level of other
However, it is important to remember that interaction is between factors and not levels. We
know that there is no interaction between the factors when we can talk about the effect of one
factor without mentioning the other factor.
Hypothesis Testing
In the above example, there are three hypotheses to be tested. These are:
For main effect gender, the null hypothesis means that there is no significant difference in
reduction of hypertension in males and females.
The null hypothesis for the main effect quantity means that there is no significant difference in
reduction of hypertension whether the patients are given 100 mg or 250 mg of the drug.
For the interaction effect, the null hypothesis means that the two main effects gender and
quantity are independent. The computational aspect involves computing F-statistic for each
Factorial design has several important features.
Factorial designs are the ultimate designs of choice whenever we are interested in
examining treatment variations.
Factorial designs are efficient. Instead of conducting a series of independent studies, we
are effectively able to combine these studies into one.
Factorial designs are the only effective way to examine interaction effects.
The assumptions remain the same as with other designs - normality, independence and
equality of variance.
6.4 Repeated Measures ANOVA
It is used when all the members of a random sample are tested under a number of conditions.
Here, we have different measurements for each of the sample as each sample is exposed to
different conditions.
However, it is used when all the members of a random sample are tested under a number of
conditions. Here, we have different measurements for each of the sample as each sample is
exposed to different conditions.
In other words, the measurement of the dependent variable is repeated. It is not possible to
use the standard ANOVA in such a case as such data violates the assumption of
independence of data and as such it will not be able to model the correlation between the
repeated measures.
For both, samples are measured on several occasions, or trials, but in the repeated measures
design, each trial represents the measurement of the same characteristic under a different
For example, repeated measures ANOVA can be used to compare the number of oranges
produced by an orange grove in years one, two and three. The measurement is the number of
oranges and the condition that changes is the year.
Thus, to compare the number, weight and price of oranges repeated measures ANOVA
cannot be used. The three measurements are number, weight, and price, and these do not
represent different conditions, but different qualities.
Why Use Repeated Measures Design?
Repeated measures design is used for several reasons:
By collecting data from the same participants under repeated conditions the individual
differences can be eliminated or reduced as a source of between group differences.
Also, the sample size is not divided between conditions or groups and thus inferential
testing becomes more powerful.
This design also proves to be economical when sample members are difficult to recruit
because each member is measured under all conditions.
This design is based on the assumption of Sphericity, which means that the variance of the
population difference scores for any two conditions should be the same as the variance of the
population difference scores for any other two conditions.
But this condition is only relevant to the one-way repeated measures ANOVA and in other
cases this assumption is commonly violated.
The null hypothesis to be tested here is:
Some differences will occur in the sample. It is desired to draw conclusions about the
population from which it was taken, not about the sample. The F-ratios are used for the
analysis of variance and conclusions are drawn accordingly.
Within-Subject Design
The repeated measures design is also known as a within-subject design.
The data presented in this design includes a measure repeated over time, a measure
repeated across more than one condition or several related and comparable measures.
Possible Designs for Repeated Measures
One-way repeated measures
Two-way repeated measures
Two-way mixed split-plot design (SPANOVA)
7 Nonparametric Statistics
Nonparametric statistics are those data that do not assume a prior distribution. When
an experiment is performed or data collected for some purpose, it is usually assumed
that it fits some given probability distribution, typically the normal distribution. This is
the basis on which the data is interpreted. When these assumptions are not made, it
becomes nonparametric statistics.
There are several advantages of using nonparametric statistics. As can be expected, since
there are fewer assumptions that are made about the sample being studied, nonparametric
statistics are usually wider in scope as compared to parametric statistics that actually assume
a distribution. This is mainly the case when we do not know a lot about the sample we are
studying and making a priori assumptions about data distributions might not give us accurate
results and interpretations. This directly translates into an increase in robustness.
However, there are also some disadvantages of nonparametric statistics. The main
disadvantage is that the degree of confidence is usually lower for these types of studies. This
means for the same sample under consideration, the results obtained from nonparametric
statistics have a lower degree of confidence than if the results were obtained using parametric
statistics. Of course, this is assuming that the study is such that it is valid to assume a
distribution for the sample.
There are many experimental scenarios in which we can assume a normal distribution. For
example if an experiment looks at the correlation between a healthy morning breakfast and
IQ, the experimenter can assume beforehand that the IQs of the sample size follow a normal
distribution within the sample, assuming the sample is chosen randomly from thepopulation.
On the other hand, if this assumption is not made, then the experimenter is following
nonparametric statistics methods.
However, there could be another experiment that measures the resistance of the human body
to a strain of bacteria. In such a case, it is not possible to determine if the data will be normally
distributed. It might happen that all people are resistant to the strain of bacteria under study or
perhaps no one is. Again, there could be other considerations as well. It could be that people
of a particular ethnicity are born with that resistance while none of the others are. In such
cases, it is not right to assume a normal distribution of data. These are the situations in which
nonparametric statistics should be used. There are many tests that tell us whether the data
can be assumed to be normally distributed or not.
7.1 Cohen’s Kappa
The items are indicators of the extent to which two raters who are examining the same set of
categorical data, agree while assigning the data to categories, for example, classifying a
tumor as 'malignant' or 'benign'.
Comparison between the level of agreement between two sets of dichotomous scores or
ratings (an alternative between two choices, e.g. accept or reject) assigned by two raters to
certain qualitative variables can be easily accomplished with the help of simple percentages,
i.e. taking the ratio of the number of ratings for which both the raters agree to the total number
of ratings. But despite the simplicity involved in its calculation, percentages can be misleading
and does not reflect the true picture since it does not take into account the scores that the
raters assign due to chance.
Using percentages can result in two raters appearing to be highly reliable and completely in
agreement, even if they have assigned their scores completely randomly and they actually do
not agree at all. Cohen's Kappa overcomes this issue as it takes into account agreement
occurring by chance.
Pr(a) - Pr(e)
К = 1 - Pr(e)
Pr(a) = Observed percentage of agreement,
Pr(e) = Expected percentage of agreement.
The observed percentage of agreement implies the proportion of ratings where the raters
agree, and the expected percentage is the proportion of agreements that are expected to
occur by chance as a result of the raters scoring in a random manner. Hence Kappa is the
proportion of agreements that is actually observed between raters, after adjusting for the
proportion of agreements that take place by chance.
Let us consider the following 2×2 contingency table, which depicts the probabilities of two
raters classifying objects into two categories.
Rater 1 Total
Rater 2 Category 1 2
1 P11 P12 P10
2 P21 P22 P20
Total P01 P02 1
Pr(a) = P01 + P10
Pr(e) = P02 + P20
The value of К ranges between -1 and +1, similar to Karl Pearson's co-efficient ofcorrelation 'r'.
In fact, Kappa and r assume similar values if they are calculated for the same set of
dichotomous ratings for two raters.
A value of kappa equal to +1 implies perfect agreement between the two raters, while that of -
1 implies perfect disagreement. If kappa assumes the value 0, then this implies that there is
no relationship between the ratings of the two raters, and any agreement or disagreement is
due to chance alone. A kappa value of 0.70 is generally considered to be satisfactory.
However, the desired reliability level varies depending on the purpose for which kappa is
being calculated.
Kappa is very easy to calculate given the software's available for the purpose and is
appropriate for testing whether agreement exceeds chance levels. However, some questions
arise regarding the proportion of chance, or expected agreement, which is the proportion of
times the raters would agree by chance alone. This term is relevant only in case the raters are
independent, but the clear absence of independence calls its relevance into question.
Also, kappa requires two raters to use the same rating categories. But it cannot be used in
case we are interested to test the consistency of ratings for raters that use different
categories, e.g. if one uses the scale 1 to 5, and the other 1 to 10.
7.2 Mann-Whitney U-Test
Mann-Whitney-Wilcoxon (MWW)
Wilcoxon Rank-Sum Test
The Method
The Mann-Whitney U-test is used to test whether two independent samples of observations
are drawn from the same or identical distributions. An advantage with this test is that the two
samples under consideration may not necessarily have the same number of observations.
This test is based on the idea that the particular pattern exhibited when 'm' number of X
random variables and 'n' number of Y random variables are arranged together in increasing
order of magnitude provides information about the relationship between their parent
The Mann-Whitney test criterion is based on the magnitude of the Y's in relation to the X's, i.e.
the position of Y's in the combined ordered sequence. A sample pattern of arrangement
where most of the Y's are greater than most of the X's or vice versa would be evidence
against random mixing. This would tend to discredit the null hypothesis of identical distribution.
The test has two important assumptions. First the two samples under consideration are
random, and are independent of each other, as are the observations within each sample.
Second the observations are numeric or ordinal (arranged in ranks).
How to Calculate the Mann-Whitney U
In order to calculate the U statistics, the combined set of data is first arranged in ascending
order with tied scores receiving a rank equal to the average position of those scores in the
ordered sequence.
Let T denote the sum of ranks for the first sample. The Mann-Whitney test statistic is then
calculated using U = n1 n2 + {n1 (n1 + 1)/2} - T , where n1 and n2 are the sizes of the first
and second samples respectively.
An Example
An example can clarify better. Consider the following samples.
Sample A
Observation 25 25 19 21 22 19 15
Rank 15.5 15.5 9.5 13 14 9.5 3.5
Sample B
Observation 18 14 13 15 17 19 18 20 19
Rank 6.5 2 1 3.5 5 9.5 6.5 12 9.5
We next compare the value of calculated U with the value given in the Tables of Critical
Values for the Mann-Whitney U-test, where the critical values are provided for given n1 and
n2 , and accordingly accept or reject the null hypothesis. Even though the distribution of U is
known, the normal distribution provides a good approximation in case of large samples.
As a Counterpart of T-Test
The Mann-Whitney U test is truly the non parametric counterpart of the two sample t-test. To
see this, one needs to recall that the t-test tests for equality of means when the underlying
assumptions of normality and equality of variance are satisfied. Thus the t-test tests if the two
samples have been drawn from identical normal population. The Mann-Whitney U test is its
7.3 Wilcoxon Signed Rank Test
The Wilcoxon Signed Rank Test is a non-parametric statistical test for testing
hypothesis on median.
The test has two versions: "single sample" and "paired samples / two samples".
Single Sample
The first version is the analogue of independent one sample t-test in the non parametric
context. It uses a single sample and is recommended for use whenever we desire to test a
hypothesis about population median.
The null hypothesis here is of the form H0 : m = m0 , where m0 is the specific value of
population median that we wish to test against the alternative hypothesis H1 : m ≠ m0 .
For example, let us suppose that the manager of a boutique claims that median income his
clients is $24,000/- per annum. To test if this is tenable, the analyst will obtain the yearly
income of a sample of his clients and test the null hypothesis H0 : m0 = 24,000.
Paired Samples
The second version of the test uses paired samples and is the non parametric analogue of
dependent t-test for paired samples.
This test uses two samples but it is necessary that they should be paired. Paired samples
imply that each individual observation of one sample has a unique corresponding member in
the other sample.
For example, suppose that we have a sample of weights of n obese adults before they are
subjected to a change of diet.
After a lapse of six months, we would like to test whether there has been any significant loss
in weight as a result of change in diet. One could be tempted to straightaway use the
dependent t-test for paired samples here.
However that test has certain assumption notable among them being normality. If this
normality assumption is not satisfied, one would have to go for the non parametric Wilcoxon
Signed Rank Test.
The null hypothesis then would be that there has been no significant reduction in median
weight after six months against the alternative that medians before and after significantly differ.
Often these techniques cannot be used if the normality assumption is not satisfied. Among
others, the t-test requires this assumption and it is not advisable to use it if this assumption is
The advantage with Wilcoxon Signed Rank Test is that it neither depends on the form of the
parent distribution nor on its parameters. It does not require any assumptions about the shape
of the distribution.
For this reason, this test is often used as an alternative to t test's whenever the population
cannot be assumed to be normally distributed. Even if the normality assumption holds, it has
been shown that the efficiency of this test compared to t-test is almost 95%.
Let us illustrate how signed ranks are created in one-sample case by considering the example
explained above. Assume that a sample of yearly incomes of 10 customers was collected.
The null hypothesis to be tested is H0 : m = 24,000.
We first calculate the deviations of the given observations from 24,000 and then rank them in
order of magnitude. This has been done in the following table:
Income Deviation Signed Ranks
23,928 -72 -1
24,500 500 5.5
23,880 -120 -2
24,675 675 7
21,965 -2035 -10
22,900 -1100 -9
23,500 -500 -5,5
24,450 450 4
22,998 -1002 -8
23,689 -311 -3
The deviations are ranked in increasing order of absolute magnitude and then the ranks are
given the signs of the corresponding deviations.
In the above table the difference 500 occurs twice. In such a case, we assign a common rank
which is the arithmetic mean of their respective ranks. Hence 500 was assigned the rank
which is the arithmetic mean of 5 and 6.
In a two sample case, the ranks are assigned in a similar way. The only difference is that in a
two sample case we first find out the differences between the corresponding observations of
the samples and then rank them in increasing order of magnitude.
The ranks are then given the sign of the corresponding differences. (Mar 15, 2009). Wilcoxon Signed Rank Test. Retrieved from
8 Other Ways to Analyse Data
Any statistical test that uses the chi square distribution can be called chi square test. It
is applicable both for large and small samples-depending on the context.
If we take random sample of say size 80 students and measure both indigenous/immigrant as
well as success/failure status of each of the student, the chi square test can be applied to test
the hypothesis.
There are different types of chi square test each for different purpose. Some of the popular
types are outlined below.
For example given a sample, we may like to test if it has been drawn from a normal
population. This can be tested using chi square goodness of fit procedure.
2. Chi square test for independence of two attributes. Suppose N observations are
considered and classified according two characteristics say A and B. We may be
interested to test whether the two characteristics are independent. In such a case, we
can use Chi square test for independence of two attributes.
The example considered above testing for independence of success in the English test
vis a vis immigrant status is a case fit for analysis using this test.
3. Chi square test for single variance is used to test a hypothesis on a specific value of the
population variance. Statistically speaking, we test the null hypothesis H0: σ = σ0 against
the research hypothesis H1: σ # σ0 where σ is the population mean and σ0 is a specific value
of the population variance that we would like to test for acceptance.
In other words, this test enables us to test if the given sample has been drawn from a
population with specific variance σ0. This is a small sample test to be used only if sample
size is less than 30 in general.
The Chi square test for single variance has an assumption that the population from which the
sample has been is normal. This normality assumption need not hold for chi square goodness
of fit test and test for independence of attributes.
However while implementing these two tests, one has to ensure that expected frequency in
any cell is not less than 5. If it is so, then it has to be pooled with the preceding or succeeding
cell so that expected frequency of the pooled cell is at least 5.
Since these tests do not involve any population parameters or characteristics, they are also
termed as non parametric or distribution free tests. An additional important fact on these two
tests is they are sample size independent and can be used for any sample size as long as the
assumption on minimum expected cell frequency is met. (Sep 24, 2009). Chi Square Test. Retrieved from
8.2 Z-Test
Z-test is a statistical test where normal distribution is applied and is basically used for
dealing with problems relating to large samples when n ≥ 30.
n = sample size
For example suppose a person wants to test if both tea & coffee are equally popular in a
particular town. Then he can take a sample of size say 500 from the town out of which
suppose 280 are tea drinkers. To test the hypothesis, he can use Z-test.
1. z test for single proportion is used to test a hypothesis on a specific value of the
population proportion.
Statistically speaking, we test the null hypothesis H0: p = p0 against the alternative
hypothesis H1: p >< p0 where p is the population proportion and p0 is a specific value of
the population proportion we would like to test for acceptance.
The example on tea drinkers explained above requires this test. In that example, p0 =
0.5. Notice that in this particular example, proportion refers to the proportion of tea
2. z test for difference of proportions is used to test the hypothesis that two populations
have the same proportion.
For example suppose one is interested to test if there is any significant difference in the
habit of tea drinking between male and female citizens of a town. In such a situation, Z-
test for difference of proportions can be applied.
One would have to obtain two independent samples from the town- one from males and
the other from females and determine the proportion of tea drinkers in each sample in
order to perform this test.
3. z -test for single mean is used to test a hypothesis on a specific value of the population
Statistically speaking, we test the null hypothesis H0: μ =0μagainst the alternative
hypothesis H1: μ ><0μwhere μ is the population mean and0μis a specific value of the
population that we would like to test for acceptance.
Unlike the t-test for single mean, this test is used if n ≥ 30 and population standard
deviation is known.
4. z test for single variance is used to test a hypothesis on a specific value of the
population variance.
Statistically speaking, we test the null hypothesis H0: σ =0σagainst H1: σ ><0σwhere σ is the
population mean and σ0 is a specific value of the population variance that we would like
to test for acceptance.
In other words, this test enables us to test if the given sample has been drawn from a
population with specific variance σ0. Unlike the chi square test for single variance, this
test is used if n ≥ 30.
5. Z-test for testing equality of variance is used to test the hypothesis of equality of two
population variances when the sample size of each sample is 30 or larger.
Irrespective of the type of Z-test used it is assumed that the populations from which the
samples are drawn are normal.
8.3 F-Test
Any statistical test that uses F-distribution can be called a F-test. It is used when the
sample size is small i.e. n < 30.
The F-test can be used to test the hypothesis that the population variances are equal.
1. F-test for testing equality of variance is used to test the hypothesis of equality of two
population variances. The example considered above requires the application of this test.
2. F-test for testing equality of several means. Test for equality of several means is carried
out by the technique named ANOVA.
For example suppose that the efficacy of a drug is sought to be tested at three levels
say 100mg, 250mg and 500mg. A test is conducted among fifteen human subjects taken
at random- with five subjects being administered each level of the drug.
To test if there are significant differences among the three levels of the drug in terms of
efficacy, the ANOVA technique has to be applied. The test used for this purpose is the F-
3. F-test for testing significance of regression is used to test the significance of the
regression model. The appropriateness of the multiple regression model as a whole can
be tested by this test. A significant F indicates a linear relationship between Y and at
least one of the X's.
Irrespective of the type of F-test used, one assumption has to be met. The populations from
which the samples are drawn have to be normal. In the case of F-test for equality of variance,
a second assumption has to be satisfied in that the larger of the sample variances has to be
placed in the numerator of the test statistic.
Like t-test, F-test is also a small sample test and may be considered for use if sample size is <
In attempting to reach decisions, we always begin by specifying the null hypothesis against a
complementary hypothesis called alternative hypothesis. The calculated value of the F-test
with its associated p-value is used to infer whether one has to accept or reject a null
All software's provide these p-values. If the associated p-value is small i.e. (<0.05) we say that
the test is significant at 5% and one may reject the null hypothesis and accept the alternative
On the other hand if associated p-value of the test is >0.05, one may accept the null
hypothesis and reject the alternative. Evidence against the null hypothesis will be considered
very strong if p-value is less than 0.01. In that case, we say that the test is significant at 1%.
8.4 Factor Analysis
Factor analysis is a statistical approach that can be used to analyze large number of
interrelated variables and to categorize these variables using their common aspects.
The approach involves finding a way of representing correlated variables together to form a
new smaller set of derived variables with minimum loss of information. So, it is a type of a
data reduction tool and it removes redundancy or duplication from a set of correlated variables.
Also, factors are formed that are relatively independent of one another. But since it require the
data to be correlated, so all assumptions that apply to correlation are relevant here.
Main Types
There are two main types of factor analysis. The two main types are:
Principal component analysis - this method provides a unique solution so that the
original data can be reconstructed from the results. Thus, this method not only provides
a solution but also works the other way round, i.e., provides data from the solution. The
solution generated includes as many factors as there are variables.
Common factor analysis - this technique uses an estimate of common difference or
variance among the original variables to generate the solution. Due to this, the number
of factors will always be less than the number of original factors. So, factor analysis
actually refers to common factor analysis.
Main Uses
The main uses of factor analysis can be summarized as given below. It helps us in:
Let us consider an example to understand the use of factor analysis.
In this way, factors can be found to represent variables with similar aspects.
8.5 ROC Curve Analysis
In almost all fields of human activity, there is often a need to discriminate between
good and bad, presence and absence. Various tests have been designed to meet this
objective. The ROC curve technique has been designed to attain two objectives in this
Various tests have been designed to meet this objective. The ROC curve technique has been
designed to attain two objectives in this regard.
First, it can be used to calibrate (in some sense) a test so that it is able to perform the
discrimination activity well. Second, it can be used to choose between tests and specify best
among them.
What is ROC?
If one applies to a bank for credit, it is most likely that the bank will calculate a credit score out
of the applicant’s background. A higher score could indicate a good customer with minimal
chance of default. The banker could refuse credit if the score is low. Often a credit score cut-
off is used below which the application is rejected.
It is not difficult to see that there is always an element of risk here – risk of committing two
types of errors. A good prospective customer (one who would not default) could be refused
credit by the bank and a bad one could be approved credit. Clearly the banker would like the
cut-off be fixed in a manner that chances of both the errors are minimized if not entirely
While complete elimination is impossible, the ROC curve analysis is a technique which
contributes to this endeavour. A related problem is the question of choosing between methods
of identifying good/bad customers should there be a choice. The ROC curve analysis
technique can be of use even here.
The Plot
In order to draw the ROC curve, the concepts of ‘Sensitivity’ and ‘Specificity’ are used – the
curve actually is the plot of sensitivity (in the y axis) against 1- specificity (in the x axis) for
different values of the cut-off.
To understand these concepts, assume that we select a sample of z customers of the bank by
retrospective sampling method. Further suppose that m and n of these are good and bad
(defaulting) customers respectively (m+n=z). Next, we use the credit scale on these
customers and calculate their credit scores. Then we use the cut-off and label customers
good or bad according to whether the credit score is above or below the cut-off. Out of m
good customers, the test classified x of them as good while the remaining m-x were classified
as bad.
In the parlance of ROC curve, x is termed as TP (for true positive meaning that the credit
scale was able to identify these customers as good correctly) while m-x in termed as FN (for
false negative). Further suppose that out of n bad customers, the test classified y of them as
bad while the remaining n-y were classified as good. In ROC parlance, y is termed as TN (for
true negative) while n-y in termed as FP (for false positive).
Actual status of costumers
Good Bad
As predicted by credit scale Good x n-y
Bad m-x y
Total m n
So far so good but how does one determine the optimal cut-off? The banker would like to
determine that cut-off for which sensitivity is high and 1-specificity is low – ideally 100%
sensitivity with 100% specificity. That is easier said than done as the best of the curves is not
a vertical line but one which rises steeply initially and then slowly. The highest point on the
curve has 100% sensitivity and 0% specificity. In other words as one of sensitivity or
specificity increases, the other decreases and vice versa. The problem of determining the
ideal cut-off is to choose one depending upon the extent of sensitivity and specificity that the
decision maker is comfortable with. Having thus fixed a cut-off, the banker can then use it for
evaluating fresh credit applications.
The fact that the best of the tests has a curve which rises steeply initially is used to choose
between tests. A test can be called the best if its corresponding ROC curve is higher than
others. (Jun 25, 2010). ROC Curve Analysis. Retrieved from
8.6 Meta Analysis
Meta analysis is a statistical technique developed by social scientists, who are very
limited in the type of experiments they can perform.
Social scientists have great difficulty in designing and implementing true experiments, so
meta-analysis gives them a quantitative tool to analyze statistically data drawn from a number
of studies, performed over a period of time.
Medicine and psychology increasingly use this method, as a way of avoiding time-consuming
and intricate studies, largely repeating the work of previous research.
What is Meta-Analysis?
Social studies often use very small sample sizes, so any statistics used generally give results
containing large margins of error.
This can be a major problem when interpreting and drawing conclusions, because it can mask
any underlying trends or correlations. Such conclusions are only tenuous, at best, and leave
the research open for criticism.
Meta-analysis is the process of drawing from a larger body of research, and using powerful
statistical analyzes on the conglomerated data.
This gives a much larger sample population and is more likely to generate meaningful and
usable data.
As the method becomes more common, database programs have made the process much
easier, with professionals working in parallel able to enter their results and access the data.
This allows constant quality assessments and also reducing the chances of unnecessary
repeat research, as papers can often take many months to be published, and the computer
records ensure that any researcher is aware of the latest directions and results.
The field of meta study is also a lot more rigorous than the traditional literature review, which
often relies heavily upon the individual interpretation of the researcher.
When used with the databases, a meta study allows a much wider net to be cast than by the
traditional literature review, and is excellent for highlighting correlations and links between
studies that may not be readily apparent as well as ensuring that the compiler does not
subconsciously infer correlations that do not exist.
The main problem is that there is the potential for publication bias and skewed data.
Research generating results not refuting a hypothesis may tend to remain unpublished, or
risks not being entered into the database. If the meta study is restricted to the research with
positive results, then the validity is compromised.
The researcher compiling the data must make sure that all research is quantitative, rather
than qualitative, and that the data is comparable across the various research programs,
allowing a genuine statistical analysis.
It is important to pre-select the studies, ensuring that all of the research used is of a sufficient
quality to be used.
One erroneous or poorly conducted study can place the results of the entire meta-analysis at
risk. On the other hand, setting almost unattainable criteria and criteria for inclusion can leave
the meta study with too small a sample size to be statistically relevant.
Striking a balance can be a little tricky, but the whole field is in a state of constant
development, incorporating protocols similar to the scientific method used for normal
quantitative research.
Finding the data is rapidly becoming the real key, with skilled meta-analysts developing a skill-
set of library based skills, finding information buried in government reports and conference
data, developing the knack of assessing the quality of sources quickly and effectively.
The conveniences, as long as the disadvantages are taken into account, are too apparent to
ignore, and a meta study can reduce the need for long, expensive and potentially intrusive
repeated research studies.
Martyn Shuttleworth (Jun 21, 2009). Meta Analysis. Retrieved from