5 Predicting Data

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

PREDICTING D TA

Arsenio P. Gardoce, Jr.

A study that aims to predict a variable based on other variables requires the use of regression
analysis. Regression analysis is commonly used to address a research question of the following form:

1. Do the variables X1, X2, X3, … Xn affect the Y variable?


2. Do the variables X1, X2, X3, … Xn influence the Y variable?
3. Do the variables X1, X2, X3, … Xn predict the Y variable?

The variables X1, X2, X3, … Xn are called independent variables while variable Y is called the
dependent variable. An independent variable (also called predictor) is a variable thought to be the
case of some effect or to predict an outcome variable while a dependent variable (also called
outcome variable) is a variable thought to be affected by the changes in an independent variable or
thought to change as the predictor variable changes. Suppose a researcher wants to determine
whether academic performance affects performance in Board Licensure Examination for Professional
Teachers (BLEPT). Academic performance is the independent variable and BLEPT performance is the
dependent variable.

There are several types of regression analyses. The most commonly used are simple linear
regression and multiple linear regression.

SIMPLE LINEAR REGRESSION


Simple linear regression can be used to model the relationship between a single dependent
variable and one predictor. Before analyzing data using linear regression, the following assumptions
must be checked:

1. The two variables should be measured on a continuous scale – interval or ratio level.
2. There is a linear relationship between the two variables.
3. There should be no significant outliers.
4. There should be independence of observations.
5. The data needs to show homoscedasticity (i.e. the variances along the line of best fit remain
similar as you move along the line).
6. The residuals (errors) of the regression line variables should be approximately normally
distributed.

Interpreting SPSS Statistics Output of Simple Linear Regression Analysis


An output of simple linear regression analysis presents three tables – Model Summary table,
ANOVA table, and Coefficients table. The model summary table provides the correlation coefficient
and the R square value (also called the coefficient of determination). The R2 value indicates how
much of the total variation in the independent variable can be explained by the independent
variable.

RESEARCH ST TISTICS 1
The ANOVA table reports how well the regression equation fits the data (i.e. how well the
regression equation predicts the dependent variable). The regression model statistically
significantly predicts the dependent variable (i.e. it is a good fit for the data) if the p-value
associated with the F-value is less than 0.05.

The coefficients table provides information to predict the dependent variable as well as
determine whether the independent variable contributes statistically significantly to the model. That
is, the p-value associated with the t-value should be less than 0.05. The table also provides the
constant value and the unstandardized coefficient for the independent variable which are necessary
information for the prediction equation.

Example 5.1. Suppose a researcher is interested in determining whether mother’s weight


affects baby’s birthweight. The weight of 378 mothers and their babies’ birthweight were
analyzed using SPSS and the following output was generated. Does mother’s weight predict
baby’s birthweight?

SPSS Statistics Output

Model Summary
Adjusted R Std. Error of the
Model R R Square Square Estimate
1 .167a .028 .025 718.414
a. Predictors: (Constant), mweight

ANOVAa
Model Sum of Squares df Mean Square F Sig.
1 Regression 5540272.452 1 5540272.452 10.734 .001b
Residual 194060771.251 376 516119.072
Total 199601043.704 377
a. Dependent Variable: bweight
b. Predictors: (Constant), mweight
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 2426.719 162.194 14.962 .000
Mweight 3.977 1.214 .167 3.276 .001
a. Dependent Variable: bweight

Figure 5.1.1. SPSS Statistics Output on Simple Linear Regression Analysis

RESEARCH ST TISTICS 2
Solution:
The model summary table provides information that the correlation coefficient is 0.167 and R 2
value is 0.028. This indicates that 2.8% of the total variation in baby’s birthweight can be explained
by mother’s weight.

The ANOVA table provides information that F-value is 10.734 and its associated p-value is
0.001. Because the associated p-value is less than 0.05, it can be said that overall, the regression
model statistically significantly predicts baby’s birthweight (i.e. it is a good fit for the data).

The coefficients table provides information that the t-value for mothers’ weight is 3.276 whose
associated p-value is 0.001. Since the associated p-value is less than 0.05, it can be said that mother’s
weight contributes statistically significantly to baby’s birthweight. The unstandardized coefficient (B
coefficient) for mothers’ weight is 3.977. Hence, the regression equation for baby’s weight is
𝐵𝑎𝑏𝑦 ′ 𝑠 𝑊𝑒𝑖𝑔ℎ𝑡 = 2426.72 + 3.977(𝑀𝑜𝑡ℎ𝑒𝑟 ′ 𝑠 𝑊𝑒𝑖𝑔ℎ𝑡). The equation can be used to predict a baby’s
birthweight in grams given the mother’s weight.

In a research manuscript, the results can be presented and interpreted as follows:

Table 5.1.1. Summary of Simple Regression Analysis for Mother’s


Weight Affecting Baby’s Birthweight (N=378)

B SE B β
Constant 2426719 162.194 14.962**
Mother’s Weight 3.977 1.214 0.167**
R2 0.028
F 10.734**
**p<0.01

Table 5.1.1 shows the summary of the simple regression analysis which was carried out
to investigate whether mothers’ weight could significantly affect babies’ birthweight.
The results of the regression analysis indicated that the model explained 2.8% of the
variance and that the model was a significant predictor of babies’ birthweight, F(1,376)
= 10.734, p = .001. The result also shows that mothers’ weight contributed significantly
to the model (B = .3.977, p<.05). The final predictive model was: Babies’ Birthweight
= 2426.72 + 3.977 (Mothers’ Weight).

RESEARCH ST TISTICS 3
MULTIPLE REGRESSION ANALYSIS
Multiple linear regression is used when we want to predict the value of a variable based on
the value of two or more other variables. Before analyzing data using multiple linear regression, the
following assumptions must be checked:

1. The dependent variable should be measured on a continuous scale – interval or ratio level.
2. There are two or more independent variables which are either continuous or categorical
(i.e. nominal or ordinal).
3. There should be independence of observations.
4. There is a linear relationship between (a) the dependent variable and each of the
independent variables and (b) the dependent variable and the independent variables
collectively.
5. The data needs to show homoscedasticity.
6. The data must not show multicollinearity (i.e. two or more independent variables are highly
correlated with each other).
7. There should be no significant outliers, high leverage points or highly influential points.
8. The residuals (errors) of the regression line variables should be approximately normally
distributed.

Interpreting SPSS Statistics Output of Multiple Linear Regression Analysis


Similar to simple linear regression analysis, the output of multiple linear regression analysis
presents three tables – Model Summary table, ANOVA table, and Coefficients table. The model
summary table provides the correlation coefficient and the R square value. The R2 value indicates
how much of the total variation in the independent variable can be explained by the
independent variable.

The ANOVA table reports how well the regression equation fits the data (i.e. how well the
regression equation predicts the dependent variable). The regression model statistically
significantly predicts the dependent variable (i.e. it is a good fit for the data) if the p-value
associated with the F-value is less than 0.05.

The coefficients table provides information to predict the dependent variable as well as
determine whether the independent variables contribute statistically significantly to the model. That
is, the p-value associated with the t-value should be less than 0.05. The table also provides the
unstandardized coefficients for the independent variables which indicate how much the
dependent variable varies with an independent variable when all other dependent variables are
held constant.

Example 5.2.1. Suppose a researcher is interested in determining whether mother’s age,


weight, history of premature labor and presence of uterine irritability predict baby’s
birthweight. The data were analyzed using SPSS and the following output was generated. Do
mother’s age, weight, history of premature labor and presence of uterine irritability predict
baby’s birthweight?

RESEARCH ST TISTICS 4
SPSS Statistics Output

Model Summary
Adjusted R Std. Error of the
Model R R Square Square Estimate
1 .313a .098 .089 694.648
a. Predictors: (Constant), Presence of Uterine Irritability, Age of the
Mother in Years, mweight, History of Premature Labor

ANOVAa
Model Sum of Squares df Mean Square F Sig.
1 Regression 19615016.330 4 4903754.082 10.162 .000b
Residual 179986027.374 373 482536.266
Total 199601043.704 377
a. Dependent Variable: bweight
b. Predictors: (Constant), Presence of Uterine Irritability, Age of the Mother in Years, mweight,
History of Premature Labor

Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 2528.187 209.715 12.055 .000
Age of the Mother in Years 7.344 6.920 .053 1.061 .289
Mweight 2.634 1.214 .110 2.169 .031
History of Premature Labor -178.828 79.523 -.115 -2.249 .025
Presence of Uterine
-444.454 105.729 -.214 -4.204 .000
Irritability
a. Dependent Variable: bweight

Figure 5.2.1. SPSS Statistics Output on Multiple Linear Regression Analysis

Solution:
The model summary table provides information that the multiple correlation coefficient is
0.313 and R2 value is 0.098. This indicates that 9.8% of the total variation in baby’s birthweight can
be explained by mother’s age, weight, history of premature labor and presence of uterine irritability.

The ANOVA table provides information that F-value is 10.162 and its associated p-value is
0.000. Because the associated p-value is less than 0.05, it can be said that overall, the regression
model statistically significantly predicts baby’s birthweight (i.e. it is a good fit for the data).

The coefficients table provides the unstandardized coefficients (B coefficients). The


unstandardized coefficient for mother’s weight is 2.634. This means that for every 1 lb increase in
mother’s weight, there is an increase in baby’s birthweight of 2.634 grams. The coefficients for
mother’s weight, history of premature labor and presence of urine irritability are significant since the
p-values associated with their t-values are less than 0.05. This indicates that mother’s weight, history
of premature labor and presence of urine irritability are significant predictors of baby’s birthweight.
On the other hand, the coefficient of mother’s age is not significant since the p-value associated with
its t-value is greater than 0.05. Hence, mother’s age is not a predictor of baby’s birthweight.

RESEARCH ST TISTICS 5
In a research manuscript, the results can be presented and interpreted as follows:

Table 5.2.1. Summary of Multiple Regression Analysis for Mother’s


Profile Predicting Baby’s Birthweight (N=378)

B SE B p-value
Constant 252.187 209.715 0.000
Mother’s Age 7.344 6.920 0.289
Mother’s Weight 2.634 1.214 0.031
History of Premature Labor -178.828 79.523 0.025
Presence of Uterine Irritability -444.454 105.729 <0.001
R-square=0.098, F=10.162, p<0.001

Table 5.2.1 presents the summary of the results of multiple regression analysis in
testing the effect of mother’s age, weight, history of premature labor and presence of
urine irritability on the baby’s birthweight. Analysis of the data reveals that mother’s
weight significantly affects baby’s birthweight (B=2.634, SE=1.214, p=0.031). The
regression coefficient is positive indicating that heavier mothers tend to have heavier
babies. Further analysis of the data reveals that mother’s history of premature labor
significantly affects the baby’s birthweight (B=-178.828, SE=79.523, p=0.025). The
negative coefficient indicates that mothers with less experience of premature labor
tend to have heavier babies than those with more history of premature labor. It is also
revealed that mother’s presence of urine irritability negatively and significantly affects
baby’s birthweight (B=-444.454, SE=105.729, p<.001). Mothers without urine
irritability tend to have heavier babies than those with uterine irritability. Age of
mother is not a significant predictor of baby’s birthweight (B=7.344, SE=6.920,
p=0.289). Taken together all the predictors of the baby’s birthweight, the R-square is
.098. This finding indicates that 9.8% of the variability of baby’s birthweight is explained
by the predictors of the study. The other 90.2% is explained by the other factors not
included in the present study.

RESEARCH ST TISTICS 6
References:

Al-Majed and M.R. Preston (2000). "Factors Influencing the Total Mercury and Methyl Mercury in the
Hair of Fishermen in Kuwait," Environmental Pollution, Vol. 109, pp. 239-250

Field, A. (2009). Discovering statistics using SPSS 3rd Edition. London: SAGE Publishing Ltd.

IBM Corp. Released 2012. IBM SPSS Statistics for Windows, Version 21.0. Armonk, NY: IBM Corp.

Senthilnathan, S. (2019). Usefulness of correlation analysis. Retrieved July 30, 2020 from
https://www.researchgate.net/publication/334308527_Usefulness_of_Correlation_Analysis

Wagner, Agahajanian, and Bing (1968). Correlation of Performance Test Scores with Tissue
Concentration of Lysergic Acid Diethylamide in Human Subjects. Clinical Pharmacology and
Therapeutics, Vol.9 pp635-638.

RESEARCH ST TISTICS 7

You might also like