Yaregal Birhanu
Yaregal Birhanu
Prepared by
Yaregal Birhanu GSE/9996/15
Section 2 Submitted to: Temesgen (PhD)
AAU, CoBE
Dec 29, 2023
Part I:
1. Cross-Sectional Data: a type of observational data that is collected at a single point in time or
over a very short period. Each observation in cross-sectional data represents a distinct individual
or unit, and the data is collected at a specific moment or within a specific time frame. Key
characteristics of cross-sectional data: Snapshot in Time, no time dimension, and
heterogeneity…
2. Time Series Data: a type of data that is collected or recorded over time, typically at regular
intervals. It involves the observation of a single variable or several variables over successive
periods.
3. Panel Data: it combines elements of both cross-sectional and time series data. It involves
observations on multiple subjects or entities over multiple time periods.
4. Pooled Cross-Sectional Data: a type of data that combines cross-sectional observations from
multiple time periods.
5. Correlation: measures the strength and direction of a linear relationship between two variables.
6. Ordinary Least Squares: the minimization of the sum of the squared residuals.
7. The role of the error term: it represents the unobservable factors that affect the dependent
variable but are not explicitly included in the model.
8. The difference between the error term and residual: error term represents unobservable factors,
while residuals are the observed differences between actual and predicted values.
9. Confidence Interval: a statistical tool used to quantify the uncertainty or variability associated
with a point estimate of a population parameter. It provides a range of values within which we
can be reasonably confident that the true parameter lies.
Linearity: the relationship between the dependent variable and the independent variables
is linear. This means that changes in the independent variables have a constant effect on
the dependent variable.
1
Homoskedasticity: the variance of the residuals is constant across all levels of the
independent variables. This means that the spread of residuals should be roughly the
same for all values of the independent variables.
No Perfect Multi-collinearity: no perfect linear relationship among the independent
variables. Multi collinearity occurs when two or more independent variables are highly
correlated, making it difficult to separate their individual effects on the dependent
variable
Normality: the distribution of residuals is normal
Nature of Multicollinearity:
High Correlation: high correlation coefficients between independent variables,
indicating that changes in one variable are associated with systematic changes in
another.
Interpretation Challenge: In the presence of multicollinearity, it becomes
challenging to recognize the individual impact of each independent variable on the
dependent variable.
Causes of Multicollinearity:
Data Collection Methods: if two variables are highly correlated because they are
derived from similar sources or measured in a similar way
Overlapping Concepts: Variables that measure similar or overlapping concepts are
likely to be correlated.
Functional Relationships: If there are functional relationships among the
independent variables.
Sample Size: In small samples, the estimation of correlation coefficients may be
imprecise, leading to difficulties in identifying multicollinearity.
Consequences of Multicollinearity:
Increased Standard Errors: Multicollinearity inflates the standard errors of the
regression coefficients, making them less precise.
Unstable Coefficients: Small changes in the data can lead to large changes in the
estimated coefficients, making the results unstable and sensitive to variations in the
sample.
Ambiguous Variable Importance: Multicollinearity makes it challenging to
determine which variables are truly important in explaining the variation in the
dependent variable.
Inefficient Estimation: The efficiency of the parameter estimates is compromised, as
the model struggles to free the effects of highly correlated variables.
Misleading Interpretations: can lead to misleading interpretations of the
relationships between independent variables and the dependent variable. It may
suggest that certain variables are not important when, in fact, they are.
2
12. The importance of the normality assumption in OLS:
t-Tests and p-Values: inference about individual coefficients relies on t-tests, which
assume that the sampling distribution of the estimated coefficients is approximately
normal. The p-values associated with these tests are accurate when the normality
assumption holds.
Large Sample Size Mitigation: for large sample sizes, the central limit theorem suggests
that the distribution of the sample mean approaches normality, even if the underlying
distribution of the error term is not exactly normal. This means that the normality
assumption becomes less critical with larger samples.
the variability of the errors (residuals) in a regression model is not constant across all
levels of the independent variable(s)
Nature of Heteroscedasticity:
o Unequal Spread: Heteroscedasticity manifests as an unequal spread of residuals.
This means that the variability of the errors systematically changes across
different values of the independent variable(s).
Causes of Heteroscedasticity:
o Omitted Variables: Failure to include relevant variables in the model that affect
the variability of the dependent variable.
o Transformation Issues: The use of transformations (e.g., logarithmic
transformations) on variables may introduce heteroscedasticity if the
transformation affects variability differently across levels of the independent
variable.
o Measurement Error: If there is measurement error in the dependent variable that
varies systematically with the independent variable.
Consequences of Heteroscedasticity:
3
o Inefficient Estimates: Heteroscedasticity leads to inefficient estimates of the
standard errors of the regression coefficients. This means that the precision of the
estimates is compromised.
------------------------------------------------------------------------------------------
salary | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+---------------------------------------------------------------------------
sales | .0159837 .011093 1.44 0.151 -.0059112 .0378787
profits | .0317025 .2764857 0.11 0.909 -.514017 .5774221
mktval | .023831 .0159338 1.50 0.137 -.0076187 .0552807
_cons | 717.0624 47.51152 15.09 0.000 623.2855 810.8393
b) Sales:
Coefficient: 0.0159837
Interpretation: Holding other variables constant, a one-unit increase in sales is associated
with an increase of approximately 0.016 units in salary. However, the p-value (P>|t|) is
0.151, suggesting that the relationship is not statistically significant at the conventional
0.05 significance level.
4
Profits:
Coefficient: 0.0317025
Interpretation: Holding other variables constant, a one-unit increase in profits is
associated with an increase of approximately 0.032 units in salary. However, the p-value
is 0.909, indicating that the relationship is not statistically significant.
Market Value:
Coefficient: 0.023831
Interpretation: Holding other variables constant, a one-unit increase in market value is
associated with an increase of approximately 0.024 units in salary. The p-value is 0.137,
suggesting that the relationship is not statistically significant at the conventional 0.05
significance level.
Intercept (_cons):
Coefficient: 717.0624
Interpretation: When all independent variables are zero (sales, profits, and market value
are all zero), the estimated average salary is 717.06. This is the intercept or the baseline
salary.
c) Only the intercept (baseline salary) is statistically insignificant at the 0.05 significance
level.
d) The coefficient of determination (R2) provides the proportion of the variation in the
dependent variable (CEO salaries) explained by the independent variables (sales, profits, and
market value) in the regression model. R2 value is 0.1777, which means that approximately
17.77% of the total variation in CEO salaries is explained by the variation in sales, profits, and
market value.
e) CEO Tenure (ceoten): Coefficient is 12.73086.
For each additional year of CEO tenure, there is an estimated increase of approximately 12.731
units in CEO salary.
f) Linear Term
The coefficient β2 represents the linear effect of age on salary.
If β2 is positive, it indicates a positive linear relationship. As age increases, salary
increases.
If β2 is negative, it indicates a negative linear relationship. As age increases,
salary decreases.
Quadratic Term
The coefficient β3 represents the quadratic effect of age on salary.
The quadratic term introduces curvature to the relationship. If β3 is positive, it
indicates an upward-curving relationship. Salary initially increases with age, but
the rate of increase slows down as age further increases.
If β3 is negative, it indicates a downward-curving relationship. Salary initially
increases with age, but the rate of increase slows down, and eventually, salary
may start decreasing with very high ages.
5
The turning point is the age at which the quadratic effect transitions the relationship from
increasing to decreasing (or vice versa). The turning point represents the age at which the effect
of age on salary changes direction.
2. a) lowest= 0 highest=18
b) average education= 12.56274
c) average wage= 5.896103
d) female=252
e) male=274
f) people above mean= 217
g) wage=B0+B1educ+B2exper+B3tenure+u
i) Interpretation of Parameters in the Model:
reg wage educ exper tenure
------------------------------------------------------------------------------
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ | .5989651 .0512835 11.68 0.000 .4982176 .6997126
exper | .0223395 .0120568 1.85 0.064 -.0013464 .0460254
tenure | .1692687 .0216446 7.82 0.000 .1267474 .2117899
_cons | -2.872735 .7289643 -3.94 0.000 -4.304799 -1.440671
B0: The intercept. It represents the estimated wage when all independent variables (educ,
exper, and tenure) are zero.
B1: The coefficient for educ. It represents the estimated change in wage for a one-unit
change in education level, holding exper and tenure constant.
B2: The coefficient for exper. It represents the estimated change in wage for a one-unit
change in years of experience, holding educ and tenure constant.
B3: The coefficient for tenure. It represents the estimated change in wage for a one-unit
change in tenure, holding educ and exper constant.
U: The error term the unexplained
ii) log(wage)= B0+B1educ+B2exper+B3tenure+u
Interpretation remains similar, but now the coefficients represent percentage changes in wage
due to a one-unit change in the corresponding independent variable.
iii) Interpretation of R-Squared:
R2 (coefficient of determination) measures the proportion of variability in the dependent
variable (wage or log(wage)) explained by the independent variables.
6
For example, if R2=0.80, it means that 80% of the variability in wage (or log(wage)) is
explained by the variables in the model
iv) Interpretation of Adjusted R-Squared:
Adjusted R2 takes into account the number of variables in the model. It penalizes the
addition of variables that do not improve the model significantly.
It is useful when comparing models with different numbers of variables.
A higher adjusted R2 suggests a better balance between model fit and complexity.
v) exper at 0.064
3. a) The coefficient of dtowndtown is 45.8.
Interpretation: Holding the size (sqrmeter) and the number of bedrooms (bdrms) constant,
being in the town (dtown = 1) is associated with an increase in house price by 45.8 units
compared to being outside the town (dtown = 0)
Interpretation: When the size is zero (which might not be practically meaningful), the
number of bedrooms is zero, and the house is not in the town (dtown = 0), the estimated
house price is 14.6
c) R2=0.58 indicates that approximately 58% of the variability in house prices is explained by
the independent variables (sqrmeter, bdrms, dtown) in the model.
Interpretation: Holding the size (sqrmeter) and the town indicator (dtown) constant,
having one additional bedroom is associated with an estimated increase in house price of
75.25 units.
e) price=14.6+90.52×140+75.25×4+45.8×1 =13034.2
f) i) price=14.6+90.52×2400+75.25×3+45.8×1= 217,534.15