Models Assignment

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 43

1.

INTRODUCTION
An econometrics model specifies the statistical relationship that is believed to hold between the
various economic quantities pertaining to a particular economic phenomenon.

Econometrics is used to create the values for economic relationships with the help of the
integration of economics, along with mathematical and statistical objectives of these parameters.
In a general language, econometrics uses economics theories with mathematical forms and
combines them with empirical economics. With the help of econometrics methods, it is easy to
obtain parameter values to identify the coefficient relationship of mathematics and economics.

An econometric model consists of - a set of equations describing the behaviour. These equations
are derived from the economic model and have two parts observed variables and disturbances.

- a statement about the errors in the observed values of variables.

- a specification of the probability distribution

An econometric model then is a set of joint probability distributions to which the true joint
probability distribution of the variables under study is supposed to belong. In the case in which
the elements of this set can be indexed by a finite number of real-valued parameters, the model is
called a parametric model; otherwise it is a nonparametric or semi parametric model. A large part
of econometrics is the study of methods for selecting models, estimating them, and carrying out
inference on them.
2. ECONOMETRICS MODELS
An econometric model is based on equations that can describe the economy's impact on business
management. It can be an observed variable or disturbance equation, but there is a statement to
derive the variables instead of assumptions.

3. LINEAR REGRESSION MODEL


In statistics, linear regression is a linear approach for modeling the relationship between a scalar
response and one or more explanatory variables (also known as dependent and independent
variables). The case of one explanatory variable is called simple linear regression; for more than
one, the process is called multiple linear regression. This term is distinct from multivariate linear
regression, where multiple correlated dependent variables are predicted, rather than a single
scalar variable.

In linear regression, the relationships are modeled using linear predictor functions whose
unknown model parameters are estimated from the data. Such models are called linear models.

Most commonly, the conditional mean of the response given the values of the explanatory
variables (or predictors) is assumed to be an affine function of those values; less commonly, the
conditional median or some other quantile is used. Like all forms of regression analysis, linear
regression focuses on the conditional probability distribution of the response given the values of
the predictors, rather than on the joint probability distribution of all of these variables, which is
the domain of multivariate analysis.

Linear regression has many practical uses. Most applications fall into one of the following two
broad categories:

1. If the goal is error reduction in prediction or forecasting, linear regression can be used to fit a
predictive model to an observed data set of values of the response and explanatory variables
2. If the goal is to explain variation in the response variable that can be attributed to variation in
the explanatory variables, linear regression analysis can be applied to quantify the strength of
the relationship between the response and the explanatory variables, and in particular to
determine whether some explanatory variables may have no linear relationship with the
response at all, or to identify which subsets of explanatory variables may contain redundant
information about the response
The linear regression model has five key assumptions:
Linear relationship between the dependent and independent variables.
 Multivariate normality
 No or little multi collinearity
 No auto-correlation
 Homoscedasticity, i.e. the variability in the response doesn’t increase as the value of the
predictor increases.

Additionally, the basic assumption of the linear regression model is that of a linear relationship
between the dependent and independent variables. We also assume that the errors follow a
normal distribution, and that the observations are independent of each other.

The Assumptions of Linear Regression


Linear regression is a useful statistical method we can use to understand the relationship between
two variables, x and y. However, before we conduct linear regression, we must first make sure
that four assumptions are met:
1. Linear relationship: There exists a linear relationship between the independent variable,
x, and the dependent variable, y.
2. Independence: The residuals are independent. In particular, there is no correlation
between consecutive residuals in time series data.
3. Homoscedasticity: The residuals have constant variance at every level of x.
4. Normality: The residuals of the model are normally distributed.

If one or more of these assumptions are violated, then the results of our linear regression may be
unreliable or even misleading.

Advantages of Linear Regression


Simple implementation

Linear Regression is a very simple algorithm that can be implemented very easily to give
satisfactory results. Furthermore, these models can be trained easily and efficiently even on
systems with relatively low computational power when compared to other complex algorithms.
Linear regression has a considerably lower time complexity when compared to some of the other
machine learning algorithms. The mathematical equations of Linear regression are also fairly
easy to understand and interpret. Hence Linear regression is very easy to master.

Performance on linearly seperable datasets


Linear regression fits linearly seperable datasets almost perfectly and is often used to find the
nature of the relationship between variables.

Over fitting can be reduced by regularization


Over fitting is a situation that arises when a machine learning model fits a dataset very closely
and hence captures the noisy data as well. This negatively impacts the performance of model
and reduces its accuracy on the test set.

Regularization is a technique that can be easily implemented and is capable of effectively


reducing the complexity of a function so as to reduce the risk of over fitting.
Disadvantages of Linear Regression
Prone to under fitting
Under fitting: A situation that arises when a machine learning model fails to capture the data
properly. This typically occurs when the hypothesis function cannot fit the data well.

Sensitive to outliers
Outliers of a data set are anomalies or extreme values that deviate from the other data points of
the distribution. Data outliers can damage the performance of a machine learning model
drastically and can often lead to models with low accuracy.

4. PROBIT MODEL
In probability theory and statistics, the Probit function is the quantile function associated with the
standard normal distribution. It has applications in data analysis and machine learning, in
particular exploratory statistical graphics and specialized regression modeling of binary response
variable.
Largely because of the central limit theorem, the standard normal distribution plays a
fundamental role in probability theory and statistics. If we consider the familiar fact that the
standard normal distribution places 95% of probability between −1.96 and 1.96, and is
symmetric around zero, it follows that:-

An ordinary differential equation for the probit function


Another means of computation is based on forming a non-linear ordinary differential equation
(ODE) for probit, as per the Steinbrecher and Shaw method
This equation may be solved by several methods, including the classical power series approach.
From this, solutions of arbitrarily high accuracy may be developed based on Steinbrecher's
approach to the series for the inverse error function. The power series solution is given by

The main assumptions of a probit model are


 The output variable is a Bernoulli random variable.

 The input variables are continuous.

 The errors are normally distributed.

 The errors are independent and identically distributed.

 The errors are homoscedastic.

5. Tobit model
In statistics, a tobit model is any of a class of regression models in which the observed range of
the dependent variable is censored in some way.
Tobin's idea was to modify the likelihood function so that it reflects the unequal sampling
probability for each observation depending on whether the latent dependent variable fell above
or below the determined threshold. For a sample that, as in Tobin's original case, was censored
from below at zero, the sampling probability for each non-limit observation is simply the height
of the appropriate density function. For any limit observation, it is the cumulative distribution,
i.e. the integral below zero of the appropriate density function. The tobit likelihood function is
thus a mixture of densities and cumulative distribution functions. The likelihood function

Below are the likelihood and log likelihood functions for a type I tobit.

functions.
For a data set with N observations the likelihood function for a type I tobit is

Re parameterization
The log-likelihood as stated above is not globally concave, which complicates the maximum

likelihood estimation. Olsen suggested the simple re parameterization


resulting in a transformed log-likelihood,

which is globally concave in terms of the transformed parameters.


For the truncated (tobit II) model, Orme showed that while the log-likelihood is not globally
concave, it is concave at any stationary point under the above transformation

Consistency

the resulting ordinary


least squares regression estimator is inconsistent. It will yield a downwards-biased estimate of
the slope coefficient and an upward-biased estimate of the intercept. Takeshi Amemiya (1973)
has proven that the maximum likelihood estimator suggested by Tobin for this model is
consistent

Interpretation

as one would with a linear regression model; this is a common error


Variations of the tobit model eighted by the probability of being above the limit; and (2) the

change in the probability of being above the limit, weighted by the

Variations of the tobit model


Variations of the tobit model can be produced by changing where and when censoring occurs.
Amemiya (1985, p. 384) classifies these variations into five categories (tobit type I – tobit type
V), where tobit type I stands for the first model described above. Schnedler (2005) provides a
general formula to obtain consistent likelihood estimators for these and other variations of the
tobit model.
Type I
The tobit model is a special case of a censored regression model

A common variation of the tobit model is


censoring at a value yL different from zero:

Type II
Type II tobit models introduce a second latent variable
In Type I tobit, the latent variable absorbs both the process of participation and the outcome of
interest. Type II tobit allows the process of participation (selection) and the outcome of interest
to be independent, conditional on observable data.
The Heckman selection model falls into the Type II tobit, which is sometimes called Heckit after
James Heckman

Type III

The Heckman model falls into this type.


Type IV
Type IV introduces a third observed dependent variable and a third latent variable

Type

Applications
Tobit models have, for example, been applied to estimate factors that impact grant receipt,
including financial transfers distributed to sub-national governments who may apply for these
grants. In these cases, grant recipients cannot receive negative amounts, and the data is thus left-
censored. For instance, Dahlberg and Johansson (2002) analyse a sample of 115 municipalities
(42 of which received a grant). Dubois and Fattore (2011) use a tobit model to investigate the
role of various factors in European Union fund receipt by applying Polish sub-national
governments.

The Tobit model assumes that


 There is a latent variable underlying the observed dependent variable.
 The error term is normally distributed.
 The quantity of interest in a Tobit model may be the censored outcome or the uncensored
outcome.

6. Binary Logistic regression model


The most adequate situation for applying the binary logistic regression model is when the
phenomenon under study presents itself in a dichotomic way and the researcher has the intent of
estimating an expression of the probability of an event occurrence defined between two
possibilities as a function of determined explanatory variables. The binary logistic regression
model can be considered a unique case of the multinomial logistic regression model, which
variable also presents itself in a qualitative form, however now with more than two event
categories, and an occurrence probability expression will be estimated for each category (Fávero
and Belfiore, 2019).

In statistics, the logistic model (or logit model) is a statistical model that models the probability
of an event taking place by having the log-odds for the event be a linear combination of one or
more independent variables. In regression analysis, logistic regression[ (or logit regression) is
estimating the parameters of a logistic model (the coefficients in the linear combination).
Formally, in binary logistic regression there is a single binary dependent variable, coded by an
indicator variable, where the two values are labeled "0" and "1", while the independent variables
can each be a binary variable (two classes, coded by an indicator variable) or a continuous
variable (any real value).
Binary variables are widely used in statistics to model the probability of a certain class or event
taking place, such as the probability of a team winning, of a patient being healthy, etc. (see §
Applications), and the logistic model has been the most commonly used model for binary
regression since about 1970.

Applications
Logistic regression is used in various fields, including machine learning, most medical fields,
and social sciences. For example, the Trauma and Injury Severity Score (TRISS), which is
widely used to predict mortality in injured patients, was originally developed by Boyd et al.
using logistic regression.[6] Many other medical scales used to assess severity of a patient have
been developed using logistic regression. Logistic regression may be used to predict the risk of
developing a given disease (e.g. diabetes; coronary heart disease), based on observed
characteristics of the patient (age, sex, body mass index, results of various blood tests, etc.).
Another example might be to predict whether a Nepalese voter will vote Nepali Congress or
Communist Party of Nepal or Any Other Party, based on age, income, sex, race, state of
residence, votes in previous elections, etc

Model
Logistic regression is a method that we can use to fit a regression model when the response
variable is binary.
Before fitting a model to a dataset, logistic regression makes the following assumptions:
Assumptions of binary logistic regression
Assumption #1: The Response Variable is Binary
Logistic regression assumes that the response variable only takes on two possible outcomes.
Some examples include:
 Yes or No
 Male or Female
 Pass or Fail
 Drafted or Not Drafted
 Malignant or Benign

Assumption #2: The Observations are Independent


Logistic regression assumes that the observations in the dataset are independent of each other.
That is, the observations should not come from repeated measurements of the same individual or
be related to each other in any way.
Assumption #3: There is No Multicollinearity Among Explanatory Variables

Logistic regression assumes that there is no severe multicollinearity among the explanatory
variables.
Assumption #4: There are No Extreme Outliers
Logistic regression assumes that there are no extreme outliers or influential observations in the
dataset.
Assumption #5: There is a Linear Relationship Between Explanatory Variables and the
Logit of the Response Variable
Logistic regression assumes that there exists a linear relationship between each explanatory
variable and the logit of the response variable. Recall that the logit is defined as:
A binary logit model is a statistical model that models the probability of an event taking place by
having the log-odds for the event be a linear combination of one or more independent variables
Here are some advantages and disadvantages of binary logit models:

Advantages:

Logistic regression is easier to implement, interpret, and very efficient to train .


 It makes no assumptions about distributions of classes in feature space .
 It can easily extend to multiple classes (multinomial regression) and a natural probabilistic
view of class predictions .
 It can interpret model coefficients as indicators of feature importance .
 It is very fast at classifying unknown records .
 It requires average or no multi collinearity between independent variables .

Disadvantages:
 If the number of observations is lesser than the number of features, Logistic Regressionshould
not be used, otherwise, it may lead to overfitting .
 The major limitation of Logistic Regression is the assumption of linearity between the
dependent variable and the independent variables. Non-linear problems can’t be solved with
logistic regression because it has a linear decision surface. Linearly separable data is rarely found
in real-world scenarios.
 It can only be used to predict discrete functions. Hence, the dependent variable of Logistic
Regression is bound to the discrete number set.
 It is tough to obtain complex relationships using logistic regression. More powerful and
compact algorithms such as Neural Networks can easily outperform this algorithm.

Multinomial logistic regression


In statistics, multinomial logistic regression is a classification method that generalizes logistic
regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it
is a model that is used to predict the probabilities of the different possible outcomes of a
categorically distributed dependent variable, given a set of independent variables (which may be
real-valued, binary-valued, categorical-valued, etc.).
Multinomial logistic regression is known by a variety of other names, including polytomous LR,
multiclass LR, softmax regression, multinomial logit (mlogit), the maximum entropy (MaxEnt)
classifier, and the conditional maximum entropy mode
Multinomial logistic regression is used when the dependent variable in question is nominal
(equivalently categorical, meaning that it falls into any one of a set of categories that cannot be
ordered in any meaningful way) and for which there are more than two categories. Some
examples would be:

Which major will a college student choose, given their grades, stated likes and dislikes, etc.?
 Which blood type does a person have, given the results of various diagnostic tests?
 In a hands-free mobile phone dialing application, which person's name was spoken, given
various properties of the speech signal?
 Which candidate will a person vote for, given particular demographic characteristics?
 Which country will a firm locate an office in, given the characteristics of the firm and of the
various candidate countries?

Introduction
There are multiple equivalent ways to describe the mathematical model underlying multinomial
logistic regression. This can make it difficult to compare different treatments of the subject in
different texts. The article on logistic regression presents a number of equivalent formulations of
simple logistic regression, and many of these have analogues in the multinomial logit model.
The idea behind all of them, as in many other statistical classification techniques, is to construct
a linear predictor function that constructs a score from a set of weights that are linearly combined
with the explanatory variables (features) of a given observation using a dot product

observation i
to category k. In discrete choice theory, where observations represent people and outcomes
represent choices, the score is considered the utility associated with person i choosing outcome k.
The predicted outcome is the one with the highest score.
Assumptions
The multinomial logistic model assumes that data are case-specific; that is, each independent
variable has a single value for each case. As with other types of regression, there is no need for
the independent variables to be statistically independent from each other (unlike, for example, in
a naive Bayes classifier); however, collinearity is assumed to be relatively low, as it becomes
difficult to differentiate between the impact of several variables if this is not the case.[5]
If the multinomial logit is used to model choices, it relies on the assumption of independence of
irrelevant alternatives (IIA), which is not always desirable. This assumption states that the odds
of preferring one class over another do not depend on the presence or absence of other
"irrelevant" alternatives. For example, the relative probabilities of taking a car or bus to work do
not change if a bicycle is added as an additional possibility. This allows the choice of K
alternatives to be modeled as a set of K-1 independent binary choices, in which one alternative is
chosen as a "pivot" and the other K-1 compared against it, one at a time

Strength

Logistic Regression is one of the simplest machine learning algorithms and is easy to implement
yet provides great training efficiency in some cases. Also due to these reasons, training a model
with this algorithm doesn't require high computation power.
The predicted parameters (trained weights) give inference about the importance of each feature.
The direction of association i.e. positive or negative is also given. So we can use logistic
regression to find out the relationship between the features.
This algorithm allows models to be updated easily to reflect new data, unlike decision trees or
support vector machines. The update can be done using stochastic gradient descent.

Weaknesses
 It is difficult to capture complex relationships using logistic regression. More powerful and
complex algorithms such as Neural Networks can easily outperform this algorithm.

The training features are known as independent variables. Logistic Regression requires
moderate or no multicollinearity between independent variables. This means if two
independent variables have a high correlation, only one of them should be used. Repetition of
information could lead to wrong training of parameters (weights) during minimizing the cost
function. Multicollinearity can be removed using dimensionality reduction techniques.
In Linear Regression independent and dependent variables should be related linearly. But
Logistic Regression requires that independent variables are linearly related to the log odds
(log(p/(1-p)).

 Only important and relevant features should be used to build a model otherwise the
probabilistic predictions made by the model may be incorrect and the model's predictive value
may degrade.
6. Propensity score matching model
(i) In the statistical analysis of observational data, propensity score matching (PSM) is a
statistical matching technique that attempts to estimate the effect of a treatment, policy, or other
intervention by accounting for the covariates that predict receiving the treatment. PSM attempts
to reduce the bias due to confounding variables that could be found in an estimate of the
treatment effect obtained from simply comparing outcomes among units that received the
treatment versus those that did not.
The "propensity" describes how likely a unit is to have been treated, given its covariate values.
The stronger the confounding of treatment and covariates, and hence the stronger the bias in the
analysis of the naive treatment effect, the better the covariates predict whether a unit is treated or
not. By having units with similar propensity scores in both treatment and control, such
confounding is reduced.
(iii) PSM is for cases of causal inference and confounding bias in non-experimental settings in
which: few units in the non-treatment comparison group are comparable to the treatment units;
and
(iv) selecting a subset of comparison units similar to the treatment unit is difficult because units
must be compared across a high-dimensional set of pretreatment characteristics.

Implementations in statistics packages


 R: propensity score matching is available as part of the MatchIt, optmatch, or other packages.
 SAS: The PSMatch procedure, and macro OneToManyMTCH match observations based on a
propensity score.
 Stata: several commands implement propensity score matching, including the user-written
psmatch2. Stata version 13 and later also offers the built-in command teffects psmatch.
 SPSS: A dialog box for Propensity Score Matching is available from the IBM SPSS Statistics
menu (Data/Propensity Score Matching), and allows the user to set the match tolerance,
randomize case order when drawing samples, prioritize exact matches, sample with or without
replacement, set a random seed, and maximize performance by increasing processing speed and
minimizing memory usage.
 Python: PsmPy, a library for propensity score matching in python
Formal definitions
The basic case is of two treatments (numbered 1 and 0), with N independent and identically
distributed random variables subjects.

The quantity to be estimated is the average treatment effect:

or control (

Strength of psm
PSM has been shown to increase model "imbalance, inefficiency, model dependence, and bias,"
which is not the case with most other matching methods. The insights behind the use of matching
still hold but should be applied with other matching methods; propensity scores also have other
productive uses in weighting and doubly robust estimation.
One disadvantage of PSM is that it only accounts for observed (and observable) covariates and
not latent characteristics. Factors that affect assignment to treatment and outcome but that cannot
be observed cannot be accounted for in the matching procedure. As the procedure only controls
for observed variables, any hidden bias due to latent variables may remain after matching.
Another issue is that PSM requires large samples, with substantial overlap between treatment and
control groups.

Weaknesses of psm
the The key advantages of PSM were, at the time of its introduction, that by using a linear
combination of covariates for a single score, it balances treatment and control groups on a large
number of covariates without losing a large number of observations. If units in the treatment and
control were balanced on a large number of covariates one at a time, large numbers of
observations would be needed to overcome the "dimensionality problem" whereby the
introduction of a new balancing covariate increases the minimum necessary number of
observations in sample geometrically.
Main theorems

Any score that is 'finer' than the propensity score is a balancing score

If treatment assignment is strongly ignorable given X then:

Using sample estimates of balancing scores can produce sample balance on X

7.ordered logit model


In statistics, the ordered logit model (also ordered logistic regression or proportional odds model)
is an ordinal regression model—that is, a regression model for ordinal dependent variables—first
considered by Peter McCullagh. For example, if one question on a survey is to be answered by a
choice among "poor", "fair", "good", "very good" and "excellent", and the purpose of the
analysis is to see how well that response can be predicted by the responses to other questions,
some of which may be quantitative, then ordered logistic regression may be used. It can be
thought of as an extension of the logistic regression model that applies to dichotomous dependent
variables, allowing for more than two (ordered) response categories.

The model and the proportional odds assumption


The model only applies to data that meet the proportional odds assumption, the meaning of
which can be exemplified as follows. Suppose there are five outcomes: "poor", "fair", "good",
"very good", and "excellent". We assume that the probabilities of these outcomes are given by
p1(x), p2(x), p3(x), p4(x), p5(x), all of which are functions of some independent variable(s) x.
Then, for a fixed value of x, the logarithms of the odds (not the logarithms of the probabilities) of
answering in certain ways are:
The proportional odds assumption states that the numbers added to each of these logarithms to
get the next are the same regardless of x. In other words, the difference between the logarithm of
the odds of having poor or fair health minus the logarithm of having poor health is the same
regardless of x; similarly, the logarithm of the odds of having poor, fair, or good health minus the
logarithm of having poor or fair health is the same regardless of x; etc
Ordered logit can be derived from a latent-variable model, similar to the one from which binary
logistic regression can be derived. Suppose the underlying process to be characterized is

(perhaps the exact level of agreement with the

statement proposed by the pollster)


assumed to follow a standard logistic distribution; and

we instead can only observe the categories of


response

Then the ordered logit technique will use the observations on y,


Advantages and disadvantages of ordinal logistic regression
So what are the main advantages and disadvantages of ordinal logistic regression? Here are some
of the main advantages and disadvantages you should keep in mind when deciding whether to
use ordinal logistic regression.
Advantages of ordinal logistic regression
 Handles ordered outcomes. Ordinal logistic regression is one of the few common machine
learning models that was specifically developed to handle multiclass outcomes that have a
natural order to them. That means that it is in a league of its own when it comes to handling
ordinal outcomes.

Fewer parameters than other multiclass regression models.


The ordinal logistic regression model is a simple model that has fewer parameters that need to
be estimated than other regression models that can handle multiclass data. Given that two models
have relatively similar performance, it is almost always better to go with the more simple model.

 Interpretable coefficients.
As with many other regression models, ordinal logistic regression models provide highly
interpretable coefficients that explain the relationship between your features and your outcome
variable. These coefficients often come along with confidence intervals and statistical tests for
even better interpretability.

Disadvantages of ordinal logistic regression


 Proportional odds assumption.
One of the main disadvantages of ordinal logistic regression is that it makes a fairly strong
assumption that is not necessarily valid in all cases. This assumption, called the proportional
odds assumptions, essentially implies that the differences associated with moving from one
category of the outcome variable to the next higher category are the same across all categories.
There are many examples of situations where this is not true, so you should consider the domain
of the problem and assess your data to determine whether this assumption holds.

Not available in common libraries. Another downside of ordinal logistic regression is that it is
a relatively niche model that is not available in all common machine learning libraries. Ordinal
logistic regression, and regression models in general, tend to be more commonly used in fields
where inference and classical statistics are king. That means that ordinal logistic regression
models are more likely to be implemented in languages and programs that favor classical
statistics such as SAS and Stata.
 General regression downsides. Ordinal logistic regression is subject to many of the same
pitfalls that other regression models like linear regression and logistic regression are. This means
that ordinal logistic regression models are also easily thrown off by things like outliers,
correlated features, non-specified interactions, and missing data.

When to use ordinal logistic regression


So when should you use ordinal logistic regression over other machine learning models? Here
are some examples of situations where you should reach for ordinal logistic regression.
 You have an ordinal outcome and inference is your primary goal. In general, you should
reach for regression models that have highly interpretable coefficients when inference is your
primary goal. That means that you should reach for regression models that can handle multiclass
outcomes like ordinal logistic regression models or multinomial regression models any time
inference is your primary goal.
Proportional odds assumption holds. You should specifically use ordinal logistic regression
over a similar model like multinomial regression when your data has a natural ordering to it and
you believe the proportional odds assumption holds. In these scenarios, the ordinal logistic
regression model is the simpler model with fewer parameters that need to be estimated.

Heckman 2 stage model


The Heckman Selection model is basically a method for estimating regression models with
problems of sample selection bias (selectivity bias). If data obtained for a regression analysis are
collected through random sampling, then classic regression methods, such as least squares, work
well. However, if the data are obtained by a sampling procedure that is not random, then
standard procedures do not work well. Under the Heckit framework, the dependent variable

the latent variable is not observed,


To obtain consistent estimates however, we rely on the conditional regression equation given as,

APPLICATION OF THE HECKMAN SELECTIVITY MODEL


For an illustration of the Heckman selection model, we will conduct an analysis on the wage
earned by married women using the Mroz (1987) sample dataset and the Stata 2013 Heckman
Two-step command. From a sample of 753 married women, 428 are participating in the labour
market and nonzero earnings. We first estimate a wage equation, explaining ln (WAGE) as a
function of the woman’s education (EDUC), and years of work experience (EXPER), using the
428 women who have positive wages. Note that the selection bias in this case comes from the
fact that our estimation only takes into account women with positive or nonzero earnings or
those participating in the labour market thus ignoring the remaining 325 observations for which
we also collected data on.
Our wage/ response equation takes the following form;

DISADVANTAGES OF THE HECKMAN SELECTION MODEL


1. The two step estimator is a limited information maximum likelihood (LIML) estimator. In
asymptotic theory and in finite samples, the full information (FIML) estimator exhibits better
statistical properties. However, the FIML estimator is more computationally difficult to
implement.
2. The Covariance matrix generated by the OLS estimation of the second stage is inconsistent.
Correct standard errors and other statistics can be generated from an asymptotic approximation
or by resampling, such as through bootstrap.
3. The canonical model assumes the errors are jointly normal. If that assumption fails, the
estimator is generally inconsistent and can provide misleading inferences in small samples. Semi
parametric and other robust alternatives can be used in such cases.

The model obtains formal identification from the normality assumption when the same
covariates appear in the selection equation and the equation of interest, but identification will be
tenuous unless there are many observations in the tails where there is substantial nonlinearity in
the inverse mills ratio (IMR). Generally, an exclusion restriction is required to obtain credible
estimates.
Heckman selection model assumes that
1. error of both selection and main equation are correlated and distributed normally,
2. explanatory variables in selection equation are independent of the error term,
3. explanatory variables in main equation are independent of the error term.

Double Hurdle Model


A double hurdle model is a statistical model used to analyze data with a dependent variable that
takes on the endpoints of an interval with positive probability and is continuously distributed
over the interior of the interval . It is used when the decision to participate in the market
isdecoupled from the consumption amount decision .. The double hurdle model allows for the
estimation of two separate equations: one for participation and another for
consumptionocnditional on participation .
The double hurdle model was first introduced by Cragg in 1971 . It has been used in various
fields such as economics, health, and social sciences to model phenomena that give rise to corner
solution responses.
The double hurdle is an extension of the standard censored regression (Tobit) model for limited
dependent variables that allows the process that generates zero (censored) observations to be
different from that which generates positive (uncensored) observations. Extensions of the hurdle
model to count data and to panel data are discussed. The model has been used widely in applied
microeconomics and empirical studies and relevant software are cited.

Double-hurdle models are used with dependent variables that take on the endpoints of an interval
with positive probability and that are continuously distributed over the interior of the interval.
For example, you observe the amount of alcohol individuals consume over a fixed period of
time. The distribution of the amounts will be roughly continuous over positive values, but there
will be a ―pile up‖ at zero, which is the corner solution to the consumption problem the
individuals face; no individual can consume a negative amount of alcohol.
suppose individuals make their consumption decisions in two steps. First, the individual
determines whether he or she wants to participate in the market. This is called the participation
decision. Then the individual determines an optimal consumption amount (which may be 0)
given his or her circumstances. This is called the quantity decision

10.Multiple linear regression


What is Multiple Linear Regression?
Multiple linear regression refers to a statistical technique that is used to predict the outcome of a
variable based on the value of two or more variables. It is sometimes known simply as multiple
regression, and it is an extension of linear regression. The variable that we want to predict is
known as the dependent variable, while the variables we use to predict the value of the
dependent variable are known as independent or explanatory variablesFigure 1: Multiple linear
regression model predictions for individual observations (Source)
Multiple Linear Regression Formula
Where:

Multiple Linear Regression Formula

Where:
Assumptions of Multiple Linear Regression
1. A linear relationship between the dependent and independent variables
The first assumption of multiple linear regression is that there is a linear relationship between the
dependent variable and each of the independent variables. The best way to check the linear
relationships is to create scatterplots and then visually inspect the scatterplots for linearity. If the
relationship displayed in the scatterplot is not linear, then the analyst will need to run a non-
linear regression or transform the data using statistical software, such as SPSS.
2. The independent variables are not highly correlated with each other
The data should not show multicollinearity, which occurs when the independent variables
(explanatory variables) are highly correlated. When independent variables show
multicollinearity, there will be problems figuring out the specific variable that contributes to the
variance in the dependent variable. The best method to test for the assumption is the Variance
Inflation Factor method.

The variance of the residuals is constant


Multiple linear regression assumes that the amount of error in the residuals is similar at each
point of the linear model. This scenario is known as homoscedasticity. When analyzing the data,
the analyst should plot the standardized residuals against the predicted values to determine if the
points are distributed fairly across all the values of independent variables. To test the assumption,
the data can be plotted on a scatterplot or by using statistical software to produce a scatterplot
that includes the entire model.
4. Independence of observation
The model assumes that the observations should be independent of one another. Simply put, the
model assumes that the values of residuals are independent. To test for this assumption, we use
the Durbin Watson statistic

Advantages of Multiple Regression


There are two main advantages to analyzing data using a multiple regression model. The first is
the ability to determine the relative influence of one or more predictor variables to the criterion
value. The real estate agent could find that the size of the homes and the number of bedrooms
have a strong correlation to the price of a home, while the proximity to schools has no correlation
at all, or even a negative correlation if it is primarily a retirement community.
The second advantage is the ability to identify outliers, or anomalies. For example, while
reviewing the data related to management salaries, the human resources manager could find that
the number of hours worked, the department size and its budget all had a strong correlation to
salaries, while seniority did not. Alternatively, it could be that all of the listed predictor values
were correlated to each of the salaries being examined, except for one manager who was being
overpaid compared to the others.

Disadvantages of Multiple Regression


Any disadvantage of using a multiple regression model usually comes down to the data being
used. Two examples of this are using incomplete data and falsely concluding that a correlation is
a causation.
When reviewing the price of homes, for example, suppose the real estate agent looked at only 10
homes, seven of which were purchased by young parents. In this case, the relationship between
the proximity of schools may lead her to believe that this had an effect on the sale price for all
homes being sold in the community. This illustrates the pitfalls of incomplete data. Had she used
a larger sample, she could have found that, out of 100 homes sold, only ten percent of the home
values were related to a school's proximity. If she had used the buyers' ages as a predictor value,
she could have found that younger buyers were willing to pay more for homes in the community
than older buyers.

11.simple linear regression


Simple linear regression is used to estimate the relationship between two quantitative
variables. You can use simple linear regression when you want to know:
1. How strong the relationship is between two variables (e.g., the relationship between rainfall
and soil erosion).
2. The value of the dependent variable at a certain value of the independent variable (e.g., the
amount of soil erosion at a certain level of rainfall).

3. Regression models describe the relationship between variables by fitting a line to the
observed data. Linear regression models use a straight line, while logistic and nonlinear
regression models use a curved line. Regression allows you to estimate how a dependent variable
changes as the independent variable(s) change.
Assumptions of simple linear regression
Simple linear regression is a parametric test, meaning that it makes certain assumptions about the
data. These assumptions are:
1. Homogeneity of variance (homoscedasticity): the size of the error in our prediction doesn’t
change significantly across the values of the independent variable.
2. Independence of observations: the observations in the dataset were collected using
statistically valid sampling methods, an d there are no hidden relationships among observations.
3. Normality: The data follows a normal distribution.

Linear regression makes one additional assumption:


4. The relationship between the independent and dependent variable is linear: the line of best fit
through the data points is a straight line (rather than a curve or some sort of grouping factor)

How to perform a simple linear regression


Simple linear regression formula
The formula for a simple linear regression is:

For a simple linear regression, you can simply plot the observations on the x and y axis and then
include the regression line and regression function:
Advantages

Advantages of Linear Regression 1. Linear Regression performs well when the dataset is
linearly separable. We can use it to find the nature of the relationship among the variables. 2.
Linear Regression is easier to implement, interpret and very efficient to train. 3. Linear
Regression is prone to over-fitting but it can be easily avoided using some dimensionality
reduction techniques, regularization (L1 and L2) techniques and cross-validation. Disadvantages
of Linear Regression 1. Main limitation of Linear Regression is the assumption of linearity
between the dependent variable and the independent variables. In the real world, the data is
rarely linearly separable. It assumes that there is a straight-line relationship between the
dependent and independent variables which is incorrect many times. 2. Prone to noise and
overfitting: If the number of observations are lesser than the number of features, Linear
Regression should not be used, otherwise it may lead to overfit because is starts considering
noise in this scenario while building the model. 3. Prone to outliers: Linear regression is very
sensitive to outliers (anomalies). So, outliers should be analyzed and removed before applying
Linear Regression to the dataset.
4. Prone to multicollinearity: Before applying Linear regression, multicollinearity should be
removed (using dimensionality reduction techniques) because it assumes athat there is no
relationship among independent variables.

References
Oxford English Dictionary, 3rd ed. s.v. probit (article dated June 2007): Bliss, C. I. (1934). "The
Method of Probits". Science. 79 (2037): 38–39. Bibcode:1934Sci....79...38B.
doi:10.1126/science.79.2037.38. PMID 17813446. These arbitrary probability units have been
called 'probits'.

Agresti, Alan (2015). Foundations of Linear and Generalized Linear Models. New York: Wiley.
pp. 183–186. ISBN 978-1-118-73003-4.

Aldrich, John H.; Nelson, Forrest D.; Adler, E. Scott (1984). Linear Probability, Logit, and
Probit Models. Sage. pp. 48–65. ISBN 0-8039-2133-

Tolles, Juliana; Meurer, William J (2016). "Logistic Regression Relating Patient Characteristics
to Outcomes". JAMA. 316 (5): 533–4. doi:10.1001/jama.2016.7653. ISSN 0098-7484. OCLC
6823603312. PMID 27483067.

Jump up to:a b c d e f g h i j k Hosmer, David W.; Lemeshow, Stanley (2000). Applied Logistic
Regression (2nd ed.). Wiley. ISBN 978-0-471-35632-5. Jump up to:a b Cramer 2002, p. 10–11.
Hayashi, Fumio (2000). Econometrics. Princeton: Princeton University Press. pp. 518–521.
ISBN 0-691-01018-8.

Goldberger, Arthur S. (1964). Econometric Theory. New York: J. Wiley. pp. 253–55. ISBN
9780471311010.

Tobin, James (1958). "Estimation of Relationships for Limited Dependent Variables" (PDF).
Econometrica. 26(1): 24–36. doi:10.2307/1907382. JSTOR 1907382
McCullagh, Peter (1980). "Regression Models for Ordinal Data". Journal of the Royal Statistical
Society. Series B (Methodological). 42 (2): 109–142. JSTOR 2984952.
"rologit.pdf" (PDF). Stata.

Greene, William H. (2012). Econometric Analysis (Seventh ed.). Boston: Pearson Education. pp.
824–827. ISBN 978-0-273-75356-8.
David A. Freedman (2009). Statistical Models: Theory and Practice. Cambridge University
Press. p. 26. A simple regression equation has on the right hand side an intercept and an
explanatory variable with a slope coefficient. A multiple regression e right hand side, each with
its own slope coefficient

Rencher, Alvin C.; Christensen, William F. (2012), "Chapter 10, Multivariate regression –
Section 10.1, Introduction", Methods of Multivariate Analysis, Wiley Series in Probability and
Statistics, vol. 709 (3rd ed.), John Wiley & Sons, p. 19, ISBN 9781118391679.

You might also like