Applied Econometrics:
–
An Introduction for Research & Business using R
Serge Pajak Nicolas Soulié
serge.pajak@u-psud.fr nicolas.soulie@u-psud.fr
Université Paris-Sud
Université Paris-Sud Econometrics
Course Calendar
Week 1 Introducing the single linear regressions and the R software
Week 2 Data management basics, specification of the functional form
Week 3 Getting started with the assignment: data, topic and literature. Q & A.
Week 4 Regression with non-i.i.d. errors (not identical and not independent)
Week 5 Instrumental variables
Week 6 Discrete choice models
Week 7 Introduction to panel data
References
o Jeffrey Wooldridge, Introductory Econometrics: A Modern Approach, 3rd Edition,
2006
o William Green, Econometric Analysis, Prentice Hall, 6th Edition, 2008
o Florian Heiss, Using R for introductory econometrics, 2016
Université Paris-Sud Econometrics
Purpose of this course
Economists try to understand and explain many (economic) questions or phenomena:
– Which factors affect people’s wages?
– Are startup companies more successful when the founder is older?
– What are the factors explaining house pricing? Size of house, # of rooms, local
crime, pollution, etc.
– Which factors affect people’s adoption and use of smartphone? Etc.
Dealing with such questions involve:
A theory: economic mechanism, potential explicative factors, insights about the
phenomenon, relevance of issue, etc.
Data: dataset including relevant variables needed to answer your question
(according to theory/literature)
Econometric analysis: data description, modification, analysis and interpretation
This course focuses on the technical (3nd) aspect.
In real-life you need the three!
Université Paris-Sud Econometrics
Econometrics
Definition
– The main variable under study, or the explained variable (wage, location choice,
migration choice, etc.)
– which is affected by other factors, or explicative variables (income, age, gender,
consumption, education, etc.)
Theory: Links between explained and explicative variables
Consequence or result of economical, sociological, psychological, etc. mechanisms
which have been theoretically documented and/or empirically validated in existing
studies
Université Paris-Sud Econometrics
Econometrics
Econometrics seeks to identify the nature of the relationship between an explained
variable and the explicative variables
Nature: positive, negative, perhaps non-monotonic (positive then negative after a
threshold), etc.
Some models allow to measure the strength of the relationship
Université Paris-Sud Econometrics
Econometrics
Aim of econometrics: creating information on the relationship between variables in order
to:
Help decision-making (firm, people, public policy, etc.)
Test hypothesis
Documenting a question/phenomena
For instance, does education affect wage? And if so, what is the return of an additional
year of education on wage?
Université Paris-Sud Econometrics
Evaluation: Term Paper
Write a mini-article in applied econometrics
- Short and basic, but structured with motivation, literature review, model, results and
interpretation
- Choice of topic by October 31. Includes name, topic, dataset and references
- Detailed instructions are included on myCourse and links to known public datasets
will be provided
- Must be turned-in at the end of January
Université Paris-Sud Econometrics
Installing R
1. Download and install R software: www.r-project.org
2. Download and install RStudio (desktop, free version):
www.rstudio.com/products/rstudio/download/
3. Install the car and wooldridge packages for applied regression. Other packages will
be introduced during the course!
Université Paris-Sud Econometrics
Data management: opening files
For importing files, use "Import dataset" in the ’Environment’ window and choose the
dataset format among csv (including text), Excel, SAS, SPSS or Stata
For example, the HousePrices3.xlsx dataset:
- Click on Import Dataset and then on "From Excel ..."
- Select the file URL
- Change dataset name
- Indicate if first line includes variable name
library(readxl)
HousePrices3 <- read_excel("C:/...HousePrices3.xlsx")
read.excel opens the Excel database "HousePrice3.xlsx" in the selected file
HousePrice3.xlsx is the name of the dataset
Université Paris-Sud Econometrics
Data management and analysis: golden rules
Data management and analysis, golden rules:
- Keep an original version of the dataset, and work on a copy
- Keep record of your work using a script
- Set your working directory
Script to keep record of your work on the dataset (new var., graphics, models, etc.):
- In RStudio: File => New File => R Script
- Write command(s), select it and click on run
- Add comments using # at the beginning of a line
Set a working directory:
- setwd("C:/Documents and Settings/Data/")
- Display working directory: getwd()
- Display files in working directory: ls()
Université Paris-Sud Econometrics
Data management: opening files
HousePrices3 variables:
- year: year of information collection (1978 or 1981)
- age: age of the house
- nbh: neighborhood, 0 to 6
- cbd: distance (feet) to Central Business District
- inst: distance (feet) to interstate
- price: selling price
- rooms: # rooms in house
- area: square footage of house
- land: square footage of lot
- baths: # of bathrooms
- dist: distance (feet) to incinerator
Université Paris-Sud Econometrics
Data management: opening files
For example, the banks.txt dataset:
- Click on Import Dataset and then on "From CSV ..."
- Select the file URL
- Change dataset name
- Indicate if first line includes variable name
- Fill the delimiter/separator: comma, semicolon, tab or whitespace
valbanks <- read.table("banks.txt", sep = " ", header = True)
read.table opens the database "banks.txt" in the working directory
valbanks is the new name of the dataset
sep= for separator, with whitespace:" ", comma: ",", semicolon: ";" or tab: "\t"
header = T or True reads the first line as variable names
Université Paris-Sud Econometrics
Data management: descriptive statistics
Using the HousePrices3 database, for descriptive statistics on quantitative variable:
install.packages("skimr")
library(skimr)
skim(HousePrices3) provides basic statistics for every variable
> skim(HousePrices3)
Skim summary statistics
n obs: 321
n variables: 27
Variable type: character
variable missing complete n min max empty n_unique
roomsd 0 2247 2247 1 1 0 2
Variable type: numeric
variable missing complete n mean sd p0 p25 median
age 0 321 321 18.01 32.57 0 0 4
area 0 321 321 2106.73 694.96 735 1560 2056
baths 0 321 321 2.34 0.77 1 2 2
cbd 0 321 321 15822.43 8967.11 1000 9000 14000
counter 0 321 321 161 92.81 1 81 161
dist 0 321 321 20715.58 8508.18 5000 13400 19900
...
Université Paris-Sud Econometrics
Data management: descriptive statistics
Using the HousePrices3 database, for descriptive statistics on quantitative variable:
install.packages("psych")
library(psych)
describe(HousePrices3$price) provides basic statistics on the variable listed after $
> vars n mean sd median trimmed mad min max
X1 1 321 96100.66 43223.73 85900 91630.93 37658.04 26000 3e+05
range skew kurtosis se
X1 274000 1.13 1.76 2412.51
With variable’s mean, median, minimum, maximum, range and also:
- vars: number of variable
- sd: standard deviation
- trimmed: variable’s mean without its 10% highest and lowest values
- mad: median absolute deviation
- skew: Skewness index (normal distribution, Skewness = 0)
- kurtosis: Kurtosis index (normal distribution, Kurtosis = 3)
- se: standard error
Université Paris-Sud Econometrics
Data management: descriptive statistics
Descriptive statistics on quantitative variable by subcategory of another variable:
describeBy(HousePrices3$price, group=HousePrices3$nbh) provides statistics on
price by sub-group of nbh’s values
Descriptive statistics by group
group: 0
vars n mean sd median trimmed mad min max
X1 1 121 108737.4 50748.08 98000 104952.4 48925.8 26000 3e+05
range skew kurtosis se
X1 274000 0.9 1.06 4613.46
---------------------------------------------------------------
group: 1
vars n mean sd median trimmed mad min max
X1 1 27 109800 46247.02 89500 105595.6 40178.46 58000 216000
range skew kurtosis se
X1 158000 0.81 -0.65 8900.24
Université Paris-Sud Econometrics
Data management: descriptive statistics
Descriptive statistics on quantitative variable for a subset of observations:
describe(HousePrices3$price[HousePrices3$rooms>5]) provides statistics on price for
houses with 5 rooms or more
vars n mean sd median trimmed mad min max
X1 1 287 100685.8 43191.59 90000 96725.41 40771.5 26000 3e+05
range skew kurtosis se
X1 274000 1.07 1.68 2549.52
describe(HousePrices3$price[HousePrices3$rooms>5 & HousePrices3$baths>3])
provides statistics on price for houses with 5 rooms or more, and with more than 3
baths
vars n mean sd median trimmed mad min max
X1 1 4 149339.2 45336.09 154750 149339.2 48449.14 98000 189857
range skew kurtosis se
X1 91857 -0.1 -2.31 22668.05
Université Paris-Sud Econometrics
Data management: descriptive statistics
For qualitative variable: Simple frequency table of age
as.data.frame(table(HousePrices3$age))
Université Paris-Sud Econometrics
Data management: descriptive statistics
A 2-way cross-table
library(gmodels)
CrossTable(HousePrices3$rooms,HousePrices3$nbh, digits=2,
prop.r=FALSE, prop.c=TRUE,prop.t = FALSE, prop.chisq = FALSE)
Université Paris-Sud Econometrics
Data management: operations with variables
How to create a new variable
newvariable <- newvariable_formula
The formula can be any operation on variable
For instance, converting distances from feet to meters:
- Creating conversion constant (1 foot = 0.3048 meter): tometers <- 0.3048
- Then, converting from feet to meters:
HousePrices3$cbdm <- HousePrices3$cbd*tometers
HousePrices3$instm <- HousePrices3$inst*tometers
HousePrices3$distm <- HousePrices3$dist*tometers
HousePrices3$areasm <- HousePrices3$area*0.0929
Université Paris-Sud Econometrics
Data management: logical and math operators
Math operators in R:
- Usual math operators: - , + , / and *
- Inferior/inferior or equal/superior/superior or equal: <, <=, >, >=
- Is equal to (comparison operator in condition/constraint): ==
Usual Math functions:
- Power: x^2, e.g. income_square <- income^2
- Square root: sqrt(x), e.g. size_sqrt <- sqrt(size + 1)
- Log: log(x) or log10(x), e.g. log_income <- log(income + 1)
Logical operators in R:
- AND: &, e.g. [HousePrices3$rooms>=5 & HousePrices3$baths>3] computes the
command only for obs. in HousePrices3 which have rooms>=5 AND baths>3
- OR: | pronounced ’tube’ or ’pipe’ (Mac: Maj + alt + L, or PC: Alt Gr + 6), e.g.
[HousePrices3$rooms>=5 | HousePrices3$area>2000] computes the command only
for observations in HousePrices3 which have rooms>=5 OR area>2000
- NOT: !=, e.g. [HousePrices3$rooms!=5] computes the command only for
observations in HousePrices3 which have variable rooms NOT EQUAL to 5
Université Paris-Sud Econometrics
Data management: operations with variables
Creating dichotomous variable (1/0):
Built a variable isTenYearsOld = 1 if the house is 10 years old, and isTenYearsOld =
0 otherwise
HousePrices3$isTenYearsOld <- as.numeric(HousePrices3$age == 10)
Built a variable recent indicating that the house is newer than 10 years old
HousePrices3$isRecent <- as.numeric(HousePrices3$age < 10)
Université Paris-Sud Econometrics
Data management: operations with variables
How to rename variables
names(HousePrices3)[names(HousePrices3) == "areasm"] <- "areasmeters"
Or, duplicate the variable then delete the old variable (here cbdmeters):
HousePrices3$cbdm <- HousePrices3$cbdmeters
HousePrices3 <- HousePrices3[-c(HousePrices3$cbdmeters)]
-c: exclude variables listed between brackets
Université Paris-Sud Econometrics
Graphical analysis
With a fictitious created dataset
x <- rnorm(n=100, mean=0.5, sd=0.1)
y <- 2+4*rnorm(n=100, mean=0.5, sd=0.1)
Simple histogram: hist(x)
Customized histogram
hist(x,
main="Histogramme pour x",
xlab="Valeurs en abcisses",
border="black",
col="blue",
xlim=c(-1,1),
breaks=5)
Boxplot boxplot(x)
Université Paris-Sud Econometrics
Graphical analysis
Plotting one variable against the other, with a regression line
plot(x,y,
xlab="1re variable", ylab="2e variable")
abline(lm(y ~ x), col = "blue", lty="dashed")
Or using ggplot2 package:
install.packages("ggplot2")
library(ggplot2)
ggplot(data=HousePrices3, aes(y=price, x=area)) + geom_point()
aes: declare variables, and geom_point(): draw scatter plot
Université Paris-Sud Econometrics
Graphical analysis
Plotting one variable against the other, with a regression line
ggplot(data=HousePrices3, aes(y=price, x=area)) + geom_point()
+ stat_smooth(method = "lm", formula = y ~ x)
stat_smooth: fitted line, here using linear model (lm).
Be careful: in ggplot’s subcommands (e.g. geom_point(), stat_smooth(), formula, etc.)
refer to y and x (as declared in aes()) and not to variables’ name (e.g. price, area, etc.)
Université Paris-Sud Econometrics
Graphical analysis
Or with a quadratic regression line
ggplot(data=HousePrices3, aes(y=price, x=area)) + geom_point()
+ stat_smooth(method = "lm", formula = y ~ poly(x, 2))
poly(x, 2): indicates a polynomial of x of degree 2 (i.e. y = x + x 2 ).
Try with a polynomial of degree 3.
Université Paris-Sud Econometrics
Graphical analysis
Now using the crime database
crime <- read_dta(".../crime.dta")
Simple histogram + scatter plot to detect correlation between variables
hist(crime$murder)
plot(crime$pctmetro, crime$murder,
xlab="Percent of pop living in the city",
ylab="Crime per 1000s inhabitants")
text(crime$pctmetro, crime$murder, labels=crime$state, pos=2)
Université Paris-Sud Econometrics
Graphical analysis
Plot excluding "dc":
plot(crime$pctmetro[crime$state!="dc"], crime$murder[crime$state!="dc"],
xlab="Percent of pop living in the city",
ylab="Crime per 1000s inhabitants")
Université Paris-Sud Econometrics
Graphical analysis
Or using ggplot (excluding "dc"), and adding a fitted curve:
ggplot(data=crime[(crime$state!="dc"),], aes(y=murder, x=pctmetro))
+ geom_point()
+ stat_smooth(method = "lm", formula = y ~ poly(x, 2))
Université Paris-Sud Econometrics
Graphical analysis
Now using the Banks losses data
valbanks<-scan("banks.txt",
what=list(0,0,""), sep="", skip=1, comment.char="#")
valbanks
valj2007<-valbanks[[1]]
valj2009<-valbanks[[2]]
namebank<-valbanks[[3]]
percent_losses<-(valj2009-valj2007)/valj2007
percent_losses
abs_losses<-(valj2007-valj2009)
abs_losses
plot(abs_losses, percent_losses,
main="Absolute Losses vs. Relative Losses(in %)",
xlab="Losses (absolute, in miles of millions)",
ylab="Losses relative (in % of January 2007 value)",
col="blue", pch = 19, cex = 1, lty = "solid", lwd = 2,
text(percent_losses, abs_losses, namebank))
text(abs_losses, percent_losses, labels=namebank, cex= 0.7, offset = 10)
Université Paris-Sud Econometrics
Graphical analysis
Complex representation for categorical variables
library(vcd)
mosaic(...)
Université Paris-Sud Econometrics
Graphical analysis
library(vcd)
isLarge <- as.numeric(HousePrices3$areasm >= 110)
mosaic(~ isRecent + isLarge + nbh4,
data = HousePrices3, shade=TRUE, legend=TRUE )
Université Paris-Sud Econometrics
Extra resources for graphical analysis
Commented examples:
https://www.harding.edu/fmccown/r/
https://www.statmethods.net/graphs/line.html
https://en.wikibooks.org/wiki/R_Programming/Graphics
More ’impressive’ results:
https://www.r-graph-gallery.com
Université Paris-Sud Econometrics