18BCE10291 - Outliers Assignment
18BCE10291 - Outliers Assignment
18BCE10291 - Outliers Assignment
Name: Vikram BM
Slot: G11+G12+G13
Assignment Questions:
https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp
https://www.transtats.bts.gov/Fields.asp?Table_ID=236
http://stat-computing.org/dataexpo/2009/the-data.html
Exercise 1
Print the summary statistics and the structure of the dataset in order to see the type of
variables and their extreme values, whether it makes sense or not.
str(flights)
## $ Year : int 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
## $ DepTime : Factor w/ 1440 levels "00:01:00","00:02:00",..: 1204 474 388 566 1110
1181 1178 639 377 981 ...
## $ CRSDepTime : Factor w/ 1217 levels "00:00:00","00:01:00",..: 973 233 158 348 853
933 888 418 153 758 ...
## $ ArrTime : Factor w/ 1440 levels "00:01:00","00:02:00",..: 1332 602 484 654 1200
1282 1238 692 412 1000 ...
## $ CRSArrTime : Factor w/ 1377 levels "00:00:00","00:01:00",..: 1283 538 408 598 1103
1208 1118 648 348 953 ...
## $ FlightNum : int 335 3231 448 1746 3920 378 509 535 11 810 ...
## $ TailNum : Factor w/ 5374 levels "","80009E","80019E",..: 3769 4129 1961 3059 2142
3852 4062 1961 3616 3324 ...
## $ Origin : Factor w/ 303 levels "ABE","ABI","ABQ",..: 136 136 141 141 141 141 141
141 141 141 ...
## $ Dest : Factor w/ 304 levels "ABE","ABI","ABQ",..: 287 287 49 49 49 151 157 157
177 177 ...
## $ Distance : int 810 810 515 515 515 688 1591 1591 451 451 ...
## 1st Qu.:2008 1st Qu.: 3.000 1st Qu.: 8.00 1st Qu.:2.00
##
## 1st Qu.: 75.00 1st Qu.: 77.0 1st Qu.: 54.00 1st Qu.: -10.000
## 3rd Qu.:142.00 3rd Qu.: 145.0 3rd Qu.:119.00 3rd Qu.: 9.000
## NA's :11
## 1st Qu.: -4.0 ORD : 311768 ORD : 327183 1st Qu.: 308.0
## 3rd Qu.: 9.0 PHX : 180767 IAH : 175499 3rd Qu.: 861.0
## (Other):4820065 (Other):4818451
## SecurityDelay LateAircraftDelay
## Min. : 0 Min. : 0
## Median : 0 Median : 0
## Mean : 0 Mean : 22
When it comes to categorical variables, outliers are considered to be the values of which
frequency is less than 1% , barplot of flights$UniqueCarrier and flights$CancellationCode.
What do you think? There are more categorical variables , so I encourage you to try them out
as well.
Ans.
barplot(counts,main="Carrier Distribution",
xlab="Number of flights")
barplot(count,main="Cancellation Code",
xlab="Number of cancellations")
Exercise 3
Remove the outliers that you have noticed at the barplots of the previous exercise, consider
the function subset.
Ans.
A good way of detecting outliers from numerical variables is boxplot, make one with
flights$ActualElapsedTime.
Ans.
boxplot(flights$ActualElapsedTime,horizontal = TRUE)
Exercise 5
Ans.
Remove outliers from flights using the subset function ,where TaxiIn is greater than 0 and less
than 120.
Ans.
Exercise 7
Remove outliers from flights using the subset function ,where TaxiOut is greater than 0 and
less than 50.
Ans.
Exercise 8
Assign NA value if the value is an outlier of flights$ArrDelay using the ifelse function.
Ans.
Extra:
rp.outlier(flights$Distance)
## [1] 3303 3414 3329 2917 2860 2846 4502 2860 2845 2845 2860 2845 4243 4243
## [15] 2846 2846 4502 4213 3303 2917 4502 4243 4213 4184 3303 2845 2860 2979
## [29] 2860 2845 2936 3972 3711 3711 3711 3784 3365 2917 4502 2986 3043 3266
## [43] 3266 3904 4243 2846 2917 2979 3784 3110 2936 2936 4184 2845 2917 3266
## [57] 2936 3303 3303 3303 3414 3329 2917 4502 3784 3266 3266 3266 3266 2994
## [71] 2994 3784 3043 3904 3904 4962 3904 3904 2845 3784 2994 2994 3711 4243
## [85] 3386 2979 4502 2846 2797
table(outliers)
## outliers
## FALSE TRUE
## 6205225 165732
THANK YOU