18BCE10291 - Outliers Assignment

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

ASSOCIATE ANALYTICS(NASSCOM)

Name: Vikram BM

Reg. No.: 18BCE10291

Slot: G11+G12+G13

Assignment Questions:

Dataset is taken from here:

https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp

https://www.transtats.bts.gov/Fields.asp?Table_ID=236

http://stat-computing.org/dataexpo/2009/the-data.html

Exercise 1

Print the summary statistics and the structure of the dataset in order to see the type of
variables and their extreme values, whether it makes sense or not.

str(flights)

## 'data.frame': 6370961 obs. of 29 variables:

## $ Year : int 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...

## $ Month : int 1 1 1 1 1 1 1 1 1 1 ...

## $ DayofMonth : int 3 3 3 3 3 3 3 3 3 3 ...

## $ DayOfWeek : int 4 4 4 4 4 4 4 4 4 4 ...

## $ DepTime : Factor w/ 1440 levels "00:01:00","00:02:00",..: 1204 474 388 566 1110
1181 1178 639 377 981 ...

## $ CRSDepTime : Factor w/ 1217 levels "00:00:00","00:01:00",..: 973 233 158 348 853
933 888 418 153 758 ...

## $ ArrTime : Factor w/ 1440 levels "00:01:00","00:02:00",..: 1332 602 484 654 1200
1282 1238 692 412 1000 ...
## $ CRSArrTime : Factor w/ 1377 levels "00:00:00","00:01:00",..: 1283 538 408 598 1103
1208 1118 648 348 953 ...

## $ UniqueCarrier : Factor w/ 20 levels "9E","AA","AQ",..: 18 18 18 18 18 18 18 18 18 18 ...

## $ FlightNum : int 335 3231 448 1746 3920 378 509 535 11 810 ...

## $ TailNum : Factor w/ 5374 levels "","80009E","80019E",..: 3769 4129 1961 3059 2142
3852 4062 1961 3616 3324 ...

## $ ActualElapsedTime: num 128 128 96 88 90 101 240 233 95 79 ...

## $ CRSElapsedTime : int 150 145 90 90 90 115 250 250 95 95 ...

## $ AirTime : num 116 113 76 78 77 87 230 219 70 70 ...

## $ ArrDelay : num -14 2 14 -6 34 11 57 -18 2 -16 ...

## $ DepDelay : num 8 19 8 -4 34 25 67 -1 2 0 ...

## $ Origin : Factor w/ 303 levels "ABE","ABI","ABQ",..: 136 136 141 141 141 141 141
141 141 141 ...

## $ Dest : Factor w/ 304 levels "ABE","ABI","ABQ",..: 287 287 49 49 49 151 157 157
177 177 ...

## $ Distance : int 810 810 515 515 515 688 1591 1591 451 451 ...

## $ TaxiIn : num 4 5 3 3 3 4 3 7 6 3 ...

## $ TaxiOut : num 8 10 17 7 10 10 7 7 19 6 ...

## $ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ...

## $ CancellationCode : Factor w/ 5 levels "","A","B","C",..: 1 1 1 1 1 1 1 1 1 1 ...

## $ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...

## $ CarrierDelay : int NA NA NA NA 2 NA 10 NA NA NA ...

## $ WeatherDelay : int NA NA NA NA 0 NA 0 NA NA NA ...

## $ NASDelay : int NA NA NA NA 0 NA 0 NA NA NA ...

## $ SecurityDelay : int NA NA NA NA 0 NA 0 NA NA NA ...

## $ LateAircraftDelay: int NA NA NA NA 32 NA 47 NA NA NA ...


summary(flights)

## Year Month DayofMonth DayOfWeek

## Min. :2008 Min. : 1.000 Min. : 1.00 Min. :1.00

## 1st Qu.:2008 1st Qu.: 3.000 1st Qu.: 8.00 1st Qu.:2.00

## Median :2008 Median : 6.000 Median :16.00 Median :4.00

## Mean :2008 Mean : 6.386 Mean :15.73 Mean :3.92

## 3rd Qu.:2008 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:6.00

## Max. :2008 Max. :12.000 Max. :31.00 Max. :7.00

##

## DepTime CRSDepTime ArrTime

## 13:32:13: 136534 06:00:00: 126941 15:00:19: 152657

## 05:55:00: 16772 07:00:00: 81633 14:00:00: 7854

## 06:00:00: 15808 08:00:00: 53579 10:20:00: 7761

## 05:57:00: 14128 06:30:00: 52469 14:10:00: 7750

## 05:56:00: 14075 07:30:00: 45438 10:15:00: 7683

## 06:55:00: 13819 08:30:00: 42197 10:10:00: 7662

## (Other) :6159825 (Other) :5968704 (Other) :6179594

## CRSArrTime UniqueCarrier FlightNum TailNum

## 16:30:00: 22666 WN :1154405 Min. : 1 : 83287

## 18:00:00: 21986 OO : 561525 1st Qu.: 691 N476HA : 4699

## 12:55:00: 21868 AA : 523848 Median :1689 N477HA : 4548

## 12:30:00: 21645 MQ : 480105 Mean :2344 N484HA : 4505

## 16:15:00: 21545 US : 375020 3rd Qu.:3735 N475HA : 4499

## 13:00:00: 21238 XE : 362623 Max. :9743 N480HA : 4416


## (Other) :6240013 (Other):2913435 (Other):6265007

## ActualElapsedTime CRSElapsedTime AirTime ArrDelay

## Min. :-25.21 Min. :-141.0 Min. : 0.00 Min. :-129.000

## 1st Qu.: 75.00 1st Qu.: 77.0 1st Qu.: 54.00 1st Qu.: -10.000

## Median :105.00 Median : 105.0 Median : 82.00 Median : -2.000

## Mean :112.12 Mean : 114.6 Mean : 90.19 Mean : 6.743

## 3rd Qu.:142.00 3rd Qu.: 145.0 3rd Qu.:119.00 3rd Qu.: 9.000

## Max. :242.00 Max. :1435.0 Max. :235.00 Max. :1357.000

## NA's :11

## DepDelay Origin Dest Distance

## Min. : -79.0 ATL : 378285 ATL : 389032 Min. : 11.0

## 1st Qu.: -4.0 ORD : 311768 ORD : 327183 1st Qu.: 308.0

## Median : -1.0 DFW : 271397 DFW : 271091 Median : 533.0

## Mean : 9.5 DEN : 236358 DEN : 223264 Mean : 611.9

## 3rd Qu.: 9.0 PHX : 180767 IAH : 175499 3rd Qu.: 861.0

## Max. :2467.0 IAH : 172321 PHX : 166441 Max. :4962.0

## (Other):4820065 (Other):4818451

## TaxiIn TaxiOut Cancelled CancellationCode

## Min. :-18.245 Min. : 0.07894 Min. :0.00000 :6233950

## 1st Qu.: 4.000 1st Qu.:10.00000 1st Qu.:0.00000 A: 54124

## Median : 6.000 Median :13.00000 Median :0.00000 B: 54753

## Mean : 6.753 Mean :15.19601 Mean :0.02151 C: 28134

## 3rd Qu.: 8.000 3rd Qu.:18.00000 3rd Qu.:0.00000 D: 0

## Max. :175.000 Max. :49.00000 Max. :1.00000


##

## Diverted CarrierDelay WeatherDelay NASDelay

## Min. :0.00000 Min. : 0 Min. : 0 Min. : 0

## 1st Qu.:0.00000 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0

## Median :0.00000 Median : 0 Median : 0 Median : 3

## Mean :0.00255 Mean : 17 Mean : 3 Mean : 13

## 3rd Qu.:0.00000 3rd Qu.: 17 3rd Qu.: 0 3rd Qu.: 17

## Max. :1.00000 Max. :2436 Max. :1352 Max. :1357

## NA's :5103021 NA's :5103021 NA's :5103021

## SecurityDelay LateAircraftDelay

## Min. : 0 Min. : 0

## 1st Qu.: 0 1st Qu.: 0

## Median : 0 Median : 0

## Mean : 0 Mean : 22

## 3rd Qu.: 0 3rd Qu.: 29

## Max. :392 Max. :1236

## NA's :5103021 NA's :5103021


Exercise 2

When it comes to categorical variables, outliers are considered to be the values of which
frequency is less than 1% , barplot of flights$UniqueCarrier and flights$CancellationCode.
What do you think? There are more categorical variables , so I encourage you to try them out
as well.

Ans.

counts <- table(flights$UniqueCarrier)

barplot(counts,main="Carrier Distribution",

xlab="Number of flights")

plot of chunk unnamed-chunk-1-1


count <- table(flights$CancellationCode)

barplot(count,main="Cancellation Code",

xlab="Number of cancellations")

plot of chunk unnamed-chunk-1-2

Exercise 3

Remove the outliers that you have noticed at the barplots of the previous exercise, consider
the function subset.

Ans.

flights <- subset(flights, !UniqueCarrier =='AQ')

flights <- subset(flights, !CancellationCode =='D')


Exercise 4

A good way of detecting outliers from numerical variables is boxplot, make one with
flights$ActualElapsedTime.

Ans.

boxplot(flights$ActualElapsedTime,horizontal = TRUE)

plot of chunk unnamed-chunk-1-1,unnamed-chunk-1-2,unnamed-chunk-1-3

Exercise 5

Remove the outliers of flights$ActualElapsedTime using boxplot.stats .

Ans.

Flights <- flights[which(!(flights$ActualElapsedTime %in%


boxplot.stats(flights$ActualElapsedTime)$out)),]
Exercise 6

Remove outliers from flights using the subset function ,where TaxiIn is greater than 0 and less
than 120.

Ans.

flights <- subset(flights, TaxiIn < 120 & TaxiIn > 0)

Exercise 7

Remove outliers from flights using the subset function ,where TaxiOut is greater than 0 and
less than 50.

Ans.

flights <- subset(flights, TaxiOut < 50 & TaxiOut >0)

Exercise 8

Assign NA value if the value is an outlier of flights$ArrDelay using the ifelse function.

Ans.

flights$ArrDelay <- ifelse(flights$ArrDelay==outlier(flights$ArrDelay),NA,flights$ArrDelay)

Extra:

rp.outlier(flights$Distance)

## [1] 3303 3414 3329 2917 2860 2846 4502 2860 2845 2845 2860 2845 4243 4243

## [15] 2846 2846 4502 4213 3303 2917 4502 4243 4213 4184 3303 2845 2860 2979

## [29] 2860 2845 2936 3972 3711 3711 3711 3784 3365 2917 4502 2986 3043 3266

## [43] 3266 3904 4243 2846 2917 2979 3784 3110 2936 2936 4184 2845 2917 3266

## [57] 2936 3303 3303 3303 3414 3329 2917 4502 3784 3266 3266 3266 3266 2994

## [71] 2994 3784 3043 3904 3904 4962 3904 3904 2845 3784 2994 2994 3711 4243
## [85] 3386 2979 4502 2846 2797

flights <- subset(flights, Distance != rp.outlier(flights$Distance))

## Warning in Distance != rp.outlier(flights$Distance): longer object length

## is not a multiple of shorter object length

outliers <- scores(flights$CRSElapsedTime, type="chisq", prob=0.98)

table(outliers)

## outliers

## FALSE TRUE

## 6205225 165732

THANK YOU

You might also like