Data Mining Project
Data Mining Project
Data Mining Project
Introduction:
Data mining, or knowledge discovery, is the computer-assisted process of digging
through and analyzing enormous sets of data and then extracting the meaning of the data. Data
Mining is the extraction of hidden, predictive information patterns from large databases. Data
Mining is especially useful now-a-days when there is massive amount of data and identifying the
useful portions of it can be a tedious job in itself.
With data mining we can now try and predict the future trends rather than identifying
them after they have already taken place. Data mining tools predict behaviors and future trends,
allowing businesses to make proactive, knowledge-driven decisions. XLMiner is an add-in
available for MS Excel that allows us to perform Data Mining on the data sets.
Problem Solving:
In this project we would like to solve a problem of Southwest in Calculating the Air Fares for
the new airports introduced, and providing discounts for its customers.
Using the Dataset we wanted explore the FARE by creating a correlation tables.
We wanted to explore the categorical predictors by computing the percentage of flights in
regression.
Predict the average fare on a route using exhaustive search.
Compare the predictive accuracy of the model.
Data Description:
It contains real data that were collected for the third quarter of 1996, in which several
new airports have opened in major cities, opening the market for new routes.
The dataset has a total of 639 records; each record consists of 18 attributes or variables.
In order to price flights on these routes, a major airline collected information on 638 air
Data Code:
S_CODE
S_CITY
E_CODE
E_CITY
COUPON
NEW
VACATION
SW
HI
S_INCOME
E_INCOME
S_POP
E_POP
SLOT
GATE
DISTANCE
PAX
FARE (the response)
Excel Sheet-Airfares.xls: (Click on the below excel sheet to view complete data)
and the other isnt. For example, if we prove that weight gain in the first trimester of pregnancy
correlates well with fetus development, we can use gain as a predictor. The alternative would be
expensive tests to monitor the actual development of the fetus.
FARE
$400.00
$350.00
$300.00
$250.00
FARE
$200.00
$150.00
$100.00
$50.00
$0.00
0
FARE
$400.00
$350.00
$300.00
$250.00
FARE
$200.00
$150.00
$100.00
$50.00
$0.00
.00
5000.00
10000.00
15000.00
FARE
$400.00
$350.00
$300.00
$250.00
FARE
$200.00
$150.00
$100.00
$50.00
$0.00
$15,000$20,000$25,000$30,000$35,000$40,000$45,000
FARE
$400.00
$350.00
$300.00
$250.00
FARE
$200.00
$150.00
$100.00
$50.00
$0.00
$10,000 $15,000 $20,000 $25,000 $30,000 $35,000
FARE
$400.00
$350.00
$300.00
$250.00
FARE
$200.00
$150.00
$100.00
$50.00
$0.00
0
5000000
10000000
FARE
$400.00
$350.00
$300.00
$250.00
FARE
$200.00
$150.00
$100.00
$50.00
$0.00
0
5000000
10000000
FARE
$400.00
$350.00
$300.00
$250.00
FARE
$200.00
$150.00
$100.00
$50.00
$0.00
0
500
1000
1500
2000
2500
3000
FARE
$400.00
$350.00
$300.00
$250.00
FARE
$200.00
$150.00
$100.00
$50.00
$0.00
0
By plotting scatter plots of all numerical predictor with response (Fare), we can say that
the numerical variable Distance has a best correlation with the response (Fare). Therefore, we
consider Distance as a Numerical Predictor.
Pivot Table: (Click on the below excel sheet to view the complete data)
Pivot Table of the Categorical Variables (Vacation, SW, Slot, Gate) and Response (Fare)
After plotting Pivot table by dragging and dropping, we can say that the one with the
highest total between the categories is the best Categorical variable. Therefore, we can say that
Slot is the best Categorical variable in response to Fare.
MLR Excel Sheet: (Click on the below excel sheet to view the complete MLR data)
From the above result set we can say that using predictors that are uncorrelated with the
dependent variable increases the variance of predictions. Therefore, we try to drop the actually
correlated predictor with the dependent variable to increase the average error of predictors.
cp
R2
LR
40000
30000
20000
Cumulative
10000
0
200
0 400
Cumulative
FARE using
average
# cases
Cumulative
SW when
sorted using
predicted
values
50
40
30
20
Cumulative
10
0
100
0 200
Cumulative
SW using
average
# cases
1
0.5
3
2.5
2
1.5
Decile mean / Global mean
1
0.5
0
0
Deciles
Deciles
Therefore by comparing RMSE, Average Errors, Lift Charts and Decile Wise Charts of
both the model training dataset, we conclude that the model with minimum number of predictors
performs perfectly.
The RMSE value, Average Error Value is small for Logistics Regression model in
comparison with MLR output. We also see from the lift charts that LR output maintains a linear
structure.
Finally we state that the Logistics Regression Model is the best model for calculating
airfares in Southwest for newly introduced airports.
Conclusion:
We conclude that the data mining techniques are one of the best techniques which are
used to find the perfectly suited model for an particular requirement. It is an easy software tool,
which is available in the market for obtaining the best result set. By comparing the obtained
result set of the two models we conclude that Logistics Regression (Exhaustive Search) is a best
suited model for this particular problem.