DCCCCCCCCCCC
DCCCCCCCCCCC
DECLARATION ii
CERTIFICATE iii
ACKNOWLEDGEMENT iv
ABSTRACT V
LIST OF TABLE vi
CHAPTER 1. INTRODUCTION
1.1. Overview 1
. 1.2 Importance 2
1.3. Objectives 3
2.3. Challenges
CHAPTER 3. METHODOLOGY 7
3.2. Preprocessing 8
3.3 EDA 9
3.7 Monitoring 13
CHAPTER 4. IMPLEMENTATION 15
4.4 Coding 23
4.5Testing 32
5.2. Comparison 38
CHAPTER 6. CONCLUSION 39
6.1. Summary 40
6.2. Achievement 41
6.3. Future Work 41
. Reference
.BIOGRAPHY
1. Introduction
Machine learning methods are appropriate in big data since attempting to manually
process vast volumes of data would be impossible without the support of machines.
There are two general groups in machine learning which are supervised and
unsupervised. Supervised is where the program gets trained on pre-determined set to be
able to predict when a new data is given. Unsupervised is where the program tries to find
the relationship and the hidden pattern between the data.
1.1Overview
The Flight Price prediction is designed to harness the power of machine learning to forecast
flight ticket prices with high precision . Utilizing vast datasets of historical flight information,
the project aims to construct predictive model that can serve as a decision-making tool for
both travellers planning their trips and airline managing their price strategies
1.2. Importance
The ability to predict flight prices has significant implications for the travel industry. For consumers, it
means the potential to secure cost-effective travel by identifying the best times to purchase tickets. For
airlines, it represents an opportunity to fine-tune pricing models, enhance revenue management, and stay
competitive in a fluctuating market.
1.3. Objectives
• To provide a tool that can help consumers and airlines make data-driven decisions
2. LITERATURE SURVEY
In this we focus is on summarizing the existing research and projects related to flight price prediction. It
involves a comprehensive review of academic papers, industry reports, and existing systems that have
attempted to predict flight prices. Previous work may include studies that employed machine learning
algorithms, statistical models, or hybrid approaches to forecast airfare. The subsection should highlight the
methodologies, datasets used, and key findings of these studies. Additionally, it may discuss the limitations
or gaps identified in the literature, which the current project aims to address.
2.2. Techniques and Algorithms
Here, the section delves into the various techniques and algorithms commonly used in flight
price prediction. It provides an overview of the different machine learning algorithms such as
linear regression, decision trees, random forests, support vector machines, and neural
networks that have been applied in predicting flight prices. Additionally, it discusses the
feature engineering methods and data preprocessing techniques utilized to prepare the input
data for these algorithms. The subsection may also explore any specific approaches or
modifications tailored to the unique characteristics of flight price data, such as seasonality,
route networks, and pricing dynamics.
2.3. Challenges
It may include technical challenges such as data quality issues, missing values, and the high
dimensionality of feature space.
Additionally, the subsection examines the inherent uncertainty in predicting human behavior,
as travelers' purchasing decisions are influenced by a multitude of factors beyond historical
price trends. Strategies for mitigating these challenges, as well as potential research
directions to overcome them, are also discussed.
One of the primary challenges identified is the dynamic nature of flight pricing, influenced by
factors like Total Stops. Additionally, the need for real-time data processing and the handling
of missing data present significant hurdles.
3. METHODOLOGY
The methodology for gathering the necessary data for the flight price prediction . It
involves identifying relevant sources of data, such as flight booking databases, airline
websites, online travel agencies, or publicly available datasets.the process of collecting
flightrelated information, including departure and arrival airports, dates, times, airlines, ticket
prices, and any other relevant features. It may also address considerations such as data
privacy, data licensing agreements, and the frequency of data updates to ensure the timeliness
and legality of the collected data.
But now here , We download a datasets from Kaggle , These datasets are Uncleaned , not
ready to training our model to these Datasets .
Fig 3.1.1
3.2. Preprocessing
The preprocessing steps applied to the raw data before feeding it into the predictive models.
It includes data cleaning processes to handle missing values, and inconsistencies in the
dataset. Additionally, data normalization or standardization techniques may be employed to
ensure that features are on a similar scale. The preprocessing stage may also involve
encoding categorical variables, handling datetime variables, and performing any necessary
transformations to make the data suitable for analysis. The subsection should provide clarity
on the rationale behind each preprocessing step and its impact on the quality the data.
Fig
3.2.1 :Data Preprocessing
3.3 Exploratory Data Analysis(EDA)
• Pair plot
Here , pair plot used to detect outlier of data Y-Label(price), and X-Label (Duration ,
total_stops)
Fig.3.3.2
• Categories Distribution
Here, the methodology for feature engineering is described, which involves selecting,
creating, or transforming the input variables to improve the predictive performance of the
models. This may include extracting relevant features from the raw data, such as day of the
week, time of day, or holiday indicators, which may influence flight prices. Feature selection
techniques, such as correlation analysis or recursive feature elimination, may be employed to
identify the most informative variables. Additionally, domain knowledge and insights from
the literature.
selecting suitable predictive models for flight price prediction. It involves evaluating
various machine learning algorithms, such as linear regression, random forests, support
vector machines, neural networks, to determine their performance on the preprocessed
dataset. Model verification criteria may include predictive accuracy, computational
efficiency, interpretability, and scalability to handle large volumes of data. Techniques such
as cross-validation and grid search may be employed to tune hyperparameters and optimize
model performance. The subsection discusses the rationale behind the choice of models and
the criteria used for model evaluation.
3.5.Deploy the Machine Learning Model
In this satge of Machine learning lifecycle , we apply to integrate machine learning model
into processed and applications . The ultimate aim of this stage is tha proper functionality of
the model after deployments .
3.6 Monitoring
It involve the involvements of safety measure for the assurance of proper operation of th
1 – Hardware Requirements
3. Software Requirements
Additionally, other tools and libraries used for data preprocessing, feature engineering, and
evaluation should be listed. For instance, Python libraries such as Pandas, NumPy,
Scikitlearn, and TensorFlow may be mentioned for data manipulation, machine learning, and
deep learning tasks. The section should also specify the version of Streamlit and other
dependencies used to ensure reproducibility.
4.2. System Architecture
A system architecture is the conceptual model that defines the structure , behaviour ,and more
view of system , An architecture description is a formal discription and representation of a
system .
Here, the architecture of the system developed for flight price prediction are described.
Streamlit, as the chosen deployment platform, plays a central role in hosting the predictive
model and providing a user-friendly interface for interacting with it. The subsection may
discuss how the predictive model is integrated into the Streamlit application, including
loading the trained model, processing user input, generating predictions, and displaying
results. It may also detail any backend services or databases used to support the application,
such as APIs for fetching real-time flight data or caching mechanisms for improving
performance.
Applied Algorithm
Data
preprocessing
Streamlit is employed as the deployment platform due to its ability to create interactive
and user-friendly web applications with minimal effort. The main components of the
Streamlit application include:
Streamlit's UI elements (such as sliders, date pickers, and dropdown menus) enable users
to input their flight search criteria, such as airline , date_of_journey , source ,
destination , dep_time , total_stops .
The trained predictive model is loaded into the Streamlit application, allowing it to be utilized
for generating flight price predictions based on user input.
The integration of the predictive model into the Streamlit application involves several steps:
The predictive model, trained using historical flight data, is saved and loaded into the
Streamlit application upon startup. This ensures that the model is readily available for
generating predictions.
2. Processing User Input
User inputs from the UI are collected and preprocessed to match the format expected
by the predictive model. This may involve converting categorical variables into
numerical representations, normalizing continuous variables, and ensuring all
required features are present.
3. GENERATING PREDICTIONS
The pre-processed user input is fed into the predictive model, which generates a price
prediction for the specified flight criteria.
4. DISPLAYING RESULTS
The predicted flight prices are presented to the user in an easily interpretable format,
such as tables, charts, or summary statistics.
The design and functionality of the user interface developed using Streamlit. It describes
the layout, features, and interactive elements provided to users for inputting query parameters
(e.g., departure airport, destination airport, date of travel) and viewing predicted flight prices.
The user interface should be intuitive, visually appealing, and responsive, with clear
4.4 CODING
import streamlit as st
import pandas as pd
st.set_page_config(page_title="Flight Price
Predictor",page_icon="https://hips.hearstapps.com/hmg-prod/images/gettyimages-
1677184597.jpg?crop=0.668xw:1.00xh;0.167xw,0&resize=1200:*")
st.sidebar.title('MENU BAR') choice=st.sidebar.selectbox(' ',
('Home','Predict'))
st.sidebar.image('https://e0.pxfuel.com/wallpapers/209/716/desktop-
wallpaperuntitled-airplane-sky-aesthetic-travel.jpg')
st.sidebar.image('https://i.pinimg.com/736x/0d/1e/96/0d1e967cde176af6f8f0568a
f
424d07b.jpg')
if(choice=='Home'):
st.title('Welcome to Flight Price Predictor')
st.text('Hi. Want to predict your flight ticket price❓❓')
st.text('Click the Menu bar for further details')
st.image('https://wallpapers.com/images/featured/airportw6v47y
jhxcohsjgf.jpg') elif(choice=='Predict'):
st.text('Kindly fill your flight details to view the predicted
price')
st.image('https://feeds.abplive.com/onecms/images/uploadedimages/2021/09/
08/634259599cd6f60c24f9e67a5680c064_original.jpg')
ch=st.selectbox('Airline',('Select','Vistara','Air India','Indigo','GO
FIRST','AirAsia','SpiceJet')) if(ch=='Vistara'):
a=5
elif(ch=='Air India'):
a=1
elif(ch=='Indigo'):
a=3
elif(ch=='GO FIRST'):
a=2
elif(ch=='AirAsia'):
a=0
elif(ch=='SpiceJet'):
a=4 cg=st.selectbox('From',
('Select','Delhi','Mumbai','Bangalore','Kolkata','Hyder abad','Chennai'))
if(cg=='Delhi'):
b=2 cx=st.selectbox('Destination',
('Select','Mumbai','Bangalore','Kolkata','Hydera bad','Chennai'))
if(cx=='Mumbai'):
f=5
elif(cx=='Bangalore'):
f=0
elif(cx=='Kolkata'):
f=4
elif(cx=='Hyderabad'):
f=3
elif(cx=='Chennai'):
f=1
elif(cg=='Mumbai'):
b=5 cx=st.selectbox('Destination',
('Select','Delhi','Bangalore','Kolkata','Hyderab ad','Chennai'))
if(cx=='Delhi'):
f=2
elif(cx=='Bangalore'):
f=0
elif(cx=='Kolkata'):
f=4
elif(cx=='Hyderabad'):
f=3
elif(cx=='Chennai'):
f=1
elif(cg=='Bangalore'):
b=0 cx=st.selectbox('Destination',
('Select','Mumbai','Delhi','Kolkata','Hyderabad'
,'Chennai'))
if(cx=='Mumbai'):
f=5
elif(cx=='Delhi'):
f=2
elif(cx=='Kolkata'):
f=4
elif(cx=='Hyderabad'):
f=3
elif(cx=='Chennai'):
f=1
elif(cg=='Kolkata'):
b=4 cx=st.selectbox('Destination',
('Select','Mumbai','Delhi','Bangalore','Hyderaba d','Chennai'))
if(cx=='Mumbai'):
f=5
elif(cx=='Delhi'):
f=2
elif(cx=='Bangalore'):
f=0
elif(cx=='Hyderabad'):
f=3
elif(cx=='Chennai'):
f=1
elif(cg=='Hyderabad'):
b=3 cx=st.selectbox('Destination',
('Select','Mumbai','Delhi','Bangalore','Kolkata'
,'Chennai'))
if(cx=='Mumbai'):
f=5
elif(cx=='Delhi'):
f=2
elif(cx=='Bangalore'):
f=0
elif(cx=='Kolkata'):
f=4
elif(cx=='Chennai'):
f=1
else:
b=1
cx=st.selectbo
x('Destination
',
('Select','Mum
bai','Delhi','
Bangalore','Ko
lkata'
,'Hyderabad'))
if(cx=='Mumbai'):
f=5
elif(cx=='Delhi'):
f=2
elif(cx=='Bangalore'):
f=0
elif(cx=='Kolkata'):
f=4
elif(cx=='Hyderabad'):
f=3
cf=st.selectbox('Departure time',
('Select','Morning','Early
Morning','Evening','Night','Afternoon','Late Night'))
if(cf=='Morning'):
c=4
elif(cf=='Early Morning'):
c=1
elif(cf=='Evening'):
c=2
elif(cf=='Night'):
c=5
elif(cf=='Afternoon'):
c=0
elif(cf=='Late Night'):
c=3 ci=st.selectbox('Stops',
('Select','one','zero','two or more')) if(ci=='one'):
d=0
elif(ci=='zero'):
d=2
elif(ci=='two or more'):
d=1
cs=st.selectbox('Arrival
time',('Select','Night','Evening','Morning','Afternoon','Early Morning','Late
Night')) if(cs=='Night'):
e=5
elif(cs=='Evening'):
e=2
elif(cs=='Morning'):
e=4
elif(cs=='Afternoon'):
e=0
elif(cs=='Early Morning'):
e=1
elif(cs=='Late Night'):
e=3 cb=st.selectbox('Class',
('Select','Economy','Business')) if(cb=='Economy'):
g=1
else:
g=0 h=st.number_input('Duration')
i=st.number_input('Days left') btn=st.button('Check') if
btn: def decompress_pickle(file): data =
bz2.BZ2File(file, 'rb') data = pickle.load(data)
return data model = decompress_pickle('Flight.pbz2')
pred=model.predict([[a,b,c,d,e,f,g,h,i]]) st.write("The
predicted price is:-",pred[0],'Rs') st.header('Time to
fly ✈🧳')
st.image('https://image.cnbcfm.com/api/v1/image/106537227-
1589463911434gettyimages-890234318.jpeg?v=1589463982&w=1600&h=900')
flight_data=pd.read_csv('/content/drive/MyDrive/Clean_Dataset.csv')
flight_data=flight_data.drop(columns=['Unnamed: 0'])
flight_data.shape
"""The dataset does not have any null, missing, duplicate values
flight_data.nunique()
sns.set(font_scale=0.7)
cl={'Economy':'green','Business':'blue'}
c=sns.countplot(data=flight_data,x='class',palette=cl
) for label in c.containers: c.bar_label(label)
"""1. Among the six airlines, only Vistara and Air India have both classes
Economy and Business
2. And the airline Vistara has the highest no.of flights from both classes 3.
Spicejet is the airline which has lowest no.of flights
flight_data.describe()
flight_data[['airline','price','class']].sort_values(by='price',ascending=Fals
e)
"""Among the various airlines, Vistara charges highest price under the
business class.
sns.set(font_scale=0.7) plt.figure(figsize=(9,9))
x=sns.barplot(data=flight_data,x='class',y='price',hue='airline',errorbar=Non
e
) for i in x.containers:
x.bar_label(i)
plt.xlabel('Class')
plt.ylabel('Ticket
Price')
plt.title('Flight ticket price vs class based on each airline')
"""The ticket price charged by Vistara is the highest under both classes, and
AirAsia offers the lowest under Economy class.
h. Plotting No.of flights per class under different departure and arrival
time.
"""
sns.set(font_scale=0.7) plt.figure(figsize=(8,6))
plt.subplot(2,1,1)
cl=sns.countplot(data=flight_data,x='departure_time',hue='class')
for l in cl.containers:
cl.bar_label(l)
plt.subplot(2,1,2)
cl=sns.countplot(data=flight_data,x='arrival_time',hue='class')
for l in cl.containers:
cl.bar_label(l)
"""This graph shows that, more morning flights are departed as well as more
night flights arrive at the airport.
i. Analysing ticket price vs destination and source cities base on each class
"""
sns.set(font_scale=0.7) plt.figure(figsize=(8,6))
plt.subplot(2,1,1)
cl=sns.barplot(data=flight_data,x='destination_city',y='price',hue='class')
for l in cl.containers:
cl.bar_label(l)
plt.subplot(2,1,2)
cl=sns.barplot(data=flight_data,x='source_city',y='price',hue='class')
for l in cl.containers:
cl.bar_label(l)
flight_data['duration'].describe()
"""From the boxplot, we can infer that, the flight ticket price falls in the
range of 0 to 100000 only, whereas there are few outliers that is beyond the
value of 120000. Since, the dataset is large enough, the outliers are removed
from the data in order to develop a proper model for the prediction.
"""
f_out=flight_data[flight_data['price']>=100000].index
flight_data=flight_data.drop(index=f_out)
sns.boxplot(x=flight_data['price'])
flight_data.shape
flight_data[['destination_city','price']].groupby('destination_city').max()
flight_data[flight_data['price']==99680]
flight_data.head(2)
"""Vistara offers Business Class at the highest ticket price to the city
Mumbai flies from Bangalore with duration of 14.42 at Rs 99680.
flight_data=flight_data.drop(columns='flight')
# Encoding:
enc_all_cols=df.apply(LabelEncoder().fit_transform)
model_params={
'LR':{
'model':LinearRegression(),
'params':{
}
},
'KNR':{
'model':KNeighborsRegressor(),
'params':{
'n_neighbors':[2,5,10]
}
},
'RFR':{
'model':RandomForestRegressor(),
'params':{
'n_estimators':[5,10,20]
}
}
}
"""Among the 3 models used, Random Forest Regressor gives the highest score.
Hence, a model with the Random Forest Regression is built and evaluated. """
rf=RandomForestRegressor(n_estimators=20)
rf.fit(X_train,y_train)
r_pred=rf.predict(X_test)
"""Looking for the labels of the categorical columns- For reference (Since the
columns are encoded)"""
print('AIRLINE')
print(flight_data['airline'].value_counts())
print(X['airline'].value_counts()) print('\n')
print('SOURCE CITY')
print(flight_data['source_city'].value_counts())
print(X['source_city'].value_counts()) print('\n')
print('DEPARTURE TIME')
print(flight_data['departure_time'].value_counts()
) print(X['departure_time'].value_counts())
print('STOPS')
print(flight_data['stops'].value_counts())
print(X['stops'].value_counts()) print('\n')
print('ARRIVAL TIME')
print(flight_data['arrival_time'].value_counts()
) print(X['arrival_time'].value_counts())
print('\n') print('DESTINATION CITY')
print(flight_data['destination_city'].value_coun
ts())
print(X['destination_city'].value_counts())
print('\n') print('CLASS')
print(flight_data['class'].value_counts())
print(X['class'].value_counts())
flight_data.sample(1)
"""As per the model evaluation, the prediction is around 99% accurate.
Therefore, for flight prediction, 'rf' the model is chosen.
import pickle
flight_data.sample(1)
"""**"""
"""**"""
filename='trained_model.sav' pickle.dump(rf,open(filename,'wb'))
load=pickle.load(open('trained_model.sav','rb'))
load.predict([[5,5,4,0,4,3,1,24.0,48]])
4.5 Testing
Testing in a machine learning project is a crucial step to ensure that the model performs as
expected and generalizes well to new, unseen data. It involves several practices:
1 – Unit Testing
Unit testing involves checking the correctness of individual components within the ML
pipeline. This could include testing data preprocessing functions, individual algorithms, or
other discrete parts of the ML system
2 - Integration Testing
Integration testing checks the combined functionality of these individual components. It
ensures that when these components work together, they produce the expected results
3 - System Testing
System testing evaluates the complete and integrated ML system to verify that it meets the
specified requirements. This includes testing the model’s performance on unseen data and
ensuring that it integrates well with other systems.
4.7 Manual Testing
Fig 4.7.1
These system also have an elegent interface which takes all the neccesary inputs for the
evaluation and to facilitate with is very easy to use . The final result of our proposed sustem
can be viewed by GUI .
Fig5.1 Front end Page
5.1. Comparison
Here, the performance of the developed flight price prediction model(s) is compared with
existing methods or benchmarks. This could involve comparing the predictive accuracy,
computational efficiency, or other relevant metrics against baseline models or state-of-the-art
approaches reported in the literature. The subsection may also discuss how the proposed
model(s) fare against commercial flight booking websites or other publicly availablprediction
services.
6. CONCLUSION
6.1. SUMMARY
This subsection provides a concise summary of the key findings and contributions of the
flight price prediction project. It recaps the objectives outlined in the introduction and
summarizes how they were addressed throughout the project. The summary may include a
brief overview of the methodology employed, the predictive models developed, and the main
results obtained. Additionally, it highlights any novel insights or advancements made in the
field of flight price prediction as a result of the project.
6.2. ACHIEVEMENTS
Here, the subsection discusses the achievements and contributions of the project. It outlines
the specific outcomes or milestones reached during the course of the project, such as the
development of accurate predictive models, the implementation of a user-friendly interface
using Streamlit, or the generation of actionable insights for stakeholders in the aviation and
travel industries. Achievements may be evaluated in terms of technical innovation, practical
utility, or societal impact, depending on the project's goals and objectives.
6.3. FUTURE WORK
This subsection explores potential avenues for future research and development based on
the findings and limitations of the flight price prediction project. It identifies areas where
further improvements or enhancements could be made to advance the state-of-the-art in flight
prediction. Future work may include refining predictive models by incorporating additional
data sources or features, exploring advanced machine learning techniques such as deep
learning or ensemble methods, or conducting longitudinal studies to evaluate model
performance over time. Additionally, opportunities for collaboration with industry partners or
academic researchers may be discussed to validate and extend the project's findings in
realworld settings.
Overall, the Conclusion section serves as a culmination of the flight price prediction project,
summarizing its main outcomes, highlighting achievements, and outlining directions for
future research and development. It provides closure to the project while laying the
groundwork for continued exploration and innovation in the field of airfare prediction.
REFERENCES
2. Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H. (2009). The
elements of statistical learning: data mining, inference, and prediction (Vol. 2,
pp. 1-758). New York: springer.
5. Burger, B., & Fuchs, M. (2005). Dynamic pricing—A future airline business
model. Journal of Revenue and Pricing Management, 4(1), 39-53.
6. Malighetti, P., Paleari, S., & Redondi, R. (2010). Has Ryanair's pricing
strategy changed over time? An empirical analysis of its 2006–2007 flights.
Tourism management, 31(1), 36-44.
7. Liu, T., Cao, J., Tan, Y., & Xiao, Q. (2017, December). ACER: An adaptive
context-aware ensemble regression model for airfare price prediction. In 2017
International Conference on Progress in Informatics and Computing (PIC)
(pp. 312-317). IEEE.
8. Tziridis, K., Kalampokas, T., Papakostas, G. A., & Diamantaras, K. I. (2017,
August). Airfare prices prediction using machine learning techniques. In 2017
25th European Signal Processing Conference (EUSIPCO) (pp. 1036-1039).
IEEE.
9. Can, Y. S., & Alagöz, F. (2023, October). Predicting Local Airfare Prices with
Deep Transfer Learning Technique. In 2023 Innovations in Intelligent
Systems and Applications Conference (ASYU) (pp. 1-4
BIOGRAPHY