FLIGHT FARE PREDICTION - Copy100
FLIGHT FARE PREDICTION - Copy100
FLIGHT FARE PREDICTION - Copy100
Submitted by:
Saleem Yousuf(2000970130103)
Sankalp Rajpoot(2000970130105)
Vivek kumar gupta(2000970130129)
Greater Noida
S No. Content Page no
Abstract 3
2 Introduction 4
4 METHODOLOGY 6
5 Experimental setup 13
6 Tools used 13
8 Source code
10 References 20
List of Figures
List of Tables
I want to give special thanks to our Mini Project coordinator Ms. Raunak Sulekh for the timely
advice and valuable guidance during designing and implementation of this project work.
I also want to express my sincere thanks and gratitude to Dr. Sanjeev Kumar Singh, Head of
Department (HOD), and Information Technology Department for providing me with the facilities
and for all the encouragement and support.
Finally, I express my sincere thanks to all staff members in the department of Information
Technology branch for all the support and cooperation.
Sankalp Rajpoot(2000970130105)
Any individual who has booked a flight ticket previously knows how dynamically costs change.
Aircraft uses advanced strategies called Revenue Management to execute a distinctive valuing
strategy . The least expensive accessible ticket changes over a period the cost of a ticket might be
high or low. This valuing method naturally modifies the toll as per the time like morning, afternoon
or night. Cost may likewise change with the seasons like winter, summer and celebration seasons.
The extreme goal of the carrier is to build its income yet on the opposite side purchaser is searching
at the least expensive cost. Purchasers generally endeavor to purchase the ticket in advance to the
takeoff day. Since they trust that airfare will be most likely high when the date of buying a ticket
is closer to the takeoff date, yet it is not generally true. Purchaser may finish up with the paying
more than they ought to for a similar seat.
A report says India affable aeronautics industry is on a high- development movement. India
is the third-biggest avionics showcase in 2020 and the biggest by 2030. Indian air traffic is normal
to cross the quantity of 100 million travelers by 2017, whereas there were just 81 million
passengers in 2015. Agreeing to Google, the expression Cheap Air Tickets is most sought in India.
At the point when the white collar class of India is presented to air travel, buyers searching at
modest costs. The rate of flight tickets at the least cost is continuously expanding.
Nowadays, the airline corporations are using complex strategies for the flight ticket fare
calculations. This highly complicated methods makes the flight ticket fare difficult to guess for the
customers, since the fare changes dynamically. Our project Improved Flight Price Prediction
System which resolve this problem and provide a facility where people will be able to predict the
flight-ticket price before purchasing the ticket.
The prime objective of our project "Improved Flight Prediction System" is to make a prediction of
the flight ticket fare for the future flights. The proposed approach is using machine learning
algorithm and we are using supervised learning. The regression model which we have selected for
our prediction is “Extreme Gradient Boosting”. In this approach we have developed this project in
python language for backend. For the GUI Bootstrap, HTML, CSS using django framework and
python using tKinter.
2.Literature Survey
It is very difficult for the customer to purchase a flight ticket at the minimum price. For this several
techniques are used to obtain the day at which the price of air ticket will be minimum. Most of
these techniques are using sophisticated artificial intelligence (AI) research is known as Machine
Learning.
Utilizing AI models, connected PLSR (Partial Least Square Regression) model to acquire the
greatest presentation to get the least cost of aircraft ticket buying, having 75.3% precision.
Janssen presented a direct quantile blended relapse model to anticipate air ticket costs for cheap
tickets numerous prior days takeoff. Ren, Yuan, and Yang , contemplated the exhibition of Linear
Regression (77.06% precision), Naive Bayes (73.06% exactness, Softmax Regression (76.84%
precision) and SVM (80.6% exactness) models in anticipating air ticket costs. Papadakis [5]
anticipated that the cost of the ticket drop later on, by accepting the issue as a grouping issue with
the assistance of Ripple Down Rule Learner (74.5 % exactness.), Logistic Regression with 69.9%
precision and Linear SVM with the (69.4% exactness) Machine Learning models. Gini and Groves
took the Partial Least Square Regression (PLSR) for developing a model of predicting the best
purchase time for flight tickets. The data was collected from major travel journey booking websites
from 22 February 2011 to 23 June 2011. Additional data were also collected and are used to check
the comparisons of the performances of the final model. Janssen built up an expectation model
utilizing the Linear Quantile Blended Regression strategy for SanFrancisco to NewYork course
with existing every day airfares given by www.infare.com. The model utilized two highlights
including the number of days left until the takeoff date and whether the flight date is at the end of
the week or weekday. The model predicts airfare well for the days that are a long way from the
takeoff date, anyway for a considerable length of time close the takeoff date, the expectation isn’t
compelling.
Wohlfarth proposed a ticket buying time enhancement model dependent on an extraordinary pre-
preparing step known as macked point processors and information mining systems (arrangement
and bunching) and measurable investigation strategy. This system is proposed to change over
heterogeneous value arrangement information into added value arrangement direction that can be
bolstered to unsupervised grouping calculation. The value direction is bunched into gathering
dependent on comparative estimating conduct. Advancement model gauge the value change
designs. A treebased order calculation used to choose the best coordinating group and afterward
comparing the advancement model.
A study by Dominguez-Menchero recommends the ideal buying time dependent on nonparametric
isotonic relapse method for a particular course, carriers, and timeframe. The model gives the most
extreme number of days before buying a flight ticket. two sorts of the variable are considered for
the expectation. One is the passage and date of procurement.
Vargas and silva contend that lodging costs alterations assume a paramount part in the
determination of the stage of the business cycle in 2008. When those economy booms,
development and work in the lodging division expand quickly should react should overabundance
demand, quickly pushing ostensible house costs upwards. Recently, a few writers scope to
experimental discoveries that house costs can make instrumental molding to determining yield.
(Forni etc, 2003; stock and Watson, 2003; Gupta Furthermore Das, 2010; das etc, 2009; 2010;
2011; Gupta and Hartley, 2013). Those lodging development division speaks to an expansive and
only aggregate monetary action communicated in the GDP. Consequently, concerning illustration
it reflects an extensive parcel of the general riches of the economy, house costs variances can
make a pointer of the Development about GDP.
There is huge literature writing in regard to U.S. house prices. Rapach Furthermore strauss use an
auto regressive dispersed slack model framework, holding 25 determinants with conjecture
genuine lodging cost development to the unique states of the elected Reserve’s eighth region.
They discover that ARDL models tend should beat a benchmark AR model. Rapach and strauss
augment those same examination on the 20 biggest u. Encountered with urban decay because of
de industrialization, innovation developed, government agent. Gogas and Pragidis utilize the
hazard premium ascertained Likewise those Contrast the middle of
Different long haul enthusiasm rates and the agents’ desires over future fleeting rates as
information variable to foreseeing what's to come heading for house costs.
Gupta and Das also forecast the recent downturn in real house price growth rates for the twenty
largest U.S. states. The authors use Spatial Bayesian VARs, based only on monthly real house
price growth rates, to forecast their downturn over the period 2007:01 to 2008:01. They find that
BVAR models are well-equipped in forecasting the future direction of real flight prices, though
they significantly underestimate the decline. Rapach and strauss expand the individual’s same
examination on the 20 most amazing. Encountered with urban rot due to de industrialization,
advancement developed, administration agonize. States reliant upon ARDL models taking a
gander at state, regional Also national level variables.
Gogas and Pragidis use the danger premium determined similarly the individual’s complexity
those white collar for different whole deal energy rates and the agents’ longings In future transient
rates as data variable with foreseeing what's with turn heading to house expenses. They construe
that masters also investigators Might utilization enough the individuals lion's share of the
information provided for by the individuals financing rate danger premium today with the goal.
4. Methodology
Data collection
The collection of data is the most important aspect of this project. There are various sources of the
data on different websites which are used to train the models. Websites give information about the
multiple routes, times, airlines and fare. Various sources from API’s to consumer travel websites
are available for data scraping. In this section details of the various sources and parameters that
are collected are discussed. To implement this data is collected from a website “Makemytrip.com”
and python is used for the implementation of the models and collection of the data.
I.Collection of data
The script extracts the information from the website and creates a CSV file as output. This file
contains the information with features and its details. Now an important aspect is to select the
features that might be needed for the flight prediction algorithm. Output collected from the website
contains numerous variable for each flight but not all are required, so only the following feature is
considered.
• Origin
• Destination
• Departure Date
• Departure Time
• Arrival Time
• Total Fare
• Airways
• Taken Date
In this study, the focus is only on minimizing the airfare charges so a single route is considered
without return. This data is collected for one of the busiest routes in India (BOM to DEL) over a
period of three months that is from February to April. For each flight data with all the features
collected manually.
All the collected data needed a lot of work so after the collection of data, it is needed to be clean
and prepare according to the model requirements. All the unnecessary data is removed like
duplicates and null values. In all machine learning this technology, this is the most important and
time consuming step. Various statistical techniques and logic built in python are used to clean and
prepare the data. For example, the price was character type, not an integer.
Data preparation is followed by analyzing the data, uncovering the hidden trends and then applying
various machine learning models. Also, some features can be calculated from the existing feature.
Days to departure can be obtained by calculating the difference between the departure date and the
date on which data is taken. This parameter is considered to be within 45 days. Also, the day of
departure plays an important role in whether it is holiday or weekday. Intuitively the flights
scheduled during weekends have a more price compared to the flights on Wednesday or Thursday.
Similarly, time also seems to play an important factor. So the time is been divided into four
categories: Morning, afternoon, evening, night.
To develop the model for the flight price prediction, many conventional machine learning
algorithms are evaluated. They are as follows: Linear regression, Decision tree, Random Forest
Algorithm, K-Nearest neighbors, Multilayer Perceptron, Support Vector Machine (SVM) and
Gradient Boosting. All these models are implemented in the scikit learn. To evaluate the
performance of this model, certain parameters are considered. They are as follows: R-squared
value, Mean Absolute Error (MAE) and Mean Squared Error (MSE). The formulas for these three
parameters are as follows:
(1)
(2)
(3)
A. Linear Regression
Regression is a method of modeling a target value based on predictors that are independent.
It is mostly based on the number of independent variables and the relationship between
independent and dependent variables. linear regression is a type of analysis where the
number of independent variables is one and the relationship between the dependent and
independent variables vary linearly. The important concept to understand linear regressions
are cost function and Gradient decent.
y(pred) = b0+b1*x B.
B. Decision tree
The Decision tree calculation separates the informational collection into small subsets, at
a similar same time it creates gradually. The last outcomes are the tree with the decision
nodes, what’s more, the leaf nodes. A decision hub may have at least two branches. In the
beginning, consider the entire informational collection as root. Highlight esteems are
wanted to be downright. On the off chance that the qualities are constant then they are
discretized before structure the model. Based on characteristic qualities records are
dispersed recursively. There are two primary characteristics in the decision tree
calculation. One is Information Gain and another is the Gini index. Information Gain is
the proportion of Change in entropy. Higher the entropy more the instructive substance,
where the entropy is a proportion of vulnerability of arbitrary variable. Gini Index is a
component that measures how frequently an arbitrarily picked component would be
mistakenly distinguished. It implies a characteristic with a lower Gini index ought to be
liked.
C. Random Forest
It is a supervised learning algorithm. The benefit of the random forest is, it very well may
be utilized for both characterization and relapse issue which structure most of current
machine learning framework. Random forest forms numerous decision trees, what’s more,
adds them together to get an increasingly exact and stable expectation.
Random Forest has nearly the equivalent parameters as a decision tree or a stowing
classifier model. It is very simple to discover the significance of each element on the
expectation when contrasted with others in this calculation. The regular component in these
techniques is, for the kth tree, a random vector theta k is produced, autonomous of the past
random vectors theta 1, ... , theta k-1 however with the equivalent distribution,while a tree
is developed utilizing the preparation set and bringing about a classifier. x is an information
vector. For a period, in stowing the random vector is created as the includes in N boxes
where N is the number of models in the preparation set of information. In random split,
choice includes various autonomous random whole numbers between 1 to K. The
dimensionality and nature of theata rely upon its utilization in the development of a tree.
After countless trees are created, they select the most famous class. These methodology
are called as random forests.
D. K-Nearest Neighbours
In regression techniques, the output obtained is an average value of its k nearest neighbors.
It is a nonparametric method like SVM. Using some values, results are evaluated and the
best performance value is obtained.
It is the class of feedforward artificial neural networks. It includes the input layer output
layer and the number of the hidden layers. The hidden layer gives the depth of the neural
network. The setup includes 1hidden layer, the number of neurons starts from 100 to 2000
with different intervals depending upon the required
condition. To fire each neuron it requires activation energy. The logistic sigmoid function is
used as an activation function.
F. Gradient boosting
In the proposed paper Support Vector Machine used as regression analysis that relays on
kernel function considered as non parametric technique. The following kernels are used:
Linear, Polynomial, Radial Basis Function. As per the previous studies Random forest and
the gradient boosting gives the maximum accuracy. The values of R square, MAE and MSE
are given in the table:
V. Predictors
After evaluating the performance of the all machine learning models , further
improvements are made using a correct predictor model for the best result. Two separated
train models are developed by applying the trained datasets. Also the appropriate weights
are assigned to them to get a better predictor model.
EXPERIMENTAL SET UP
Steps to Create Model
1. Import Libraries
2. Load Dataset
3. Exploratory Data Analysis
4. Data Cleaning
5. Feature Engineering
6. Dimensionality Reductions
7. Outlier Removal using Business Logic
8. Outlier Removal using Standard Deviation & Mean
9. Data Visualization
10. Building a Model
11. Test the Model for few properties
12. Export the tested model to a pickle file
Tools used
Panda- pandas is a software library written for the Python programming language for data
manipulation and analysis. It is most widely used for data science/data analysis and machine
learning tasks.
Python Language
Google Colaboratory
Jupyter Notebook- The Jupyter Notebook is an open-source web application that allows data
scientists to create and share documents that integrate live code, equations, computational output,
visualizations, and other multimedia resources, along with explanatory text in a single document.
Hardware Required Tools
8 GB RAM
1.2GHZ Processor
16 GB ROM
There are two main use cases of flight price prediction in the travel industry. OTAs and other
travel platforms integrate this feature to attract more visitors looking for the best rates. Airlines
employ the technology to forecast rates of competitors and adjust their pricing strategies
accordingly
1- Existing system requires a great amount of manual work has to be done. The amount of manual
work increases exponentially with increase in services.
3- Needs a lot of working staff and extra attention on all the records.
The predictor also makes straightforward recommendations on the best date to purchase
the flight. If it’s not today, visitors can subscribe to price drop alerts and receive the latest updates
via email or directly to their phone.
Conclusion and future work
saves an average of about Rs. 200 per transaction when predicting to wait.
● Routes with data collected over the longer duration of time tend to facilitate with much
more accurate predictions in the model and thus lead to higher average savings. We were
successfully able to analyze each route and generalize the entire project based in terms of
the sector to which the route belonged, and classified them into three major subsections -
Business Routes, Tourist Routes and Tier-2 Routes. We have also successfully busted
some of the typical myths and misconceptions related to the airline industry and backed
them up with data and analysis.
Currently, there are many fields where prediction-based services are used such as stock price
predictor tools used by stock brokers and service like Zestimate which gives the estimated value
of house prices. Therefore, there is requirement for service like this in the aviation industry which
can help the customers in booking tickets. There are many researches works that have been done
on this using various techniques and more research is needed to improve the accuracy of the
prediction by using different algorithms. More accurate data with better features can be also be
used to get more accurate results
List of References
FIGURE - 2