Flight DElay Report
Flight DElay Report
CHAPTER 1. INTRODUCTION
In the present world, the major components of any transportation system include
passenger airline, cargo airline, and air traffic control system. With the passage of
time, nations around the world have tried to evolve numerous techniques of
improving the airline transportation system. This has brought drastic change in
the airline operations. Flight delays occasionally cause inconvenience to the modern
passengers [1]. Every year approximately 20% of airline flights are canceled or
delayed, costing passengers more than 20 billion dollars in money and their time.
My case study was about LaGuardia Airport in New York, Logan International
Airport in Boston, San Francisco International Airport in San Francisco, and
O’Hare International Airport in Chicago, which are four major airports in the United
States of America. But we focused the idea and research on LaGuardia
International Airport. Compared with the data produced by all airports in USA, the
data which we gathered was very limited, but it gave us a great direction on how
weather plays a part in flight delays. In this project, the goal is to use exploratory
analysis and to build machine learning models to predict airline departure and
arrival delays.
This master project report is organized into nine chapters. The preface of the project,
research motivation, and problem statement form chapter 1. Chapter 2 describes the
basic concepts of flight and weather data. Chapter 3 focuses on structures of the
project. Chapter 4 and 5 explain the data collection and data exploration part of the
flight data, while the chapter 6 focuses on predictive modelling implemented on the
flight data. Chapter 7 focuses on predictive modelling implemented on the weather
data. Chapter 8 starts with the introduction of the Twitter data and some tweets
exploration that helped me in the course of building the project. It focuses on
predictive modeling of Twitter data using Random Forest and Support Vector
Machine. Chapter 9 concludes the paper and finally chapter 10 talks about the future
scope of the project.
The main concern of the researchers and analysts is to predict the reasons for flight
delays and for that they have put in their efforts on collecting data about flight and
the weather. Mohamed et al. [2] have studied the pattern of arrival delay for non-
stop domestic flights at the Orlando International Airport. They focused primarily
on the cyclic variations that happen in the air travel demand and the weather at that
particular airport.
In Shervin et al.’s work [3], their motive of research is to propose an approach that
improves the operational performance without hampering or effecting the planned
cost.
Adrian et al. [4] have created a data mining model which enables the flight delays by
observing the weather conditions. They have used WEKA and R to build their
models by selecting different classifiers and choosing the one with the best results.
They have used different machine learning techniques like Naïve Bayes and Linear
Discriminant Analysis classifier.
Choi et al. [5] have focused on overcoming the effects of the data imbalancing caused
during data training. They have used techniques like Decision Trees, AdaBoost, and
K-Nearest Neighbors for predicting individual flight delays. A binary classification
was performed by the model to predict the scheduled flight delay.
Schaefer et al. [6] have made Detailed Policy Assessment Tool (DPAT) that is used
to stimulate the minor changes in the flight delay caused by the weather changes.
Bing Liu [7] has done a sentiment analysis and opinion mining that analyzes people’s
opinions, sentiments, and studies their behavior. The output of the research is a
feature-based opinion summary which is also known as sentiment classification.
Using techniques such as Natural Language Processing, Naïve Bayes, and Support Vector
Machine, researchers built algorithms for analysis that helped them in extracting features
in the model. Most of them focused on predicting overall flight delays. Our research
concentrated mainly on predicting flight delays for a particular airport over a specific
period of time. First, we used a regression model to examine the significance of each
feature and then, a feature selection approach to examine the impact of feature
combination. These two techniques determined the features to retain in the model. Instead
of using the whole set, we sampled 5,000 records at a time to run through different
machine learning models. The machine learning models implemented here were
Random Forest classifier and Support Vector Machine (SVM) classifier. Further, we
applied an approach
called One-Hot-Encoder to create a variant of the model for evaluating potential
prediction performance.
CHAPTER 2. LITERATURE SURVEY
CHAPTER 3. SYSTEM ANALYSIS
EXISTING SYSTEM:
A supervised machine learning classifies data inputs accordingly labeled output and
unsupervised learning classifies the inputs without having any labeled data. Several
researchers had used machine learning algorithms to solve the classification problems
in the educational domain.
Keeping in view, the identification of Flight demographic, Climate, social, personal,
and others Features some latest literature proved that machine learning played a much
significant role in predictive modeling.
IMPLEMENTATION
MODULES:
Data Collection
Dataset
Data Preparation
Model Selection
Analyze and Prediction
Accuracy on test set
Saving the Trained Model
MODULES DESCSRIPTION:
Data Collection:
This is the first real step towards the real development of a machine learning
model, collecting data. This is a critical step that will cascade in how good the
model will be, the more and better data that we get, the better our model will
perform.
There are several techniques to collect the data, like web scraping, manual
interventions and etc.
Data Preparation:
we will transform the data. By getting rid of missing data and removing some
columns. First we will create a list of column names that we want to keep or
retain.
Next we drop or remove all columns except for the columns that we want to
retain.
Finally we drop or remove the rows that have missing values from the data set.
Model Selection:
While creating a machine learning model, we need two dataset, one for
training and other for testing. But now we have only one. So lets split
this in two with a ratio of 80:20. We will also divide the dataframe into
feature column and label column.
The function returns four datasets. Labelled them as train_x, train_y, test_x,
test_y. If we see shape of this datasets we can see the split of dataset.
We will use Random Forest Classifier, which fits multiple decision tree to the
data. Finally I train the model by passing train_x, train_y to the fit method.
Once the model is trained, we need to Test the model. For that we will
pass test_x to the predict method.
Random Forest is one of the most powerful methods that is used in machine
learning for regression problems. The random forest comes in the category of the
supervised regressor algorithm. This algorithm is carried out in two different
stages the first one deals with the creation of the forest of the given dataset, and
the other one deals with the prediction from the regressor.
Once you’re confident enough to take your trained and tested model into the
production-ready environment, the first step is to save it into a .h5 or . pkl file
using a library like pickle .
Make sure you have pickle installed in your environment.
Next, let’s import the module and dump the model into . pkl file
CHAPTER 5.
SYSTEM DESIGN
SYSTEM ARCHITECTURE:
DATA FLOW DIAGRAM:
Preprocessing
Training dataset
Feature Extraction
Yes
Yes
Delay Predict
UML DIAGRAMS
GOALS:
Input Output
Delay Predict
SEQUENCE DIAGRAM:
A sequence diagram in Unified Modeling Language (UML) is a kind of
interaction diagram that shows how processes operate with one another and in
what order. It is a construct of a Message Sequence Chart. Sequence diagrams are
sometimes called event diagrams, event scenarios, and timing diagrams.
ACTIVITY DIAGRAM:
Activity diagrams are graphical representations of workflows of stepwise
activities and actions with support for choice, iteration and concurrency. In the
Unified Modeling Language, activity diagrams can be used to describe the
business and operational step-by-step workflows of components in a system. An
activity diagram shows the overall flow of control.
Input data
Preprocessing
Training
Predicted t Delay
CHAPTER 6
INPUT DESIGN AND OUTPUT DESIGN
INPUT DESIGN
The input design is the link between the information system and the user. It
comprises the developing specification and procedures for data preparation and
those steps are necessary to put transaction data in to a usable form for processing
can be achieved by inspecting the computer to read data from a written or printed
document or it can occur by having people keying the data directly into the
system. The design of input focuses on controlling the amount of input required,
controlling the errors, avoiding delay, avoiding extra steps and keeping the
process simple. The input is designed in such a way so that it provides security
and ease of use with retaining the privacy. Input Design considered the following
things:
What data should be given as input?
How the data should be arranged or coded?
The dialog to guide the operating personnel in providing input.
Methods for preparing input validations and steps to follow when error
occur.
OBJECTIVES
OUTPUT DESIGN
A quality output is one, which meets the requirements of the end user and
presents the information clearly. In any system results of processing are
communicated to the users and to other system through outputs. In output design
it is determined how the information is to be displaced for immediate need and
also the hard copy output. It is the most important and direct source information
to the user. Efficient and intelligent output design improves the system’s
relationship to help user decision-making.
1. Designing computer output should proceed in an organized, well thought out
manner; the right output must be developed while ensuring that each output
element is designed so that people will find the system can use easily and
effectively. When analysis design computer output, they should Identify the
specific output that is needed to meet the requirements.
2. Select methods for presenting information.
3. Create document, report, or other formats that contain information produced by
the system.
The output form of an information system should accomplish one or more of the
following objectives.
Convey information about past activities, current status or projections of
the
Future.
Signal important events, opportunities, problems, or warnings.
Trigger an action.
Confirm an action.
CHAPTER 7
SYSTEM REQUIREMENTS:
HARDWARE REQUIREMENTS:
SOFTWARE REQUIREMENTS:
SOFTWARE ENVIRONMENT
Python:
Python is a high-level, interpreted, interactive and object-oriented scripting
language. Python is designed to be highly readable. It uses English keywords
frequently where as other languages use punctuation, and it has fewer syntactical
constructions than other languages.
Python is Interpreted − Python is processed at runtime by the interpreter.
You do not need to compile your program before executing it. This is
similar to PERL and PHP.
History of Python
Python was developed by Guido van Rossum in the late eighties and early
nineties at the National Research Institute for Mathematics and Computer Science
in the Netherlands.
Python is derived from many other languages, including ABC, Modula-3, C, C+
+, Algol-68, SmallTalk, and Unix shell and other scripting languages.
Python is copyrighted. Like Perl, Python source code is now available under the
GNU General Public License (GPL).
Python is now maintained by a core development team at the institute, although
Guido van Rossum still holds a vital role in directing its progress.
Python Features
Python's features include −
Easy-to-learn − Python has few keywords, simple structure, and a clearly
defined syntax. This allows the Flight to pick up the language quickly.
Apart from the above-mentioned features, Python has a big list of good features,
few are listed below −
It supports functional and structured programming methods as well as
OOP.
It provides very high-level dynamic data types and supports dynamic type
checking.
Run the downloaded file. This brings up the Python install wizard, which is
really easy to use. Just accept the default settings, wait until the install is
finished, and you are done.
The Python language has many similarities to Perl, C, and Java. However, there
are some definite differences between the languages.
$ python
Python2.4.3(#1,Nov112010,13:34:43)
>>>
Type the following text at the Python prompt and press the Enter −
>>>print"Hello, Python!"
If you are running new version of Python, then you would need to use print
statement with parenthesis as in print ("Hello, Python!");. However in Python
version 2.4.3, this produces the following result −
Hello, Python!
print"Hello, Python!"
We assume that you have Python interpreter set in PATH variable. Now, try to
run this program as follows −
$ python test.py
Hello, Python!
Flask Framework:
Flask is a web application framework written in Python. Armin
Ronacher, who leads an international group of Python enthusiasts named
Pocco, develops it. Flask is based on Werkzeug WSGI toolkit and Jinja2
template engine. Both are Pocco projects.
Http protocol is the foundation of data communication in world wide web.
Different methods of data retrieval from specified URL are defined in this
protocol.
1 GET
2 HEAD
3 POST
4 PUT
5 DELETE
<body>
<formaction="http://localhost:5000/login"method="post">
<p>Enter Name:</p>
<p><inputtype="text"name="nm"/></p>
<p><inputtype="submit"value="submit"/></p>
</form>
</body>
</html>
app=Flask(__name__)
@app.route('/success/<name>')
def success(name):
@app.route('/login',methods=['POST','GET'])
def login():
ifrequest.method=='POST':
user=request.form['nm']
return redirect(url_for('success',name= user))
else:
user=request.args.get('nm')
if __name__ =='__main__':
app.run(debug =True)
user = request.form['nm']
It is passed to ‘/success’ URL as variable part. The browser displays
a welcome message in the window.
User = request.args.get(‘nm’)
Python Install
Many PCs and Macs will have python already installed.
To check if you have python installed on a Windows PC, search in the start bar
for Python or run the following on the Command Line (cmd.exe):
To check if you have python installed on a Linux or Mac, then on linux open
the command line or on Mac open the Terminal and type:
python --version
If you find that you do not have python installed on your computer, then you
can download it for free from the following website: https://www.python.org/
Python Quickstart
Python is an interpreted programming language, this means that as a
developer you write Python (.py) files in a text editor and then put those files
into the python interpreter to be executed.
The way to run a python file is like this on the command line:
Let's write our first Python file, called helloworld.py, which can be done in any
text editor.
helloworld.py
print("Hello, World!")
Simple as that. Save your file. Open your command line, navigate to the
directory where you saved your file, and run:
Hello, World!
Congratulations, you have written and executed your first Python program.
C:\Users\Your Name>python
From there you can write any python, including our hello world example from
earlier in the tutorial:
C:\Users\Your Name>python
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:04:45) [MSC v.1900 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print("Hello, World!")
C:\Users\Your Name>python
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:04:45) [MSC v.1900 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print("Hello, World!")
Hello, World!
Whenever you are done in the python command line, you can simply type the
following to quit the python command line interface:
exit()
Or by creating a python file on the server, using the .py file extension, and
running it in the Command Line:
Python Indentations
Example
if 5 > 2:
print("Five is greater than two!")
Example
if 5 > 2:
print("Five is greater than two!")
Comments
Comments start with a #, and Python will render the rest of the line as a
comment:
Example
Comments in Python:
#This is a comment.
print("Hello, World!")
Docstrings
Python uses triple quotes at the beginning and end of the docstring:
Example
"""This is a
multiline docstring."""
print("Hello, World!")
CHAPTER 9
SYSTEM STUDY
FEASIBILITY STUDY
The feasibility of the project is analyzed in this phase and business
proposal is put forth with a very general plan for the project and some cost
estimates. During system analysis the feasibility study of the proposed system
is to be carried out. This is to ensure that the proposed system is not a burden
to the company. For feasibility analysis, some understanding of the major
requirements for the system is essential.
ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY
ECONOMICAL FEASIBILITY
This study is carried out to check the economic impact that the system
will have on the organization. The amount of fund that the company can pour
into the research and development of the system is limited. The expenditures
must be justified. Thus the developed system as well within the budget and
this was achieved because most of the technologies used are freely available.
Only the customized products had to be purchased.
TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is,
the technical requirements of the system. Any system developed must not
have a high demand on the available technical resources. This will lead to
high demands on the available technical resources. This will lead to high
demands being placed on the client. The developed system must have a
modest requirement, as only minimal or null changes are required for
implementing this system.
SOCIAL FEASIBILITY
The aspect of study is to check the level of acceptance of the system by the
user. This includes the process of training the user to use the system efficiently.
The user must not feel threatened by the system, instead must accept it as a
necessity. The level of acceptance by the users solely depends on the methods
that are employed to educate the user about the system and to make him familiar
with it. His level of confidence must be raised so that he is also able to make
some constructive criticism, which is welcomed, as he is the final user of the
system.
CHAPTER 10
SYSTEM TESTING
The purpose of testing is to discover errors. Testing is the process of
trying to discover every conceivable fault or weakness in a work product. It
provides a way to check the functionality of components, sub assemblies,
assemblies and/or a finished product It is the process of exercising software with
the intent of ensuring that the
Software system meets its requirements and user expectations and does not fail in
an unacceptable manner. There are various types of test. Each test type addresses
a specific testing requirement.
TYPES OF TESTS
Unit testing
Unit testing involves the design of test cases that validate that the internal
program logic is functioning properly, and that program inputs produce valid
outputs. All decision branches and internal code flow should be validated. It is
the testing of individual software units of the application .it is done after the
completion of an individual unit before integration. This is a structural testing,
that relies on knowledge of its construction and is invasive. Unit tests perform
basic tests at component level and test a specific business process, application,
and/or system configuration. Unit tests ensure that each unique path of a business
process performs accurately to the documented specifications and contains clearly
defined inputs and expected results.
Integration testing
Functional test
Unit testing is usually conducted as part of a combined code and unit test
phase of the software lifecycle, although it is not uncommon for coding and unit
testing to be conducted as two distinct phases.
Test strategy and approach
Field testing will be performed manually and functional tests will be
written in detail.
Test objectives
All field entries must work properly.
Pages must be activated from the identified link.
The entry screen, messages and responses must not be delayed.
Features to be tested
Verify that the entries are of the correct format
No duplicate entries should be allowed
All links should take the user to the correct page.
Test Results: All the test cases mentioned above passed successfully. No defects
encountered.
6.3 Acceptance Testing
Test Results: All the test cases mentioned above passed successfully. No defects
encountered.
CHAPTER 11
CONCLUSIONS
In this project, we use flight data, weather, and demand data to predict flight
departure delay. Our result shows that the Random Forest method yields the best
performance compared to the SVM model. Somehow the SVM model is very time
consuming and does not necessarily produce better results. In the end, our model
correctly predicts 91% of the non-delayed flights. However, the delayed flights
are only correctly predicted 41% of time. As a result, there can be additional
features related to the causes of flight delay that are not yet discovered using our
In the second part of the project, we can see that it is possible to predict flight delay
patterns from just the volume of concurrently published tweets, and their sentiment
and objectivity. This is not unreasonable; people tend to post about airport delays on
Twitter; it stands to reason that these posts would become more frequent, and more
profoundly emotional, as the delays get worse. Without more data, we cannot make a
robust model and find out the role of related factors and chance on these results.
However, as a proof of concept, there is potential for these results. It may be possible
REFERENCE
[1] A. B. Guy, "Flight delays cost $32.9 billion, passengers foot half the bill". [Online]
Available : https://news.berkeley.edu/2010/10/18/flight_delays/3/. [Accessed on
June 2017].
of arrival delay", Journal of Air Transport Management,, Volume 13(6), pp. 355–
[4] A. A. Simmons, "Flight Delay Forecast due to Weather Using Data Mining", M.S.
2015.
[6] L. Schaefer and D. Millner, "Flight Delay Propagation Analysis With The Detailed
Policy Assessment Tool", Man and Cybernetics Conference, Tucson, AZ, 2001.
[7] B. Liu "Sentiment Analysis and Opinion Mining Synthesis", Morgan & Claypool
https://towardsdatascience.com/data-cleaning-101-948d22a92e4. [Accessed on
March 2018].
[13] How to Predict Yes/No Outcomes Using Logistic Regression. [Online]. Available:
https://blog.cleaarbrain.com/posts/how-to-predict-yesno-outcomes-using-logistic-
[14] S. Polamuri, "How The Random Forest Algorithm Works In Machine Learning".
https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-
learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.
Available: https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-