BCA 8th Project report(Linear regression)
BCA 8th Project report(Linear regression)
A project report on
Submitted to
Supervisor Recommendation
I hereby recommend that this project report under my supervision by Sameer Ansari and Sonam Dorje
Lama entitled “IPL Prediction System” in partial fulfillment of the requirement for a Bachelor's Degree
in Computer Application of Tribhuvan University be processed for evaluation
………………….
Mr. Amit Chaudhary
Project Supervisor
Bouddha, Kathmandu
TRIBHUVAN UNIVERSITY
Faculty of Humanities & Social Science
LETTER OF APPROVAL
This is to certify that this project prepared by Sameer Ansari and Sonam Dorje Lama
entitled “IPL Prediction System” in partial fulfillment of the requirements for the degree of
Bachelor in Computer Application has been evaluated. In our opinion, it is satisfactory in the
scope and quality of a project for the required degree.
We would like to express our deepest appreciation to all those who provided us with the
possibility to complete this report. A special gratitude is given to our final project supervisor,
Mr. Amit Chaudhary, whose contribution in stimulating suggestions and encouragement,
helped us to contribute to our project, especially in writing this report.
Further more, we would also like to acknowledge with much appreciation the crucial role of
the coordinator, who gave the permission to use all required equipment and the necessary
materials to complete our project. Special thanks to our Academic Manager, Mr. Tyson
lama, who gave us valuable suggestions regarding the project. Last but not least, many
thanks go to our teachers, friends, and guardians who directly or indirectly helped us in
achieving the goal. We would like to thank all the guidance which has improved our
presentation skills thanks to their comment and advice.
i
ABSTRACT
In today’s date data analysis is need for every data analytics to examine the sets of data to extract the
useful information from it and to draw conclusion according to the information. Data analytics
techniques and algorithms are more used by the commercial industries which enables them to take
precise business decisions. It is also used by the analysts and the experts to authenticate or negate
experimental layouts, assumptions and conclusions. In recent years the analytics is being used in the
field of sports to predict and draw various insights. Due to the involvement of money, team spirit,
city loyalty and a massive fan following, the outcome of matches is very important for all stake
holders. In this paper, the past seven year’s data of IPL containing the player’s details, match venue
details, teams, ball to ball details, is taken and analyzed to draw various conclusions which help in
the improvement of a player’s performance. Various other features like how the venue or toss decision
has influenced the winning of the match in last seven years are also predicted. Various machine
learning and data extraction models are considered for prediction are Linear regression, Decision tree,
K-means, Logistic Regression etc. The cross validation score and the accuracy are also calculated
using various machine learning algorithms. Before prediction we have to explore and visualize the
data because data exploration and visualization is an important stage of predictive modeling.
ii
TABLE OF CONTENTS
Abstract i
Acknowledgement ii
Chapter 1 1-2
INTRODUCTION 1
1.1 Introduction 1
1.2 Plan of Implementation 2
1.3 Problem Statement 2
1.4 Objective of the Program 2
Chapter 2 3-4
BACKGROUND STUDY & LITERATURE REVIEW 3
Chapter 3 5-7
SYSTEM ANALYSIS AND DESIGN 5
3.1 Data Collection 5
3.2 Data Processing 5
3.3 Data Visualization 6
3.4 Model Development and Evaluation 7
Chapter 4 8-10
SYSTEM REQUIREMENTS SPECIFICATION 8
4.1 Functional Requirements 8
4.2 Non- Functional Requirements 9
4.3 System Configuration 10
4.4 Hardware Requirements 10
4.5 Software Requirements 10
Chapter 5 11-12
SYSTEM DESIGN 11
5.1 System Development Methodology 11
Chapter 6 13-18
IMPLEMENTATION & TESTING 13
Chapter 7 19-24
RESULTS 19
Chapter 8 25-25
FUTURE SCOPE AND CONCLUSION 25
REFERENCES 26-26
LIST OF FIGURES
Introduction
Machine Learning is a branch of Artificial Intelligence that aims at solving real-life
engineering problems. This technique requires no programming, whereas it depends on only data
learning where the machine learns from pre-existing data and predicts the result accordingly.
Machine Learning methods have benefit of using decision trees, heuristic learning, knowledge
acquisition, and mathematical models. It thus provides controllability, observability, stability and
effectiveness.
Cricket is being played in many countries around the world. There are a lot of domestic and
international cricket tournaments being held in many countries. The cricket game has various forms
such as Test Matches, Twenty20 Internationals, Internationals one day, etc. IPL is also one of them,
and has great popularity among them. It's a twenty-20 cricket game league played to inspire young
and talented players in India. The league was conducted annually in March, April or May and has
a huge fan base among India. There are eight teams which represent eight cities which are chosen
from an auction. These teams compete against each other for the trophy. The whole match depends
on the luck for the team, player’s performance and lot more parameters that will be taken in to the
consideration. The match that is played before the day is also will make a change in the prediction.
The stakeholders are much more benefited due to the huge popularity and the huge presence of
people at the venue. The accuracy of a data depends on the size of the data we take for analysing
and the records that are taken for predicting the outcome.
Cricket is a game played between two teams comprising of 11 players in each team. The result is
either a win, loss or a tie. However, sometimes due to bad weather conditions the game is also
washed out as Cricket is a game which cannot be played in rain. Moreover, this game is also
extremely unpredictable because at every stage of the game the momentum shifts to one of the
teams between the two. A lot of times the result gets decided on the last ball of the match where the
game gets really close. Considering all these unpredictable scenarios of this unpredictable game,
there is a huge interest among the spectators to do some prediction either at the start of the game or
during the game. Many spectators also play betting games to win money.
1
Plan of Implementation
The project can be broken down into 7 main steps which are as follows:
Problem Statement
To predict the results of an IPL match using machine learning techniques or algorithms such
as Logistic Regression, Gaussian Naive Bayes, K Nearest Neighbours, SVM, Gradient boost
algorithm, Decision tree and Random forest.
We have used 17 features which are as follows: season, city, date, team1, team2, toss_winner,
toss_decision, result, dl_applied, winner, win_by_runs, win_by_wickets, player_of_match, venue,
umpire1, umpire2 and umpire3.
2
Chapter 2
BACKGROUND STUDY &
LITERATURE SURVEY
In order to get required knowledge about various concepts related to the present application,
existing literature were studied. Some of the important conclusions were made through those are
listed below.
1. Kalpdrum Passi and Niravkumar Pandey discussed about the prediction accuracy in
terms of runs scored by batsman and the no. of wickets taken by the bowler in each team
[1].
3. R.P.Schumaker et. al, discussed about different statistical simulations used in predictive
modeling for different sports [3].
4. John McCullagh implemented neural networks and datamining techniques to identify the
talent and also for the selection of players based on the talent in Australian Football
League[4].
5. Bunker et. al, proposed a novel sport prediction framework to solve specific challenges and
predict sports results [5].
6. Ramon Diaz-Uriarte et. al, investigated the use of random forest for classification of
microarray data and proposed a new method of gene selection in classification problem
based on random forest [6].
7. Rabindra Lamsal and Ayesha Choudhary, proposed a solution to calculate the weightage
of a team based on the player’s past performance of IPL using linear regression [7].
8. Akhil Nimmagadda et. Al, proposed a model using Multiple Variable Linear Regression
and Logistic regression to predict the score in different innings and also the winner of the
match using Random Forest algorithm [8].
3
9. Ujwal U J et. Al, predicted the outcome of the given cricket match by analyzing previous
cricket matches using Google Prediction API [9].
10. Rameshwari Lokhande and P.M.Chawan came up with live cricket score predicton using
linear regression and Naïve Bayes classifier [10].
11. Abhishek Naik et. Al, proposed a new model used matrix factorization technique to analyze
and predict the winner in ODI cricket match [11].
12. Esha Goel and Er. Abhilasha discussed the improvements in Random Forest
Algorithmand described the usage in various fields like agriculture, astronomy, medicine,
etc. [12].
13. Amit Dhurandhar and Alin Dobra proposed a new methodology for analysing the error
of classifiers and model selection measures to analyse the decision tree algorithm [13].
14. H. Yusuff et. Al, performed logistic regression using mammograms to find the accuracy
with valid samples [14].
4
Chapter 3
APPROACH AND DESIGN
The below figure explains the approach we have taken into building the predictive model using
machine learning algorithms.
Data Collection
Data collection is the process of gathering and measuring information from countless
different sources. In order to use the data, we collect to develop practical machine learning
solutions.
Collecting data allows you to capture a record of past events so that we can use data analysis to
find recurring patterns. From those patterns, you build predictive models using machine learning
algorithms that look for trends and predict future changes.
The Indian Premier League's official website is the principal basis of data for this project. The
data was web scrapped from the website and kept in the appropriate format using a python library
called beautiful soup. The dataset has the columns regarding match-number, IPL season year, the
place where match has been held and the stadium name, the match winner details, participating
5
teams, the margin of winning and the umpire details, player of the match etc. Indian Premier
League was only 11 years old, which is why, after the pre-processing, only 577 matches were
available. Here, some of the columns may contain null values and some of the attributes may not
be required for match winner prediction which is discussed in data preprocessing.
Data Preprocessing
Data cleaning
There are some null values in the dataset in the columns such as winner, city, venue
etc. Due to the presence of these null values, the classification cannot be done
accurately. So, we tried to replace the null values in different columns with dummy
values.
This step is the main part where we can eliminate some columns of the dataset that
are not useful for the estimation of match winning team. This is estimated using
feature importance. The considered attributes have the following feature importance.
Data Visualization
The data which has been collected is used for visualizing for the better understanding
of the information.
6
Matplotlib Library is used here for visualizing the graphs
The data visualization is necessary to understand the solution in a better way. The below
graphs were drawn based up on the previous seasons of the IPL matches.
Here, we have developed a generic model and applied all classification methods. The data
is split into training data and test data, we train the model using certain features and use it to
predict the testing data, then we calculate the performance of the system. The various classification
models used are: Logistic Regression, Gaussian Naïve Bayes Classifier, KNN (K Nearest
Neighbor) algorithm, Support Vector Machines, Gradient Boost Algorithm, Decision Trees and
Random Forest Classifier. Among these methods the Random Forest and Decision tree has given
good results.
7
Chapter 4
SRS document itself states in precise and explicit language those functions and capabilities a
software system (i.e., a software application, an ecommerce website and so on) must provide, as
well as states any required constraints by which the system must abide. SRS also functions as a
blueprint for completing a project with as little cost growth as possible. SRS is often referred to
as the “parent” document because all subsequent project management documents, such as design
specifications, statements of work, software architecture specifications, testing and validation
plans, and documentation plans, are related to it.
Functional Requirements
Functional Requirement defines a function of a software system and how the system must behave
when presented with specific inputs or conditions. These may include calculations, data
manipulation and processing and other specific functionality.
8
Following are the functional requirements on the system:
1. The whole process can be handled at minimal human interaction with android and web both.
2. The application automatically receives the captured data from server.
3. The user can call emergency, map location and ECG graph on demand
4. The system gives a warning message.
Non-functional requirements are the requirements which are not directly concerned with
the specific function delivered by the system. They specify the criteria that can be used to judge
the operation of a system rather than specific behaviours. They may relate to emergent system
properties such as reliability, response time and store occupancy. Non-functional requirements
arise through the user needs, because of budget constraints, organizational policies, the need for
interoperability with other software and hardware systems or because of external factors such
as :-
Performance Requirements
Design Requirements
Security Constraints
Basic Operational Requirements
Product Requirements
9
System Configuration
Hardware Requirements
Processors - Pentium IV Processor
Speed - 3.00 GHZ
RAM - 2 GB
Storage - 20 GB
Software Requirements
Operating system - Windows 10 Professional
IDE used - Visual Studio Code
10
Chapter 5
SYSTEM DESIGN
Design is a meaningful engineering representation of something that is to be built. It is the most
crucial phase in the developments of a system. Software design is a process through which the
requirements are translated into a representation of software. Design is a place where design is
fostered in software Engineering. Based on the user requirements and the detailed analysis of the
existing system, the new system must be designed. This is the phase of system designing. Design
is the perfect way to accurately translate a customer’s requirement in the finished software
product. Design creates a representation or model, provides details about software data structure,
architecture, interfaces and components that are necessary to implement a system. The logical
system design arrived at as a result of systems analysis is converted into physical system design.
Model phases
Prerequisite Analysis: This stage is worried about gathering of necessity of the framework.
This procedure includes producing record and necessity survey.
Framework Design: Keeping the prerequisites at the top of the priority list the framework
details are made an interpretation of into a product representation. In this stage the fashioner
underlines on calculation, information structure, programming design and so on.
11
Coding: In this stage developer begins his coding with a specific end goal to give a full portray
of item. At the end of the day framework particulars are just changed over into machine
coherent register code.
Usage: The execution stage includes the genuine coding or programming of the product. The
yield of this stage is regularly the library, executables, client manuals and extra programming
documentation.
Testing: In this stage all projects (models) are coordinated and tried to guarantee that the
complete framework meets the product prerequisites. The testing is worried with check and
approval.
Support: The upkeep stage is the longest stage in which the product is upgraded to satisfy the
changing client need, adjust to suit change in the outside environment, right mistakes and
oversights beforehand undetected in the testing stage, improve the proficiency of the product.
System Architecture
12
Chapter 6
IMPLEMENTATION
13
14
15
16
17
18
Chapter 7
RESULTS
MODEL ACCURACY
19
20
21
22
23
24
Chapter 8
25
REFERENCES
T. A. Severini, Analytic methods in sports: Using mathematics and statistics to understand
data from baseball, football, basketball, and other sports. Chapman and Hall/CRC, 2014.
8. H. Ghasemzadeh and R. Jafari, “Coordination analysis of human movements with body
sensor networks: A signal processing model to evaluate baseball swings,” IEEE Sensors
Journal, vol. 11, no. 3, pp. 603–610, 2010
9. R. Rein and D. Memmert, “Big data and tactical analysis in elite soccer: future challenges
and opportunities for sports science,” SpringerPlus, vol. 5, no. 1, p. 1410, 2016
Veppur Sankaranarayanan, Vignesh and Sattar, Junaed and
Lakshmanan,”Auto-play: A Data Mining Approach to ODI Cricket
Simulation and Prediction”,SIAM Conference on Data Mining, 2014
K. A. A. D. Raj and P. Padma, ”Application of Association Rule
Mining: A case study on team India”, 2013 International Conference
on Computer Communication and Informatics, 2013
Tim B. SWARTZ, Paramjit S Gill and S. Muthukumarana,”Modelling
and simulation for one-day cricket”, Canadian Journal of Statistics, 2009, Vol 37, No 2,
pp-143-160
26