Cricket Player Data Analysis Using Clustering Technique
Cricket Player Data Analysis Using Clustering Technique
[5] Data Analytics based Deep Mayo Predictor for IPL-9 The data used in this work is comprised of a large number of
Authors: Deep Prakash, Patvardhan, Vasantha Lakshmi: cricket score data collected from Kaggle repository. Historical
records from various cricket archives were used to gather
Cricket analytics in the Indian Premier League (IPL) information on scores and statistics from past matches. Surveys
covers various studies related to player valuations, team were conducted with cricket fans and experts to gather their
performance measurement, all-rounder classification, player wage opinions and perspectives on the game.
determination, and player pricing and valuation. The importance
of predictive modeling in cricket, especially in T20 matches, was Pre-Processing:
highlighted due to the significant investments made by franchises.
The literature also discussed the challenges of incorporating Data pre-processing is the most essential part of a data science
current form alongside career statistics for accurate predictions. project. It consumes a major time dedicated to the project. Pre-
Machine learning techniques were emphasized for performance processing of data includes getting rid of erroneous data,
ranking and outcome prediction in IPL matches. Overall, the inconsistent data, formatting the data present and to fill the
literature review underscores the significance of data analytics missing values. The unwanted data is removed including
and machine learning in enhancing decision-making processes in duplicate observations. It mainly deals with correction,
cricket. standardization, and transformation of data. This is done to make
sure outcomes are reliable.
[6] Cricket Players Performance Prediction and Evaluation
Using Machine Learning Algorithms Sumathi, Prabu and Feature Extraction:
Rajkamal: Feature Selection is an essential phase where the parameters to
This paper presents a system that leverages machine analyze cricketer’s performance are to be decided. Parameters
learning algorithms including K-means clustering and hierarchical such as player, span, match, innings, not out, runs, Highest score,
models to predict and evaluate cricket players' performance, Average, strike rate,100,50 and 0 runs scored are considered for a
ultimately aiming to enhance player ranking and increase match- batsman.
winning probabilities. By analyzing individual player Classifiers:
performance, clustering similar players based on attributes, and
selecting the top performers for team formation, the proposed Kmeans Algorithm:
system demonstrates effectiveness in predicting player
performance and optimizing team selection. The experimental In the cricket score data analysis project, the K-means algorithm
findings highlight the system's capability in accurately predicting is utilized as a clustering technique to group players based on
player performance and selecting the best players for a cricket their performance metrics. The algorithm is applied in the
team, showcasing the potential of machine learning in following manner:
revolutionizing player evaluation and team formation in cricket. 1. Initialization: The algorithm begins by randomly initializing K
centroids, where K represents the number of clusters to be
3. PROPOSED WORK
formed. These centroids serve as the initial cluster centers.
•To analyze and understand the performance of cricket players
2. Assignment: Each data point, in this case, each player's
using statistical tools and graphics.
performance metrics such as runs scored, average, strike rate, and
•To generate insights into the strengths and weaknesses of other relevant statistics, is assigned to the closest centroid based
players, identify trends, and recognize patterns of player on a distance metric, typically the Euclidean distance. This step
performance. involves calculating the distance between each data point and the
centroids and assigning the data point to the cluster with the
•Gather cricket score data from various sources, including official nearest centroid.
databases, team statistics, and other relevant repositories.
3. Update: After assigning all data points to clusters, the
•Address missing values, outliers, and inconsistencies in the data. algorithm recalculates the centroids for each cluster by taking the
Transform data if needed, such as converting categorical variables mean of all
into numerical representations.
the data points assigned to that cluster. This step involves
updating the cluster centers based on the data points assigned to
each cluster.
4. Iteration: Steps 2 and 3 are repeated iteratively until
convergence is reached. Convergence occurs when the centroids
no longer change significantly between iterations or when a
maximum number of iterations is reached.
5. Final Clusters: The final clusters and centroids obtained after
convergence represent the solution provided by the K-means
algorithm. Each cluster contains players with similar performance
characteristics based on the selected metrics.
By applying the K-means algorithm to the cricket score data,
players can be grouped into clusters based on their performance,
allowing for the identification of patterns and insights into player
performance. This clustering approach enables the analysis of
player similarities and differences, aiding in strategic decision-
making for coaches, managers, and analysts in the cricket domain.
Hierarchical Clustering:
Hierarchical clustering is utilized in the cricket score data
analysis project to group players based on their performance
metrics. The algorithm is applied as follows:
1. Initialization: The algorithm starts with each player as a
separate cluster and then iteratively merges clusters based on their
similarities until a final set of clusters is obtained .
Fig-4.2: Data Flow Diagram
2. Distance Calculation: A distance metric, such as Euclidean
distance or Manhattan distance, is used to calculate the
similarities between player performance metrics, such as runs
scored, average, and strike rate . 5. DATA DEFINITION
3. Cluster Merging: The algorithm merges clusters that are closest Cricket score data analysis refers to the process of collecting,
to each other based on the calculated distances, creating a cleaning, transforming and visualizing cricket score data in order
hierarchy of clusters . to draw insights, identify patterns and make meaningful
conclusions about the game of cricket. The aim is to understand
4. Dendrogram Creation: A dendrogram is constructed to
various aspects of the game such as player performance, team
visualize the hierarchical relationships between players and
strategies, and match outcomes by using statistical and machine
clusters, showing how they are grouped at different levels of
learning techniques. This helps to make better decisions,improve
similarity .
player performance and enhance the overall spectator experience.
5. Cluster Interpretation: The final clusters obtained from
hierarchical clustering represent groups of players with similar 6. MODELING AND ANALYSIS
performance characteristics. These clusters can provide insights
into player similarities, strengths, and weaknesses, aiding in Cricket score analysis can be used to determine player and team
strategic decision-making for coaches and analysts in the cricket performance over a period of time, identify strengths and
domain. weaknesses, and track progress. Analysis can be used to
understand match patterns and tendencies, and to create effective
By applying hierarchical clustering in the cricket score data playing strategies and By analyzing data from past matches and
analysis project, players can be grouped hierarchically based on player statistics, cricket score analysis that identify emerging
their performance metrics, allowing for a deeper understanding of trends in the game.
player relationships and performance patterns.
7. CHALLENGES
1. Data Collection: The first challenge in cricket score analysis is
collecting accurate and comprehensive data. This includes scoring
information, player statistics, team statistics, match conditions,
and more.
2. Data Quality: The quality of the data collected is critical in
determining the accuracy of the analysis. Factors such as errors in
data entry, missing information, and inconsistent data formatting
can all impact the quality of the data.
3. Data Integration: Integrating data from different sources, such
as from different seasons or from different cricket leagues, can be
Fig-4.1: Working Architecture a
challenging task. This is due to the differences in data structure,
formatting, and naming conventions used by different sources.
4. Data visualization: Visualizing cricket score data can be
challenging, as it is often complex and multidimensional. Data
analysts need to be able to present the data in a way that is easy to
understand and interpret.
5. Statistical Analysis: Score analysis involves complex statistical
analysis, which can be challenging for analysts without a strong
statistical background. This includes understanding the
underlying distribution of the data and selecting appropriate
statistical models.
8. RESULTS
In this graph, we can see the anaylized data of match played and
player played cricket for highest time. This result will help us
understand that the how many matches did player played matches.
Final list of players using Kmeans Clustering Algorithm
11. REFERENCES
[1]. Tim Swartz, Paramjit Gill, and Saman Muthukumarana.
“Modelling and simulation for one-day cricket”. In: Canadian
Journal of Statistics 37 (June 2009), pp. 143–160. doi:
10.1002/cjs.10017.
[5]. Rahul Chakwate and Madhan .R. Analysing Long Short Term
Memory Models for Cricket Match Outcome Prediction. Nov.
2020.