Foundation of Data Science 2
Foundation of Data Science 2
GROUP NO – 24
T.JATAVEDA 22BDS0391
M.SAAKETH 22BCE2683
G.ABHINAV 22BCE3750
1. Project Overview
Objective:
Project Overview
The goal is to develop an Athletic Performance Analysis System that collects,processes, and
analyzes real-time performance data of athletes. The system should handle both structured
(SQL) and unstructured (NoSQL/JSON) data, process realtime metrics, and provide
personalized insights for performance optimization.
The goal is to develop an Athletic Performance Analysis System that collects, processes,
and analyzes real-time performance data of athletes. This system should:
Handle structured data (SQL) such as player profiles, performance statistics, and
match records.
Handle unstructured data (NoSQL/JSON) such as sensor outputs, wearable device
logs, and GPS data.
Provide real-time analytics and feedback to enhance performance.
Generate personalized training recommendations based on historical and current
metrics.
Key Capabilities:
Text mining plays a crucial role in handling unstructured textual data such as coach feedback,
athlete interviews, medical reports, and match commentaries. The main areas include:
Text Cleaning: Involves removing punctuation, stop words (like “the”, “is”), and
lowercasing text.
Parsing: Breaks text into smaller parts like tokens (words), sentences, etc., for easier
processing.
Utilizes keyword or context-based searches to find relevant textual data from logs or
match commentary archives.
Example: Searching for "injury" mentions across all player records to identify
recurring issues.
Text Mining
Involves pattern recognition, trend analysis, and discovering hidden insights from
textual data.
Example: Identifying that players tend to perform better in matches following positive
coach reviews.
Part-of-Speech (POS) Tagging
Assigns grammatical categories (noun, verb, adjective) to each word in the text.
Useful in syntactic analysis of performance descriptions or feedback for deeper
linguistic analysis.
Stemming
Reduces words to their root form (e.g., “running”, “ran”, “runs” → “run”).
Helps unify different forms of the same word for consistent analysis.
Task Description
Cleaning & Parsing Remove noise, format text for analysis
Searching & Retrieval Find player-specific notes using keywords
Text Mining Extract insights from match-day blogs or feedback notes
Part-of-Speech Identify action-related verbs like "sprint", "missed", "scored"
Tagging
Stemming Normalize "running", "ran", "runs" → "run"
Text Analytics Pipeline End-to-end processing from raw feedback → structured
insights
Stages of NLP
Predictive modeling involves creating statistical or machine learning models that forecast
outcomes based on historical data. In the case of athletic performance, we aim to:
Algorithms used:
Once a model is trained, it must be evaluated to check if it's reliable for making real-world
predictions.
Evaluation Metrics:
Metric Description Relevance to Project
Accuracy Percentage of correct predictions Overall performance prediction
accuracy
Precision Proportion of true positive predictions Predicting high-performing players
Recall True positives detected among all actual Injury detection sensitivity
positives
F1 Score Harmonic mean of precision and recall Balance between precision and
recall
Confusion Detailed count of TP, TN, FP, FN Performance classification
Matrix breakdown
ROC-AUC Curve showing the trade-off between TPR Evaluating model quality
and FPR
4. Model Deployment
Deployment means making the trained model available for real-time usage—either in a
dashboard or application.
Techniques:
REST APIs: Deploy Python model via Flask for mobile/web usage.
Model in Production: Embedded in sports analytics platforms for coaches.
Real-Time Monitoring: Evaluate if the model's prediction remains accurate over
time.
5. Takeaways
NumPy – For numerical computations, efficient arrays, used for storing metrics like
speed or agility.
Pandas – For data wrangling, data frames for structured athletic data like goals,
assists, match stats.
Matplotlib / Seaborn – For creating performance graphs, trend lines.
Scikit-learn – For building ML models (clustering, regression, classification).
NLTK / SpaCy – For text processing, sentiment analysis on comments or coach
feedback.
Library Purpose
pandas Data handling, reading performance logs
numpy Numerical operations (e.g., velocity calculations)
matplotlib Visualizing performance trends
seaborn Advanced performance heatmaps or comparisons
scikit-learn ML model training for prediction
nltk, spaCy NLP operations on feedback data
o df.groupby('position')['distance_covered'].mean()
Monitoring metrics over time (e.g., stamina per match, injury frequency).
Applying time series plots and trend forecasting to visualize performance
improvements or declines.
IDE FEATURES
JUPYTER NOTEBOOK Interactive coding, visual outputs
PYCHARM Full-featured, great for long scripts
VSCODE Lightweight, supports extensions and notebooks
GOOGLE COLAB Cloud-based, free GPU for performance modeling
Tableau is a powerful data visualization tool used to convert raw data into interactive
dashboards and visual reports. It helps identify trends, patterns, and outliers in data, especially
when dealing with complex sports performance metrics.
2.Dimensions vs Measures
3..Descriptive Statistics
X-axis: Position
Y-axis: Stamina
Insight: Which positions maintain the highest average stamina?
X-axis: Age
Y-axis: Average Overall Rating
Insight: How does performance vary with age?
Location: Country
Size/Color: Number of athletes
Insight: Countries producing the highest number of top performers
plt.figure(figsize=(12, 6))
sns.barplot(data=top_sprinters, x='Name', y='SprintSpeed', palette='viridis')
plt.title("Top 10 Athletes by Sprint Speed (Overall > 85)")
plt.xticks(rotation=45)
plt.ylabel("Sprint Speed")
plt.xlabel("Player Name")
plt.tight_layout()
plt.show()
plt.figure(figsize=(10, 5))
sns.lineplot(data=avg_rating_by_age, x='Age', y='Overall', marker='o', color='orange')
plt.title("Average Overall Rating by Age")
plt.xlabel("Age")
plt.ylabel("Average Rating")
plt.grid(True)
plt.tight_layout()
plt.show()
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Correlation Between Physical and Skill Attributes")
plt.tight_layout()
plt.show()
plt.figure(figsize=(10, 6))
sns.barplot(x=top_nations.values, y=top_nations.index, palette='magma')
plt.title("Top 10 Countries with Most Athletes")
plt.xlabel("Number of Players")
plt.ylabel("Country")
plt.tight_layout()
plt.show()
# Set style
sns.set(style="whitegrid")
Summary
These visualizations using NumPy, Pandas, Seaborn, and Matplotlib simulate the Tableau-style
insights: