Assignment Question
Assignment Question
1. To develop a predictive model that forecasts the total score of a team in an IPL match.
2. To identify the key factors (such as batting and bowling performance) that impact the
final score.
3. To create an interactive interface or API for real-time score prediction based on team
selection and match conditions.
What are the key features (columns) in your dataset, and why are they important?
Key features include:
• Team composition (batting and bowling lineup): The performance of key players like
openers and wicket-takers significantly influences the match score.
• Venue: The type of pitch and location can impact batting or bowling conditions, affecting
the final score.
• Batting stats (average runs, strike rate): These are key in predicting how much a team
might score.
• Bowling stats (economy rate, wickets taken): They impact the number of overs a batting
team can face and how quickly they score.
• Weather conditions: Rain or dew can significantly affect the match outcome and score.
These features are essential because they have a direct influence on the outcome of an IPL
match.
Did you face any challenges during data collection? How did you resolve them?
A challenge was dealing with missing data for player stats and match-specific information. I
resolved this by either imputing missing values or removing incomplete rows where data was
crucial (e.g., missing team composition). Additionally, for certain matches, weather data was
sparse, so I used general weather patterns based on the season and location.
3. Data Preprocessing
How did you split your dataset for training and testing?
I used an 80/20 split, where 80% of the data was used for training and 20% for testing the model.
I also performed cross-validation to ensure the model generalizes well to unseen data.
4. Feature Engineering
Can you explain the impact of any new features you created?
I created a feature called "Batting Performance Index" which combines batting average, strike
rate, and number of boundaries. This feature provided a composite measure of batting strength,
which improved the model’s accuracy in predicting total scores.
Did you use any dimensionality reduction techniques? Why or why not?
I did not use dimensionality reduction techniques like PCA because the dataset wasn't highly
dimensional. The Random Forest model also handles feature importance effectively, so reducing
dimensions wasn't necessary.
Which machine learning algorithms did you try, and why did you select the final one?
I tried several models, including:
• Linear Regression: It provided a baseline, but it wasn’t effective due to the non-linear
nature of the relationship between features and scores.
• Decision Trees: These were useful, but they tended to overfit.
• Random Forest: After tuning, Random Forest provided the best performance in terms of
accuracy and generalization, which is why I chose it as the final model.
What were the key parameters you tuned during model training?
For Random Forest, I tuned the following parameters:
6. Model Evaluation
• Mean Absolute Error (MAE): To measure the average error between predicted and
actual scores.
• Root Mean Squared Error (RMSE): To understand the magnitude of error in the
prediction.
• R-squared: To assess how well the model explains the variance in the target variable
(final score).
7. Hyperparameter Tuning
Which hyperparameter tuning technique did you use (Grid Search, Random Search)?
Why?
I used Grid Search for hyperparameter tuning, as it allowed me to systematically test all
combinations of parameters and identify the optimal ones for the model.
• n_estimators = 200
• max_depth = 15
• min_samples_split = 10 These settings resulted in better performance and less
overfitting.
How did you ensure that the API handles requests efficiently?
The model is loaded into memory once to avoid repeated loading during each API request.
Additionally, we used multi-threading to handle concurrent requests efficiently.
What were the main challenges you faced during this project?
One challenge was ensuring the data quality, as certain match-specific details (like player
injuries) were missing or incomplete. Handling data imbalance and ensuring the model
generalized well were also challenges.