0% found this document useful (0 votes)
7 views

DS Assignment

The assignment focuses on using machine learning techniques to predict diseases based on patient data, requiring students to preprocess data, conduct exploratory analysis, and develop a classification model. Students will work with real-world datasets, such as those for heart disease or diabetes, and will be evaluated on various tasks including feature engineering, model training, evaluation, and deployment of a web application. The final deliverables include a Jupyter Notebook, a report summarizing the project, and a link to a GitHub repository.

Uploaded by

prachinpatil19
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

DS Assignment

The assignment focuses on using machine learning techniques to predict diseases based on patient data, requiring students to preprocess data, conduct exploratory analysis, and develop a classification model. Students will work with real-world datasets, such as those for heart disease or diabetes, and will be evaluated on various tasks including feature engineering, model training, evaluation, and deployment of a web application. The final deliverables include a Jupyter Notebook, a report summarizing the project, and a link to a GitHub repository.

Uploaded by

prachinpatil19
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Assignment Title: Disease Prediction Using Machine Learning

Course: Data Science

Level: Undergraduate (T.Y.BSc in Information Technology)

Assignment Type: Individual

Total Marks: 150

Submission Mode: Online via Classroom

Deadline: 15-03-2025

Objective

This assignment will help students apply machine learning techniques to predict
diseases using patient data. The focus will be on data preprocessing, exploratory data
analysis (EDA), feature engineering, training simple ML models with hyperparameter
tuning, and deploying a prediction web app using Flask/Streamlit.

Problem Statement

You are a Data Scientist working for a healthcare analytics company. Your task is to
build a machine learning model that predicts whether a patient is at risk of a
particular disease based on their health parameters. The dataset contains medical
records such as age, BMI, glucose levels, and other diagnostic features. Your goal is to
develop a classification model that can accurately predict the presence or absence of
a disease.

Dataset

You will use a real-world but simple dataset such as:

​ •​ Heart Disease Prediction Dataset

​ •​ Diabetes Prediction Dataset

​ •​ Chronic Kidney Disease Dataset


These datasets are available on Kaggle/UCI Machine Learning Repository. The dataset
typically includes:

​ •​ Age

​ •​ Gender

​ •​ Blood Pressure

​ •​ Cholesterol Levels

​ •​ Glucose Levels

​ •​ BMI (Body Mass Index)

​ •​ Smoking/Alcohol Consumption Status

​ •​ Family History of Disease

​ •​ Target Variable (0 = No Disease, 1 = Disease Present)

Students can choose one of these datasets or any similar real-world dataset.

Assignment Tasks & Marking Scheme

Part 1: Data Preprocessing & Exploration (30 Marks)

​ 1.​ Load the dataset and display the first few rows.

​ 2.​ Handle missing values appropriately.

​ 3.​ Perform Exploratory Data Analysis (EDA):

​ •​ Summary statistics

​ •​ Data distributions (histograms, box plots)

​ •​ Correlation matrix

​ •​ Outlier detection & handling

​ 4.​ Normalize/standardize numerical features if needed.

Deliverables:
​ •​ Python code with EDA

​ •​ Summary of insights from the data

Criteria Full Marks Good Needs Poor (7-14) No


(30) (22-29) Improveme Submission
nt (15-21) (0)

Data Thorough Good EDA Basic EDA Poor No


Preprocessi EDA, with minor with handling of submission
ng & EDA missing issues minimal missing
values insights values,
handled weak EDA
well,
insightful
analysis

Part 2: Feature Engineering & Selection (25 Marks)

​ 1.​ Handle categorical variables (one-hot encoding, label encoding).

​ 2.​ Identify and remove highly correlated features.

​ 3.​ Apply feature selection techniques (e.g., SelectKBest, Mutual


Information).

Deliverables:

​ •​ Python code with feature selection

​ •​ Justification for chosen features

Criteria Full Marks Good Needs Poor (6-11) No


(25) (18-24) Improveme Submission
nt (12-17) (0)
Feature Excellent Good Basic Poor or No
Engineering feature selection selection, incorrect submission
& Selection selection with minor limited selection
with issues justification
justification

Part 3: Model Development & Training (35 Marks)

​ 1.​ Split the dataset into training and testing sets (80-20 or 70-30 split).

​ 2.​ Train at least two models from the following:

​ •​ Logistic Regression

​ •​ Decision Tree

​ •​ Random Forest

​ •​ k-Nearest Neighbors (KNN)

​ •​ Support Vector Machine (SVM)

​ 3.​ Tune hyperparameters using GridSearchCV or RandomizedSearchCV.

​ 4.​ Train and evaluate models using metrics such as accuracy, precision,
recall, and F1-score.

Deliverables:

​ •​ Python code for model training

​ •​ Performance comparison table

​ •​ Explanation of chosen models and hyperparameters

Criteria Full Marks Good Needs Poor (9-17) No


(35) (26-34) Improveme Submission
nt (18-25) (0)
Model Two models Two models One model Poor model No models
Developme implemente implemente implemente selection, implemente
nt & d with d, limited d, no tuning weak d
Training hyperparam tuning implementa
eter tuning tion

Part 4: Model Evaluation & Optimization (30 Marks)

​ 1.​ Evaluate models using confusion matrix, Precision Recall Curve.

​ 2.​ Interpret results and suggest improvements.

​ 3.​ Apply feature selection techniques and retrain the model if necessary.

Deliverables:

​ •​ Evaluation metrics & visualizations

​ •​ Comparison and interpretation of results

Criteria Full Marks Good Needs Poor (7-14) No


(30) (22-29) Improveme Submission
nt (15-21) (0)

Model Thorough Good Basic Poor No


Evaluation evaluation evaluation evaluation evaluation, evaluation
& with clear with some with limited missing key
Optimizatio improveme insights explanation metrics
n nts

Part 5: Model Deployment & Report (30 Marks)

​ 1.​ Save the best model using Pickle/Joblib.

​ 2.​ Develop a Flask or Streamlit web application where users can input
patient details and receive a disease prediction.
​ 3.​ Write a report summarizing:

​ •​ Problem statement

​ •​ Data preprocessing & insights

​ •​ Model training & evaluation

​ •​ Challenges faced and possible improvements

Deliverables:

​ •​ Flask/Streamlit app source code

​ •​ Screenshots of working app

​ •​ Final report summarizing the project

Criteria Full Marks Good Needs Poor (7-14) No


(30) (22-29) Improveme Submission
nt (15-21) (0)

Deployment Fully Working Basic app Poor No


& Report functional app with with weak execution of deployment
app with a minor report deployment or report
well-structu issues in & report
red report report

Submission Instructions

1.​ Upload a Jupyter Notebook (.ipynb) with:

​ •​ Well-commented Python code

​ •​ Explanations and visualizations

2.​ Upload a report (.pdf) covering:

​ •​ Problem statement, methodology, snapshots of output and results


​ •​ Discussion on model performance as per Confusion Matrix

​ •​ Provide a link to GitHub repository containing project files.

​ •​ Naming Convention:
StudentID_LastName_FirstName_ML_Assignment.pdf

Additional Notes

​ •​ Plagiarism/Use of ChatGpt will result in zero marks.

​ •​ Use Python (Pandas, NumPy, Scikit-Learn, Matplotlib/Seaborn,


Flask/Streamlit).

​ •​ Bonus Marks: For additional feature selection or an Explainable AI


technique (SHAP, LIME).

You might also like