Mini Project Template Both
Mini Project Template Both
Submitted by
USN Name
1BI20AI005 Aryan Khera
1BI20AI014 Harsh Singh
Certificate
USN Name
1BI20AI005 Aryan Khera
1BI20AI014 Harsh Singh
2.
ACKNOWLEDGEMENT
The satisfaction and euphoria that accompany the completion of a task would be incomplete
without crediting the people who made it possible, whose constant guidance and
encouragement crowned the efforts with success.
We would like to express our thanks to the Principal Dr. Aswath M.U. for his
encouragement that motivates us for the successful completion of mini project work.
It gives us immense pleasure to thank Dr. Jyothi D.G. Professor & Head, Department of
Artificial Intelligence & Machine Learning for her constant support and encouragement.
We would like to express our deepest gratitude to our mini project guide Mr. Manjunatha
P.B for his constant support and guidance throughout the Mini Project work.
We are very much pleasured to express our sincere gratitude to the friendly co-operation
showed by all the staff members of Artificial Intelligence and Machine Learning
Department, BIT.
Last but not the least, we would here by acknowledge and thank our friends and family who
have been our source of inspiration always instrumental in successful completion of the
Project work.
Date: 1-1-2024
Place: Bengaluru
Aryan Khera
Harsh Singh
ABSTRACT
INDEX
LIST OF FIGURES
INTRODUCTION
Chapter 2
LITERATURE REVIEW
Paper 1:
[1] Title: “Classification Of Diabetes Disease Using Support Vector Machine.
International Journal of Engineering Research and Applications” by Jegan, Chitra.
(2013)
• Diabetes mellitus, a global health concern, affects 285 million people worldwide,
projected to reach 380 million in the next 20 years. To address this, a classifier is
essential for efficient and cost-effective diabetes detection. The Pima Indian diabetic
database at the UCI machine learning laboratory is a standard for testing such
algorithms. This study advocates the use of Support Vector Machine (SVM) as a
classifier, showcasing its success in accurately diagnosing diabetes from high-
dimensional medical datasets.
Paper 2:
[2] Title: “Prediction of Diabetes Using Data Mining Techniques. Research Journal of
Pharmacy and Technology” by Mareeswari, V. & Saranya, R & Mahalakshmi, R &
Preethi, E. (2017)
Paper 3:
[3] Title: “Prognosis of Diabetes Using Data mining Approach-Fuzzy C Means
Clustering and Support Vector Machine. International Journal of Computer Trends
• This study employs data mining methods, specifically FCM and SVM, to analyze a
diabetes databank for diagnosis. The dataset, sourced from the UCI repository,
includes 9 clinical attributes and an output indicating diabetes diagnosis across 768
cases. By leveraging these techniques, the research aims to enhance clinical
decision-making with efficient analysis of large medical datasets.
1. Gathering Data:
• Data Collection: Obtain relevant data from sources such as healthcare databases,
electronic health records (EHRs), medical literature, or patient surveys. The dataset
might include variables like age, gender, BMI, family history of diabetes, blood
pressure, cholesterol levels, and glucose levels.
• Data Sources: Collect data from clinical settings, research studies, public health
databases, or wearable devices that monitor health metrics.
• Data Quality: Ensure that the data is accurate, reliable, and comprehensive. Validate
the data sources and address any issues related to missing values, outliers, or
inconsistencies that could affect the predictive model's performance.
2. Data Pre-processing:
• Data Cleaning: Clean the dataset by handling missing values (e.g., imputation using
mean, median, or regression-based methods), removing duplicates, and correcting
errors in the data.
• Data Transformation: Transform variables as needed, such as normalizing numerical
variables (e.g., glucose levels, BMI) to a common scale or encoding categorical
variables (e.g., gender, family history) using appropriate techniques like one-hot
encoding.
• Feature Engineering: Create new features or variables that may be predictive of
diabetes risk, such as BMI categories, age groups, or interaction terms between
relevant variables.
• Data Splitting: Divide the dataset into a training set (to train the predictive model)
and a test set (to evaluate the model's performance).
• Model Selection: Research various models suitable for binary classification tasks.
Logistic regression is a natural choice, given its interpretability and simplicity. Other
models like decision trees, random forests, or support vector machines can also be
considered.
• Model Training: Train the selected model on the training dataset using features
(predictors) to predict the binary outcome variable (diabetes or non-diabetes).
• Model Testing: Evaluate the model's performance on the test dataset. Assess metrics
such as accuracy, precision, recall, F1-score, and ROC-AUC to gauge how well the
model generalizes to new, unseen data.
• Cross-Validation: Implement cross-validation techniques to ensure robustness in
model evaluation and mitigate overfitting or underfitting issues.
5. Evaluation:
4.2 Dataflow
DATA ACQUISITION
DATA PRE-
PROCESSING
FEATURE
SELECTION
LOGISTIC
REGRESSION
PREDICTION
RESULT
1. Data Acquisition:
Data acquisition for diabetes prediction using Logistic Regression involves
gathering patient data from sources such as electronic health records or clinical
databases, selecting relevant features like age, gender, blood pressure, and
cholesterol levels, and ensuring data quality and privacy compliance. The acquired
dataset is then split into training and testing sets, with the training set used to train
the Logistic Regression model to recognize patterns associated with diabetes
disease. This step is crucial as the quality of the acquired data directly influences the
model's ability to make accurate predictions and generalize to new, unseen patient
data, laying the foundation for a successful predictive model.
2. Data Preprocessing:
Data preprocessing for diabetes disease prediction using Logistic Regression
involves a series of essential steps to optimize the raw data for accurate model
training. This process encompasses handling missing values through imputation or
removal, addressing outliers to prevent their impact on model performance,
normalizing or standardizing numerical features to ensure uniform scaling, encoding
categorical variables into a numerical format, performing feature engineering to
derive new informative features, addressing imbalances in the dataset through
oversampling or under sampling.
3. Feature Selection:
Feature selection is a critical step in the machine learning pipeline, aimed at
identifying and retaining the most relevant features from the dataset while discarding
less informative or redundant ones. In the context of heart disease prediction using
Logistic Regression, feature selection involves choosing the subset of features that
contribute most significantly to the model's predictive performance.
4. Spliting the Dataset:
In the diabetes disease prediction using Logistic Regression, the dataset is crucially
split into a training set for model training and a testing set for evaluation. This
process ensures that the model generalizes well to new data. Typically, a random or
stratified split is employed, with a portion reserved for training (e.g., 80%) and the
remainder for testing (e.g., 20%). The aim is to prevent overfitting, enabling the
Logistic Regression model to accurately predict heart disease on unseen data.
5. Classification:
One of the Simplest and best ML classification algorithm is Logistic Regression.
The LR is the supervised ML binary classification algorithm widely used in most
application. It works on categorical dependent variable the result can be discrete or
binary categorical variable 0 or 1. The sigmoid function is used as a cost function.
Sigmoid function maps a predicted real value to a probabilistic value between „0‟
and „1‟.
6. Predict Result:
To predict heart disease using Logistic Regression, preprocess new patient data,
extract features, and input them into the trained model. The model assigns a
probability score, and a threshold (e.g., 0.5) is applied for binary classification.
Predictions above the threshold indicate the presence of heart disease; below the
threshold, absence. Interpretation should consider the chosen threshold's impact on
sensitivity and specificity in the context of heart disease prediction.
data['Outcome'].value_counts()
Splitting the Features and Target(heading)
x = data.drop(columns='Outcome', axis=1)
y = data['Outcome']
print(x)
print(y)
data['Pregnancies'].unique()
sns.barplot(data['Pregnancies'],y)
data['Glucose'].unique()
sns.barplot(data['Glucose'],y)
Model Training(heading)
Logistic Regression
model = LogisticRegression()
Model Evalution(heading)
Accuracy Score(heading)
input_data_as_numpy_array = np.asarray(input_data)
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)
prediction =model.predict(input_data_reshaped)
print(prediction)
if prediction[0]==0:
else:
Fig 6.13
Getting information and checking for missing values in the dataset we have taken.
Fig 6.14 Statistical measures about the data which is essential for the regression.
Model evaluation in machine learning is the process of using various metrics and approaches
to determine a trained model's effectiveness and quality. It involves evaluating whether the
model achieves the required goals and how well it generalizes to fresh, untested data.
Diabetes prediction using machine learning has demonstrated the potential to significantly
contribute to the field of healthcare and disease management. Through the utilization of
advanced machine learning algorithms, we have developed a predictive model that
analyzes relevant medical data to accurately identify individuals at risk of diabetes.
The main objective of the project is to classify and identify Diabetes Patients using ML
algorithms is being discussed throughout the project.
We build the model using some machine learning algorithms such as logistic regression,
decision tree, Random Forest and Gradient Boosting, these all are supervised machine
learning algorithm in machine learning.
As part of the future scope, we hope to try out different algorithms to optimize the feature
output process, increase the feature similarity of data to improve the model’s representation
capability.
[1]. Ahamed, K. U., Islam, M., Uddin, A., Akhter, A., Paul, B. K., Yousuf, M. A., . . . Moni, M. A.,
et al. (2021). A deep learning approach using effective preprocessing techniques to detect covid-19
from chest CT-scan and X-ray images. Computers in Biology and Medicine, 139, Article 105014.
10.1016/j.compbiomed.2021.105014.
2. Albahli, S. (2020). Type 2 machine learning: An effective hybrid prediction model for
early type 2 diabetes detection. Journal of Medical Imaging and Health Informatics, 10,
1069–1075.
4. Choubey, D. K., Paul, S., Kumar, S., & Kumar, S. (2017). Classification of Pima Indian
diabetes dataset using naive Bayes with genetic algorithm as an attribute selection. In
Proceedings of the international conference on communication and computing system
(ICCCS 2016) (pp. 451–455).
5.Dinh, A., Miertschin, S., Young, A., & Mohanty, S. D. (2019). A data-driven approach to
predicting diabetes and cardiovascular disease with machine learning. BMC Medical
Informatics and Decision Making, 19, 211.