vanaja_internship_report_2023_(1)[1][1]
vanaja_internship_report_2023_(1)[1][1]
A report submitted in partial fulfilment of the requirements for the Award of Degree of
BACHELOR OF TECHNOLOGY
in
RANGAMREDDY VANAJA
Regd.No.: 21781A33C8
CERTIFICATE
This is to certify that the MACHINE LEARNING internship submitted by
RANGAMREDDY VANAJA (Regd.No.:21781A33C8) is work done by her and
submitted during 2023- 2024 academic year, in partial fulfillment of the requirements for
the award of the degree of BACHELOR OF TECHNOLOGY in CSE(AI&ML).
Mrs.D.Gayathri Dr.M.Lavanya
Co-Ordinator of Internship Head of Department(CSM&CSD)
Department of CSE(AI&M
Certificate
CERTIFICATE OF INTERNSHIP
ACKNOWLEDGEMENT
I wish to record my deep sense of gratitude and profound thanks to our beloved Vice
Chairman, Sri R.V.Srinivas for his valuable support throughout the course.
I express our sincere thanks to Dr.M.MOHAN BABU, our beloved principal for his
encouragement and suggestion during the course of study.
I express our sincere thanks to the Mrs.D.Gayathri , internship coordinator, for her keen
interest, stimulating guidance, constant encouragement with our work during all stages, tobring
this report into fruition.
I wish to convey my gratitude and sincere thanks to all members for their support and
finally, I would like to express my sincere thanks to all teaching, non-teaching faculty
members, our parents, friends and for all those who have supported us to complete
(NAME: R.vanaja)
(ROLL.NO.:217181A33C8)
ABSTRACT
ABOUT TRAINING:
The machine learning with python training by Internshala is 6-week
online training program in which internshala aim to provide you with
a comprenshive Introduction to machine learning.In internshala
program,you will learn the basics of python, Data exploration and
preprocessing,Linearregression,logisticregression,decision
tree,Ensemble modules.This training program has video tutorials and
is packed with assignments,assessments tests, quizzes,and practice
exercises,for you to get a hands-on learning experience.At the end of
this training program,you will have a solid understanding of machine
learning with python and will be able to build an end-to-end
predictive model.For doubt clearing,you can post your queries on the
forum and get answers within 24 hours.
CONTENTS
1. Introduction to Machine Learning
2.Data
3. Introduction to python
3.4 Functions
4.3 Correlation
5 Linear Regression
7. Logistic Regression
8.Decision Tree
9.Ensemble models
10.Clustering(Unsupervised Learning)
10.2 K-means
1. Introduction to Machine Learning
1.1. What is Machine Learning
⮚ Arthur Samuel, an American pioneer in the field of computer gaming and artificial
intelligence, coined the term "Machine Learning" in 1959.
⮚ Over the past two decades Machine Learning has become one of the mainstays of
information technology.
⮚ With the ever-increasing amounts of data becoming available there is good reason to
believe that smart data analysis will become even more pervasive as a necessary ingredient
for technological progress.
⮚ Machine learning is a subset of artificial intelligence (AI). It is focused on teaching
computers to learn from data and to improve with experience – instead of being explicitly
programmed to do so. In machine learning, algorithms are trained to find patterns and
correlations in large data sets and to make the best decisions and predictions based on that
analysis.
Supervised Learning :
Supervised learning is a process of providing input data as well
as correct output data to the machine learning model. The aim
of a supervised learning algorithm is to find a mapping
function to map the input variable(x)with the output
variable(y).
Unsupervised Machine Learning :
Unsupervised learning cannot be directly applied to a regression or classification problem because unlike
supervised learning, we have the input data but no corresponding output data. The goal of unsupervised
learning is to find the underlying structure of dataset, group that data according to similarities,
and
represent that dataset in a compressed format.
Relation to Optimization
2. DATA
1. Quantitative data
2. Qualitative data
E.G.: – No. Of Cars You Have, No. Of Marbles In Containers, Students In a classroom,etc..
These Are The Data Types That Cannot Be Expressed In Numbers. This Describes Categories Or Groups
And Is Hence Known As The Categorical Data Type.
A. Structured Data:
This Type Of Data Is Either Number Or Words. This Can Take Numerical Values But Mathematical
Operations Cannot Be Performed On It. This Type Of Data Is Expressed In Tabular Format.
E.G.) Sunny=1, Cloudy=2, Windy=3 Or Binary Form Data Like 0 Or1, Good Or Bad, Etc.
B. Unstructured Data:
This Type Of Data Does Not Have The Proper Format And Therefore Known As Unstructured Data.This
Comprises Textual Data, Sounds, Images, Videos, Etc.
Besides This, There Are Also Other Types Refer As Data Types Preliminaries Or Data Measures:-
1. Nominal data
2. Ordinal data
These Can Also Be Refer Different Scales Of Measurements.
DATA ANALYTICS:
⮚ Is the process of studying in the available data and drawing valuable insights or information
from it. With the help of software.
⮚ Is being used every day & every where to enable the business to take smart and accurate decisions.
GRAPHICAL REPRESENTATION OF DATA:
⮚ It is one of the simple techniques for drawing insights from the data.
⮚ It helps us to study relationship between the variables.
⮚ Helps us to identify the trend & patterns across the variables.
MAJOR TYPES :
1. Line graph
2. Bar graph
3. Histogram
4. Piechart
5. Scatter plot
3. INTRODUCTION TO PYTHON
Python is a popular programming language. It was created by Guido van Rossum, and released in 1991.
It is used for:
1. Int ->Integer value can be any length such as integers 10, 2, 29, -20, -150 etc. Python has
no restriction on the length of an integer. Its value belongs to int
2. Float ->Float is used to store floating-point numbers like 1.9, 9.902, 15.2, etc. It is accurate upto
15 decimal points.
3. String->The string can be defined as the sequence of characters represented in the quotation
marks. In Python, we can use single, double, or triple quotes to define a string.
OUTPUT:
3.2 CONDITIONAL STATEMENT
IF:
These conditions can be used in several ways, most commonly in "if statements" and loops.
ELIF:
The elif keyword is pythons way of saying "if the previous conditions were not true, then try this
condition".
ELSE:
The else keyword catches anything which isn't caught by the preceding conditions.
NESTED IF:
You can have if statements inside if statements, this is called nested if statements.
3.3 ITERATIVE STATEMENTS
Iteration statements or loop statements allow us to execute a block of statements as long as the condition
is true.
1.While Loop :
While Loop In Python is used to execute a block of statement as long as a given condition is true. And
when the condition is false, the control will come out of the loop
2. For Loop :
For loop in Python is used to iterate over items of any sequence, such as a list or a string.
Calling a Function
LIBRARY:
Python has created several open-source libraries, each with its root source. A library is an initially merged
collection of code scripts that can be used iteratively to save time. It's similar to a physical library in that
it holds reusable resources, as the name implies.
Matplotlib:
The plotting of numerical data is the responsibility of this library. It's for this reason that it's used in
analysis of data. It's an open-source library that plots high-definition figures such as pie charts,
scatterplots, boxplots, and graphs, among other things.
NumPy:
NumPy is one of the most widely used open-source Python packages, focusing on mathematical and
scientific computation. It has built-in mathematical functions for convenient computation and facilitates
large matrices and multidimensional data. It can be used for various things, including linear algebra, as
an N-dimensional container for all types of data. The NumPy Array Python object defines an N-
dimensional array with rows and columns. A long with this, it can be used as a random number generator.
Pandas:
Pandas is an open source library licenced under the Berkeley Software Distribution (BSD). In the domain
of data science, this well-known library is widely used. They're mostly used for analysis, manipulation,
and cleaning of data, among other things. Pandas allows us to perform simple data modelling and analysis
without having to swap to another language like R.
Scikit- learn:
Scikit-learn is also an open-source machine learning library based on Python. Both supervised and
unsupervised learning processes can be used in this library. Popular algorithms and the SciPy, NumPy,
and Matplotlib packages are all already pre-included in this library. The most well-known Scikit-most-
learn application is for Spotify music recommendations.
The target variable is the variable whose values are modeled and predicted by other variables.
A predictor variable is a variable whose values will be used to predict the value of the target variable.
Overview
⮚ A complete tutorial on data exploration (EDA)
⮚ We cover several data exploration aspects, including missing value imputation, outlier removal and the art of feature
engineering
2. Continuous Variables:- In case of continuous variables, we need to understand the central tendency and spread of
the variable.
3. Categorical Variables:- For categorical variables, we’ll use frequency table to understand distribution of each
category.
4.Continuous & Continuous: While doing bi-variate analysis between two continuous variables, we should look at
scatter plot. It is a nifty way to find out the relationship between two variables.
4.3 correlation : Correlation explains how one or more variables are related to each other. These variables can be
input data features which have been used to forecast our target variable.
Positive Correlation:Two features (variables) can be positively correlated with each other. It means that when the value
of one variable increase then the value of the other variable(s) also increases.
Negative Correlation:Two features (variables) can be negatively correlated with each other. It means that when the
value of one variable increase then the value of the other variable(s) decreases.
No Correlation:Two features (variables) are not correlated with each other. It means that when the value of one variable
increase or decrease then the value of the other variable(s) doesn’t increase or decreases.
feacture scaling: Feature scaling is a method used to normalize the range of independent variables or features of data.
min-max normalization
Also known as min-max scaling or min-max normalization, rescaling is the simplest method and consists in rescaling the range of features to scale the
range in [0, 1] or [−1, 1]. Selecting the target range depends on the nature of the data. The general formula for a min-max of [0, 1] is given as:
5. Linear Regression
5.2 Model evaluation metrics
Predictive models have become a trusted advisor to many businesses and for a good reason. These
models can “foresee the future”, and there are many different methods available, meaning any industry
can find one that fits their particular challengesWhen we talk about predictive models, we are talking
either about a regression model (continuous output) or a classification model(nominal or binary
output). In classification problems, we use two types of algorithms (dependent on the kind of output it
creates):
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
⮚ Linear
regression is a
linear approach
for
⮚
⮚
2. Gradient Descent
Gradient Descent is known as one of the most commonly used optimization algorithms to train
machine learning models by means of minimizing errors between actual and expected results. Further,
gradient descent is also used to train Neural Networks.
⮚ In mathematical terminology, Optimization algorithm refers to the task of
minimizing/maximizing an objective function f(x) parameterized by x. Similarly, in machine
learning, optimization is the task of minimizing the cost function parameterized by the model's
parameters.
⮚ The main objective of gradient descent is to minimize the convex function using iteration of
parameter updates. Once these machine learning models are optimized, these models can be used
as powerful tools for Artificial Intelligence and various computer science applications.
3. Training model
In machine learning classification problems, there are often too many factors on the basis of
which the final classification is done. These factors are basically variables called features. The
higher the number of features, the harder it gets to visualize the training set and then work on it.
Sometimes, most of these features are correlated, and hence redundant. This is where
dimensionality reduction algorithms come into play. Dimensionality reduction is the process of
reducing the number of random variables under consideration, by obtaining a set of principal
variables. It can be divided into feature selection and feature extraction.
7. Logistic Regression
Logistic regression is one of the most popular Machine Learning algorithms, which comes under the
Supervised Learning technique. It is used for predicting the categorical dependent variable using a given
set of independent variables.
⮚ Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome
must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but
instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0
and 1.
⮚ Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is used
for
solving the classification problems.
⮚ In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function,
which predicts two maximum values (0 or 1).
⮚ EVALUATION METRIX:
⮚ Confusion Matrix
⮚ A confusion matrix is an N X N matrix, where N is the number of classes being predicted.
For the problem in hand, we have N=2, and hence we get a 2 X 2 matrix. Here are a few definitions,
you need to remember for a confusion matrix :
⮚ Accuracy: the proportion of the total number of predictions that were correct.
⮚ Positive Predictive Value or Precision: the proportion of positive cases that were correctly
identified.
⮚ Negative Predictive Value: the proportion of negative cases that were correctly identified.
⮚ Sensitivity or Recall : the proportion of actual positive cases which are correctly identified.
⮚ Specificity: the proportion of actual negative cases which are correctly identified.
F1 Score:
⮚ precision and recall for classification problems and also highlighted the importance of
choosing precision/recall basis our use case. What if for a use case, we are trying to get the best precision
and recall at the same time? F1-Score is the harmonic mean of precision and recall values for a
classification
problem. The formula for F1-Score is as follows:
⮚
⮚ Hence, for each sensitivity, we get a different specificity. The two vary as follows:
⮚
⮚ The ROC curve is the plot between sensitivity and (1- specificity). (1- specificity) is also known
as false positive rate and sensitivity is also known as True Positive rate. Following is the ROC curve for
the case in hand.
8. DECISION TREE
8.1 HOW DECISION TREE WORKS
The decision of making strategic splits heavily affects a tree’s accuracy. The decision criteria are different
for classification and regression trees.
Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes. The creation
of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that the purity
of the node increases with respect to the target variable. The decision tree splits the nodes on all available
variables and then selects the split which results in most homogeneous sub-nodes.
Entropy
Entropy is a measure of the randomness in the information being processed. The higher the entropy, the harder it
is to draw any conclusions from that information. Flipping a coin is an example of an action that provides
information that is random.
8.2 implementing decision tree
●Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
●In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
●The decisions or the test are performed on the basis of features of the given dataset.
●It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
●It is called a decision tree because, similar to a tree, it starts with the root node, which expands
on further branches and constructs a tree-like structure.
●In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
●A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree
into subtrees.
●Below diagram explains the general structure of a decision tree:
9.ENSEMBLE MODELS
9.1 BASIC ENSEMBLE TECHNIQUES
Ensemble modeling is a process where multiple diverse models are created to predict an outcome, either
by using many different modeling algorithms or using different training data sets. The ensemble model
then aggregates the prediction of each base model and results in once final prediction for the unseen data.
The motivation for using ensemble models is to reduce the generalization error of the prediction. As long
as the base models are diverse and independent, the prediction error of the model decreases when the
ensemble approach is used. The approach seeks the wisdom of crowds in making a prediction. Even
though the ensemble model has multiple base models within the model, it acts and performs as a
single model. Most of the practical data mining solutions utilize ensemble modeling
techniques.Classification covers the approaches of different ensemble modeling techniques and their
implementation in detail
BAGGING
The idea of bagging is based on making the training data available to an iterative learning process. Each
model learns the error produced by the previous model using a slightly different subset of the training
data set. Bagging reduces variance and minimizes overfitting. One example of such a technique is the
random forest algorithm.
Ensemble Algorithm
A single algorithm may not make the perfect prediction for a given data set. Machine learning algorithms
have their limitations and producing a model with high accuracy is challenging. If we build andcombine
multiple models, we have the chance to boost the overall accuracy. We then implement the combination
of models by aggregating the output from each model with two objectives
RANDOM FOREST
This technique uses a subset of training samples as well as a subset of features to build multiple split
trees. Multiple decision trees are built to fit each training set. The distribution of samples/features is
typically implemented in a random mode.
10 . Clustering
10.1 Clustering:
Clustering is the task of dividing the population or data points into a number of groups such that data
points in the same groups are more similar to other data points in the same group and dissimilar to the
data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity
betweenthem.
10.2 K-means:
K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering,
unlike in supervised learning. K-Means performs the division of objects into clusters that share
similarities and are dissimilar to the objects belonging to another cluster.
FINAL PROJECT
Problem Statement
The dataset contains the information like personal information of the customer transaction
information,and bank information belonging to a bank.it is often necessary to predict when ths customers
are going to withdraw their money from the bank account and stay dormant. Being able-to predict this,the
bank can take necessary action to prevent customers from withdrawing huge sums and stay an
active/loyal customer. . Our task is to be able to predict the customers who are going to churn based
on
the information given.
Conclusion:
Finally, when it comes to the development of machine learning models of our own, we looked at the
choices of various development languages, IDEs and Platforms. Next thing that we need to do is start
learning and practicing each machine learning technique. The subject is vast, it means that there is width,
but if we consider the depth, each topic can be learned in a few hours. Each topic is independent of each
other. we need to take into consideration one topic at a time, learn it, practice it and implement the
algorithm/s in it using a language choice of us.
REFERENCES
https:// www.kaggle.com
https://arqiipubl.com
https://www.scribd.com
https://arvix.org
https://monkeylearn.