Gyan Singh Machine Learning Project For A Level
Gyan Singh Machine Learning Project For A Level
PROJECT REPORT
ON
Submitted By:
Mr. Gyan Singh
Regd. No – 1179726
Guided by:
Mr. Sushil Kumar
PrimeItZen Software Solution Pvt Ltd
Prayagraj
ACKNOWLEDGEMENT
Lastly, words run to express my gra tude to all the lecturers and
friends for their co-opera on, construc ve cri cism and their
valuable sugges on during the prepara on of this project report.
Mr Gyan Singh
tt
ti
tt
ti
ti
tt
fi
ti
CONTENTS
1. Abstract.
2. Introduc on.
3. Ar cial Intelligence.
4. Machine learning.
14. Conclusion.
15. References.
ti
fi
ti
tti
ti
ti
ti
ti
Abstract
It is be er to know to know whether the loan from a bank some one
is looking for will be approved or not and if he knows this then he
don’t need to go bank by bank he just need to enter some data about
him and he will know whether the bank loan will be approved or not.
Our model uses machine learning to predict the approval of a bank
loan for a par cular customer. It has certain data elds like loan
amount applicants annual salary, expenditure, etc. Any customer can
enter the required data in the data eld and can get the predic on
whether the loan he is applying for will be approved or not in no
me.
So in this way our project can be an e ec ve applica on for any one
who is seeking for a loan.
ti
tt
ti
fi
ff
ti
ti
fi
ti
Introduction
India's real estate sector is projected to reach $180 billion by 2020
from $126 billion in 2015, according to a joint report by CREDAI and
JLL.
Investment in ows in the housing sector since 2014 have been Rs.
590 billion, about 47 per cent of the total invested money in real
estate, it said.
The report also said that the contribu on of the residen al segment
to the GDP would almost double to 11 per cent by 2020. Released on
Wednesday at CREDAI Conclave 2018, the report traces 7 trends that
will change the way real estate business will happen in the future in
India. JLL also projected that the housing sector's contribu on to the
Indian GDP is expected to almost double to more than 11 per cent by
2020 up from es mated 5-6 per cent.
Regulatory reforms, steady demand generated through rapid
urbanisa on, rising household income and the emergence of
a ordable and nuclear housing are some of the key drivers of growth
for the sector, the report said.
Sales gures are projected to improve with RERA bound to rebuild
the trust de cit between buyers and developers, it said.
On GST, the report said it would lead to cost savings of 3-4 per cent.
Prices would con nue to remain dependent on demand and supply
dynamics within micro-markets.
Apart from eight major ci es, JLL said that ci es like Nagpur, Kochi,
Chandigarh and Patna could be growth centres.
ff
fi
ti
fi
fl
ti
ti
ti
ti
ti
ti
ti
The recent relaxa on in the FDI has provided a huge boost to
investment in the industry.
The report said, adding that a ordable housing and warehousing
segments would a ract huge investment going forward.
The a ordable housing segment, which has been granted
infrastructure status, would create avenues for developers.
So this project has a major role in the real estate business par cular
in housing department.
ff
tt
ti
ff
ti
ARTIFICIAL INTELLIGENCE
The eld was founded on the claim that human intelligence "can be
so precisely described that a machine can be made to simulate
it". This raises philosophical arguments about the nature of
the mind and the ethics of crea ng ar cial beings endowed with
human-like intelligence which are issues that have been explored
by myth, c on and philosophy since an quity. Some people also
consider AI to be a danger to humanity if it progresses
fi
ff
ti
fi
fi
ti
fi
ti
fi
ti
ti
ti
ti
ti
ti
ft
fi
fi
ti
ti
fi
ti
ti
ti
ti
ti
ti
ti
ti
fi
ti
ti
ti
fi
ti
ti
ti
ti
ti
ti
fi
ti
ti
unabated. Others believe that AI, unlike previous technological
revolu ons, will create a risk of mass unemployment.
Basics
A typical AI perceives its environment and takes ac ons that
maximize its chance of successfully achieving its goals. An AI's
intended goal func on can be simple ("1 if the AI wins a game of Go,
0 otherwise") or complex ("Do ac ons mathema cally similar to the
ac ons that got you rewards in the past"). Goals can be explicitly
de ned, or can be induced. If the AI is programmed for
"reinforcement learning", goals can be implicitly induced by
rewarding some types of behavior and punishing
others. Alterna vely, an evolu onary system can induce goals by
using a " tness func on" to mutate and preferen ally replicate high-
scoring AI systems; this is similar to how animals evolved to innately
ti
fi
ffi
fi
ti
ti
ti
ti
ti
ti
fi
ti
ti
ti
fi
ti
ti
ffi
ti
ti
ti
ti
ti
fi
desire certain goals such as nding food, or how dogs can be bred
via ar cial selec on to possess desired traits. Some AI systems, are
not generally given goals, except to the degree that goals are
somehow implicit in their training data. Such systems can s ll be
benchmarked if the non-goal system is framed as a system whose
"goal" is to successfully accomplish its narrow classi ca on task.
Concepts of Learning
Learning is the process of conver ng experience into exper se or
knowledge.
Learning can be broadly classi ed into three categories, as
men oned below, based on the nature of the learning data and
interac on between the learner and the environment.
• Supervised Learning
• Unsupervised Learning
Semi-supervised Learning
•
Supervised Learning
Supervised learning is commonly used in real world applica ons,
such as face and speech recogni on, products or movie
recommenda ons, and sales forecas ng. Supervised learning can be
further classi ed into two types - Regression and Classi ca on.
Regression trains on and predicts a con nuous-valued response, for
example predic ng real estate prices.
Classi ca on a empts to nd the appropriate class label, such as
analyzing posi ve/nega ve sen ment, male and female persons,
benign and malignant tumors, secure and unsecure loans etc.
In supervised learning, learning data comes with descrip on, labels,
targets or desired outputs and the objec ve is to nd a general rule
that maps inputs to outputs. This kind of learning data is
called labeled data. The learned rule is then used to label new data
with unknown outputs.
Supervised learning involves building a machine learning model that
is based on labeled samples. For example, if we build a system to
es mate the price of a plot of land or a house based on various
features, such as size, loca on, and so on, we rst need to create a
database and label it. We need to teach the algorithm what features
correspond to what prices. Based on this data, the algorithm will
learn how to calculate the price of real estate using the values of the
input features.
Supervised learning deals with learning a func on from available
training data. Here, a learning algorithm analyzes the training data
and produces a derived func on that can be used for mapping new
examples. There are many supervised learning algorithms such as
Logis c Regression, Neural networks, Support Vector Machines
(SVMs), and Naive Bayes classi ers.
ti
ti
fi
ti
fi
ti
ti
ti
tt
ti
ti
fi
ti
fi
ti
ti
ti
ti
ti
fi
ti
fi
fi
ti
ti
ti
Common examples of supervised learning include classifying e-mails
into spam and not-spam categories, labeling webpages based on
their content, and voice recogni on.
Unsupervised Learning
Unsupervised learning is used to detect anomalies, outliers, such as
fraud or defec ve equipment, or to group customers with similar
behaviors for a sales campaign. It is the opposite of supervised
learning. There is no labeled data here.
When learning data contains only some indica ons without any
descrip on or labels, it is up to the coder or to the algorithm to nd
the structure of the underlying data, to discover hidden pa erns, or
to determine how to describe the data. This kind of learning data is
called unlabeled data.
Suppose that we have a number of data points, and we want to
classify them into several groups. We may not exactly know what
the criteria of classi ca on would be. So, an unsupervised learning
algorithm tries to classify the given dataset into a certain number of
groups in an op mum way.
Unsupervised learning algorithms are extremely powerful tools for
analyzing data and for iden fying pa erns and trends. They are
most commonly used for clustering similar input into logical groups.
Unsupervised learning algorithms include Kmeans, Random Forests,
Hierarchical clustering and so on.
Semi-supervised Learning
If some learning samples are labeled, but some other are not
labeled, then it is semi-supervised learning. It makes use of a large
amount of unlabeled data for training and a small amount
of labeled data for tes ng. Semi-supervised learning is applied in
cases where it is expensive to acquire a fully labeled dataset while
more prac cal to label a small subset. For example, it o en requires
skilled experts to label certain remote sensing images, and lots of
ti
ti
ti
ti
fi
ti
ti
ti
ti
tt
ti
ft
tt
fi
eld experiments to locate oil at a par cular loca on, while
acquiring unlabeled data is rela vely easy.
Reinforcement Learning
Here learning data gives feedback so that the system adjusts to
dynamic condi ons in order to achieve a certain objec ve. The
system evaluates its performance based on the feedback responses
and reacts accordingly. The best known instances include self-driving
cars and chess master algorithm AlphaGo.
Training data and Test data are two important concepts in machine
learning. This chapter discusses them in detail.
Training Data
The observa ons in the training set form the experience that the
algorithm uses to learn. In supervised learning problems, each
observa on consists of an observed output variable and one or
more observed input variables.
Test Data
The test set is a set of observa ons used to evaluate the
performance of the model using some performance metric. It is
important that no observa ons from the training set are included in
the test set. If the test set does contain examples from the training
set, it will be di cult to assess whether the algorithm has learned to
generalize from the training set or has simply memorized it.
Consider for example that the original dataset is partitioned into five
subsets of equal size, labeled A through E. Initially, the model is trained
on partitions B through E, and tested on partition A. In the next iteration,
the model is trained on partitions A, C, D, and E, and tested on partition
B. The partitions are rotated until models have been trained and tested
on all of the partitions. Cross-validation provides a more accurate
estimate of the model's performance than testing a single partition of the
data.
Classification:-
Classi ca on is a machine learning technique that uses known data
to determine how the new data should be classi ed into a set of
exis ng categories.
ti
fi
fi
fi
fi
ti
ti
ti
ti
ti
ti
ti
ti
ti
tt
ti
ti
fi
Consider the following examples to understand classi ca on
technique
In a hospital, the emergency room has more than 15 features (age,
blood pressure, heart condi on, severity of ailment etc.) to analyze
before deciding whether a given pa ent has to be put in an intensive
care unit as it is a costly proposi on and only those pa ents who can
survive and a ord the cost are given top priority. The problem here
is to classify the pa ents into high risk and low risk pa ents based
on the available features or parameters.
While classifying a given set of data, the classi er system performs
the following ac ons −
• Ini ally a new data model is prepared using any of the learning
algorithms.
• Then the prepared data model is tested.
• Later, this data model is used to examine the new data and to
determine its class.
Regression
ti
ti
ti
ti
ff
ti
fi
ti
ti
ti
fi
fi
ti
ti
fi
fi
ti
ti
ti
fi
ti
In regression, the program predicts the value of a con nuous output
or response variable. Examples of regression problems include
predic ng the sales for a new product, or the salary for a job based
on its descrip on. Similar to classi ca on, regression problems
require supervised learning. In regression tasks, the program
predicts the value of a con nuous output or response variable from
the input or explanatory variables.
Recommendation
Recommenda on is a popular method that provides close
recommenda ons based on user informa on such as history of
purchases, clicks, and ra ngs. Google and Amazon use this method
to display a list of recommended items for their users, based on the
informa on from their past ac ons. There are recommender
engines that work in the background to capture user behavior and
recommend selected items based on earlier user ac ons. Facebook
also uses the recommender method to iden fy and recommend
people and send friend sugges ons to its users.
Clustering
Groups of related observa ons are called clusters. A common
unsupervised learning task is to nd clusters within the training
data.
We can also de ne clustering as a procedure to organize items of a
given collec on into groups based on some similar features. For
example, online news publishers group their news ar cles using
clustering.
Linear Regression
Linear regression is used to es mate real world values like cost of
houses, number of calls, total sales etc. based on con nuous
variable(s). Here, we establish rela onship between dependent and
independent variables by ng a best line. This line of best t is
known as regression line and is represented by the linear
equa on Y= a *X + b.
In this equa on −
Y – Dependent Variable
a – Slope
X – Independent variable
b – Intercept
These coe cients a and b are derived based on minimizing the sum
of squared di erence of distance between data points and
regression line.
.
The best way to understand linear regression is by considering an
example.
Example:-
Suppose we are asked to arrange students in a class in the increasing
order of their weights. By looking at the students and visually
analyzing their heights and builds we can arrange them as required
using a combina on of these parameters, namely height and build.
This is real world linear regression example. We have gured out
that height and build have correla on to the weight by a
rela onship, which looks similar to the equa on above.
ti
ti
ti
ffi
ti
ff
ti
fi
tti
ti
ti
ti
ti
fi
ti
fi
When we consider the Linear Regression we observe the graph as
shown as
Logis c Regression
Logis c regression is another technique borrowed by machine
learning from sta s cs. It is the preferred method for binary
classi ca on problems, that is, problems with two class values.
ti
fi
ti
ti
ti
ti
It is a classi ca on algorithm and not a regression algorithm as the
name says. It is used to es mate discrete values or values like 0/1, Y/
N, T/F based on the given set of independent variable(s). It predicts
the probability of occurrence of an event by ng data to a logit
func on. Hence, it is also called logit regression. Since, it predicts
the probability, its output values lie between 0 and 1.
Example
Let us understand this algorithm through a simple example.
Assume that there is a puzzle to solve that has only 2 outcome
scenarios – either there is a solu on or there is none. Now suppose,
we have a wide range of puzzles to test a person which subjects he is
good at. The outcomes may be something like this – if a
trigonometry puzzle is given, a person may be 80% likely to solve it.
On the other hand, if a geography puzzle is given, the person may be
only 20% likely to solve it. This is where Logis c Regression helps in
solving. As per math, the log odds of the outcome is expressed as a
linear combina on of the predictor variables.
Example
Consider an example of using tanic data set for predic ng whether
a passenger will survive or not. The model below uses 3 features/
a ributes/columns from the data set, namely sex, age and sibsp (no
of spouse/children). In this case, whether the passenger died or
survived, is represented as red and green text respec vely.
Random Forest
Random Forest is a popular supervised ensemble learning algorithm.
‘Ensemble’ means that it takes a bunch of ‘weak learners’ and has
them work together to form one strong predictor. In this case, the
weak learners are all randomly implemented decision trees that are
brought together to form the strong predictor — a random forest.
Random Forest is a trademark term for an ensemble of decision
trees. In Random Forest, we have a collec on of decision trees,
known as “Forest”. To classify a new object based on a ributes, each
tree gives a classi ca on and we say the tree “votes” for that class.
tt
tt
fi
ti
ti
ti
ti
tt
ti
ti
The forest chooses the classi ca on having the most votes (over all
the trees in the forest).
Yes you guessed it right. It’s the collection and code stack of
various open source repositories which is developed by people
(still in process ) to continuously improve upon the existing
methods.
The best thing about using these packages is that they have zero
learning curve. Once you have a basic understanding of Python,
you can just implement it. They are free to use under GNU
license. Just import the package and use.
If you do not want to use any of them, you can easily implement
the functionality from scratch(which most of the developers
do).
Picture Cedits: XKCD updates
The main reason or the only reason why Python will never be
used very widely is because of the overhead it brings in. But to
clear the case, it was never built for the system but for the
usability. Small processors or low memory hardware won’t
accommodate Python codebase today, but for such cases we
have C and C++ as our development tools.
Another great tool is R. It’s open source, free and made for
statistical analysis. In my view, Python is a great tool for the
development of programs which perform data manipulation
whereas R is a statistical software which works on a particular
format of dataset. Python provides the various development
tools which can be used to work with other systems.
NeuralNetworks
the
Motivation: As part of my personal journey to gain a better understanding
of Deep Learning, I’ve decided to build a Neural Network from scratch
without a deep learning library like TensorFlow. I believe that
understanding the inner workings of a Neural Network is important to any
aspiring Data Scientist.
This article contains what I’ve learned, and hopefully it’ll be useful for
you as well!
• An input layer, x
• An arbitrary amount of hidden layers
• An output layer, ŷ
You might notice that in the equation above, the weights W and the biases
b are the only variables that affects the output ŷ.
Naturally, the right values for the weights and biases determines the
strength of the predictions. The process of fine-tuning the weights and
biases from the input data is known as training the Neural Network.
Each iteration of the training process consists of the following steps:
Loss Function
There are many available loss functions, and the nature of our problem
should dictate our choice of loss function. In this tutorial, we’ll use a
simple sum-of-sqaures error as our loss function.
That is, the sum-of-squares error is simply the sum of the difference
between each predicted value and the actual value. The difference is
squared so that we measure the absolute value of the difference.
Our goal in training is to find the best set of weights and biases that
minimizes the loss function.
Backpropagation
Now that we’ve measured the error of our prediction (loss), we need to
find a way to propagate the error back, and to update our weights and
biases.
In order to know the appropriate amount to adjust the weights and biases
by, we need to know the derivative of the loss function with respect to
the weights and biases.
Recall from calculus that the derivative of a function is simply the slope of
the function.
Chain rule for calculating derivative of the loss function with respect to the
weights. Note that for simplicity, we have only displayed the partial
derivative assuming a 1-layer Neural Network.
Phew! That was ugly but it allows us to get what we needed — the
derivative (slope) of the loss function with respect to the weights, so that
we can adjust the weights accordingly.
Now that we have that, let’s add the backpropagation function into our
python.
Now that we have our complete python code for doing feedforward and
backpropagation, let’s apply our Neural Network on an example and see
how well it does.
Our Neural Network should learn the ideal set of weights to represent this
function. Note that it isn’t exactly trivial for us to work out the weights just
by inspection alone.
Let’s train the Neural Network for 1500 iterations and see what happens.
Looking at the loss per iteration graph below, we can clearly see the loss
monotonically decreasing towards a minimum. This is consistent with
the gradient descent algorithm that we’ve discussed earlier.
Let’s look at the final prediction (output) from the Neural Network after
1500 iterations.
Predictions after 1500 training iterations
We did it! Our feedforward and backpropagation algorithm trained the
Neural Network successfully and the predictions converged on the true
values.
Note that there’s a slight difference between the predictions and the actual
values. This is desirable, as it prevents overfitting and allows the Neural
Network to generalize better to unseen data.
What’s Next?
Fortunately for us, our journey isn’t over. There’s still much to learn about
Neural Networks and Deep Learning. For example:
Final Thoughts
I’ve certainly learnt a lot writing my own Neural Network from scratch.
This exercise has been a great investment of my time, and I hope that it’ll
be useful for you as well!
PREDICTING HOUSE PRICES FOR REGIONS IN
COUNTRY
Introduc on:
Real estate sector in India is expected to reach US$ 650 billion and its
share in India’s Gross Domes c Product (GDP) is projected increase
to 17 per cent by 2040. Emergence of nuclear families, rapid
urbanisa on and rising household income are likely to remain the key
drivers for growth in all spheres of real estate, including residen al,
commercial and retail. Rapid urbanisa on in the country is pushing
the growth of real estate. More than 70 per cent of India’s GDP will
be contributed by the urban areas by 2020.
2. Overview of data
Avg. Area Number of Rooms Average Number of Rooms for Houses in same city
1. Training dataset
2. Tes ng dataset
Training dataset:
Training dataset is normally used in training the
machine. We use almost 80% of our data to train the machine. In
training we provide the machine inputs as well as the output. For
example when we train a student about addi on we tell him that
1+2=3,2+3=5,6+7=13
This kind of examples are provided to the students and the student
learn from these example, and when we ask he replies and if replies
correctly than we become sure that the student has learned addi on.
similarly in case of machines they are trained with huge amount of
data.
Tes ng dataset:
Tes ng data sets are normally used to test the machine whether the
machine is predic ng upto an acceptable level.
In the tes ng phase we give input to the machine and if the machine
predicts it correctly then we become sure that our machine has
learned.
This is how you create the training set and testing set.
#import dataset
import pandas as pd
#import dataset
import pandas as pd
#Check out the Data
IndianHousing=pd.read_csv(‘Indian_Housing.csv’)
#Training a Linear Regression Model
# X and Y arrays
#split dependent variable and independent variable
x=IndianHousing[['Avg. Area Income','Avg. Area House Age','Avg.
Area Number of Rooms','Avg. Area Number of Bedrooms','Area
Population']]
y=IndianHousing['Price']
Linear regression:
Regression is a linear approach to
modelling the rela onship between a scalar response (or
dependent variable) and one or more explanatory variables (or
independent variables). The case of one explanatory variable is
called simple linear regression. For more than one explanatory
variable, the process is called mul ple linear regression.[1] This
term is dis nct from mul variate linear regression, where
mul ple correlated dependent variables are predicted, rather
than a single scalar variable.[2]
Creating Model
fromsklearn.linear_modelimportLinearRegression
lm=LinearRegression()
lm.fit(X_train,y_train)
LinearRegression(copy_X=True,
fit_intercept=True, n_jobs=1, normalize=False)
ti
ti
ti
ti
ti
fi
Model Evaluation
print(lm.intercept_)## printing the intercept
#-2640159.79685
coeff_df =
pd.DataFrame(lm.coef_,X.columns,columns=['Coef
ficient'])
predictions=lm.predict(X_test)
Conclusion: