0% found this document useful (0 votes)
52 views

Gyan Singh Machine Learning Project For A Level

This project report discusses predicting housing prices for regions in a country using machine learning. The report includes an introduction on the growth of the Indian real estate sector. It also provides background on artificial intelligence and machine learning techniques. The project aims to build a model that can accurately predict housing prices based on various factors to help stakeholders in the real estate industry.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

Gyan Singh Machine Learning Project For A Level

This project report discusses predicting housing prices for regions in a country using machine learning. The report includes an introduction on the growth of the Indian real estate sector. It also provides background on artificial intelligence and machine learning techniques. The project aims to build a model that can accurately predict housing prices based on various factors to help stakeholders in the real estate industry.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

A

PROJECT REPORT
ON

PREDICTING HOUSING PRICES FOR


REGIONS IN A COUNTRY
A Project is Submitted to National Institute of Electronics And
Information Technology.

Submitted By:
Mr. Gyan Singh

Regd. No – 1179726
Guided by:
Mr. Sushil Kumar
PrimeItZen Software Solution Pvt Ltd
Prayagraj
ACKNOWLEDGEMENT

The sa sfac on that accompanies the successful comple on of any


task would be incomplete without the people, whose constant
guidance and encouragement crowns all e orts with success.

I express my deep sense of gra tude to Er. Anup Kumar Mishra ,


Head of Department of Computer Science and Engineering for his
ini a ve and constant inspira on and Mr. Sushil Kumar for their
guidance.

Lastly, words run to express my gra tude to all the lecturers and
friends for their co-opera on, construc ve cri cism and their
valuable sugges on during the prepara on of this project report.

Mr. Gyan Singh


ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ff
ti
ti
ti
Declaration

I Mr Gyan Singh do here by declared that the project report en tled


as “Housing prices for regions in a country”submi ed by me. This is
an original piece of work done by me under the guidance of Er.
Sushil Kumar Assistant Professor of Computer Science and
Engineering, Department .

This report is submi ed as a part for the par al ful llment of my


degree and is not submi ed to any other Ins tute in any other form
or not published at any me before.

Mr Gyan Singh
tt
ti
tt
ti
ti
tt
fi
ti
CONTENTS
1. Abstract.
2. Introduc on.

3. Ar cial Intelligence.

4. Machine learning.

5. Machine learning using python.

6. Implementa on of machine learning in banking sector.

7. Predic on on approval of a bank loan using python.

8. General steps involved in machine learning.

9. Dataset prepara on.

10. Loading dataset

11. Spli ng dataset

12. Training our model.

13. Tes ng our model.

14. Conclusion.

15. References.
ti
fi
ti
tti
ti
ti
ti
ti
Abstract
It is be er to know to know whether the loan from a bank some one
is looking for will be approved or not and if he knows this then he
don’t need to go bank by bank he just need to enter some data about
him and he will know whether the bank loan will be approved or not.
Our model uses machine learning to predict the approval of a bank
loan for a par cular customer. It has certain data elds like loan
amount applicants annual salary, expenditure, etc. Any customer can
enter the required data in the data eld and can get the predic on
whether the loan he is applying for will be approved or not in no
me.
So in this way our project can be an e ec ve applica on for any one
who is seeking for a loan.
ti
tt
ti
fi
ff
ti
ti
fi
ti
Introduction
India's real estate sector is projected to reach $180 billion by 2020
from $126 billion in 2015, according to a joint report by CREDAI and
JLL.
Investment in ows in the housing sector since 2014 have been Rs.
590 billion, about 47 per cent of the total invested money in real
estate, it said.
The report also said that the contribu on of the residen al segment
to the GDP would almost double to 11 per cent by 2020. Released on
Wednesday at CREDAI Conclave 2018, the report traces 7 trends that
will change the way real estate business will happen in the future in
India. JLL also projected that the housing sector's contribu on to the
Indian GDP is expected to almost double to more than 11 per cent by
2020 up from es mated 5-6 per cent.
Regulatory reforms, steady demand generated through rapid
urbanisa on, rising household income and the emergence of
a ordable and nuclear housing are some of the key drivers of growth
for the sector, the report said.
Sales gures are projected to improve with RERA bound to rebuild
the trust de cit between buyers and developers, it said.
On GST, the report said it would lead to cost savings of 3-4 per cent.
Prices would con nue to remain dependent on demand and supply
dynamics within micro-markets.
Apart from eight major ci es, JLL said that ci es like Nagpur, Kochi,
Chandigarh and Patna could be growth centres.
ff
fi
ti
fi
fl
ti
ti
ti
ti
ti
ti
ti
The recent relaxa on in the FDI has provided a huge boost to
investment in the industry.
The report said, adding that a ordable housing and warehousing
segments would a ract huge investment going forward.
The a ordable housing segment, which has been granted
infrastructure status, would create avenues for developers.
So this project has a major role in the real estate business par cular
in housing department.
ff
tt
ti
ff
ti
ARTIFICIAL INTELLIGENCE

What is ar cial intelligence:


Ar cial intelligence (AI), some mes called machine intelligence,
is intelligence demonstrated by machines, in contrast to the natural
intelligence displayed by humans and other animals. In computer
science AI research is de ned as the study of "intelligent agents": any
device that perceives its environment and takes ac ons that
maximize its chance of successfully achieving its goals.[1] Colloquially,
the term "ar cial intelligence" is applied when a machine mimics
"cogni ve" func ons that humans associate with other human
minds, such as "learning" and "problem solving.

The scope of AI is disputed: as machines become increasingly


capable, tasks considered as requiring "intelligence" are o en
removed from the de ni on, a phenomenon known as the AI e ect,
leading to the quip, "AI is whatever hasn't been done yet. For
instance, op cal character recogni on is frequently excluded from
"ar cial intelligence", having become a rou ne technology. Modern
machine capabili es generally classi ed as AI include
successfully understanding human speech, compe ng at the highest
level in strategic game systems (such as chess and Go), autonomously
opera ng cars, and intelligent rou ng in content delivery
networks and military simula ons.
ti
ti
fi
fi
ti
ti
ti
ti
ti
fi
fi
ti
ti
fi
fi
ti
ti
ti
ti
ti
ti
fi
ti
ti
ff
ft
Ar cial intelligence was founded as an academic discipline in 1956,
and in the years since has experienced several waves of op mism,
followed by disappointment and the loss of funding (known as an "AI
winter"), followed by new approaches, success and renewed
funding. For most of its history, AI research has been divided into
sub elds that o en fail to communicate with each other These sub-
elds are based on technical considera ons, such as par cular goals
(e.g. "robo cs" or "machine learning"), the use of par cular tools
("logic" or ar cial neural networks), or deep philosophical
di erences. Sub elds have also been based on social factors
(par cular ins tu ons or the work of par cular researchers).

The tradi onal problems (or goals) of AI research


include reasoning, knowledge
representa on, planning , learning , natural language
processing, percep on and the ability to move and manipulate
objects.. General intelligence is among the eld's long-term goals.
Approaches include sta s cal methods, computa onal intelligence,
and tradi onal symbolic AI. Many tools are used in AI, including
versions of search and mathema cal op miza on, ar cial neural
networks, and methods based on sta s cs, probability and
economics. The AI eld draws upon computer science ,mathema cs,
psychology,linguis cs,philosophy and many others.

The eld was founded on the claim that human intelligence "can be
so precisely described that a machine can be made to simulate
it". This raises philosophical arguments about the nature of
the mind and the ethics of crea ng ar cial beings endowed with
human-like intelligence which are issues that have been explored
by myth, c on and philosophy since an quity. Some people also
consider AI to be a danger to humanity if it progresses
fi
ff
ti
fi
fi
ti
fi
ti
fi
ti
ti
ti
ti
ti
ti
ft
fi
fi
ti
ti
fi
ti
ti
ti
ti
ti
ti
ti
ti
fi
ti
ti
ti
fi
ti
ti
ti
ti
ti
ti
fi
ti
ti
unabated. Others believe that AI, unlike previous technological
revolu ons, will create a risk of mass unemployment.

In the twenty- rst century, AI techniques have experienced a


resurgence following concurrent advances in computer power, large
amounts of data, and theore cal understanding; and AI techniques
have become an essen al part of the technology industry, helping to
solve many challenging problems in computer science, so ware
engineering and opera ons research.

History of artificial intelligence :


Capabili es that exceed the abili es of human beings. The history of
Ar cial Intelligence (AI) began in an quity, with myths, stories and
rumors of ar cial beings endowed with intelligence or
consciousness by master cra smen; as Pamela McCorduck writes, AI
began with "an ancient wish to forge the gods.

The seeds of modern AI were planted by classical philosophers who


a empted to describe the process of human thinking as the
mechanical manipula on of symbols. This work culminated in the
inven on of the programmable digital computer in the 1940s, a
machine based on the abstract essence of mathema cal reasoning.
This device and the ideas behind it inspired a handful of scien sts to
begin seriously discussing the possibility of building an electronic
brain.

The eld of AI research was founded at a workshop held on the


campus of Dartmouth College during the summer of 1956. Those
who a ended would become the leaders of AI research for decades.
Many of them predicted that a machine as intelligent as a human
being would exist in no more than a genera on and they were given
millions of dollars to make this vision come true.
tt
ti
fi
fi
ti
tt
ti
ti
ti
fi
fi
ti
ti
ti
ft
ti
ti
ti
ti
ti
ti
ft
Eventually it became obvious that they had grossly underes mated
the di culty of the project. In 1973, in response to the cri cism
of James Lighthill and ongoing pressure from congress,
the U.S. and Bri sh Governments stopped funding undirected
research into ar cial intelligence, and the di cult years that
followed would later be known as an "AI winter". Seven years later, a
visionary ini a ve by the Japanese Governmen nspired
governments and industry to provide AI with billions of dollars, but
by the late 80s the investors became disillusioned by the absence of
the needed computer power (hardware) and withdrew funding
again.

Investment and interest in AI boomed in the rst decades of the 21st


century, when machine learning was successfully applied to many
problems in academia and industry due to the presence of powerful
computer hardware. As in previous "AI summers", some observers
(such as Ray Kurzweil) predicted the imminent arrival of ar cial
general intelligence.

Basics
A typical AI perceives its environment and takes ac ons that
maximize its chance of successfully achieving its goals. An AI's
intended goal func on can be simple ("1 if the AI wins a game of Go,
0 otherwise") or complex ("Do ac ons mathema cally similar to the
ac ons that got you rewards in the past"). Goals can be explicitly
de ned, or can be induced. If the AI is programmed for
"reinforcement learning", goals can be implicitly induced by
rewarding some types of behavior and punishing
others. Alterna vely, an evolu onary system can induce goals by
using a " tness func on" to mutate and preferen ally replicate high-
scoring AI systems; this is similar to how animals evolved to innately
ti
fi
ffi
fi
ti
ti
ti
ti
ti
ti
fi
ti
ti
ti
fi
ti
ti
ffi
ti
ti
ti
ti
ti
fi
desire certain goals such as nding food, or how dogs can be bred
via ar cial selec on to possess desired traits. Some AI systems, are
not generally given goals, except to the degree that goals are
somehow implicit in their training data. Such systems can s ll be
benchmarked if the non-goal system is framed as a system whose
"goal" is to successfully accomplish its narrow classi ca on task.

AI o en revolves around the use of algorithms. An algorithm is a set


of unambiguous instruc ons that a mechanical computer can
execute. A complex algorithm is o en built on top of other, simpler,
algorithms. A simple example of an algorithm is the following recipe
for op mal play at c-tac-toe

1. If someone has a "threat" (that is, two in a row), take the


remaining square. Otherwise,
2. if a move "forks" to create two threats at once, play that move.
Otherwise,
3. take the center square if it is free. Otherwise,
4. if your opponent has played in a corner, take the opposite
corner. Otherwise,
5. take an empty corner if one exists. Otherwise,
6. take any empty square.
Many AI algorithms are capable of learning from data; they can
enhance themselves by learning new heuris cs (strategies, or "rules
of thumb", that have worked well in the past), or can themselves
write other algorithms. Some of the "learners" described below,
including Bayesian networks, decision trees, and nearest-neighbor,
could theore cally, if given in nite data, me, and memory, learn to
approximate any func on, including whatever combina on of
mathema cal func ons would best describe the en re world. These
learners could therefore, in theory, derive all possible knowledge, by
considering every possible hypothesis and matching it against the
data. In prac ce, it is almost never possible to consider every
ft
ti
ti
fi
ti
ti
ti
ti
ti
ti
ti
ti
fi
fi
ft
ti
ti
fi
ti
ti
ti
ti
possibility, because of the phenomenon of "combinatorial
explosion", where the amount of me needed to solve a problem
grows exponen ally. Much of AI research involves guring out how to
iden fy and avoid considering broad swaths of possibili es that are
unlikely to be frui ul. For example, when viewing a map and looking
for the shortest driving route from Denver to New York in the East,
one can in most cases skip looking at any path through San
Francisco or other areas far to the West; thus, an AI wielding an
path nding algorithm like A* can avoid the combinatorial explosion
that would ensue if every possible route had to be ponderously
considered in turn.
The earliest (and easiest to understand) approach to AI was
symbolism (such as formal logic): "If an otherwise healthy adult has a
fever, then they may have in uenza". A second, more general,
approach is Bayesian inference: "If the current pa ent has a fever,
adjust the probability they have in uenza in such-and-such way". The
third major approach, extremely popular in rou ne business AI
applica ons, are analogizers such as SVM and nearest-neighbor:
"A er examining the records of known past pa ents whose
temperature, symptoms, age, and other factors mostly match the
current pa ent, X% of those pa ents turned out to have in uenza". A
fourth approach is harder to intui vely understand, but is inspired by
how the brain's machinery works: the ar cial neural
network approach uses ar cial "neurons" that can learn by
comparing itself to the desired output and altering the strengths of
the connec ons between its internal neurons to "reinforce"
connec ons that seemed to be useful. These four main approaches
can overlap with each other and with evolu onary systems; for
example, neural nets can learn to make inferences, to generalize, and
to make analogies. Some systems implicitly or explicitly use mul ple
of these approaches, alongside many other AI and non-AI
algorithms; the best approach is o en di erent depending on the
problem.
ft
ti
fi
ti
ti
ti
ti
ti
tf
ti
fi
ti
fl
ti
fl
ti
ft
ff
ti
fi
ti
ti
ti
fi
ti
ti
fl
ti
The blue line could be an example of over ng a linear func on due
to random noise.
Learning algorithms work on the basis that strategies, algorithms,
and inferences that worked well in the past are likely to con nue
working well in the future. These inferences can be obvious, such as
"since the sun rose every morning for the last 10,000 days, it will
probably rise tomorrow morning as well". They can be nuanced, such
as "X% of families have geographically separate species with color
variants, so there is an Y% chance that undiscovered black
swans exist". Learners also work on the basis of "Occam's razor": The
simplest theory that explains the data is the likeliest. Therefore, to be
successful, a learner must be designed such that it prefers simpler
theories to complex theories, except in cases where the complex
theory is proven substan ally be er. Se ling on a bad, overly
complex theory gerrymandered to t all the past training data is
known as over ng. Many systems a empt to reduce over ng by
rewarding a theory in accordance with how well it ts the data, but
penalizing the theory in accordance with how complex the theory is.
[61] Besides classic over ng, learners can also disappoint by
"learning the wrong lesson". A toy example is that an image classi er
trained only on pictures of brown horses and black cats might
conclude that all brown patches are likely to be horses.[62] A real-
world example is that, unlike humans, current image classi ers don't
determine the spa al rela onship between components of the
picture; instead, they learn abstract pa erns of pixels that humans
are oblivious to, but that linearly correlate with images of certain
types of real objects. Faintly superimposing such a pa ern on a
fi
tti
ti
fi
tti
ti
ti
tt
fi
tt
tt
fi
tti
tt
fi
tt
fi
fi
ti
tti
ti
fi
legi mate image results in an "adversarial" image that the system
misclassi es.

A self-driving car system may use a neural network to determine


which parts of the picture seem to match previous training images of
pedestrians, and then model those areas as slow-moving but
somewhat unpredictable rectangular prisms that must be avoided.
Compared with humans, exis ng AI lacks several features of human
"commonsense reasoning"; most notably, humans have powerful
mechanisms for reasoning about "naïve physics" such as space, me,
and physical interac ons. This enables even young children to easily
make inferences like "If I roll this pen o a table, it will fall on the
oor". Humans also have a powerful mechanism of "folk psychology"
that helps them to interpret natural-language sentences such as "The
city councilmen refused the demonstrators a permit because they
advocated violence". (A generic AI has di culty inferring whether the
councilmen or the demonstrators are the ones alleged to be
advoca ng violence.) This lack of "common knowledge" means that
AI o en makes di erent mistakes than humans make, in ways that
can seem incomprehensible. For example, exis ng self-driving cars
cannot reason about the loca on nor the inten ons of pedestrians in
the exact way that humans do, and instead must use non-human
modes of reasoning to avoid accidents
fl
ti
ft
ti
fi
ff
ti
ti
ti
ffi
ff
ti
ti
ti
What is Machine Learning?
Machine learning is an applica on of ar cial intelligence (AI) that
provides systems the ability to automa cally learn and improve from
experience without being explicitly programmed. Machine learning
focuses on the development of computer programs that can access
data and use it learn for themselves.

The process of learning begins with observa ons or data, such as


examples, direct experience, or instruc on, in order to look for
pa erns in data and make be er decisions in the future based on the
examples that we provide. The primary aim is to allow the computers
learn automa cally without human interven on or assistance and
adjust ac ons accordingly.

The name machine learning was coined in 1959 by Arthur


Samuel. Machine learning explores the study and construc on
of algorithms that can learn from and make predic ons on data. such
algorithms overcome following strictly sta c program instruc ons by
making data-driven predic ons or decisions, through building
a model from sample inputs. Machine learning is employed in a
range of compu ng tasks where designing and programming explicit
algorithms with good performance is di cult or infeasible; example
tt
ti
ti
ti
ti
tt
ti
ti
ffi
ti
ti
ti
fi
ti
ti
ti
ti
ti
applica ons include email ltering, detec on of network intruders,
and computer vision.

Machine learning is closely related to (and o en overlaps


with) computa onal sta s cs, which also focuses on predic on-
making through the use of computers. It has strong es
to mathema cal op miza on, which delivers methods, theory and
applica on domains to the eld. Machine learning is some mes
con ated with data mining, where the la er sub eld focuses more
on exploratory data analysis and is known as unsupervised learning.

History of Machine learning


Arthur Samuel, an American pioneer in the eld of computer
gaming and ar cial intelligence, coined the term "Machine
Learning" in 1959 while at IBM . As a scien c endeavour, machine
learning grew out of the quest for ar cial intelligence.

Already in the early days of AI as an academic discipline, some


researchers were interested in having machines learn from data.
They a empted to approach the problem with various symbolic
methods, as well as what were then termed "neural networks"; these
were mostly perceptrons and other models that were later found to
b e re i nve n o n s o f t h e ge n e ra l i ze d l i n e a r m o d e l s o f
sta s cs Probabilis c reasoning was also employed, especially in
automated medical diagnosis.
However, an increasing emphasis on the logical, knowledge-based
approach caused a ri between AI and machine learning.
Probabilis c systems were plagued by theore cal and prac cal
problems of data acquisi on and representa on. By 1980, expert
systems had come to dominate AI, and sta s cs was out of
favor. Work on symbolic/knowledge-based learning did con nue
ti
fl
ti
ti
ti
tt
ti
ti
ti
ti
ti
fi
ti
ti
ti
ft
ti
ti
fi
ti
fi
ti
fi
tt
ti
ti
fi
ti
ti
ti
fi
fi
ti
ft
ti
ti
ti
ti
ti
within AI, leading to induc ve logic programming, but the more
sta s cal line of research was now outside the eld of AI proper,
in pa ern recogni on and informa on retrieval. Neural networks
research had been abandoned by AI and computer science around
the same me. This line, too, was con nued outside the AI/CS eld,
as "connec onism", by researchers from other disciplines
including Hop eld, Rumelhart and Hinton. Their main success came
in the mid-1980s with the reinven on of backpropaga on.
Machine learning, reorganized as a separate eld, started to ourish
in the 1990s. The eld changed its goal from achieving ar cial
intelligence to tackling solvable problems of a prac cal nature. It
shi ed focus away from the symbolic approaches it had inherited
from AI, and toward methods and models borrowed from sta s cs
and probability theory.

Concepts of Learning
Learning is the process of conver ng experience into exper se or
knowledge.
Learning can be broadly classi ed into three categories, as
men oned below, based on the nature of the learning data and
interac on between the learner and the environment.
• Supervised Learning
• Unsupervised Learning
Semi-supervised Learning

Similarly, there are four categories of machine learning algorithms as


shown below −
• Supervised learning algorithm
• Unsupervised learning algorithm
• Semi-supervised learning algorithm
• Reinforcement learning algorithm
ft
ti
ti
ti
tt
ti
ti
ti
fi
ti
fi
ti
ti
ti
fi
ti
ti
fi
fi
ti
ti
fl
ti
ti
ti
fi
fi
ti
H o w e v e r, t h e m o s t c o m m o n l y used ones
are supervised and unsupervised learning.

Supervised Learning
Supervised learning is commonly used in real world applica ons,
such as face and speech recogni on, products or movie
recommenda ons, and sales forecas ng. Supervised learning can be
further classi ed into two types - Regression and Classi ca on.
Regression trains on and predicts a con nuous-valued response, for
example predic ng real estate prices.
Classi ca on a empts to nd the appropriate class label, such as
analyzing posi ve/nega ve sen ment, male and female persons,
benign and malignant tumors, secure and unsecure loans etc.
In supervised learning, learning data comes with descrip on, labels,
targets or desired outputs and the objec ve is to nd a general rule
that maps inputs to outputs. This kind of learning data is
called labeled data. The learned rule is then used to label new data
with unknown outputs.
Supervised learning involves building a machine learning model that
is based on labeled samples. For example, if we build a system to
es mate the price of a plot of land or a house based on various
features, such as size, loca on, and so on, we rst need to create a
database and label it. We need to teach the algorithm what features
correspond to what prices. Based on this data, the algorithm will
learn how to calculate the price of real estate using the values of the
input features.
Supervised learning deals with learning a func on from available
training data. Here, a learning algorithm analyzes the training data
and produces a derived func on that can be used for mapping new
examples. There are many supervised learning algorithms such as
Logis c Regression, Neural networks, Support Vector Machines
(SVMs), and Naive Bayes classi ers.
ti
ti
fi
ti
fi
ti
ti
ti
tt
ti
ti
fi
ti
fi
ti
ti
ti
ti
ti
fi
ti
fi
fi
ti
ti
ti
Common examples of supervised learning include classifying e-mails
into spam and not-spam categories, labeling webpages based on
their content, and voice recogni on.

Unsupervised Learning
Unsupervised learning is used to detect anomalies, outliers, such as
fraud or defec ve equipment, or to group customers with similar
behaviors for a sales campaign. It is the opposite of supervised
learning. There is no labeled data here.
When learning data contains only some indica ons without any
descrip on or labels, it is up to the coder or to the algorithm to nd
the structure of the underlying data, to discover hidden pa erns, or
to determine how to describe the data. This kind of learning data is
called unlabeled data.
Suppose that we have a number of data points, and we want to
classify them into several groups. We may not exactly know what
the criteria of classi ca on would be. So, an unsupervised learning
algorithm tries to classify the given dataset into a certain number of
groups in an op mum way.
Unsupervised learning algorithms are extremely powerful tools for
analyzing data and for iden fying pa erns and trends. They are
most commonly used for clustering similar input into logical groups.
Unsupervised learning algorithms include Kmeans, Random Forests,
Hierarchical clustering and so on.

Semi-supervised Learning
If some learning samples are labeled, but some other are not
labeled, then it is semi-supervised learning. It makes use of a large
amount of unlabeled data for training and a small amount
of labeled data for tes ng. Semi-supervised learning is applied in
cases where it is expensive to acquire a fully labeled dataset while
more prac cal to label a small subset. For example, it o en requires
skilled experts to label certain remote sensing images, and lots of
ti
ti
ti
ti
fi
ti
ti
ti
ti
tt
ti
ft
tt
fi
eld experiments to locate oil at a par cular loca on, while
acquiring unlabeled data is rela vely easy.

Reinforcement Learning
Here learning data gives feedback so that the system adjusts to
dynamic condi ons in order to achieve a certain objec ve. The
system evaluates its performance based on the feedback responses
and reacts accordingly. The best known instances include self-driving
cars and chess master algorithm AlphaGo.
Training data and Test data are two important concepts in machine
learning. This chapter discusses them in detail.

Training Data
The observa ons in the training set form the experience that the
algorithm uses to learn. In supervised learning problems, each
observa on consists of an observed output variable and one or
more observed input variables.

Test Data
The test set is a set of observa ons used to evaluate the
performance of the model using some performance metric. It is
important that no observa ons from the training set are included in
the test set. If the test set does contain examples from the training
set, it will be di cult to assess whether the algorithm has learned to
generalize from the training set or has simply memorized it.

A program that generalizes well will be able to effectively perform a task


with new data. In contrast, a program that memorizes the training data
by learning an overly complex model could predict the values of the
response variable for the training set accurately, but will fail to predict
the value of the response variable for new examples. Memorizing the
training set is called over-fitting. A program that memorizes its
observations may not perform its task well, as it could memorize
fi
ti
ti
ti
ffi
ti
ti
ti
ti
ti
ti
relations and structures that are noise or coincidence.
Regularization may be applied to many models to reduce over-fitting.

In addition to the training and test data, a third set of observations,


called a validation or hold-out set, is sometimes required. The
validation set is used to tune variables called hyper parameters, which
control how the model is learned. It is common to partition a single set
of supervised observations into training, validation, and test sets. There
are no requirements for the sizes of the partitions, and they may vary
according to the amount of data available. It is common to allocate 50
percent or more of the data to the training set, 25 percent to the test
set, and the remainder to the validation set.

During development, and particularly when training data is scarce, a


practice called cross-validation can be used to train and validate an
algorithm on the same data. In cross-validation, the training data is
partitioned. The algorithm is trained using all but one of the partitions,
and tested on the remaining partition. The partitions are then rotated
several times so that the algorithm is trained and evaluated on all of the
data.

Consider for example that the original dataset is partitioned into five
subsets of equal size, labeled A through E. Initially, the model is trained
on partitions B through E, and tested on partition A. In the next iteration,
the model is trained on partitions A, C, D, and E, and tested on partition
B. The partitions are rotated until models have been trained and tested
on all of the partitions. Cross-validation provides a more accurate
estimate of the model's performance than testing a single partition of the
data.

Purpose of Machine Learning

Machine learning can be seen as a branch of AI or Ar cial


Intelligence, since, the ability to change experience into exper se or
to detect pa erns in complex data is a mark of human or animal
intelligence.
tt
ti
ti
fi
As a eld of science, machine learning shares common concepts
with other disciplines such as sta s cs, informa on theory, game
theory, and op miza on.
As a sub eld of informa on technology, its objec ve is to program
machines so that they will learn.
However, it is to be seen that, the purpose of machine learning is
not building an automated duplica on of intelligent behavior, but
using the power of computers to complement and supplement
human intelligence. For example, machine learning programs can
scan and process huge databases detec ng pa erns that are beyond
the scope of human percep on.
In the real world, we usually come across lots of raw data which is
not t to be readily processed by machine learning algorithms. We
need to preprocess the raw data before it is fed into various
machine learning algorithms. This chapter discusses various
techniques for preprocessing data in Python machine learning.

Technique used in Machine Learning


• Classification
• Regression
• Recommendation
• Clustering

Classification:-
Classi ca on is a machine learning technique that uses known data
to determine how the new data should be classi ed into a set of
exis ng categories.
ti
fi
fi
fi
fi
ti
ti
ti
ti
ti
ti
ti
ti
ti
tt
ti
ti
fi
Consider the following examples to understand classi ca on
technique
In a hospital, the emergency room has more than 15 features (age,
blood pressure, heart condi on, severity of ailment etc.) to analyze
before deciding whether a given pa ent has to be put in an intensive
care unit as it is a costly proposi on and only those pa ents who can
survive and a ord the cost are given top priority. The problem here
is to classify the pa ents into high risk and low risk pa ents based
on the available features or parameters.
While classifying a given set of data, the classi er system performs
the following ac ons −
• Ini ally a new data model is prepared using any of the learning
algorithms.
• Then the prepared data model is tested.
• Later, this data model is used to examine the new data and to
determine its class.

Applica ons of Classi ca on

• Detec on of Credit card fraud - The Classi ca on method is


used to predict credit card frauds. Employing historical records
of previous frauds, the classi er can predict which future
transac ons may turn into frauds.
• E-mail spam - Depending on the features of previous spam
mails, the classi er determines whether a newly received e-
mail should be sent to the spam folder.

Regression
ti
ti
ti
ti
ff
ti
fi
ti
ti
ti
fi
fi
ti
ti
fi
fi
ti
ti
ti
fi
ti
In regression, the program predicts the value of a con nuous output
or response variable. Examples of regression problems include
predic ng the sales for a new product, or the salary for a job based
on its descrip on. Similar to classi ca on, regression problems
require supervised learning. In regression tasks, the program
predicts the value of a con nuous output or response variable from
the input or explanatory variables.

Recommendation
Recommenda on is a popular method that provides close
recommenda ons based on user informa on such as history of
purchases, clicks, and ra ngs. Google and Amazon use this method
to display a list of recommended items for their users, based on the
informa on from their past ac ons. There are recommender
engines that work in the background to capture user behavior and
recommend selected items based on earlier user ac ons. Facebook
also uses the recommender method to iden fy and recommend
people and send friend sugges ons to its users.

Clustering
Groups of related observa ons are called clusters. A common
unsupervised learning task is to nd clusters within the training
data.
We can also de ne clustering as a procedure to organize items of a
given collec on into groups based on some similar features. For
example, online news publishers group their news ar cles using
clustering.

Applica ons of Clustering


ti
ti
ti
ti
ti
ti
ti
fi
ti
ti
ti
ti
ti
fi
fi
ti
ti
ti
ti
ti
ti
Clustering nds applica ons in many elds such market research,
pa ern recogni on, data analysis, and image processing.as
discussed here −
• Helps marketers to discover dis nct groups in their customer
basis and characterize their customer groups based on
purchasing pa erns.
• In biology, it can be used to derive plant and animal
taxonomies, categorize genes with similar func onality and
gain insight into structures inherent in popula ons.
• Helps in iden ca on of areas of similar land use in an earth
observa on database.
• Helps in classifying documents on the web for informa on
discovery.
• Used in outlier detec on applica ons such as detec on of
credit card fraud.
• Cluster Analysis serves as a data mining func on tool to gain
insight into the distribu on of data to observe characteris cs of
each cluster.

List of Common Machine Learning Algorithms


Here is the list of commonly used machine learning algorithms that
can be applied to almost any data problem −
• Linear Regression
• Logis c Regression
• Decision Tree
• Random Forest
• Gradient Boos ng algorithms like GBM, XGBoost, LightGBM and
CatBoost
tt
ti
ti
fi
ti
tt
ti
fi
ti
ti
ti
ti
ti
ti
ti
fi
ti
ti
ti
ti
ti
ti
This sec on discusses each of them in detail –

Linear Regression
Linear regression is used to es mate real world values like cost of
houses, number of calls, total sales etc. based on con nuous
variable(s). Here, we establish rela onship between dependent and
independent variables by ng a best line. This line of best t is
known as regression line and is represented by the linear
equa on Y= a *X + b.
In this equa on −
Y – Dependent Variable
a – Slope
X – Independent variable
b – Intercept
These coe cients a and b are derived based on minimizing the sum
of squared di erence of distance between data points and
regression line.
.
The best way to understand linear regression is by considering an
example.

Example:-
Suppose we are asked to arrange students in a class in the increasing
order of their weights. By looking at the students and visually
analyzing their heights and builds we can arrange them as required
using a combina on of these parameters, namely height and build.
This is real world linear regression example. We have gured out
that height and build have correla on to the weight by a
rela onship, which looks similar to the equa on above.
ti
ti
ti
ffi
ti
ff
ti
fi
tti
ti
ti
ti
ti
fi
ti
fi
When we consider the Linear Regression we observe the graph as
shown as

Logis c Regression
Logis c regression is another technique borrowed by machine
learning from sta s cs. It is the preferred method for binary
classi ca on problems, that is, problems with two class values.
ti
fi
ti
ti
ti
ti
It is a classi ca on algorithm and not a regression algorithm as the
name says. It is used to es mate discrete values or values like 0/1, Y/
N, T/F based on the given set of independent variable(s). It predicts
the probability of occurrence of an event by ng data to a logit
func on. Hence, it is also called logit regression. Since, it predicts
the probability, its output values lie between 0 and 1.

Example
Let us understand this algorithm through a simple example.
Assume that there is a puzzle to solve that has only 2 outcome
scenarios – either there is a solu on or there is none. Now suppose,
we have a wide range of puzzles to test a person which subjects he is
good at. The outcomes may be something like this – if a
trigonometry puzzle is given, a person may be 80% likely to solve it.
On the other hand, if a geography puzzle is given, the person may be
only 20% likely to solve it. This is where Logis c Regression helps in
solving. As per math, the log odds of the outcome is expressed as a
linear combina on of the predictor variables.

Decision Tree Algorithm


It is a supervised learning algorithm that is mostly used for
classi ca on problems. It works for both discrete and con nuous
dependent variables. In this algorithm, we split the popula on into
two or more homogeneous sets. This is done based on most
signi cant a ributes to make as dis nct groups as possible.
Decision trees are used widely in machine learning, covering both
classi ca on and regression. In decision analysis, a decision tree is
used to visually and explicitly represent decisions and decision
making. It uses a tree-like model of decisions.
ti
fi
fi
fi
ti
ti
fi
tt
ti
ti
ti
ti
ti
ti
fi
tti
ti
ti
A decision tree is drawn with its root at the top and branches at the
bo om. In the image, the bold text represents a condi on/internal
node, based on which the tree splits into branches/ edges. The
branch end that doesn’t split anymore is the decision/leaf.

Example
Consider an example of using tanic data set for predic ng whether
a passenger will survive or not. The model below uses 3 features/
a ributes/columns from the data set, namely sex, age and sibsp (no
of spouse/children). In this case, whether the passenger died or
survived, is represented as red and green text respec vely.

Random Forest
Random Forest is a popular supervised ensemble learning algorithm.
‘Ensemble’ means that it takes a bunch of ‘weak learners’ and has
them work together to form one strong predictor. In this case, the
weak learners are all randomly implemented decision trees that are
brought together to form the strong predictor — a random forest.
Random Forest is a trademark term for an ensemble of decision
trees. In Random Forest, we have a collec on of decision trees,
known as “Forest”. To classify a new object based on a ributes, each
tree gives a classi ca on and we say the tree “votes” for that class.
tt
tt
fi
ti
ti
ti
ti
tt
ti
ti
The forest chooses the classi ca on having the most votes (over all
the trees in the forest).

Ar cial Intelligence (AI) and Machine Learning are


everywhere. Chances are that you are using them and not even
aware about that. In Machine Learning (ML), computers, so ware,
and devices perform via cogni on similar to human brain.
Typical successful applica ons of machine learning include programs
that decode handwri en text, face recogni on, voice recogni on,
speech recogni on, pa ern recogni on, spam detec on programs,
weather forecas ng, stock market analysis and predic ons, and so
on. This chapter discusses these applica ons in detail.
ti
ti
ti
fi
tt
tt
ti
fi
ti
ti
ti
ti
ti
ti
ti
ft
ti
MACHINE LEARNING USING PYTHON

Why Python is the most popular


language used for
Machine Learning
The Beginning

Back in 1991 when Guido van Rossum released Python as his


side project, he didn’t expected that it would be the world’s
fastest growing computer language in the near future. If we
follow the trends, Python turns out as a goto language for fast
prototyping.
f a s t p r o t o t y p i n

Picture Credits: Stack Overflow

A statement from the StackOverflow developer survey 2017

“Python shot to the most wanted language this year”.

Why this trend ?

If we look at the philosophy of the Python language, you can say


that this language was built for its readability and less
complexity. You can easily understand it and make
someoneunderstand very fast.You can read it for yourself. Just
use below command in PythonImport this

Also Python wins the hearts of the users. According to the


Hackerrank 2018 developer survey(https://
research.hackerrank.com/developer-skills/2018/) “JavaScript
may be the most in-demand language by employers, but Python
wins the heart of developers across all ages, according to our
Love-Hate index.”

Why in Machine Learning ?

Now let’s understand why would anyone want to use only


Python in designing any Machine Learning project. Machine
learning, in layman terms, is to use the data to make a machine
make intelligent decision. For example — You can build a spam
detection algorithm where the rules can be learned from the
data or an anomaly detection of rare events by looking at
previous data or arranging your email based on tags you had
assigned by learning on email history and so on.

Machine learning is nothing but to recognise patterns in your


data.

An important task of a Machine learning engineer in his/her


work life is to extract, process, define, clean, arrange and then
understand the data to develop intelligent algorithms.

So for a Machine learning engineer/Computer Vision Engineer


like me or a budding Data Scientist/Machine Learning/
Algorithm Engineer/Deep learning engineer why I would
recommend Python, because it’s easy to understand.

Sometimes the concepts of Linear Algebra, Calculus are so


complex, that they take the maximum amount of effort. A quick
implementation in Python helps a ML engineer to validate an
idea.

Data is the key

So it totally depends on the type of the task where you want to


apply Machine learning. I work in computer vision projects. So
the input data for me is the image or video. For someone else it
would be a series of points over time or collection of language
documents spreaded across various domains or audio files
given or just some numbers.

Imagine everything that exists around you is data. And it’s


raw, unstructured, bad, incomplete, large. How Python
can tackle all of them ?Lets see.

Packages, Packages everywhere !

Yes you guessed it right. It’s the collection and code stack of
various open source repositories which is developed by people
(still in process ) to continuously improve upon the existing
methods.

Want to work with images — numpy, opencv, scikit

Want to work in text — nltk, numpy, scikit

Want to work in audio — librosa

Want to solve machine learning problem — pandas, scikit

Want to see the data clearly — matplotlib, seaborn, scikit

Want to use deep learning — tensorflow, pytorch

Want to do scientific computing — scipy

Want to integrate web applications — Django


Want to take a shower …. Well

The best thing about using these packages is that they have zero
learning curve. Once you have a basic understanding of Python,
you can just implement it. They are free to use under GNU
license. Just import the package and use.

If you do not want to use any of them, you can easily implement
the functionality from scratch(which most of the developers
do).
Picture Cedits: XKCD updates

Yes it’s not fast and takes more space but …

The main reason or the only reason why Python will never be
used very widely is because of the overhead it brings in. But to
clear the case, it was never built for the system but for the
usability. Small processors or low memory hardware won’t
accommodate Python codebase today, but for such cases we
have C and C++ as our development tools.

In my case, when we implement an algorithm(Neural network)


for a particular task, we use python(tensorflow). But for
deployment in real systems where speed matters we switch to
C.

Easter egg — Cython is in development for many years. [https://


en.wikipedia.org/wiki/Cython]. You get the readability of
Python, but efficiency of C.

Okay enough of the talk, now show me the way.

Now we know the Why. Lets see the How.


• Understand the basic concepts of Data structure.

Before jumping into any field of computer science, its very


important to understand how the machine perceives the data.
The atomic unit of value in C is 1 byte. Using the same byte we
can code each input from the universe. If I were to make a list
of things then, go to each and every implementation of data
structures. This tutorial (https://www.geeksforgeeks.org/data-
structures/) would be a good starting point.
• Learn python the hard way.

Once you get a understanding of the basics, jump into tutorial


series of Learn Python the hard way by Zed Shaw. One of the
statements from the book tells you that the hard way is easier.
The foundation should always be strong.

(Learn Python Hard Way)


• Machine Learning — Implementation matters.

The implementation of a clustering algorithm will open your


insights more about the problem then just reading the
algorithm. Here when a user implements the things in Python,
it is going to be much faster to prototype the code and test it.
One simple case of K means clustering is explained in following
blog — K means in Python
• Simplicity is the best

Whenever you implement a piece of code, always keep in mind


that an equivalent optimised code is always there. Keep asking
your peers that whether they can understand the underlying
functionality by just seeing the code stack. Use of meaningful
variables, modularity of code, comments, no hard coding are
keypoint areas which make a piece of code complete.
What about others ?
World’s most popular frameworks for data scientist are Excel
and SAS. The problem of using them is they can’t handle large
datasets and less community support for wide variety of usage
i.e. You can’t use Excel to handle a company’s raw data.

MATLAB also provides great libraries and packages for specific


tasks of image analysis. You can find great number of toolboxes
for the given task. The main con of using MATLAB is that it is
very slow(execution time is slow). It can’t be used in
deployment, but only for prototyping. Also it’s not free to use,
unlike python which is open.

Another great tool is R. It’s open source, free and made for
statistical analysis. In my view, Python is a great tool for the
development of programs which perform data manipulation
whereas R is a statistical software which works on a particular
format of dataset. Python provides the various development
tools which can be used to work with other systems.

R has a learning curve to it. The predefined functions need


predefined input. In Python you can play around the data.

Well if we focus on the overall task which is needed to train,


validate and test the models — as far as it satisfy the aim of
theproblem, any language/tool/framework can be used. Be it
extracting raw data from an API, analyzing it, doing an in depth
visualization and making an classifier for the given task.

There are a number of reasons why the Python programming


language is popular with professionals who work on machine
learning systems.
One of the most commonly cited reasons is the syntax of
Python, which has been described as both “elegant” and also
“math-like.” Experts point out that the semantics of Python
have a particular correspondence to many common
mathematical ideas, so that it doesn't take as much of a learning
curve to apply those mathematical ideas in the Python
language.
Python is also often described as simple and easy to learn,
which is a big part of its appeal for any applied use, including
machine learning systems. Some programmers describe Python
as having a favorable “complexity/performance trade-off” and
describe how using Python is more intuitive than some other
languages, because of its accessible syntax.
Other users point out that Python also has particular tools that
are extremely helpful in working with machine learning
systems. Some cite an array of frameworks and libraries, along
with extensions like NumPy, where these accessories make
Python
tasks easier to implement. So the context of the programming
language itself is also important in its popularity for these
applied uses. Another resource is a scikit module called
“machine learning in Python,” which can guide professionals
toward using Python in this capacity.
Python is described favorably for machine learning in
comparison to languages like Java, Ruby on Rails, C or Perl.
Where some might use other languages for “hard-coding” and
describe Python as a “toy language” that’s accessible to basic
users, many see Python as a fully functional alternative to
dealing with the cryptic syntax of some other languages.
Some point out that ease of use makes for better collaborative
coding and implementation, and that as a general-purpose
language, Python can do a lot of things easily, which helps with
a complex set of machine learning tasks. All of this makes
Python a frequently sought-after language skill in the tech
world. Another benefit is broad support: Because so many
people view as Python as a standard, the support community is
large, which builds Python's popularity even more.

NeuralNetworks

the
Motivation: As part of my personal journey to gain a better understanding
of Deep Learning, I’ve decided to build a Neural Network from scratch
without a deep learning library like TensorFlow. I believe that
understanding the inner workings of a Neural Network is important to any
aspiring Data Scientist.

This article contains what I’ve learned, and hopefully it’ll be useful for
you as well!

What’s a Neural Network?


Most introductory texts to Neural Networks brings up brain analogies
when describing them. Without delving into brain analogies, I find it easier
to simply describe Neural Networks as a mathematical function that maps
a given input to a desired output.

Neural Networks consist of the following components

• An input layer, x
• An arbitrary amount of hidden layers

• An output layer, ŷ

• A set of weights and biases between each layer, W and b

• A choice of activation function for each hidden layer, σ. In this


tutorial, we’ll use a Sigmoid activation function.
The diagram below shows the architecture of a 2-layer Neural Network
(note that the input layer is typically excluded when counting the number
of layers in a Neural Network)
Architecture of a 2-layer Neural Network
Creating a Neural Network class in Python is easy.

Training the Neural Network

The output ŷ of a simple 2-layer Neural Network is:

You might notice that in the equation above, the weights W and the biases
b are the only variables that affects the output ŷ.

Naturally, the right values for the weights and biases determines the
strength of the predictions. The process of fine-tuning the weights and
biases from the input data is known as training the Neural Network.
Each iteration of the training process consists of the following steps:

• Calculating the predicted output ŷ, known as feedforward


• Updating the weights and biases, known as backpropagation

The sequential graph below illustrates the process.

As we’ve seen in the sequential graph above, feedforward is just simple


calculus and for a basic 2-layer neural network, the output of the Neural
Network is:

Let’s add a feedforward function in our python code to do exactly that.


Note that for simplicity, we have assumed the biases to be 0.

However, we still need a way to evaluate the “goodness” of our


predictions (i.e. how far off are our predictions)? The Loss Function
allows us to do exactly that.

Loss Function
There are many available loss functions, and the nature of our problem
should dictate our choice of loss function. In this tutorial, we’ll use a
simple sum-of-sqaures error as our loss function.

That is, the sum-of-squares error is simply the sum of the difference
between each predicted value and the actual value. The difference is
squared so that we measure the absolute value of the difference.
Our goal in training is to find the best set of weights and biases that
minimizes the loss function.

Backpropagation
Now that we’ve measured the error of our prediction (loss), we need to
find a way to propagate the error back, and to update our weights and
biases.

In order to know the appropriate amount to adjust the weights and biases
by, we need to know the derivative of the loss function with respect to
the weights and biases.

Recall from calculus that the derivative of a function is simply the slope of
the function.

Gradient descent algorithm


If we have the derivative, we can simply update the weights and biases by
increasing/reducing with it(refer to the diagram above). This is known as
gradient descent.
However, we can’t directly calculate the derivative of the loss function
with respect to the weights and biases because the equation of the loss
function does not contain the weights and biases. Therefore, we need the
chain rule to help us calculate it.

Chain rule for calculating derivative of the loss function with respect to the
weights. Note that for simplicity, we have only displayed the partial
derivative assuming a 1-layer Neural Network.
Phew! That was ugly but it allows us to get what we needed — the
derivative (slope) of the loss function with respect to the weights, so that
we can adjust the weights accordingly.

Now that we have that, let’s add the backpropagation function into our
python.

For a deeper understanding of the application of calculus and the chain


rule in backpropagation, I strongly recommend this tutorial by
3Blue1Brown.

Putting it all together

Now that we have our complete python code for doing feedforward and
backpropagation, let’s apply our Neural Network on an example and see
how well it does.
Our Neural Network should learn the ideal set of weights to represent this
function. Note that it isn’t exactly trivial for us to work out the weights just
by inspection alone.

Let’s train the Neural Network for 1500 iterations and see what happens.
Looking at the loss per iteration graph below, we can clearly see the loss
monotonically decreasing towards a minimum. This is consistent with
the gradient descent algorithm that we’ve discussed earlier.

Let’s look at the final prediction (output) from the Neural Network after

1500 iterations.
Predictions after 1500 training iterations
We did it! Our feedforward and backpropagation algorithm trained the
Neural Network successfully and the predictions converged on the true
values.

Note that there’s a slight difference between the predictions and the actual
values. This is desirable, as it prevents overfitting and allows the Neural
Network to generalize better to unseen data.

What’s Next?
Fortunately for us, our journey isn’t over. There’s still much to learn about
Neural Networks and Deep Learning. For example:

• What other activation function can we use besides the Sigmoid


function?
• Using a learning rate when training the Neural Network

• Using convolutions for image classification tasks

I’ll be writing more on these topics soon, so do follow me on Medium and


keep and eye out for them!

Final Thoughts
I’ve certainly learnt a lot writing my own Neural Network from scratch.

Although Deep Learning libraries such as TensorFlow and Keras makes it


easy to build deep nets without fully understanding the inner workings of a
Neural Network, I find that it’s beneficial for aspiring data scientist to gain
a deeper understanding of Neural Networks.

This exercise has been a great investment of my time, and I hope that it’ll
be useful for you as well!
PREDICTING HOUSE PRICES FOR REGIONS IN
COUNTRY
Introduc on:
Real estate sector in India is expected to reach US$ 650 billion and its
share in India’s Gross Domes c Product (GDP) is projected increase
to 17 per cent by 2040. Emergence of nuclear families, rapid
urbanisa on and rising household income are likely to remain the key
drivers for growth in all spheres of real estate, including residen al,
commercial and retail. Rapid urbanisa on in the country is pushing
the growth of real estate. More than 70 per cent of India’s GDP will
be contributed by the urban areas by 2020.

Cross-border capital in ows to India’s real estate sector have


increased 600 per cent during 2012-17 to reach US$ 2.6 billion. In
2017, India ranked 19th out of 73 countries in a rac ng cross-border
capital to its property market. Private Equity and Venture Capital
investments in the sector have reached US$ 4.47 billion in 2018.
Between 2015 and March 2018, the retail segment in Indian realty
a racted private equity investments of around Rs 5,500 crore (US$
$853.4 million).
O ce space has been driven mostly by growth in ITeS/IT,BFSI,
consul ng and manufacturing. Gross o ce absorp on in top Indian
ci es has increased 26 per cent year-on-year to 36.4 million square
feet between Jan-Sep 2018. Warehousing space is expected to reach
247 million square feet in 2020 and see investments of Rs 50,000
crore (US$ 7.76 billion) during 2018-20. Grade-A o ce space
absorp on is expected to cross 700 million square feet by 2022, with
Delhi-NCR contribu ng the most to this demand.
tt
ti
ffi
ti
ti
ti
ti
ti
fl
ti
ti
ffi
tt
ti
ti
ffi
ti
Steps involved in building our model:
Every machine
learning project follows the steps which are given below:

1. Understanding the problem

2. Overview of data

3. Feeding the data

4. Spli ng the data set

5. Training the model

6. Tes ng the model

7. Reducing the amount of errors

8. predict the result

Understanding the problem:


When we have a problem in order to
solve that we rst need to understand the problem what it is, and
then what are the steps should taken to solve the problem. our
model also need to understand that our aim is to predict whether the
bank loan will be approved or not .

Over view of data:


The next step is to look at the data we’re
working with. Realis cally, most of the data we will get, even from
the government, can have errors, and it’s important to iden fy these
errors before spending me analyzing the data. Normally, we must
answer the following ques ons:

❖ Do we nd something wrong in the data?


ti
tti
fi
fi
ti
ti
ti
ti
❖ Are there ambiguous variables in the dataset?

❖ Are there variables that should be xed or removed?


Let’s start by reading the data using the func on read.csv() and
summarize both datasets:

Data elds and their descrip on:


Variable Descrip on

Average Income of residents of the city house is located


Avg. Area Income in.

Avg. Area House Age Average Age of Houses in same city

Avg. Area Number of Rooms Average Number of Rooms for Houses in same city

Avg. Area Number of Bedrooms


Average Number of Bedrooms for Houses in same city

Area Population Population of city house is located in

Price Price that the house sold at

Address Address for the house


fi
ti
fi
ti
ti
Loan Predic on Methodology:
ti
A er reading the data to our variable now we need to split it.
Generally we have two types of data they are :

1. Training dataset

2. Tes ng dataset

Training dataset:
Training dataset is normally used in training the
machine. We use almost 80% of our data to train the machine. In
training we provide the machine inputs as well as the output. For
example when we train a student about addi on we tell him that
1+2=3,2+3=5,6+7=13

This kind of examples are provided to the students and the student
learn from these example, and when we ask he replies and if replies
correctly than we become sure that the student has learned addi on.
similarly in case of machines they are trained with huge amount of
data.

Tes ng dataset:
Tes ng data sets are normally used to test the machine whether the
machine is predic ng upto an acceptable level.

In the tes ng phase we give input to the machine and if the machine
predicts it correctly then we become sure that our machine has
learned.

Example on Split dataset in to Training Set and Tes ng Set:


Say your data has 5 columns.

Column 0 to Column 4 are the dependent variables (Y).


ft
ti
ti
ti
ti
ti
ti
ti
ti
The last column on the right is the independent variables (X).

This is how you create the training set and testing set.

#import dataset
import pandas as pd
#import dataset
import pandas as pd
#Check out the Data
IndianHousing=pd.read_csv(‘Indian_Housing.csv’)
#Training a Linear Regression Model
# X and Y arrays
#split dependent variable and independent variable
x=IndianHousing[['Avg. Area Income','Avg. Area House Age','Avg.
Area Number of Rooms','Avg. Area Number of Bedrooms','Area
Population']]
y=IndianHousing['Price']

#split training set and testing set


fromsklearn.model_selectionimport
train_test_split

X_train, X_test, y_train, y_test =


train_test_split(X, y, test_size=0.4,
random_state=101)

Training our model:


A human being learns to nd the
solution of the problem by thinking of a logic which will nd out
fi
fi
the answer similarly in case of machines we train them, by
using some particular logic speci c to the problem. we have
tried a few logic and then choose the best logic depending
upon the prediction each logic giving.

The logics which we have used are :

Linear regression:
Regression is a linear approach to
modelling the rela onship between a scalar response (or
dependent variable) and one or more explanatory variables (or
independent variables). The case of one explanatory variable is
called simple linear regression. For more than one explanatory
variable, the process is called mul ple linear regression.[1] This
term is dis nct from mul variate linear regression, where
mul ple correlated dependent variables are predicted, rather
than a single scalar variable.[2]

Creating Model
fromsklearn.linear_modelimportLinearRegression

lm=LinearRegression()

lm.fit(X_train,y_train)

LinearRegression(copy_X=True,
fit_intercept=True, n_jobs=1, normalize=False)
ti
ti
ti
ti
ti
fi
Model Evaluation
print(lm.intercept_)## printing the intercept
#-2640159.79685

coeff_df =
pd.DataFrame(lm.coef_,X.columns,columns=['Coef
ficient'])

Predictions from our Model

predictions=lm.predict(X_test)

Reducing the errors:


Now as we have trained the machine the next step to reduce as
much error as we can.
fromsklearnimportmetrics
print('MAE:',metrics.mean_absolute_error(y_tes
t,predictions))print('MSE:',metrics.mean_squar
ed_error(y_test,predictions))print('RMSE:',np.
sqrt(metrics.mean_squared_error(y_test,predict
ions)))
#MAE: 82288.2225191
#MSE: 10460958907.2
#RMSE: 102278.829223
Predicting the output:
with all the hurdles passed the machine can nally predict
whether the price of the house will be passed or not.

Conclusion:

From a proper analysis of posi ve points and constraints on the


component, it can be safely concluded that the product is a highly
e cient component. This applica on is working properly and
mee ng to all Banker requirements. This component can be easily
plugged in many other systems. There have been numbers cases of
computer glitches, errors in content and most important weight of
features is xed in automated predic on system, So in the near
future the so –called so ware could be made more secure, reliable
and dynamic weight adjustment .In near future this model of
predic on can be integrate with the module of automated processing
system. the system is trained on old training dataset in future
so ware can be made such that new tes ng date should also take
part in training data a er some x me.
ffi
ft
ti
ti
fi
ft
ft
fi
ti
ti
ti
ti
ti
fi

You might also like