Digital Data Part 2
Digital Data Part 2
So, this is the first lecture on introduction to data science. Before we jump into
the techniques I would like to introduce some interesting ways of looking at data
science and in a broader context understand what these techniques are doing
and how one should think about data science problems. One could teach these
techniques as disparate set of methods to solve data science problems; however,
the critical thing is in learning how to use these techniques for real problems
which is what we call as problem formulation. What we will do in this course is
introduce the participants to a data science problem solving framework, very
short lecture on that, to give you a view of how one should think about general
data science problems and how you convert a problem you know which is not
well defined into something that is manageable using the techniques such you
learn in this course.
So, let me start with this laundry list of techniques that people usually see when
they look at any curriculum for data science or any website which talks about
data science or many books that talk about data science. I have just done some
colour coding in terms of the techniques that you will see in this course in green.
And other techniques are out there which we will not be teaching in this course,
but which would be a part of more advanced course. So, there are techniques
such as Regression analysis, K- nearestneighbour, K- means clustering, logistics
regression, Principle component analysis, all of which you will see in this course
then people talk about Predictive modelling under that there are techniques such
as Lasso, Elastic net that you can learn.
Then there are techniques such as Linear discriminant analysis, Support Vector
Machines, Decision trees and random forests, Quadratic discriminant analysis,
Naive Bayes classifier, Hierarchical clustering and many more such as deep
networks and so on. So, to get a general idea of data science one might be
tempted to ask that if all of these collections of techniques solve data science
problems then, one would like to know what types of problems are being solved
really and once one understands the types of problems that are being solved
then, the next logical question would be- do you need so many techniques for
the types of problems that you are trying to solve.
1|Page
DATA SCIENCE PART 2
So, this would be typical questions that that one might be interested in
answering. What I am going to do is I am going to give you my view of the types
of problems that are being solved and why there are so many techniques to
solve these types of problems. Since this is a first course on data science for
engineers we going to cover major categories of problems that are of most
interest to engineers. This is not to say that other categories of problems do not
exist or that they are not interesting. We will keep this viewpoint in the
background as we go through the lecture materials of this course. Other than
this one could also think of statistics as useful by itself for data science
problems. Statistics is also intricately embedded into the data science
techniques in terms of the formulation themselves and also in characterizing the
properties of the machine learning techniques.
So, in my mind fundamentally I would say that there are mainly 2 class of
problems that we solve in data science. So, I am going to call these as
classification problems and function approximation problems. So, let us look at
what classification problems relate to so these are types of problems where you
have data which are in general labeled and I will explain what label means and
whenever you get a new data you want to assign a label to that data.
So, now the data science problem is the following if I give you a new data point
let us say x1 star x2 star all the way up to xn*, the algorithm should be able to
classify and say this point is likely to have come from either class 1 or it could
have come from class 2. So, assigning a label to this new data in terms of what is
the likelihood of this data having come from either class 1 or class 2 is the
classification problem. Let us say if you assign the likelihood of this coming from
class 1 as 0.9 and from class 2 as 0.1 then one would make the judgment that
this data point is likely to belong to class 1.
Now, let us see how this is useful in a real problem. So, I will give you 2 examples
one example is something that people talk about all the time nowadays which is
called fraud detection. So, let us take one particular case of fraud detection for
example, so, whenever we go and use our credit card we buy something and the
credit card gets charged. So, let us say there are certain characteristics of every
transaction that you record such as the amount the time of the day the
transaction is made the place from which the transaction is made the type of
2|Page
DATA SCIENCE PART 2
product that is bought through the transaction and so on. So, you can think of
many such attributes. Let us say those are the attributes that characterize every
single transaction.
Let us assume that there are many people and they are making transactions and
you have transactions listed like this and you find out that of these, these were
actually fraudulent transactions these were transactions that was not legal or
was not made by the person who owns the credit card and these are
transactions which are legal. So, this is something that you label based on
exploring each transaction which you think might not be right and actually when
you find out that that transaction was not legal then you put it into the basket
which is illegal transaction.
And in cases where the transaction has a very high likelihood of being
fraudulent, then the company could call that person and then say hey we saw
that your credit card was used in such and such a place such and such a time for
buying such and such a thing did you actually do this transaction and if you have
gone on a vacation to remote place and you made this transaction you tell them
I have come to this place on vacation this is the right transaction and so on if
not, then you find that transaction is fraudulent and stop the payment. So, this is
one example of how a binary classification problem is useful in real life.
Now, when we talked about this data and we talked about the binary
classification problem, we talked about just 2 classes x1 and x2, but in reality
there could be problems where there are multiple classes. One very good
engineering example would be fault diagnosis or prediction of failures. Where
you might have, let us say a certain equipment a pump or a compressor or
distillation column whatever the equipment might be and then the working of
that equipment is let us say characterized by several attributes. How much
power it draws, how much performance does it give, is there vibration, is there
noise, what is the temperature and so on.
So, now, you could have engineering data x which let us say talks about the
characteristic of let us say a pump and the pump is characterized, the operation
of the pump is characterized, by let us say several attributes x1 to xn. And if you
have legacy data or historical data where you have been operating pumps for
years and years and then you know that if these variables take values in this
block then everything is fine with the pump.
3|Page
DATA SCIENCE PART 2
So, I write n for normal and then you could have a block of data and that data
might have been the data that is recorded whenever there is a particular type of
fault in the pump let me call this fault f one. Then you could have another block
of data which could have been seen when there is fault f2 and so on. So, we will
just stick to 2 faults f1 and f2. Let us assume these are the only 2 failure modes
that are possible. Now you start operating the pump and then at some point you
get this data and then you ask the following question. Based on this data would
be possible for me to say if the pump is operating normally or is there likely to be
failure mode 1 that is the current situation of the pump or is it failure mode 2
that is the current situation of the problem.
So, in this case you see that there are 3 classes n, f1 and f2. So, this is what is
called a multi class problem. So, again when a new data comes in we want to
label this as either normal f1 or f2 if it is normal, you do not do anything. If it is
f1, if it is very severe then you stop the pump and then x it. If it is not very
severe you let the maintenance know that this pump is going to come up for
maintenance at some time and in the next shutdown of the plant this pump
needs to be maintained. So, that is how classification problems are very
important in engineering context.
So, we will look at examples of both binary classification and multi class
classification as we go through the series of lectures. So, in summary the one
type of problem that we are interested in data science is classification and these
2 pictures here show the different types of challenges that we are going to face
when we look at classification problems. So, problems where linear equation can
be a decision function for us to classify are called linear classification problems
or we call these problems are as linear classifiable or linearly classifiable and
here we show in 2 dimensions,so, binary classification problem so all of this
could be a class 1 and all of these could be class 2 and a line or a plane or a
hyper plane could be used to classify this data points.
Now, more complicated problems are where hyper plane or a line might not be
enough for us to classify. Here is an example of a classification problem which is
non-linear. So, let us assume that this data and this data belongs to class 1 and
this data belongs to class 2. However you try to draw a line it would be very
difficult ,almost impossible in this case, to classify these 2 classes with just a line
in this 2 D picture. However, if your decision boundary are the function that you
are going to use to classify is of this form. So, you see the difference between
this and this, this is non-linear, this is linear, then we could easily extend the
concepts that we have learnt in terms of the half spaces and so on to do
classification for these types of problem using non-linear decision boundaries.
So, you would say if the points are to this side it is 1 class and if the points are to
the other side it is another class. One has to do this carefully de ning the
4|Page
DATA SCIENCE PART 2
equivalent ideas for non-linear decision boundaries equivalent to the linear case
very carefully and the minute you move from linear to non-linear then there are
a host of other questions that come about. And these questions are really related
to what type of non-linear function should one use.
When we talk about linear classifiers the linear functional form is fixed it is very
simple. It is only one functional form you have to estimate the parameters of
course, but we do not have to really think about what functional form you were
going to use. However, if it is a non-linear problem then we really need to choose
a particular type of decision function that we need to use and how do you choose
this decision functions now the minute you go to the non-linear domain there are
infinite number of functional forms that you can choose how do you choose one
that works for you it is an interesting and important question that one needs to
answer.
So, that is as far as classification problems are concerned, now let us move on to
the other type of problem that one solves in data science this is what I would call
as function approximation problem. Again, I am showing function approximation
problems in 2 dimensional space here. So, I might have an out-put and an input
so again in a general case we will have many inputs and many outputs. This is
what is called as a case of single input here and a single output. However, you
could have many attributes and the output being a function of many attributes.
This is also a function approximation problem or you could also have many
outputs which are a function of many attributes. So, this is also possible. So
function approximation is the task of nding these functions and whenever we
write a function this function is typically parameterized by parameter. So, for
example, if you just take let us say one output and then say this is f1, x1, x2, xn
these are the attribute values and there will also usually be a set of parameters
that you have to use for that function. So, that could be p1 p2 and pr let us say.
So, when I talk about a function approximation problem, then the problem that
we are trying to solve is the following. Given several samples of these out-puts
and the corresponding attributes that resulted in these outputs. So, this is the
data that we are going to talk about and once I have a large amount of this data,
how do I find this function form and once I choose a functional form how do I also
identify the parameters that are in the functional form. So, a simple example is if
it is a linear functional form then I say y = a₀ x + b0 let us say.
samples of these out-puts and the corresponding attributes that resulted in these
outputs. So, this is the data that we are going to talk about and once I have a
large amount of this data, how do I find this function form and once I choose a
functional form how do I also identify the parameters that are in the functional
form. So, a simple example is if it is a linear functional form then I say y = a₀ x +
b0 let us say.
5|Page
DATA SCIENCE PART 2
samples of these out-puts and the corresponding attributes that resulted in these
outputs. So, this is the data that we are going to talk about and once I have a
large amount of this data, how do I find this function form and once I choose a
functional form how do I also identify the parameters that are in the functional
form. So, a simple example is if it is a linear functional form then I say y = a₀ x +
b0 let us say.
Now the same non-linear version of the pro problem similar to the picture on the
top is shown here. Here you want to have a non-linear surface or a curve that
goes through these points and these points are clustered around that curve. So,
in summary there are really only two types of problems that we predominantly
solve from an engineering viewpoint using data sciences, these are classification
problems and function approximation problems.
So, if there are only two types of problems that we are really solving then one
might ask why are there so many techniques for solving these types of problems
and one standard question that comes about whenever someone does data
science is if a particular technique is better than another technique and the
proponent of one technique will say this is a greatest technique the proponent of
other technique will say that is a greatest technique and you know this debate
keeps going on and so on. So, I am going to give a slightly different view of why
we have so many techniques and you know you can kind of resolve in some
sense this question of which technique is better. So to do this let us do a thought
experiment.
6|Page