0% found this document useful (0 votes)

9 views6 pages

Digital Data Part 2

This document outlines an introductory course on data science, focusing on key techniques and problem-solving frameworks. It categorizes data science problems into classification and function approximation, providing real-world examples such as fraud detection and fault diagnosis. The course aims to equip participants with the necessary skills to apply these techniques effectively in engineering contexts.

Uploaded by

Manivannan B

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views6 pages

Digital Data Part 2

Uploaded by

Manivannan B

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

DATA SCIENCE PART 2

Now, that we have looked at the fundamentals required to understand data

science in terms of linear algebra statistics and optimization, we are going to
start series of lectures where we introduce data science, describe different
techniques that are used in data science and finally, end with one practical
industrial example of use of data science. While we introduce the techniques we
will also use smaller examples to illustrate how the technique might be used in a
data science problem. At the end of this course there will also be a case study for
the participants to practice.

So, this is the first lecture on introduction to data science. Before we jump into
the techniques I would like to introduce some interesting ways of looking at data
science and in a broader context understand what these techniques are doing
and how one should think about data science problems. One could teach these
techniques as disparate set of methods to solve data science problems; however,
the critical thing is in learning how to use these techniques for real problems
which is what we call as problem formulation. What we will do in this course is
introduce the participants to a data science problem solving framework, very
short lecture on that, to give you a view of how one should think about general
data science problems and how you convert a problem you know which is not
well defined into something that is manageable using the techniques such you
learn in this course.

So, let me start with this laundry list of techniques that people usually see when
they look at any curriculum for data science or any website which talks about
data science or many books that talk about data science. I have just done some
colour coding in terms of the techniques that you will see in this course in green.

And other techniques are out there which we will not be teaching in this course,
but which would be a part of more advanced course. So, there are techniques
such as Regression analysis, K- nearestneighbour, K- means clustering, logistics
regression, Principle component analysis, all of which you will see in this course
then people talk about Predictive modelling under that there are techniques such
as Lasso, Elastic net that you can learn.

Then there are techniques such as Linear discriminant analysis, Support Vector
Machines, Decision trees and random forests, Quadratic discriminant analysis,
Naive Bayes classifier, Hierarchical clustering and many more such as deep
networks and so on. So, to get a general idea of data science one might be
tempted to ask that if all of these collections of techniques solve data science
problems then, one would like to know what types of problems are being solved
really and once one understands the types of problems that are being solved
then, the next logical question would be- do you need so many techniques for
the types of problems that you are trying to solve.

1|Page
DATA SCIENCE PART 2

So, this would be typical questions that that one might be interested in
answering. What I am going to do is I am going to give you my view of the types
of problems that are being solved and why there are so many techniques to
solve these types of problems. Since this is a first course on data science for
engineers we going to cover major categories of problems that are of most
interest to engineers. This is not to say that other categories of problems do not
exist or that they are not interesting. We will keep this viewpoint in the
background as we go through the lecture materials of this course. Other than
this one could also think of statistics as useful by itself for data science
problems. Statistics is also intricately embedded into the data science
techniques in terms of the formulation themselves and also in characterizing the
properties of the machine learning techniques.

So, in my mind fundamentally I would say that there are mainly 2 class of
problems that we solve in data science. So, I am going to call these as
classification problems and function approximation problems. So, let us look at
what classification problems relate to so these are types of problems where you
have data which are in general labeled and I will explain what label means and
whenever you get a new data you want to assign a label to that data.

So, typical example of this type of problem is called a binary classification

problem which is used in many applications I will point out 2 applications for
example. In this type of problem what you have is data we will call data x. This
data could have many attributes let us say x1 x2 all the way up to xn this is
something that we saw in linear algebra and so on and in binary classification
problems what you have is you have a group of data which you say can be
assigned a label let us say c 1 and I will explain why I use the term c, c refers to
the class to which this data belongs and then another block of data with the
same attributes may be labeled as c2.

So, now the data science problem is the following if I give you a new data point
let us say x1 star x2 star all the way up to xn*, the algorithm should be able to
classify and say this point is likely to have come from either class 1 or it could
have come from class 2. So, assigning a label to this new data in terms of what is
the likelihood of this data having come from either class 1 or class 2 is the
classification problem. Let us say if you assign the likelihood of this coming from
class 1 as 0.9 and from class 2 as 0.1 then one would make the judgment that
this data point is likely to belong to class 1.

Now, let us see how this is useful in a real problem. So, I will give you 2 examples
one example is something that people talk about all the time nowadays which is
called fraud detection. So, let us take one particular case of fraud detection for
example, so, whenever we go and use our credit card we buy something and the
credit card gets charged. So, let us say there are certain characteristics of every
transaction that you record such as the amount the time of the day the
transaction is made the place from which the transaction is made the type of

2|Page
DATA SCIENCE PART 2

product that is bought through the transaction and so on. So, you can think of
many such attributes. Let us say those are the attributes that characterize every
single transaction.

Let us assume that there are many people and they are making transactions and
you have transactions listed like this and you find out that of these, these were
actually fraudulent transactions these were transactions that was not legal or
was not made by the person who owns the credit card and these are
transactions which are legal. So, this is something that you label based on
exploring each transaction which you think might not be right and actually when
you find out that that transaction was not legal then you put it into the basket
which is illegal transaction.

Now, if you use a data science algorithm a binary classification algorithm to be

able to give the likelihood of a transaction being correct or fraudulent based on
this easily calculatable attributes or easily monitorable or measurable attributes.
Then, whenever a new transaction takes place you could run it through this
classifier and then find the likelihood of this transaction being fraudulent.

And in cases where the transaction has a very high likelihood of being
fraudulent, then the company could call that person and then say hey we saw
that your credit card was used in such and such a place such and such a time for
buying such and such a thing did you actually do this transaction and if you have
gone on a vacation to remote place and you made this transaction you tell them
I have come to this place on vacation this is the right transaction and so on if
not, then you find that transaction is fraudulent and stop the payment. So, this is
one example of how a binary classification problem is useful in real life.

Now, when we talked about this data and we talked about the binary
classification problem, we talked about just 2 classes x1 and x2, but in reality
there could be problems where there are multiple classes. One very good
engineering example would be fault diagnosis or prediction of failures. Where
you might have, let us say a certain equipment a pump or a compressor or
distillation column whatever the equipment might be and then the working of
that equipment is let us say characterized by several attributes. How much
power it draws, how much performance does it give, is there vibration, is there
noise, what is the temperature and so on.

So, now, you could have engineering data x which let us say talks about the
characteristic of let us say a pump and the pump is characterized, the operation
of the pump is characterized, by let us say several attributes x1 to xn. And if you
have legacy data or historical data where you have been operating pumps for
years and years and then you know that if these variables take values in this
block then everything is fine with the pump.

3|Page
DATA SCIENCE PART 2

So, I write n for normal and then you could have a block of data and that data
might have been the data that is recorded whenever there is a particular type of
fault in the pump let me call this fault f one. Then you could have another block
of data which could have been seen when there is fault f2 and so on. So, we will
just stick to 2 faults f1 and f2. Let us assume these are the only 2 failure modes
that are possible. Now you start operating the pump and then at some point you
get this data and then you ask the following question. Based on this data would
be possible for me to say if the pump is operating normally or is there likely to be
failure mode 1 that is the current situation of the pump or is it failure mode 2
that is the current situation of the problem.

So, in this case you see that there are 3 classes n, f1 and f2. So, this is what is
called a multi class problem. So, again when a new data comes in we want to
label this as either normal f1 or f2 if it is normal, you do not do anything. If it is
f1, if it is very severe then you stop the pump and then x it. If it is not very
severe you let the maintenance know that this pump is going to come up for
maintenance at some time and in the next shutdown of the plant this pump
needs to be maintained. So, that is how classification problems are very
important in engineering context.

So, we will look at examples of both binary classification and multi class
classification as we go through the series of lectures. So, in summary the one
type of problem that we are interested in data science is classification and these
2 pictures here show the different types of challenges that we are going to face
when we look at classification problems. So, problems where linear equation can
be a decision function for us to classify are called linear classification problems
or we call these problems are as linear classifiable or linearly classifiable and
here we show in 2 dimensions,so, binary classification problem so all of this
could be a class 1 and all of these could be class 2 and a line or a plane or a
hyper plane could be used to classify this data points.

Now, more complicated problems are where hyper plane or a line might not be
enough for us to classify. Here is an example of a classification problem which is
non-linear. So, let us assume that this data and this data belongs to class 1 and
this data belongs to class 2. However you try to draw a line it would be very
difficult ,almost impossible in this case, to classify these 2 classes with just a line
in this 2 D picture. However, if your decision boundary are the function that you
are going to use to classify is of this form. So, you see the difference between
this and this, this is non-linear, this is linear, then we could easily extend the
concepts that we have learnt in terms of the half spaces and so on to do
classification for these types of problem using non-linear decision boundaries.

So, you would say if the points are to this side it is 1 class and if the points are to
the other side it is another class. One has to do this carefully de ning the

4|Page
DATA SCIENCE PART 2

equivalent ideas for non-linear decision boundaries equivalent to the linear case
very carefully and the minute you move from linear to non-linear then there are
a host of other questions that come about. And these questions are really related
to what type of non-linear function should one use.

When we talk about linear classifiers the linear functional form is fixed it is very
simple. It is only one functional form you have to estimate the parameters of
course, but we do not have to really think about what functional form you were
going to use. However, if it is a non-linear problem then we really need to choose
a particular type of decision function that we need to use and how do you choose
this decision functions now the minute you go to the non-linear domain there are
infinite number of functional forms that you can choose how do you choose one
that works for you it is an interesting and important question that one needs to
answer.

So, that is as far as classification problems are concerned, now let us move on to
the other type of problem that one solves in data science this is what I would call
as function approximation problem. Again, I am showing function approximation
problems in 2 dimensional space here. So, I might have an out-put and an input
so again in a general case we will have many inputs and many outputs. This is
what is called as a case of single input here and a single output. However, you
could have many attributes and the output being a function of many attributes.

This is also a function approximation problem or you could also have many
outputs which are a function of many attributes. So, this is also possible. So
function approximation is the task of nding these functions and whenever we
write a function this function is typically parameterized by parameter. So, for
example, if you just take let us say one output and then say this is f1, x1, x2, xn
these are the attribute values and there will also usually be a set of parameters
that you have to use for that function. So, that could be p1 p2 and pr let us say.
So, when I talk about a function approximation problem, then the problem that
we are trying to solve is the following. Given several samples of these out-puts
and the corresponding attributes that resulted in these outputs. So, this is the
data that we are going to talk about and once I have a large amount of this data,
how do I find this function form and once I choose a functional form how do I also
identify the parameters that are in the functional form. So, a simple example is if
it is a linear functional form then I say y = a₀ x + b0 let us say.

5|Page
DATA SCIENCE PART 2

samples of these out-puts and the corresponding attributes that resulted in these
outputs. So, this is the data that we are going to talk about and once I have a
large amount of this data, how do I find this function form and once I choose a
functional form how do I also identify the parameters that are in the functional
form. So, a simple example is if it is a linear functional form then I say y = a₀ x +
b0 let us say.
Now the same non-linear version of the pro problem similar to the picture on the
top is shown here. Here you want to have a non-linear surface or a curve that
goes through these points and these points are clustered around that curve. So,
in summary there are really only two types of problems that we predominantly
solve from an engineering viewpoint using data sciences, these are classification
problems and function approximation problems.
So, if there are only two types of problems that we are really solving then one
might ask why are there so many techniques for solving these types of problems
and one standard question that comes about whenever someone does data
science is if a particular technique is better than another technique and the
proponent of one technique will say this is a greatest technique the proponent of
other technique will say that is a greatest technique and you know this debate
keeps going on and so on. So, I am going to give a slightly different view of why
we have so many techniques and you know you can kind of resolve in some
sense this question of which technique is better. So to do this let us do a thought
experiment.

6|Page

Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Nature, Justice, and Rights in Aristotle's Politics
100% (1)
Nature, Justice, and Rights in Aristotle's Politics
443 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
30 pages
Python For Data Science Department of Indian Institute of Technology, Madras Lecture - 01 Why Python For Data Science?
No ratings yet
Python For Data Science Department of Indian Institute of Technology, Madras Lecture - 01 Why Python For Data Science?
9 pages
Lec 1
No ratings yet
Lec 1
9 pages
Data Science S3mca
No ratings yet
Data Science S3mca
55 pages
M1.1 DS
No ratings yet
M1.1 DS
57 pages
Python
100% (1)
Python
635 pages
Lecture 1 - Introduction To Data Science
No ratings yet
Lecture 1 - Introduction To Data Science
14 pages
Chapter 6_Data science and k nearest neighbour model (PART B)
No ratings yet
Chapter 6_Data science and k nearest neighbour model (PART B)
5 pages
Data-Science - Introduction
No ratings yet
Data-Science - Introduction
35 pages
Data Science PDF
No ratings yet
Data Science PDF
11 pages
DataScience Intro
No ratings yet
DataScience Intro
36 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
01. Introduction
No ratings yet
01. Introduction
20 pages
DataScience-Intro
No ratings yet
DataScience-Intro
36 pages
Introduction Am
No ratings yet
Introduction Am
74 pages
Introduction To Datascience (R20DS501)
No ratings yet
Introduction To Datascience (R20DS501)
162 pages
Data Science Ppt1 Update
No ratings yet
Data Science Ppt1 Update
67 pages
ds_u1_chp1
No ratings yet
ds_u1_chp1
13 pages
Introduction to Data Science __ 23CSH-283
100% (1)
Introduction to Data Science __ 23CSH-283
48 pages
Summary Business Analytics
No ratings yet
Summary Business Analytics
24 pages
Unit-1 - Introduction to Data Science
No ratings yet
Unit-1 - Introduction to Data Science
17 pages
Chapter 1
No ratings yet
Chapter 1
47 pages
Data Science 1
100% (3)
Data Science 1
133 pages
DSV Sem Exam
No ratings yet
DSV Sem Exam
15 pages
2 - Business Problems and Data Science Solutions
No ratings yet
2 - Business Problems and Data Science Solutions
26 pages
Carmichael MArron 2018 OJO
No ratings yet
Carmichael MArron 2018 OJO
22 pages
02 Introduction_Fall 23-24
No ratings yet
02 Introduction_Fall 23-24
29 pages
FDS UNIT 1 QB
No ratings yet
FDS UNIT 1 QB
7 pages
What Is Data Science - A Beginner's Guide To Data Science - Edureka
No ratings yet
What Is Data Science - A Beginner's Guide To Data Science - Edureka
14 pages
FDS Unit-4
No ratings yet
FDS Unit-4
15 pages
Data Science
No ratings yet
Data Science
65 pages
DS QB unit 1
No ratings yet
DS QB unit 1
45 pages
Why Data Science?
No ratings yet
Why Data Science?
13 pages
Types of Digital Data
No ratings yet
Types of Digital Data
22 pages
UNIT- I
No ratings yet
UNIT- I
17 pages
data scince report
No ratings yet
data scince report
11 pages
PDF
No ratings yet
PDF
42 pages
An Introduction To Data Science (2022 Updated Edition)
No ratings yet
An Introduction To Data Science (2022 Updated Edition)
9 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
Data Science Overview Basic to Advance Guide
No ratings yet
Data Science Overview Basic to Advance Guide
27 pages
AD3491 - Unit 1 - Introduction to Data Science Important Questions 2 Marks With Answer --3-8
No ratings yet
AD3491 - Unit 1 - Introduction to Data Science Important Questions 2 Marks With Answer --3-8
6 pages
UNIT I Material
No ratings yet
UNIT I Material
25 pages
Introduction To Data Science - Module 1
No ratings yet
Introduction To Data Science - Module 1
4 pages
Data Science
No ratings yet
Data Science
11 pages
FDS notes
No ratings yet
FDS notes
5 pages
Data Science
100% (2)
Data Science
33 pages
Unit-4
No ratings yet
Unit-4
6 pages
Fundamentals of Data Science unit 1
No ratings yet
Fundamentals of Data Science unit 1
33 pages
Unit 4 & 5-Data Science and Computer Vision
No ratings yet
Unit 4 & 5-Data Science and Computer Vision
18 pages
Introduction Data Science Edited
No ratings yet
Introduction Data Science Edited
33 pages
Internship Report 2023-24 Data Science
100% (2)
Internship Report 2023-24 Data Science
23 pages
Fundamental of Data Science
No ratings yet
Fundamental of Data Science
20 pages
Data Science
No ratings yet
Data Science
5 pages
001-2023-0714 DLBDSIDS01 Course Book
No ratings yet
001-2023-0714 DLBDSIDS01 Course Book
90 pages
lec41
No ratings yet
lec41
6 pages
Unit I
No ratings yet
Unit I
52 pages
Unit I 2 Marks
No ratings yet
Unit I 2 Marks
5 pages
Problem Solving Analysis
From Everand
Problem Solving Analysis
Ron Rieke
No ratings yet
GROKKING ALGORITHMS: Advanced Methods to Learn and Use Grokking Algorithms and Data Structures for Programming
From Everand
GROKKING ALGORITHMS: Advanced Methods to Learn and Use Grokking Algorithms and Data Structures for Programming
Eric Schmidt
No ratings yet
Lab Report
No ratings yet
Lab Report
12 pages
FINAL (PS) - PR1 11 - 12 - UNIT 4 - LESSON 1 - Relevant Literature Sources For Qualitative Research
No ratings yet
FINAL (PS) - PR1 11 - 12 - UNIT 4 - LESSON 1 - Relevant Literature Sources For Qualitative Research
29 pages
Don N. Page - Agnesi Weighting For The Measure Problem of Cosmology
No ratings yet
Don N. Page - Agnesi Weighting For The Measure Problem of Cosmology
26 pages
Meaning and Scope of Medical Anthropology
No ratings yet
Meaning and Scope of Medical Anthropology
11 pages
IMS Third Assignment PHD Coursework
No ratings yet
IMS Third Assignment PHD Coursework
7 pages
John C. Lilly - Dolphin in History (OCR)
No ratings yet
John C. Lilly - Dolphin in History (OCR)
31 pages
Example of A Qualitative Dissertation
100% (2)
Example of A Qualitative Dissertation
8 pages
Statistik
No ratings yet
Statistik
55 pages
Sem - Engl0607 - Tech, Scientific, and Buss Engl
No ratings yet
Sem - Engl0607 - Tech, Scientific, and Buss Engl
0 pages
The Future of Knowledge Organization and Information Organization
No ratings yet
The Future of Knowledge Organization and Information Organization
6 pages
TEORI KRITIS MADZHAB FRANKFRUT (Khoirurrizki Romadhon)
No ratings yet
TEORI KRITIS MADZHAB FRANKFRUT (Khoirurrizki Romadhon)
12 pages
ONDISTRIBUTION
No ratings yet
ONDISTRIBUTION
8 pages
Philippie History
No ratings yet
Philippie History
23 pages
Brochure - Online Workshop On Advance Statistical Data Analysis Using SPSS - Jan - 21-27 - 2023
No ratings yet
Brochure - Online Workshop On Advance Statistical Data Analysis Using SPSS - Jan - 21-27 - 2023
7 pages
Topic 3. HYPOTHESIS AND ITS LOGIC PROCESS (2015-17)
No ratings yet
Topic 3. HYPOTHESIS AND ITS LOGIC PROCESS (2015-17)
24 pages
Pyqs Dual
No ratings yet
Pyqs Dual
6 pages
DVA Question Bank
No ratings yet
DVA Question Bank
3 pages
BSEE Curriculum
No ratings yet
BSEE Curriculum
5 pages
Causes and Consequences of The Protestant Reformation: Sascha O. Becker Steven Pfaff
No ratings yet
Causes and Consequences of The Protestant Reformation: Sascha O. Becker Steven Pfaff
52 pages
STS Reviewer Compilation
No ratings yet
STS Reviewer Compilation
8 pages
Research Methodos Notes
No ratings yet
Research Methodos Notes
57 pages
Ap Seminar - Iwa Final Draft Daken Vandusen
No ratings yet
Ap Seminar - Iwa Final Draft Daken Vandusen
9 pages
A Student Satisfaction Model
No ratings yet
A Student Satisfaction Model
15 pages
Sillitoe DevelopmentIndigenousKnowledge 1998
No ratings yet
Sillitoe DevelopmentIndigenousKnowledge 1998
31 pages
Group C S1& S2 Draft Syllabus
No ratings yet
Group C S1& S2 Draft Syllabus
118 pages
Module 6.design
No ratings yet
Module 6.design
19 pages
Holiday Homework - Commerce
No ratings yet
Holiday Homework - Commerce
3 pages
Research Trans Slay
No ratings yet
Research Trans Slay
3 pages
RESEARCH METHODS (1)
No ratings yet
RESEARCH METHODS (1)
4 pages

Digital Data Part 2

Uploaded by

Digital Data Part 2

Uploaded by

DATA SCIENCE PART 2

Now, that we have looked at the fundamentals required to understand data

So, typical example of this type of problem is called a binary classification

Now, if you use a data science algorithm a binary classification algorithm to be

You might also like