0% found this document useful (0 votes)

54 views

Module1 21CS644 DSV

Uploaded by

divyashree

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views

Module1 21CS644 DSV

Uploaded by

divyashree

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 16

Module 1

Introduction to Data Science

Dr.T.THIMMAIAH INSTITUTE OF TECHNOLOGYY
(Estd. 1986) Oorgaum, Kolar Gold Fields, Karnataka – 563120
(Affiliated to VTU, Belgaum, Approved by AICTE - New Delhi)
NAAC Accredited with 'A' Grade, NBA Accredited for CSE,ECE &Mining Engg Programs

Department of Computer Science and Engineering

Table of Contents

Module 1: - Introduction to Data Science

Topics Page No

1. Introduction: What is Data Science? 1-5

Big Data and Data science Hype and getting past

1.1.1 1-1
the hype.

1.1.2 Why now- Datafication. 1-1

1.1.3 The Current Landscape and Skill sets. 2-2 to 3-3

1.1.4 Data Science Jobs, Data Science Profile. 4-4

What is a Data Scientist Really? In Academia, In

1.1.5 5-5
Industry.

2. Statistical Inference 5-14

2.1.1 Statistical Thinking in the Age of Big Data. 5-5

2.1.2 Populations and Samples. 6-6

2.1.3 Populations and Samples of Big data. 6-6

Statistical modeling – Probability distributions, fitting a

2.1.4 7-7 to 14-14
model.
Data Science and Visualization (21CS644)

Module - 1
Introduction to Data science

1.1.1 Big Data and Data science Hype and getting past the hype.
1.1.2 Why now- Datafication
Q1. Define datafication and explain data science hype in big data. (6M)

Datafication as a process of taking all aspects of life and turning them into data.” As
examples, they mention that “Google’s augmented-reality glasses datafy the gaze.

 Twitter datafies stray thoughts. LinkedIn datafies professional networks.

 Once we datafy things, we can transform their purpose and turn the information into
new forms of value.
 The hype is crazy—people throw around tired phrases straight out of the height of the
pre-financial crisis era like “Masters of the Universe” to describe data scientists, and
that doesn’t bode well.
 In general, hype masks reality and increases the noise-to-signal ratio.
 The longer the hype goes on, the more many of us will get turned off by it, and the
harder it will be to see what’s good underneath it all, if anything.
 The general experience of data scientists is that, at their job, they have access to a
larger body of knowledge and methodology, as well as a process, which we now
define as the data science process.
 Rachel gave herself the task of understanding the cultural phenomenon of data science
and how others were experiencing it.
 She started meeting with people at Google, at start-ups and tech companies, and at
universities, mostly from within statistics departments.
 From those meetings she started to form a clearer picture of the new thing that’s
emerging.
 She ultimately decided to continue the investigation by giving a course at Columbia
called “Introduction to Data Science,” which Cathy covered on her blog. Hence the
hype for data science is discussed.

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 1 of 16

Data Science and Visualization (21CS644)

1.1.3 The Current Landscape and Skill sets.

Q2.Explain drew Conway’s Venn diagram of data science and discuss the role of social
scientist in data science. (10M)

Data science is the science of analyzing raw data using statistics and machine learning
techniques with the purpose of drawing conclusions about that information. Data Science
involves the below skills as shown in the Figure 1.1.1

 Statistics, computer science, mathematics

 Data cleaning and formatting
 Data visualization

Figure 1.1.1Drew Conway’s Venn diagram

Drew Conway’s Venn diagram of data science in which data science is the intersection of
three sectors – Substantive expertise, hacking skills, and math & statistics knowledge.

In Drew Conway’s Venn Diagram of Data Science, the primary colours of data are

 Hacking skills,
 Math and stats knowledge, and
 Substantive expertise

It is known to everyone that data is the key part of data science. And data is a commodity
traded electronicallyso, in order to be in this market, one needs to speak hacker.

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 2 of 16

Data Science and Visualization (21CS644)

So, what does this line means? Being able to manage text files at the command-line, learning
vectorized operations, thinking algorithmically; are the hacking skills that make for a
successful data hacker.

 Once you have collected and cleaned the data, the next step is to actually obtain
insight from it. In order to do this, you need to use appropriate mathematical and
statistical methods that demand at least a baseline familiarity with these tools.
 The important part is Substantive expertise. According to Drew Conway, “data plus
math and statistics only gets you machine learning”, which is excellent if that is what
you are interested in, but not if you are doing data science.
 Science is about experimentation and building knowledge, which demands some
motivating questions about the world and hypotheses that can be brought to data and
tested with statistical methods.
 On the other hand, “substantive expertise + knowledge in mathematics and statistics
are where maximum traditional researcher falls”.
 Doctoral level researchers use most of their time getting expertise in these areas, but
very little time acquiring technology. Part of this is the culture of academia, which
does not compensate researchers for knowing technology.

Role

 Both LinkedIn and Face book are social network companies. Often‐ times a
description or definition of data scientist includes hybrid statistician, software
engineer, and social scientist. This made sense in the context of companies where the
product was a social product and still makes sense when we’re dealing with human or
user behaviour.
 Social scientists also do tend to be good question askers and have other good
investigative qualities, so a social scientist who also has the quantitative and
programming chops makes a great data scientist.
 But it’s almost a historical artifact to limit your conception of a data scientist to
someone who works only with online user behaviour data. There’s an‐ other emerging
field out there called computational social sciences, which could be thought of as a
subset of data science.

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 3 of 16

Data Science and Visualization (21CS644)

1.1.4 Data Science Jobs, Data Science Profile.

1.1.5 What is a Data Scientist Really? In Academia, In Industry

Q3. Explain Harlan Harris clustering and visualization of subfields and discuss the role
of data scientist in academia and industry. (8M)

 Clustering and visualization of data science subfields based on a survey of data

science practitioners analyzing the Analyzers by Harlan Harris as shown in below
figure 1.1.2.

Figure 1.1.2. Harlan Harris Clustering and visualization subfields

• Data Businesspeople are the product and profit-focused data scientists. They’re
leaders, managers, and entrepreneurs, but with a technical bent. A common
educational path is an engineering degree paired with an MBA.

• Data Creative’s are jacks-of-all-trades, able to work with a broad range of data and
tools. They may think of themselves as artists or hackers, and excel at visualization
and open-source technologies.

• Data Developers are focused on writing software to do analytic, statistical, and

machine learning tasks, often in production environments. They often have computer
science degrees, and often work with so-called “big data”.

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 4 of 16

Data Science and Visualization (21CS644)

• Data Researchers apply their scientific training, and the tools and techniques they
learned in academia, to organizational data. They may have PhDs, and their creative
applications of mathematical tools yields valuable insights and products.

• In academia, a data scientist is trained in some discipline, works with large amounts
of data, grapples with computational problems posed by the structure, size, messiness,
and the complexity and nature of the data, and solves real-world problems.

• In industry, a data scientist

• Knows how to extract meaning from and interpret data, which requires both
tools and methods from statistics and machine learning, as well as being
human.

• Spends a lot of effort in collecting, cleaning, and munging data utilizing

statistics and software engineering skills and performs exploratory data
analysis, finds patterns, builds models, and algorithms.

• Communicates the findings in clear language and with data visualizations so

that even if her/his colleagues unfamiliar with the data can understand the
implications.

2.1.1 Statistical Thinking in the Age of Big Data

Q4. Define statistical inference and its types. (6M)

 Data represents the traces of the real-world processes, and exactly which traces we
gather are decided by our data collection or sampling method.
 A new idea to simplify those captured traces into something more comprehensible
This overall process of going from the world to the data, and then from the data back
to the world, is the field of statistical inference
 Statistical inference is the process of drawing conclusions about populations or
scientific truths from data.

There are different types of statistical inferences that are extensively used for making
conclusions. They are One sample hypothesis testing, Confidence Interval, Pearson
Correlation, Bi-variate regression, Multi-variate regression, Chi-square statistics and
contingency table,ANOVA or T-test.

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 5 of 16

Data Science and Visualization (21CS644)

There are two main branches of statistical inference:

 Parameter Estimation
 Hypothesis Testing

Parameter Estimation

Parameter estimation is another primary goal of statistical inference. Parameters are capable
of being deduced they are quantified traits or properties related to the population you are
studying. Some instances comprise the population mean, population variance, and so on-the-
list. Imagine measuring each person in a town to realize the mean. This is a daunting if not an
impossible task.

There are two broad methods of parameter estimation:

 Point Estimation
 Interval Estimation

Hypothesis Testing

Hypothesis testing is used to make decisions or draw conclusions about a population based on
sample data. It involves formulating a hypothesis about the population parameter, collecting
sample data, and then using statistical methods to determine whether the data provide enough
evidence to reject or fail to reject the hypothesis.

2.1.2 Populations and Samples.

2.1.3 Populations and Samples of Big data.

Q5. Define population and sample and explain how its related to big data. (6M)

 Population: Any set of objects/units e.g., tweets, photographs, emails sent by

employees.
 Sample: Taking a subset of the units of size n to examine observations to draw
conclusions and make inferences about the population. Different Sampling
mechanisms may introduce biases into the data which may affect drawn
conclusions

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 6 of 16

Data Science and Visualization (21CS644)

In the age of Big Data, where we can record all users’ actions all the time, don’t we
observeeverything, Is there really still this notion of population and sample, If we had all the
emails in the first place, why would we need to take a sample.

Statistics is the branch of mathematics that deals with the collection, organization, analysis,
and interpretation of numerical data. Statistics is especially useful in drawing general
conclusions about a set of data from a sample of the data.

 Traditional statistics definitely forms a critical element of data science. Data science
encompasses more. data science is about “dealing” with data, not just “analyzing” it,
which is the bread and butter of classical statistics “Dealing” includes (finding and
gathering data, cleaning and pre-processing, storing, EDA, statistics, machine
learning, natural language processing and data visualization).
 After the invention of computers, we have a modern way to manipulate data and we
have a very high computing ability.
 So we are no longer restricted to deal with just samples of the data, Data Science is
simply the response to the new technology.

2.1.4 Statistical modelling – Probability distributions, fitting a model.

Q6. Explain the three things in revolution of big data and discuss why the data is not
objective. (6M)

The Big Data revolution consists of three things

• Collecting and using a lot of data rather than small samples

• Accepting messiness in your data

• Giving up on knowing the causes

Big Data doesn’t need to understand cause given that the data is so enormous. It doesn’t need
to worry about sampling error because it is literally keeping track of the truth. The way the
article frames this is by claiming that the new approach of Big Data is letting “N=ALL.” Can
N=ALL? .

For example, as this InfoWorld post explains, Internet surveillance will never really work,
because the very clever and tech-savvy criminals that we most want to catch are the very ones
we will never be able to catch, because they’re always a step ahead.

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 7 of 16

Data Science and Visualization (21CS644)

 An example from that very article election night polls—is in itself a great counter-
example: even if we poll absolutely everyone who leaves the polling stations, we still
don’t count people who decided not to vote in the first place.
 And those might be the very people we’d need to talk to understand our country’s
voting problems. Indeed, we’d argue that the assumption we make that N=ALL is one
of the biggest problems we face in the age of Big Data.
 It is, above all, a way of excluding the voices of people who don’t have the time,
energy, or access to cast their vote in all sorts of informal, possibly unannounced,
elections.
 Those people, busy working two jobs and spending time waiting for buses, become
invisible when we tally up the votes without them.
 To you this might just mean that the recommendations you receive on Netflix don’t
seem very good because most of the people who bother to rate things on Netflix are
young and might have different tastes than you, which skews the recommendation
engine toward them.
 But there are plenty of much more insidious consequences stemming from this basic
idea. Another way in which the assumption that N=ALL can matter is that it often
gets translated into the idea that data is objective.
 It is wrong to believe either that data is objective or that “data speaks,” In other
words, ignoring causation can be a flaw, rather than a feature.
 Models that ignore causation can add to historical problems instead and data doesn’t
speak for itself.
 Data is just a quantitative, pale echo of the events of our society hence data is not
objective.

Q7. What is a model? Explain briefly the statistical model and steps involved in building
a model. (10M)

A model is a simple view of the complex reality.A model is our Attempt to understand and
represent the nature of reality through a particular lens, be it architectural, biological, or
mathematical.

 A model is an Artificial construction where all extraneous detail has been removed or
abstracted Statisticians and data scientists capture the uncertainty and randomness of

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 8 of 16

Data Science and Visualization (21CS644)

data generating processes with mathematical functions that express the shape and
structure of the data itself e.g. Statistical and mathematical models.

For example, if you have two columns of data, x and y, and you think there’s a linear
relationship, you’d write down y = β0 +β1x. You don’t know what β0 and β1 are in terms of
actual numbers yet, so they’re the parameters.

 Other people prefer pictures and will first draw a diagram of data flow, possibly with
arrows, showing how things affect other things or what happens over time. This gives
them an abstract picture of the relationships before choosing equations to express
them.Specification of a mathematical (or probabilistic) relationship b/w different
variablee.g., A business model is based on simple mathematical relationships.

i.e., profit = revenue – expenses, and can be calculated based on No. of users, revenue per
user

One place to start with to build a model is exploratory data analysis (EDA) This entails
making plots and building intuition for your particular dataset. EDA helps out a lot, as well as
trial and error and iteration, you can (and should) plot histograms and look at scatter plots to
start getting a feel for the data.

Steps to build a model

1. Define Your Objective: First, define very clearly what problem you are going to solve.
Whether that is a customer churn prediction, better product recommendations, or patterns in
data, you first need to know your direction. This should bring clarity to the choice of data,
algorithms, and evaluation metrics.

2. Collect Data: Gather data relevant to your objective. This can include internal data from
your company, publicly available datasets, or data purchased from external sources. Ensure
you have enough data to train your model effectively.

3. Clean Data: Data cleaning is a critical step to prepare your dataset for modeling. It
involves handling missing values, removing duplicates, and correcting errors. Clean data
ensures the reliability of your model’s predictions.

4. ExploreData: Data exploration, or exploratory data analysis (EDA), involves

summarizing the main characteristics of your dataset. Use visualizations and statistics to
uncover patterns, anomalies, and relationships between variables.

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 9 of 16

Data Science and Visualization (21CS644)

5. Split the Data: Divide your dataset into training and testing sets. The training set is used
to train your model, while the testing set evaluates its performance. A common split ratio is
80% for training and 20% for testing.

6. Choose a Model: Select a model that suits your problem type (e.g., regression,
classification) and data. Beginners can start with simpler models like linear
regression or decision trees before moving on to more complex models like neural networks.

7. Train the Model: Feed your training data into the model. This process involves the model
learning from the data, adjusting its parameters to minimize errors. Training a model can take
time, especially with large datasets or complex models.

8. Evaluate the Model: After training, assess your model’s performance using the testing set.
Common evaluation metrics include accuracy, precision, recall, and F1 score. Evaluation
helps you understand how well your model will perform on unseen data.

9. ImproveModel: Based on the evaluation, you may need to refine your model. This can
involve tuning hyperparameters, choosing a different model, or going back to data cleaning
and preparation for further improvements.

10. Deploy the Model: Once satisfied with your model’s performance, deploy it for real-
world use. This could mean integrating it into an application or using it for decision-making
within your organization.

Q8. Explain the probability distribution and its types in statistical models with an
example. (10M)

 The classical example is the height of humans, following a normal distributiona bell-
shaped curveNatural processes tend to generate measurements whose empirical shape
could be approximated by mathematical functions with a few parameters that could be
estimated from the data.
 Not all processes generate data that looks like a named distribution, but many do. We
can use these functions as building blocks of our models. An infinite number of
possible distributions.

They are to be interpreted as assigning a probability to a subset of possible outcomes, and

have corresponding functions. For example, the normal distribution is written as:

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 10 of 16

Data Science and Visualization (21CS644)

The parameter μ is the mean and median and controls where the distribution is centred
(because this is a symmetric distribution), and the parameter σ controls how spread out the
distribution is. This is the general functional form, but for specific real-world phenomenon,
these parameters have actual numbers as values, which we can estimate from the data.

 A random variable denoted by x or y can be assumed to have a corresponding

probability distribution.
 For example, let x be the amount of time until the next bus arrives (measured in
seconds). x is a random variable because there is variation How do we know the right
distribution to use?

Well, there are two possible ways: we can conduct an experiment where we show up at the
bus stop at a random time, measurehow much time until the next bus, and repeat this
experiment over and over again.

Then we look at the measurements, plot them, and approximate the function as discussed. Or,
because we are familiar with the fact that “waiting time” is a common enough real-world
phenomenon that a distribution called the exponential distribution has been invented to
describe p(x) = λe−λx. The below figure represents types of distribution as shown in figure
2.1.1.

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 11 of 16

Data Science and Visualization (21CS644)

Figure 2.1.1 Types of probability distribution

 If we consider X to be the random variable that represents the amount of money spent,
then we can look at the distribution of money spent across all users, and represent it as
pX.
 We can then take the subset of users who looked at more than five items before
buying anything, and look at the distribution of money spent among these users.
 Let Y be the random variable that represents number of items looked at, and then p X
Y > 5 would be the corresponding conditional distribution.
 When we observe data points, i.e., x1, y1 , x2, y2 , . . ., xn, yn , we are observing
realizations of a pair of random variables.
 When we have an entire dataset with n rows and k columns, we are observing n
realizations of the joint distribution of those k random variables.

Q9. Explain briefly about fitting a model process. What is under fitting and over fitting
of a model? (10M)

Fitting a model means that you estimate the parameters of the model using the observed data.
You are using your data as evidence to help approximate the real-world mathematical process
that generated the data.

 Fitting the model often involves optimization methods and algorithms, such as
maximum likelihood estimation to help get the parameters. Once you fit the model,
you actually can write it as p(x) = 2e−2x
 Fitting the model is when you start actually coding: your code will read in the data,
and you’ll specify the functional form that you wrote down on the piece of paper.
Then R or Python will use built-in optimization methods to give you the most likely
values of the parameters given the data as shown in below figure 2.1.2.

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 12 of 16

Data Science and Visualization (21CS644)

Figure 2.1.2. Fitting a model

 A statistical model or a machine learning algorithm is said to have underfitting when a

model is too simple to capture data complexities.
 The under fitting model has High bias and low variance.

Reasons for under fitting

 The model is too simple, so it may be not capable to represent the complexities in the
data.
 The input feature which is used to train the model is not the adequate representations
of underlying factors influencing the target variable.
 The size of the training dataset used is not enough.
 Excessive regularization are used to prevent the over fitting, which constraint the
model to capture the data well.
 Features are not scaled.

Techniques to Reduce Under fitting

 Increase model complexity.

 Increase the number of features, performing feature engineering.
 Remove noise from the data.
 Increase the number of epochs or increase the duration of training to get better results.

A statistical model is said to be over fitted when the model does not make accurate
predictions on testing data. When a model gets trained with so much data, it starts learning
from the noise and inaccurate data entries in our data set. And when testing with test data
results in High variance. Then the model does not categorize the data correctly, because of
too many details and noise.

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 13 of 16

Data Science and Visualization (21CS644)

 Over fitting is a problem where the evaluation of machine learning algorithms on

training data is different from unseen data.

Reasons for over fitting:

 High variance and low bias.

 The model is too complex.
 The size of the training data.

Techniques to Reduce Over fitting

 Improving the quality of training data reduces over fitting by focusing on meaningful
patterns mitigate the risk of fitting the noise or irrelevant features.
 Increase the training data can improve the model’s ability to generalize to unseen data
and reduce the likelihood of over fitting.
 Reduce model complexity.
 Early stopping during the training phase (have an eye over the loss over the training
period as soon as loss begins to increase stop training).
 Ridge Regularization and Lasso Regularization.

The case when the model makes the predictions with 0 errors, is said to have a good fit on the
data. This situation is achievable at a spot between over fitting and under fitting. In order to
get a good fit, we will stop at a point just before where the error starts increasing. At this
point, the model is said to have good skills in training datasets as well as our unseen testing
dataset.

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 14 of 16

DSF 1-2
No ratings yet
DSF 1-2
28 pages
Ds Intro KK
No ratings yet
Ds Intro KK
11 pages
DS 1
No ratings yet
DS 1
56 pages
Data Science Module1
No ratings yet
Data Science Module1
20 pages
Unit - I & II
No ratings yet
Unit - I & II
59 pages
Module - 1 IDS
100% (1)
Module - 1 IDS
19 pages
6220010
No ratings yet
6220010
37 pages
Data Science vs. Statistics: Two Cultures?
No ratings yet
Data Science vs. Statistics: Two Cultures?
22 pages
Finding Data Patterns in the Noise: A Data Scientist's Tale
From Everand
Finding Data Patterns in the Noise: A Data Scientist's Tale
Olayinka Ugwu
No ratings yet
0bd671618021447098c9e2da9729d5bb_20250117_105640_347_862932_Introduction
No ratings yet
0bd671618021447098c9e2da9729d5bb_20250117_105640_347_862932_Introduction
35 pages
Modul1 PPt.pptx
No ratings yet
Modul1 PPt.pptx
56 pages
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
No ratings yet
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
5 pages
IDS Notes
No ratings yet
IDS Notes
32 pages
DS Unit-1 PDF
No ratings yet
DS Unit-1 PDF
50 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
Dsdm-Unit1 241031 194317
No ratings yet
Dsdm-Unit1 241031 194317
38 pages
What Is Data Science Module1
No ratings yet
What Is Data Science Module1
33 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
INTRODUCTION and M1-CH-1
No ratings yet
INTRODUCTION and M1-CH-1
63 pages
Data Science With Python (MSC 3rd Sem) Unit 1
No ratings yet
Data Science With Python (MSC 3rd Sem) Unit 1
17 pages
Module 1
No ratings yet
Module 1
19 pages
Introduction to Data-Science
No ratings yet
Introduction to Data-Science
246 pages
Carmichael MArron 2018 OJO
No ratings yet
Carmichael MArron 2018 OJO
22 pages
Data Science
From Everand
Data Science
Chloe Martin
No ratings yet
What is Data Science
No ratings yet
What is Data Science
9 pages
AIDS C04-Session-19
No ratings yet
AIDS C04-Session-19
29 pages
DAT100_Int_Data_Ana_Lec2_Intro II
No ratings yet
DAT100_Int_Data_Ana_Lec2_Intro II
39 pages
DS231_Week_2
No ratings yet
DS231_Week_2
33 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
13 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
16 pages
Introduction Data Science
100% (1)
Introduction Data Science
23 pages
Module 1
No ratings yet
Module 1
47 pages
Data
No ratings yet
Data
43 pages
DSS-first Lecture
No ratings yet
DSS-first Lecture
14 pages
Introduction to Data Science Lecture 1
No ratings yet
Introduction to Data Science Lecture 1
4 pages
Session 1819
No ratings yet
Session 1819
47 pages
Data Scientist and Data Visualization
No ratings yet
Data Scientist and Data Visualization
50 pages
Data Science CLASS 12 INVESTIGATORY PROJECT
No ratings yet
Data Science CLASS 12 INVESTIGATORY PROJECT
9 pages
DS_Module 1
No ratings yet
DS_Module 1
57 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Data Science
No ratings yet
Data Science
85 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
85 pages
Data Science-New (Unit-I)
No ratings yet
Data Science-New (Unit-I)
18 pages
IDS Sec-1 CS1-CS8 Merged Slides
No ratings yet
IDS Sec-1 CS1-CS8 Merged Slides
419 pages
Datascience Notes
No ratings yet
Datascience Notes
161 pages
Ch7-Overview of Data Science-part 1
No ratings yet
Ch7-Overview of Data Science-part 1
37 pages
DS231 Module 2
No ratings yet
DS231 Module 2
33 pages
Unit 1
No ratings yet
Unit 1
8 pages
CSL-410-L02
No ratings yet
CSL-410-L02
16 pages
Data Science Basics
No ratings yet
Data Science Basics
25 pages
Dia1
No ratings yet
Dia1
88 pages
Data Science
From Everand
Data Science
John D. Kelleher
3/5 (8)
IDS Unit 1
No ratings yet
IDS Unit 1
67 pages
Datascience and Visualization
No ratings yet
Datascience and Visualization
8 pages
Slide Voice Content
No ratings yet
Slide Voice Content
20 pages
Data Science 2
No ratings yet
Data Science 2
3 pages
Basic of ds
No ratings yet
Basic of ds
14 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
Task 2a
No ratings yet
Task 2a
16 pages
Datascience
75% (8)
Datascience
28 pages
5b Students Details New
No ratings yet
5b Students Details New
25 pages
RESUM
No ratings yet
RESUM
2 pages
updated LESSON PLAN 2024-25 SFH 2nd CSE-A
No ratings yet
updated LESSON PLAN 2024-25 SFH 2nd CSE-A
4 pages
SFH CO PO Justificatio
No ratings yet
SFH CO PO Justificatio
5 pages
BCS502 5B sem Remedial All courses attendance-B
No ratings yet
BCS502 5B sem Remedial All courses attendance-B
7 pages
3B 100
No ratings yet
3B 100
2 pages
POP3rd internals (1)
No ratings yet
POP3rd internals (1)
8 pages
CNS 18CS52-19 Batch CO Justification
No ratings yet
CNS 18CS52-19 Batch CO Justification
2 pages
Seminar Report Shanu Saklani
No ratings yet
Seminar Report Shanu Saklani
22 pages
Multimedia and Computer Vision unit 5
No ratings yet
Multimedia and Computer Vision unit 5
25 pages
2023 Article IJOE Vol 19 NO 12
No ratings yet
2023 Article IJOE Vol 19 NO 12
17 pages
Amoore Et Al. - 2024 - A World Model On The Political Logics of Generati
No ratings yet
Amoore Et Al. - 2024 - A World Model On The Political Logics of Generati
9 pages
Classification Algorithms I
No ratings yet
Classification Algorithms I
14 pages
06. Spam Email Detection
No ratings yet
06. Spam Email Detection
23 pages
Malware Detection Using ML
No ratings yet
Malware Detection Using ML
20 pages
On The Effects of Data Scale On Computer Control Agents
No ratings yet
On The Effects of Data Scale On Computer Control Agents
23 pages
Importance of Machine Learning
No ratings yet
Importance of Machine Learning
36 pages
AI Ethics Guide EN
No ratings yet
AI Ethics Guide EN
76 pages
Machine Learning Note - Exam Note For ML
No ratings yet
Machine Learning Note - Exam Note For ML
28 pages
ML (methodology)
No ratings yet
ML (methodology)
4 pages
Chapter 4 Predictive Analytics i Data Mining Process^j Methods^j and Algorithms
No ratings yet
Chapter 4 Predictive Analytics i Data Mining Process^j Methods^j and Algorithms
10 pages
Deep Learning and Optimisation For Quality of Service Modelling
No ratings yet
Deep Learning and Optimisation For Quality of Service Modelling
10 pages
Machine Learning Technical Report
No ratings yet
Machine Learning Technical Report
12 pages
Multi-Modal_Structure-Embedding_Graph_Transformer_for_Visual_Commonsense_Reasoning
No ratings yet
Multi-Modal_Structure-Embedding_Graph_Transformer_for_Visual_Commonsense_Reasoning
11 pages
Cotton Crop Disease Prediction Using Deep Learning
No ratings yet
Cotton Crop Disease Prediction Using Deep Learning
13 pages
An inter-modal attention-based deep learning framework using unified modality for multimodal fake news, hate speech and offensive language detection
No ratings yet
An inter-modal attention-based deep learning framework using unified modality for multimodal fake news, hate speech and offensive language detection
11 pages
Data Analysis from Scratch with Python Peters Morgan - The ebook in PDF/DOCX format is available for instant download
100% (2)
Data Analysis from Scratch with Python Peters Morgan - The ebook in PDF/DOCX format is available for instant download
58 pages
Logical Functions PPT Kesar Bhandari - 1
No ratings yet
Logical Functions PPT Kesar Bhandari - 1
14 pages
Ir Practical Manual 2
No ratings yet
Ir Practical Manual 2
24 pages
CCS355 Neural Networks and Deep Learning
No ratings yet
CCS355 Neural Networks and Deep Learning
142 pages
Final Project Report
No ratings yet
Final Project Report
43 pages
An Auto-Explained Automated Machine Learning Tool For Big
No ratings yet
An Auto-Explained Automated Machine Learning Tool For Big
6 pages
NN MTH404
No ratings yet
NN MTH404
9 pages
1 _ An Introduction to SHAP Values and Machine Learning Interpretability _ DataCamp
No ratings yet
1 _ An Introduction to SHAP Values and Machine Learning Interpretability _ DataCamp
12 pages
Electric Vehicle Range Prediction-Regression Analysis
No ratings yet
Electric Vehicle Range Prediction-Regression Analysis
37 pages
Solved MCQ - Project Cycle
No ratings yet
Solved MCQ - Project Cycle
21 pages
Inambar 2024 A Review of Machine Learning Application in The Talent Acquisition Process
No ratings yet
Inambar 2024 A Review of Machine Learning Application in The Talent Acquisition Process
13 pages
Scikit-Learn Cheatsheet For Machine Learning
No ratings yet
Scikit-Learn Cheatsheet For Machine Learning
1 page

Module1 21CS644 DSV

Uploaded by

Module1 21CS644 DSV

Uploaded by

Module 1

Introduction to Data Science

Department of Computer Science and Engineering

Module 1: - Introduction to Data Science

1. Introduction: What is Data Science? 1-5

Big Data and Data science Hype and getting past

1.1.2 Why now- Datafication. 1-1

1.1.3 The Current Landscape and Skill sets. 2-2 to 3-3

1.1.4 Data Science Jobs, Data Science Profile. 4-4

What is a Data Scientist Really? In Academia, In

2. Statistical Inference 5-14

2.1.1 Statistical Thinking in the Age of Big Data. 5-5

2.1.2 Populations and Samples. 6-6

2.1.3 Populations and Samples of Big data. 6-6

Statistical modeling – Probability distributions, fitting a

 Twitter datafies stray thoughts. LinkedIn datafies professional networks.

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 1 of 16

1.1.3 The Current Landscape and Skill sets.

 Statistics, computer science, mathematics

Figure 1.1.1Drew Conway’s Venn diagram

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 2 of 16

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 3 of 16

1.1.4 Data Science Jobs, Data Science Profile.

 Clustering and visualization of data science subfields based on a survey of data

Figure 1.1.2. Harlan Harris Clustering and visualization subfields

• Data Developers are focused on writing software to do analytic, statistical, and

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 4 of 16

• In industry, a data scientist

• Spends a lot of effort in collecting, cleaning, and munging data utilizing

• Communicates the findings in clear language and with data visualizations so

2.1.1 Statistical Thinking in the Age of Big Data

Q4. Define statistical inference and its types. (6M)

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 5 of 16

There are two main branches of statistical inference:

There are two broad methods of parameter estimation:

2.1.2 Populations and Samples.

 Population: Any set of objects/units e.g., tweets, photographs, emails sent by

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 6 of 16

2.1.4 Statistical modelling – Probability distributions, fitting a model.

The Big Data revolution consists of three things

• Collecting and using a lot of data rather than small samples

• Accepting messiness in your data

• Giving up on knowing the causes

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 7 of 16

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 8 of 16

Steps to build a model

4. ExploreData: Data exploration, or exploratory data analysis (EDA), involves

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 9 of 16

They are to be interpreted as assigning a probability to a subset of possible outcomes, and

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 10 of 16

 A random variable denoted by x or y can be assumed to have a corresponding

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 11 of 16

Figure 2.1.1 Types of probability distribution

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 12 of 16

Figure 2.1.2. Fitting a model

 A statistical model or a machine learning algorithm is said to have underfitting when a

Reasons for under fitting

Techniques to Reduce Under fitting

 Increase model complexity.

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 13 of 16

 Over fitting is a problem where the evaluation of machine learning algorithms on

Reasons for over fitting:

 High variance and low bias.

Techniques to Reduce Over fitting

Prof.Velantina.V, Dept.of CSE, DR.T.T.I.T,KGF Page 14 of 16

You might also like