Module1 21CS644 DSV
Module1 21CS644 DSV
Table of Contents
Topics Page No
Module - 1
Introduction to Data science
1.1.1 Big Data and Data science Hype and getting past the hype.
1.1.2 Why now- Datafication
Q1. Define datafication and explain data science hype in big data. (6M)
Datafication as a process of taking all aspects of life and turning them into data.” As
examples, they mention that “Google’s augmented-reality glasses datafy the gaze.
Q2.Explain drew Conway’s Venn diagram of data science and discuss the role of social
scientist in data science. (10M)
Data science is the science of analyzing raw data using statistics and machine learning
techniques with the purpose of drawing conclusions about that information. Data Science
involves the below skills as shown in the Figure 1.1.1
Drew Conway’s Venn diagram of data science in which data science is the intersection of
three sectors – Substantive expertise, hacking skills, and math & statistics knowledge.
In Drew Conway’s Venn Diagram of Data Science, the primary colours of data are
Hacking skills,
Math and stats knowledge, and
Substantive expertise
It is known to everyone that data is the key part of data science. And data is a commodity
traded electronicallyso, in order to be in this market, one needs to speak hacker.
So, what does this line means? Being able to manage text files at the command-line, learning
vectorized operations, thinking algorithmically; are the hacking skills that make for a
successful data hacker.
Once you have collected and cleaned the data, the next step is to actually obtain
insight from it. In order to do this, you need to use appropriate mathematical and
statistical methods that demand at least a baseline familiarity with these tools.
The important part is Substantive expertise. According to Drew Conway, “data plus
math and statistics only gets you machine learning”, which is excellent if that is what
you are interested in, but not if you are doing data science.
Science is about experimentation and building knowledge, which demands some
motivating questions about the world and hypotheses that can be brought to data and
tested with statistical methods.
On the other hand, “substantive expertise + knowledge in mathematics and statistics
are where maximum traditional researcher falls”.
Doctoral level researchers use most of their time getting expertise in these areas, but
very little time acquiring technology. Part of this is the culture of academia, which
does not compensate researchers for knowing technology.
Role
Both LinkedIn and Face book are social network companies. Often‐ times a
description or definition of data scientist includes hybrid statistician, software
engineer, and social scientist. This made sense in the context of companies where the
product was a social product and still makes sense when we’re dealing with human or
user behaviour.
Social scientists also do tend to be good question askers and have other good
investigative qualities, so a social scientist who also has the quantitative and
programming chops makes a great data scientist.
But it’s almost a historical artifact to limit your conception of a data scientist to
someone who works only with online user behaviour data. There’s an‐ other emerging
field out there called computational social sciences, which could be thought of as a
subset of data science.
Q3. Explain Harlan Harris clustering and visualization of subfields and discuss the role
of data scientist in academia and industry. (8M)
• Data Businesspeople are the product and profit-focused data scientists. They’re
leaders, managers, and entrepreneurs, but with a technical bent. A common
educational path is an engineering degree paired with an MBA.
• Data Creative’s are jacks-of-all-trades, able to work with a broad range of data and
tools. They may think of themselves as artists or hackers, and excel at visualization
and open-source technologies.
• Data Researchers apply their scientific training, and the tools and techniques they
learned in academia, to organizational data. They may have PhDs, and their creative
applications of mathematical tools yields valuable insights and products.
• In academia, a data scientist is trained in some discipline, works with large amounts
of data, grapples with computational problems posed by the structure, size, messiness,
and the complexity and nature of the data, and solves real-world problems.
• Knows how to extract meaning from and interpret data, which requires both
tools and methods from statistics and machine learning, as well as being
human.
Data represents the traces of the real-world processes, and exactly which traces we
gather are decided by our data collection or sampling method.
A new idea to simplify those captured traces into something more comprehensible
This overall process of going from the world to the data, and then from the data back
to the world, is the field of statistical inference
Statistical inference is the process of drawing conclusions about populations or
scientific truths from data.
There are different types of statistical inferences that are extensively used for making
conclusions. They are One sample hypothesis testing, Confidence Interval, Pearson
Correlation, Bi-variate regression, Multi-variate regression, Chi-square statistics and
contingency table,ANOVA or T-test.
Parameter Estimation
Hypothesis Testing
Parameter Estimation
Parameter estimation is another primary goal of statistical inference. Parameters are capable
of being deduced they are quantified traits or properties related to the population you are
studying. Some instances comprise the population mean, population variance, and so on-the-
list. Imagine measuring each person in a town to realize the mean. This is a daunting if not an
impossible task.
Point Estimation
Interval Estimation
Hypothesis Testing
Hypothesis testing is used to make decisions or draw conclusions about a population based on
sample data. It involves formulating a hypothesis about the population parameter, collecting
sample data, and then using statistical methods to determine whether the data provide enough
evidence to reject or fail to reject the hypothesis.
Q5. Define population and sample and explain how its related to big data. (6M)
In the age of Big Data, where we can record all users’ actions all the time, don’t we
observeeverything, Is there really still this notion of population and sample, If we had all the
emails in the first place, why would we need to take a sample.
Statistics is the branch of mathematics that deals with the collection, organization, analysis,
and interpretation of numerical data. Statistics is especially useful in drawing general
conclusions about a set of data from a sample of the data.
Traditional statistics definitely forms a critical element of data science. Data science
encompasses more. data science is about “dealing” with data, not just “analyzing” it,
which is the bread and butter of classical statistics “Dealing” includes (finding and
gathering data, cleaning and pre-processing, storing, EDA, statistics, machine
learning, natural language processing and data visualization).
After the invention of computers, we have a modern way to manipulate data and we
have a very high computing ability.
So we are no longer restricted to deal with just samples of the data, Data Science is
simply the response to the new technology.
Q6. Explain the three things in revolution of big data and discuss why the data is not
objective. (6M)
Big Data doesn’t need to understand cause given that the data is so enormous. It doesn’t need
to worry about sampling error because it is literally keeping track of the truth. The way the
article frames this is by claiming that the new approach of Big Data is letting “N=ALL.” Can
N=ALL? .
For example, as this InfoWorld post explains, Internet surveillance will never really work,
because the very clever and tech-savvy criminals that we most want to catch are the very ones
we will never be able to catch, because they’re always a step ahead.
An example from that very article election night polls—is in itself a great counter-
example: even if we poll absolutely everyone who leaves the polling stations, we still
don’t count people who decided not to vote in the first place.
And those might be the very people we’d need to talk to understand our country’s
voting problems. Indeed, we’d argue that the assumption we make that N=ALL is one
of the biggest problems we face in the age of Big Data.
It is, above all, a way of excluding the voices of people who don’t have the time,
energy, or access to cast their vote in all sorts of informal, possibly unannounced,
elections.
Those people, busy working two jobs and spending time waiting for buses, become
invisible when we tally up the votes without them.
To you this might just mean that the recommendations you receive on Netflix don’t
seem very good because most of the people who bother to rate things on Netflix are
young and might have different tastes than you, which skews the recommendation
engine toward them.
But there are plenty of much more insidious consequences stemming from this basic
idea. Another way in which the assumption that N=ALL can matter is that it often
gets translated into the idea that data is objective.
It is wrong to believe either that data is objective or that “data speaks,” In other
words, ignoring causation can be a flaw, rather than a feature.
Models that ignore causation can add to historical problems instead and data doesn’t
speak for itself.
Data is just a quantitative, pale echo of the events of our society hence data is not
objective.
Q7. What is a model? Explain briefly the statistical model and steps involved in building
a model. (10M)
A model is a simple view of the complex reality.A model is our Attempt to understand and
represent the nature of reality through a particular lens, be it architectural, biological, or
mathematical.
A model is an Artificial construction where all extraneous detail has been removed or
abstracted Statisticians and data scientists capture the uncertainty and randomness of
data generating processes with mathematical functions that express the shape and
structure of the data itself e.g. Statistical and mathematical models.
For example, if you have two columns of data, x and y, and you think there’s a linear
relationship, you’d write down y = β0 +β1x. You don’t know what β0 and β1 are in terms of
actual numbers yet, so they’re the parameters.
Other people prefer pictures and will first draw a diagram of data flow, possibly with
arrows, showing how things affect other things or what happens over time. This gives
them an abstract picture of the relationships before choosing equations to express
them.Specification of a mathematical (or probabilistic) relationship b/w different
variablee.g., A business model is based on simple mathematical relationships.
i.e., profit = revenue – expenses, and can be calculated based on No. of users, revenue per
user
One place to start with to build a model is exploratory data analysis (EDA) This entails
making plots and building intuition for your particular dataset. EDA helps out a lot, as well as
trial and error and iteration, you can (and should) plot histograms and look at scatter plots to
start getting a feel for the data.
1. Define Your Objective: First, define very clearly what problem you are going to solve.
Whether that is a customer churn prediction, better product recommendations, or patterns in
data, you first need to know your direction. This should bring clarity to the choice of data,
algorithms, and evaluation metrics.
2. Collect Data: Gather data relevant to your objective. This can include internal data from
your company, publicly available datasets, or data purchased from external sources. Ensure
you have enough data to train your model effectively.
3. Clean Data: Data cleaning is a critical step to prepare your dataset for modeling. It
involves handling missing values, removing duplicates, and correcting errors. Clean data
ensures the reliability of your model’s predictions.
5. Split the Data: Divide your dataset into training and testing sets. The training set is used
to train your model, while the testing set evaluates its performance. A common split ratio is
80% for training and 20% for testing.
6. Choose a Model: Select a model that suits your problem type (e.g., regression,
classification) and data. Beginners can start with simpler models like linear
regression or decision trees before moving on to more complex models like neural networks.
7. Train the Model: Feed your training data into the model. This process involves the model
learning from the data, adjusting its parameters to minimize errors. Training a model can take
time, especially with large datasets or complex models.
8. Evaluate the Model: After training, assess your model’s performance using the testing set.
Common evaluation metrics include accuracy, precision, recall, and F1 score. Evaluation
helps you understand how well your model will perform on unseen data.
9. ImproveModel: Based on the evaluation, you may need to refine your model. This can
involve tuning hyperparameters, choosing a different model, or going back to data cleaning
and preparation for further improvements.
10. Deploy the Model: Once satisfied with your model’s performance, deploy it for real-
world use. This could mean integrating it into an application or using it for decision-making
within your organization.
Q8. Explain the probability distribution and its types in statistical models with an
example. (10M)
The classical example is the height of humans, following a normal distributiona bell-
shaped curveNatural processes tend to generate measurements whose empirical shape
could be approximated by mathematical functions with a few parameters that could be
estimated from the data.
Not all processes generate data that looks like a named distribution, but many do. We
can use these functions as building blocks of our models. An infinite number of
possible distributions.
The parameter μ is the mean and median and controls where the distribution is centred
(because this is a symmetric distribution), and the parameter σ controls how spread out the
distribution is. This is the general functional form, but for specific real-world phenomenon,
these parameters have actual numbers as values, which we can estimate from the data.
Well, there are two possible ways: we can conduct an experiment where we show up at the
bus stop at a random time, measurehow much time until the next bus, and repeat this
experiment over and over again.
Then we look at the measurements, plot them, and approximate the function as discussed. Or,
because we are familiar with the fact that “waiting time” is a common enough real-world
phenomenon that a distribution called the exponential distribution has been invented to
describe p(x) = λe−λx. The below figure represents types of distribution as shown in figure
2.1.1.
If we consider X to be the random variable that represents the amount of money spent,
then we can look at the distribution of money spent across all users, and represent it as
pX.
We can then take the subset of users who looked at more than five items before
buying anything, and look at the distribution of money spent among these users.
Let Y be the random variable that represents number of items looked at, and then p X
Y > 5 would be the corresponding conditional distribution.
When we observe data points, i.e., x1, y1 , x2, y2 , . . ., xn, yn , we are observing
realizations of a pair of random variables.
When we have an entire dataset with n rows and k columns, we are observing n
realizations of the joint distribution of those k random variables.
Q9. Explain briefly about fitting a model process. What is under fitting and over fitting
of a model? (10M)
Fitting a model means that you estimate the parameters of the model using the observed data.
You are using your data as evidence to help approximate the real-world mathematical process
that generated the data.
Fitting the model often involves optimization methods and algorithms, such as
maximum likelihood estimation to help get the parameters. Once you fit the model,
you actually can write it as p(x) = 2e−2x
Fitting the model is when you start actually coding: your code will read in the data,
and you’ll specify the functional form that you wrote down on the piece of paper.
Then R or Python will use built-in optimization methods to give you the most likely
values of the parameters given the data as shown in below figure 2.1.2.
The model is too simple, so it may be not capable to represent the complexities in the
data.
The input feature which is used to train the model is not the adequate representations
of underlying factors influencing the target variable.
The size of the training dataset used is not enough.
Excessive regularization are used to prevent the over fitting, which constraint the
model to capture the data well.
Features are not scaled.
A statistical model is said to be over fitted when the model does not make accurate
predictions on testing data. When a model gets trained with so much data, it starts learning
from the noise and inaccurate data entries in our data set. And when testing with test data
results in High variance. Then the model does not categorize the data correctly, because of
too many details and noise.
Improving the quality of training data reduces over fitting by focusing on meaningful
patterns mitigate the risk of fitting the noise or irrelevant features.
Increase the training data can improve the model’s ability to generalize to unseen data
and reduce the likelihood of over fitting.
Reduce model complexity.
Early stopping during the training phase (have an eye over the loss over the training
period as soon as loss begins to increase stop training).
Ridge Regularization and Lasso Regularization.
The case when the model makes the predictions with 0 errors, is said to have a good fit on the
data. This situation is achievable at a spot between over fitting and under fitting. In order to
get a good fit, we will stop at a point just before where the error starts increasing. At this
point, the model is said to have good skills in training datasets as well as our unseen testing
dataset.