0% found this document useful (0 votes)
26 views

Introduction To Data Science

Uploaded by

sagarmeravi563
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Introduction To Data Science

Uploaded by

sagarmeravi563
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 62

Vidya Pratishthan’s Arts Science And

Commerce College Vidyanagri Baramati

• Department Of B.B.A.(C.A.)
• Class:-S.Y.B.B.A.(C.A.)
• Subject:-Big Data
• Chapter No:-2
• Chapter Name:-Introduction to Data Science
• Presented By:-Asst.Prof. Shinde Akshay M.
Introduction to Big Data By Asst.Prof. Shinde Akshay M. 18/11/2024
Big Data Analytics
• Big Data analytics is the process of collecting,
organizing and analyzing large sets of data (called Big
Data) to discover patterns and other useful information.

• Data analytics is the science of analyzing raw data in


order to make conclusions about that information.

• Big Data analytics can help organizations to better


understand the information contained within the data
and will also help identify the data that is most
important to the business and future business
decisions.
• Data Analytics involves applying an algorithmic
or mechanical process to derive insights and, for
example, running through several data sets to
look for meaningful correlations between each
other.Introduction to Big Data By Asst.Prof. Shinde Akshay M. 18/11/2024
Steps Involved in Data Analytics
The process involved in data analytics involves several
different steps:
1. The first step is to determine the data requirements or how
the data is grouped. Data may be separated by age,
demographic, income, or gender. Data values may be
numerical or be divided by category.
2. The second step in data analytics is the process of
collecting it. This can be done through a variety of sources
such as computers, online sources, cameras,
environmental sources, or through personnel.
3. Once the data is collected, it must be organized so it can be
analyzed. Organization may take place on a spreadsheet or
other form of software that can take statistical data.
4. The data is then cleaned up before analysis. This means it
is scrubbed and checked to ensure there is no duplication
or error, and that it is not incomplete. This step helps
correct any errors before it goes on to a data analyst to be
analyzed.
Steps Involved in Data Analytics

Data
Process Organize Cleaning
Requirement
Types Of Data Analytics
1. Descriptive: What is happening?
1. Descriptive analytics answers the question
of what happened.
2. This type of analysis describes or summarizes
raw data(Past Data) into something explainable
and meaningful.

3. With the help of descriptive analysis, we


analyze and describe the features of a data.
4. In the descriptive analysis, we deal with
the past data to draw conclusions and present
our data in the form of dashboards.
5. It looks at the past performance and
understands the performance by mining
historical data to understand the cause of
success
6. or businesses,
In failure in the past.
descriptive analysis is used
for determining the Key Performance
Indicator or KPI to evaluate the performance
of the business.
7. Descriptive analytics looks at data and
analyze past event for insight as to how to
approach future events.
8. This is mostly used to summarize different
aspects of a particular business, describe
what’s going on in a particular organization and
when it’s required to understand activities at an
aggregate
9. level.
This relates to describing the past and is
useful as it allows us to analyze past behaviors
and how they could have an impact in the near
future.
10. Almost all management reporting such as
sales, marketing, operations, and finance uses
this type of analysis.
 Common example of Descriptive analytics are
company reports that provide historic review
like:
•Data Queries
•Reports
•Descriptive Statistics
•Data dashboard
2. Diagnostic: Why is it happening?
1. Diagnostic Analytics examines data to answer
the question “Why did it happen?”.
2. This is characterized by various techniques
such as Drill- Down, Data Discovery, Data Mining
and Correlations.
3. These techniques allow the users to go
towards deeper analysis which will result in
justifying why certain activities or situations have
occurred
4. in assessment
On an organization.
of the descriptive data,
diagnostic analytical tools will empower an
analyst to drill down and in so doing isolate the
root-cause
5. of astage,
At this problem.
historical data can be
measured against other data to answer the
question of why something happened.
6. Diagnostic analytics gives in-depth insights
into a particular problem.
3. Predictive: What is likely to happen?
1. Predictive Analysis answers the question “What
is likely to happen?”

2. It uses the findings of descriptive and


diagnostic analytics to detect clusters and
exceptions, and to predict future trends, which
makes it a valuable tool f or forecasting.
3. Predictive models typically utilize a variety
of variable data to make the prediction.
4. Predictive models typically utilize The
variability of the component data will have a
relationship with what it is likely to predict
(e.g. the older a person, the more susceptible
they are to a heart-attack – we would say
that age has a linear correlation with heart-attack
risk). a variety of variable data to make the
5. Predictive models are some of the most
prediction.
important utilised across a number of fields.
Deeper Insights
Gain deep insights regarding your business, processes,
functioning, and customer needs. With us, you can make
everything better.
Unknown Patterns
Uncover the unknown industry patterns, adopt trends quickly,
and grab new opportunities for enhanced, data-driven
working.
Customer Understanding
With data, understand your customer, their needs, and the
methods to fulfil these needs for enhanced customer
engagement.
High Business Performance
Data-driven insights allow organizations to drive high
business performance and know what is right or wrong for
your business.
Strategic Decisions
Data gives you the power to take strategic business
decisions. You can understand the strategic needs of your
organization.
Predictive Behaviour
Predict the behaviour of new processes and procedures even
Prescriptive: What do I need to do?
1. The next step up in terms of value and
complexity is the prescriptive model.

2. Prescriptive analytics is used to literally


prescribe what action to take when a problem
occurs.
3. The prescriptive model utilizes an
understanding of what has happened, why it
has happened and a variety of “what-might-
happen” analysis to help the user determine the
best course of action to take.
4. Prescriptive analysis is typically not just
with one individual action, but is in fact a
host
5. of other
A goodactions.
example of this is a traffic
application helping you choose the best route
home and taking into account the distance of
each route, the speed at which one can
travel on each road and, crucially, the current
traffic constraints.
It uses a vast data sets and intelligence to analyze
the outcome of the possible action and then select
the best option.

Prescriptive analytics uses sophisticated tools and


technologies, like machine learning, business rules
and algorithms, which makes it sophisticated to
implement and manage.
Statistical inference
• Statistical inference is the process of using data
analysis to deduce properties of an
underlying distribution of probability
It is also known as Statistical Induction

• Inferential statistical analysis infers properties of


a population.
• for example by testing hypotheses and deriving
estimates.
• It is assumed that the observed data set
is sampled from a larger population.
• Inferential statistics can be contrasted with descriptive
statistics
• Descriptive statistics is solely concerned with properties
of the observed data, and it does not rest on the
assumption that the data come from a larger
population.
Population

• A population is the entire group that you want to


draw conclusions about.

Sample
• A sample is the specific group that you will collect
data from.
• The size of the sample is always less than the total
size of the population.
Population
It is the collection of a specified group of similar
objects, individuals, or entities that have some
common observable characteristics in them. Out of
which, each object is termed as “Elementary
units”.

Example- Let’s consider we have a list consisting


of the name of all the employees in a company, It
is nothing but a population. Out of which each
employee will be considered as an elementary unit.
Types of Population

Finite population
This is a type of population in which the number of
elementary units is exactly quantifiable.
Example- Books in a university library.

Infinite population
In this type of population, The count of elementary
units is not quantifiable to at most certainty.
Example- Population of a country. The population
of a country is not certainly quantifiable in most of
the time while approximation can be done. This is
because in each second the number of deaths and
births is changing concerning time.
Real population
This is such a type of population that is mostly
based on real-time data and the information is
concrete and reliable. This population does not
require approximation or hypothetical data.
Example- Employees working in a company.

Hypothetical population
This can be a finite or infinite imaginary population
designed by a researcher. Here mostly, the
researcher will take a real-time scenario and apply
his/her common hypothesis or assumptions to
draw the structure and information of a population.
Example- Possible outcomes of a die if rolled ’n’
times.
Sample

A part of the population drawn according to a rule


or plan for concluding characteristics is called a
sample.

Sample size
The number of items in a sample is called a sample
size.
example, Out of 50k employees, 5k was selected
for analysis and that makes the sample size 5k.
Characteristics of the sample

A sample should follow certain characteristics to


make it fit for data analysis.

1. Representativeness

A sample should represent the overall behavior of


a population.
imagine the situation in the above example in
which 5k employees are selected out of 50k
employees.
2. Homogeneity
Homogeneity is nothing but the matching of
behavior in multiple samples.
Imagine if we want to calculate the mean salary of
the 50 k employees and we have 3 samples each
of 5k sample size.
· Sample 1 has a mean salary of $40k
· Sample 2 has a mean salary of 38k
· Sample 3 has a mean salary of $41k
We can say that these samples are homogeneous
since all samples are giving approximately equal
information
3. Adequacy regarding the salary of the employees.

The number of sampling units in a sample should


be adequate for doing the research.
In the above example, Out of 50k employees, It will
be not effective if draw a sample of sample size 5
or 6 for doing research.
4. Similar regulating conditions

There should be a similar way of selecting samples


if there is a need for multiple samples.

In the above example, Out of 50k employees, a


sample of 5k employees was chosen at random
and if we are selecting another sample it’s should
be also chosen randomly.
Some important terminologies
Sampling unit
Similar to the elementary unit, each element in the sample is
called a sampling unit. Here out of 5k employees, each of the
employees will be a sampling unit.
Sampling frame
A complete list of sampling units, maps, or other acceptable
material, which represents the population to be sampled is called
the sampling frame.
Statistical Modelling

• Statistical modeling is the process of applying


statistical analysis to a dataset.

• A statistical model is a mathematical


representation (or mathematical model) of
observed data.
When data analysts apply various statistical
models to the data they are investigating, they are
able to understand and interpret the information
more strategically. Rather than sifting through the
raw data, this practice allows them to identify
relationships between variables, make
predictions about future sets of data, and visualize
that data
 Statistical Modelling:-
In simple terms Statistical modelling is a simplified,
mathematically-formalized way to approximate reality(i.e. What generate your
data) and optionally to make predictions from the observation.

• All commonly used statistical procedures can be put into a general


modelling framework.
This is of the form Data=Pattern + Residual

• Variation in the observed data can be split into two component


1]Pattern:- Systematic or explained variation
2]Residual:-Leftover or unexplained variation
Basic Steps of Statistical Model Building are:-
A] Model Selection
B] Model Fitting
C] Model Validation

These three basic steps are used iteratively until an appropriate model for the
data has been developed

A]Model Selection:-
In the model selection step, plots of the data, process knowledge and
assumptions about the process are used to determine the form of the model to
be fit to the data

B] Model Fitting:-
Model fitting method is used to estimate the unknown parameters in the model

C] Model Validation:-
Model Validation validates the model i.e. model is useful for us or not
Reasons to Learn Statistical
Modeling
A)You will be better equipped to choose
the right model for your needs.
• There are many different types of statistical models, and an effective
data analyst needs to have a comprehensive understanding of them
all.
• In each scenario, you should be able to identify not only which
model will help best answer the question at hand, but also which
model is most appropriate for the data you’re working with
B)You will be better able to prepare
your data for analysis.
• Data is rarely ready for analysis in its raw form. To ensure your
analysis is accurate and viable, the data must first be cleaned up.
This cleanup often includes organizing the gathered information
and removing “bad or incomplete data” from the sample.
• “Before any statistical model can be completed, you need to
explore [and], understand the data. “If there is no quality [in
the data], then you can’t really derive any insights from it.”
• Once you know how various statistical models work and how
they leverage data, it will become easier for you to
determine what data is most relevant to the question you
are trying to answer, as well.
3. You will become a better
communicator.
• In most organizations, data analysts are required
to communicate their findings with two different
audiences.

• The first audience consists of those on the business


team who don’t need to understand the details of
your analysis, but simply want to know the key
takeaways.

• The second audience consists of those who are


interested in the more granular details; this group
will want both the list of broad conclusions and an
explanation of how you reached them.
• Having a thorough understanding of statistical modeling can help you
better communicate with both of these audiences, as you will be better
equipped to reach conclusions and therefore generate better data
visualizations, which are helpful in communicating complex ideas to
non-analysts. Simultaneously, a complex understanding of how these
models work on the backend will allow you to generate and explain
those more granular details when necessary.
Probability:-
Probability theory developed from the study of games of chance like dice and cards

Probability theory has a foundation of Statistical Inference

Probability=The no. of ways of achieving success / The total no of possible


outcomes
Probability
• probability is an intuitive concept. We use it on a
daily basis without necessarily realizing that we
are speaking and applying probability to work.
• Life is full of uncertainties. We don’t know the
outcomes of a particular situation until it
happens. Will it rain today? Will I pass the next
math test? Will my favorite team win the toss?
Will I get a promotion in next 6 months? All
these questions are examples of uncertain
situations we live in.
•Experiment – are the uncertain situations, which
could have multiple outcomes. Whether it rains on
a daily basis is an experiment.
•Outcome is the result of a single trial. So, if it
rains today, the outcome of today’s trial from the
experiment is “It rained”
•Event is one or more outcome from an
experiment. “It rained” is one of the possible event
for this experiment.
•Probability is a measure of how likely an event
is. So, if it is 60% chance that it will rain tomorrow,
the probability of Outcome “it rained” for tomorrow
Why do we need probability?

In an uncertain world, it can be of immense help to know


and understand chances of various events. You can plan
things accordingly. If it’s likely to rain, I would carry my
umbrella. If I am likely to have diabetes on the basis of
my food habits, I would get myself tested. If my customer
is unlikely to pay me a renewal premium without a
reminder, I would remind him about it.
Probability Distribution:

It is simply a statistical function that explains


complete probable values and likelihoods that are
accounted for by a random variable in a given
range.
A probability distribution is a summary of
probabilities for the values of a random variable

As a distribution, the mapping of the values of a


random variable to a probability has a shape when
all values of the random variable are lined up.

The distribution also has general properties that


can be measured.

Two important properties of a probability


distribution are the expected value and the
variance.
Expected Value:-

The expected value is the average or mean value


of a random variable X.

This is the most likely value or the outcome with


the highest probability.

It is typically denoted as a function of the


uppercase letter E with square brackets: for
example, E[X] for the expected value
of X or E[f(x)] where the function f() is used to
sample a value from
The expectation the
value (ordomain of X.of a random
the mean)
variable X is denoted by E(X)

•Expected Value. The average value of a random


variable.
Variance:-

The variance is the spread of the values of a


random variable from the mean.

This is typically denoted as a function Var; for


example, Var(X) is the variance of the random
variable X or Var(f(x)) for the variance of values
drawn from the domain of X using the function f().

•Variance. The average spread of values around


the expected value.
The structure of the probability distribution will
differ depending on whether the random variable
is discrete or continuous.

Discrete Probability Distributions

A discrete probability distribution summarizes the


probabilities for a discrete random variable.

The probability mass function, or PMF, defines the


probability distribution for a discrete random
variable.
Discrete probability functions are also known as
probability mass functions and can assume a
discrete number of values.
A discrete probability distribution has a cumulative
distribution function, or CDF.
This is a function that assigns a probability that a
discrete random variable will have a value of less
than or equal to a specific discrete value.
•Probability Mass Function. Probability for a
value for a discrete random variable.
•Cumulative Distribution Function. Probability
less than or equal to a value for a random variable.

For example, coin tosses and counts of events are


discrete functions.
These are discrete distributions because there are
no in-between values.

For example, the likelihood of rolling a specific


number on a die is 1/6. The total probability for all
six values equals one. When you roll a die, you
inevitably obtain one of the possible values.
Types of Discrete Distribution

There are a variety of discrete probability


distributions that you can use to model different
types of data. The correct discrete distribution
depends on the properties of your data.

•Binomial distribution to model binary data, such


as coin tosses.

•Poisson distribution to model count data, such as


the count of library book checkouts per hour.

•Uniform distribution to model multiple events with


the same probability, such as rolling a die.
Continuous Probability Distributions

Continuous probability functions are also known as


probability density functions(PDF).

Unlike discrete probability distributions where each


particular value has a non-zero likelihood, specific
values in continuous distributions have a zero
probability.
For example, the likelihood of measuring a
temperature that is exactly 32 degrees is zero.
What Is Correlation?

• Correlation is a statistical measure.

• Correlation explains how one or more variables


are related to each other. These variables can be
input data features which have been used to
forecast our target variable.
• Two features (variables) can be positively
correlated with each other. It means that when
the value of one variable increases then the
value of the other variable(s) also increases.
Correlation is really one of the very basics of data
analysis and is an important tool for a data
analyst, as it can help define trends, make
predictions and uncover root causes for certain
phenomena.
There could be essentially two types of data you
can work with when determining correlation:

Univariate Data:

• In a simple set up we work with a single


variable.
We measure central tendency to enquire about the
representative data, dispersion to measure the
deviations around the central tendency, skewness
to measure the shape and size of the distribution
and kurtosis to measure the concentration of the
data at the central position. This data, relating to
a single variable is called univariate data.
Bivariate data:

it often becomes essential in our analysis to study


two variables simultaneously

For example, a> height and weight of a person, b>


age and blood pressure, etc.
This statistical data on two characters of any
individual, measured simultaneously are termed as
bivariate data.
Types of correlation:
1.Positive correlation 5)Perfect
Positive
2.Negative correlation 6)Perfect
Negative
3.Zero correlation
Positive correlation:
4.Spurious
If due to increase of any ofcorrelation
the two data, the other
data also increases, we say that those two data are
positively correlated.

For example, height and weight of a male or


female are positively correlated.
Negative correlation:
If due to increase of any of the two, the other
decreases, we say that those two data are
negatively correlated.
For example, the price and demand of a
commodity are negatively correlated. When the
price increases, the demand generally goes down.
Zero correlation:

If in between the two data, there is no clear-cut


trend. i.e. , the change in one does not guarantee
the co-directional change in the other, the two
data are said to be non-correlated or may be said
to possess, zero correlation.
For example, quality like affection, kindness is in
most cases non-correlated with the academic
achievements, or better to say that intellect of a
person is purely non-correlated with complexion.
Spurious correlation:

• If the correlation is due to the influence of any


other ‘third’ variable, the data is said to be
spuriously correlated.

For example, children with “body control


problems” and clumsiness has been reported as
being associated with adult obesity. One can
probably say that uncontrolled and clumsy kids
participate less in sports and outdoor activities and
that is the ‘third’ variable here. At most times, it is
difficult to figure out the ‘third’ variable and even if
that is achieved, it is even more difficult to gauge
the extent of its influence on the two primary
variables.
Regression

Regression is a statistical technique that is used to


model the relationship of a dependent variable
with respect to one or more independent variables.

Regression is widely used in several statistical


analysis problems and it is also one of the most
important tools in Machine Learning.

Regression is a statistical method used in finance,


investing, and other disciplines that attempts to
determine the strength and character of the
relationship between one dependent variable
(usually denoted by Y) and a series of other
variables (known as independent variables).

Regression helps investment and financial


managers to value assets and understand the
relationships between variables, such
as commodity prices and the stocks of businesses
dealing in those commodities.
The statistical techniques that expresses a functional relationship
between two or more variables in the form of an equation to
estimate the value of a variable based on the given value of another
variable is Regression analysis

The variable whose value is to be estimated is called Dependant


Variable.
The variable whose value is used to estimate this value is called
Independent Variable
Regression Analysis:

Regression analysis is used in stats to find trends


in data.
For example, you might guess that there’s a
connection between how much you eat and how
much you weigh; regression analysis can help you
quantify that.
Regression analysis will provide you with an
equation for a graph so that you can make
predictions
For example,about your been
if you’ve data putting on weight over
the last few years, it can predict how much you’ll
weigh in ten years time if you continue to put on
weight at the same rate.
In statistics, it’s hard to stare at a set of random
numbers in a table and try to make any sense of it.
For example, global warming may be
reducing average snowfall in your town and you
are asked to predict how much snow you think will
fall this year. Looking at the following table you
might guess somewhere around 10-20 inches.
That’s a good guess, but you could make
a better guess, by using regression.
Linear Regression

A linear regression refers to a regression model


that is completely made up of linear variables.

Beginning with the simple case, Single Variable


Linear Regression is a technique used to model the
relationship between a single input independent
variable (feature variable) and an output
dependent variable using a linear model i.e a line.

Multi-Variable Linear Regression where a model is


created for the relationship between multiple
independent input variables (feature variables)
and an output dependent variable. The model
remains linear in that the output is a linear
combination of the input variables.
A few key points about Linear Regression:
•Fast and easy to model and is particularly useful
when the relationship to be modeled is not
extremely complex and if you don’t have a lot of
data.
•Very intuitive to understand and interpret.
•Linear Regression is very sensitive to outliers.
Polynomial Regression

When we want to create a model that is suitable


for handling non-linearly separable data, we will
need to use a polynomial regression.

In this regression technique, the best fit line is not


a straight line.

A few key points about Polynomial Regression:


•Able to model non-linearly separable data; linear
regression can’t do this. It is much more flexible in
general and can model some fairly complex
relationships.
•Full control over the modelling of feature variables
(which exponent to set).
•Requires careful design. Need some knowledge of
the data in order to select the best exponents.
•Prone to over fitting if exponents are poorly
selected.
Ridge Regression

A standard linear or polynomial regression will fail


in the case where there is high collinearity among
the feature variables.
Collinearity is the existence of near-linear
relationships among the independent variables.
The presence of hight collinearity can be
determined in a few different ways:
•A regression coefficient is not significant even
though, theoretically, that variable should be
highly correlated with Y.
•When you add or delete an X feature variable, the
regression coefficients change dramatically.
•Your X feature variables have high pairwise
correlations (check the correlation matrix).

You might also like