0% found this document useful (0 votes)
8 views

Basic Statistics Concepts For Data Science

statistics concept
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Basic Statistics Concepts For Data Science

statistics concept
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Basic Statistics Concepts for Data

Science
1. Descriptive Statistics

It is used to describe the basic features of data that provide a summary of the

given data set which can either represent the entire population or a sample of

the population.

It is derived from calculations that include:

 Mean: It is the central value which is commonly known as arithmetic

average.

 Mode: It refers to the value that appears most often in a data set.

 Median: It is the middle value of the ordered set that divides it in exactly half .

2. Variability

Variability includes the following parameters:


 Standard Deviation: It is a statistic that calculates the dispersion of a data

set as compared.

 Variance: It refers to a statistical measure of the spread between the

numbers in a data set. In general terms, it means the difference from the

mean. A large variance indicates that numbers are far apart from average

value. Small variance indicates that the numbers are closer to the average

values. Zero variance indicates that the values are identical to the given set.
 Range: This is defined as the difference between the largest and smallest

value of a dataset.

 Percentile: It refers to the measure used in statistics that indicates the value

below which the given percentage of observation in the dataset falls.

 Quartile: It is defined as the value that divides the data points into quarters .

 Interquartile Range: It measures the middle half of your data . In general

terms, it is the middle 50% of the dataset.

3. Correlation

It is one of the major statistical techniques that measure the relationship

between two variables. The correlation coefficient indicates the strength of the

linear relationship between two variables.

 A correlation coefficient that is more than zero indicates a positive

relationship.

 A correlation coefficient that is less than zero indicates a negative

relationship.

 Correlation coefficient zero indicates that there is no relationship between

the two variables.


4. Probability Distribution

It specifies of all possible events. In simple terms, an event refers to the result

of an experiment. Events are of two types dependent and independent .

 Independent event: The event is said to be an Independent event when it is

not affected by the earlier events .

 Dependent event: The event is said to be dependent when the occurrence

of the event is dependent on the earlier events

The probability of independent events is calculated by simply multiplying the

probability of each event and for a dependent event is calculated by conditional

probability.

5. Regression

It is a method that is used to determine the relationship between one or more

independent variables and a dependent variable. Regression is mainly of two

types:

 Linear regression: It is used to fit the regression model that explains the

relationship between a numeric predictor variable and one or more predictor

variables.

 Logistic regression: It is used to fit a regression model that explains the

relationship between the binary response variable and one or more predictor

variables.
6. Normal Distribution

Normal is used to define the probability density function for a continuous

random variable in a system . The standard normal distribution has two

parameters – mean and standard deviation . When the distribution of random

variables is unknown, the normal distribution is used. The central limit theorem

justifies why normal distribution is used in such cases.

7. Bias

In statistical terms, it means when a model is representative of a complete

population. This needs to be minimized to get the desired outcome .

The three most common types of bias are:

 Selection bias: It is a phenomenon of selecting a group of data for statistical

analysis, the selection in such a way that data is not randomized resulting in

the data being unrepresentative of the whole population.

 Confirmation bias: It occurs when the person performing the statistical

analysis has some predefined assumption.

 Time interval bias: It is caused intentionally by specifying a certain time

range to favor a particular outcome.

You might also like