PDS_Unit4
PDS_Unit4
Introduction to Statistics
Any raw Data, when collected and organized in the form of numerical or
tables, is known as Statistics. Statistics is also the mathematical study of
the probability of events occurring based on known quantitative Data or a
Collection of Data.
Statistics attempts to infer the properties of a large Collection of Data
from inspection of a sample of the Collection thereby allowing educated
guesses to be made with a minimum of expense. There are generally 3
kinds of averages commonly used in Statistics. They are: (i) Mean, (ii)
Median, and (iii) Mode.
Statistics is the study of Data Collection, Analysis, Interpretation,
Presentation, and organizing in a specific way. Mathematical methods
used for different analytics include mathematical Analysis, linear algebra,
stochastic Analysis, the theory of measure-theoretical probability, and
differential equations. Collecting, classifying, organizing, and displaying
numerical Data is associated with Statistics. This helps one to grasp
different outcomes from it and foresee several possibilities of various
events. Statistics discuss information, observations, and Data in the form
of numerical Data. We are able to find different indicators of central
tendencies and the divergence of various values from the center with the
help of Statistics.
The ability to analyze and interpret statistical Data is a vital skill for
researchers and professionals from a wide variety of disciplines. You may
need to make decisions on the basis of statistical Data, interpret
statistical Data in research papers, do your own research, and interpret
the Data.
There are two kinds of Statistics, which are descriptive Statistics and
Types of Statistics
descriptive Statistics. After the Data has been collected, analyzed, and
summarised we use Inferential Statistics to describe the Meaning of the
collected Data.
Example
In a class, the Data is the set of marks obtained by 50 students. Now
when we take out the Data average, the result is the average of 50
students’ marks. If the average marks obtained by 50 students are 88 out
of 100, on the basis of the outcome, we will draw a conclusion.
Median: The middle number in the Data set while listed in either
ascending or descending order is the Median.
Mode: The number that occurs the most in a Data set and ranges
between the highest and lowest value is the Mode.
Quantitative Data
Discrete data - This is one of the types of data that can only involve
the use of integers and cannot be divided into smaller or finer parts.
data, namely, interval and ratio data. Interval data can be negative
and does not have a meaningful zero. On the other hand, ratio data
can never be negative and has a meaningful zero. Calculations for
continuous data are performed using descriptive statistics.
Just as there are two types of data, similarly, there are two types of
variables in statistics. These variables are used to represent the
corresponding data. In both types of statistics, it is necessary to choose
the right kind of variable so as to administer the appropriate statistical
test. Given below are the different types of variables in statistics:
Qualitative Variables
Binary or dichotomous variables - Such a type of variable can
represent nominal data with only two levels of outcomes and does not
Quantitative Variables
These data plot types for visualization are sometimes called graphs or
charts
Our eyes are drawn to colours and patterns. We can quickly identify red
from blue, and square from the circle. Our culture is visual, including
everything from art and advertisements to TV and movies.
Data visualization is another form of visual art that grabs our interest and
keeps our eyes on the message. When we see a chart, we quickly see
trends and outliers. If we can see something, we internalize it quickly.
It’s storytelling with a purpose. If you’ve ever stared at a massive
spreadsheet of data and couldn’t see a trend, you know how much more
effective a visualization can be. The uses of Data Visualization as follows.
process.
Box plots
Histograms
Heat maps
Charts
Tree maps
Box Plots
Area Chart: It combines the line chart and bar chart to show how
the numeric values of one or more groups change over the progress
of a viable area.
Dual Axis Chart: It combines a column chart and a line chart and
then compares the two variables.
Line Graph: The data points are connected through a straight line;
therefore, creating a representation of the changing trend.
understood.
Bubble Chart: It is a multi-variable graph that is a hybrid of Scatter
Plot and a Proportional Area Chart.
Funnel Chart: The chart determines the flow of users with the help
of a business or sales process.
Histograms
In a histogram, the height of the bar does not necessarily indicate how
many occurrences of scores there were within each bin. It is the product
of height multiplied by the width of the bin that indicates the frequency
of occurrences within that bin. One of the reasons that the height of the
bars is often incorrectly assessed as indicating the frequency and not the
area of the bar is because a lot of histograms often have equally spaced
bars (bins), and under these circumstances, the height of the bin does
reflect the frequency.
Heat Maps
A heat map is data analysis software that uses colour the way a bar graph
uses height and width: as a data visualization tool.
If you’re looking at a web page and you want to know which areas get the
most attention, a heat map shows you in a visual way that’s easy to
assimilate and make decisions from. It is a graphical representation of
data where the individual values contained in a matrix are represented as
colours. Useful for two purposes: for visualizing correlation tables and for
visualizing missing values in the data. In both cases, the information is
conveyed in a two-dimensional table.
Note that heat maps are useful when examining a large number of
values, but they are not a replacement for more precise graphical
displays, such as bar charts, because colour differences cannot be
perceived accurately.
Charts
Bar charts are used for comparing the quantities of different categories
Bar Charts
or groups. Values of a category are represented with the help of bars and
they can be configured with vertical or horizontal bars, with the length or
height of each bar representing the value.
A bar graph plots data with the help of bars, which represent value on
the y-axis and category on the x-axis. Bar graphs use bars with varying
heights to show the data which belongs to a specific category.
We can also stack bars on top of each other. Let's plot the data for apples
and oranges.
We can change the number and size of bins using numpy too.
Pie Chart
Scatter Charts
Scatter plots are used when we have to plot two or more variables
present at different coordinates. The data is scattered all over the graph
and is not confined to a range. Two or more variables are plotted in a
Scatter Plot, with each variable being represented by a different color.
Let's use the ‘Iris’ dataset to plot a Scatter Plot.
Let’s try plotting the data with the help of a line chart.
This is not very informative. We cannot figure out the relationship
between different data points.
This is much better. But we still cannot differentiate different data points
belonging to different categories. We can color the dots using the flower
species as a hue.
Figure 29: Scatter plot with multiple colors
The brighter the color, the higher the footfall at the airport. By looking at
the graph, we can infer that :
1. The annual footfall for any given year is highest around July and
August.
2. The footfall grows annually. Any month in a year will have a higher
footfall when compared to the previous years.
Let's display the actual values in our heatmap and change the hue to
blue.
Bubble Charts
Timeline Charts