0% found this document useful (0 votes)
11 views

PDS_Unit4

The document provides an overview of statistics, including its definition, types (descriptive and inferential), and key concepts such as measures of central tendency and dispersion. It also discusses data visualization techniques, emphasizing the importance of effectively presenting data through various graphical methods like histograms, bar charts, and scatter plots. Additionally, it highlights the significance of data interpretation in research and decision-making processes.

Uploaded by

Ankitha T C
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

PDS_Unit4

The document provides an overview of statistics, including its definition, types (descriptive and inferential), and key concepts such as measures of central tendency and dispersion. It also discusses data visualization techniques, emphasizing the importance of effectively presenting data through various graphical methods like histograms, bar charts, and scatter plots. Additionally, it highlights the significance of data interpretation in research and decision-making processes.

Uploaded by

Ankitha T C
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Statistics: Introduction, Types of Statistics.

Data Visualization and


UNIT 4

Interpretation: Histogram, Bar Charts, Scatter Plots, Good vs. Bad


Visualization.
Sampling distributions; Point estimation - estimators, minimum variance
unbiased estimation, maximum likelihood estimation, method of
moments, consistency; Interval estimation.

Introduction to Statistics
Any raw Data, when collected and organized in the form of numerical or
tables, is known as Statistics. Statistics is also the mathematical study of
the probability of events occurring based on known quantitative Data or a
Collection of Data.
Statistics attempts to infer the properties of a large Collection of Data
from inspection of a sample of the Collection thereby allowing educated
guesses to be made with a minimum of expense. There are generally 3
kinds of averages commonly used in Statistics. They are: (i) Mean, (ii)
Median, and (iii) Mode.
Statistics is the study of Data Collection, Analysis, Interpretation,
Presentation, and organizing in a specific way. Mathematical methods
used for different analytics include mathematical Analysis, linear algebra,
stochastic Analysis, the theory of measure-theoretical probability, and
differential equations. Collecting, classifying, organizing, and displaying
numerical Data is associated with Statistics. This helps one to grasp
different outcomes from it and foresee several possibilities of various
events. Statistics discuss information, observations, and Data in the form
of numerical Data. We are able to find different indicators of central
tendencies and the divergence of various values from the center with the
help of Statistics.

The ability to analyze and interpret statistical Data is a vital skill for
researchers and professionals from a wide variety of disciplines. You may
need to make decisions on the basis of statistical Data, interpret
statistical Data in research papers, do your own research, and interpret
the Data.

There are two kinds of Statistics, which are descriptive Statistics and
Types of Statistics

inferential Statistics. In descriptive Statistics, the Data or Collection Data


are described in a summarized way, whereas in inferential Statistics, we
make use of it in order to explain the descriptive kind. Both of them are
used on a large scale. Also, there is another kind of Statistics where
descriptive transitions into inferential Statistics.
Statistics is mainly divided into the following two categories.
Descriptive Statistics
Inferential Statistics

In the descriptive Statistics, the Data is described in a summarized way.


Descriptive Statistics

The summarization is done from the sample of the population using


different parameters like Mean or standard deviation. Descriptive
Statistics are a way of using charts, graphs, and summary measures to
organize, represent, and explain a set of Data.
It is used to quantitatively describe the attributes of the known data and
provides summaries of either the sample or the population. The measures
of descriptive statistics are given as follows:
Measures of central tendency: These measures are used to describe data
with respect to a single central point. Mean, median, and mode are three
types of central tendency.
Measures of dispersion: These measures are used to describe the
variability of data. In other words, it is used to quantify the spread of a
distribution about a central value. Range, variance, standard
deviation, mean deviation, quartile deviation, and coefficients of
dispersion are the types that fall under this category.

In the Inferential Statistics, we try to interpret the Meaning of


Inferential Statistics

descriptive Statistics. After the Data has been collected, analyzed, and
summarised we use Inferential Statistics to describe the Meaning of the
collected Data.

Inferential Statistics use the probability principle to assess whether


trends contained in the research sample can be generalized to the

larger population from which the sample originally comes.


Inferential Statistics are intended to test hypotheses and
investigate relationships between variables and can be used to

make population predictions.


Inferential Statistics are used to draw conclusions and inferences,
i.e., to make valid generalizations from samples.

The measures of inferential statistics are given below:

Hypothesis testing - It is used to test some assumptions and make


inferences about the population parameters by using an estimate of the
sample. There are many types of statistical tests used for this purpose.
Some of them are the z test, t test, f test, and ANOVA test.

Regression Analysis - This type of analysis is used when the effect of


change in one variable causing a change in another variable needs to be
evaluated and quantified. Simple linear, multiple linear, nominal, logistic,
and ordinal regression are the types of regression analysis.

Example
In a class, the Data is the set of marks obtained by 50 students. Now
when we take out the Data average, the result is the average of 50
students’ marks. If the average marks obtained by 50 students are 88 out
of 100, on the basis of the outcome, we will draw a conclusion.

Mean: Mean is considered the arithmetic average of a Data set that is


Mean, Median and Mode in Statistics

found by adding the numbers in a set and dividing by the number of


observations in the Data set.

Median: The middle number in the Data set while listed in either
ascending or descending order is the Median.

Mode: The number that occurs the most in a Data set and ranges
between the highest and lowest value is the Mode.

The measures of central tendency do not suffice to describe the complete


Measures of Dispersion in Statistics

information about a given Data. Therefore, the variability is described by


a value called the measure of dispersion.

1. The range in Statistics is calculated as the difference between the


The different measures of dispersion include:

maximum value and the minimum value of the Data points.


2. The quartile deviation that measures the absolute measure of
dispersion. The Data points are divided into 3 quarters. Find the
Median of the Data points. The Median of the Data points to the left
of this Median is said to be the upper quartile and the Median of
the Data points to the right of this Median is said to be the lower
quartile. Upper quartile - lower quartile is the interquartile range.
Half of this is the quartile deviation.
3. The Mean deviation is the statistical measure to determine the
average of the absolute difference between the items in a
distribution and the Mean or Median of that series.
4. The standard deviation is the measure of the amount of variation of
a set of values.

Types of Data in Statistics


There are two types of data in statistics, namely, qualitative data and
quantitative data. In data analysis, it is necessary to apply the correct
testing technique which can only be done by sorting the data into various
types. Given below are the different types of data in statistics.
1. Qualitative Data or Categorical Data
 Nominal data - This type of data can be divided into two mutually
exclusive groups that do not overlap. Labels and tags are used to
categorize such data. Nominal data does not have any intrinsic
ordering and does not possess any numeric properly. Examples of
nominal data are gender, eye color, etc.
 Ordinal data - Similar to nominal data, arithmetic, and logical
operations cannot be performed on ordinal data as it does not possess
any numerical property. However, such a type of data can be
intrinsically ordered. For example, rating a restaurant experience on a
scale of 1 - 5.

Quantitative Data

Discrete data - This is one of the types of data that can only involve
the use of integers and cannot be divided into smaller or finer parts.

Mathematical operations can be performed on discrete data. For


example, days of a month, number of teachers in a school, etc.
Continuous data - Such a type of data can be divided into finer levels
and can take on any numeric value. There are two types of continuous

data, namely, interval and ratio data. Interval data can be negative
and does not have a meaningful zero. On the other hand, ratio data
can never be negative and has a meaningful zero. Calculations for
continuous data are performed using descriptive statistics.

Types of Variables in Statistics

Just as there are two types of data, similarly, there are two types of
variables in statistics. These variables are used to represent the
corresponding data. In both types of statistics, it is necessary to choose
the right kind of variable so as to administer the appropriate statistical
test. Given below are the different types of variables in statistics:

Qualitative Variables
Binary or dichotomous variables - Such a type of variable can
represent nominal data with only two levels of outcomes and does not

possess any intrinsic ordering. For example, passing or failing in an


examination.
Nominal variables - Nominal variables are used to represent data that
does not have any rank and cannot be ordered intrinsically. In other

words, it is used to represent nominal data. For example, breeds of


dogs.
Ordinal variables - This type of variable is used to represent ordinal
data wherein the groups can be ranked in a specific order. An example

is the finishing rank of people in a race.

Quantitative Variables

Discrete variables - Discrete variables represent the counts of unique


items or values. Such variables are also known as integer variables.

The different types of flowers in a garden can be represented using


discrete variables.
Continuous variables - These variables, also known as ratio variables,
are used to denote continuous data. It is used to represent non-finite

and continuous values. For example, the volume of a sphere.

Data Visualization and Interpretation


Data visualization is the representation of data through use of common
graphics, such as charts, plots, infographics, and even animations. These
visual displays of information communicate complex data relationships
and data-driven insights in a way that is easy to understand.
Data visualization can be utilized for a variety of purposes, and it’s
important to note that is not only reserved for use by data teams.
Management also leverages it to convey organizational structure and
hierarchy while data analysts and data scientists use it to discover and
explain patterns and trends.
Data visualization is commonly used to spur idea generation across
teams. They are frequently leveraged during brainstorming or Design
Thinking sessions at the start of a project by supporting the collection of
different perspectives and highlighting the common concerns of the
collective. While these visualizations are usually unpolished and
unrefined, they help set the foundation within the project to ensure that
the team is aligned on the problem that they’re looking to address for key
stakeholders.
Data visualization is a critical step in the data science process, helping
teams and individuals convey data more effectively to colleagues and
decision makers. Teams that manage reporting systems typically
leverage defined template views to monitor performance. However, data
visualization isn’t limited to performance dashboards. For example,
while text mining an analyst may use a word cloud to to capture key
concepts, trends, and hidden relationships within this unstructured data.
Alternatively, they may utilize a graph structure to illustrate relationships
between entities in a knowledge graph. There are a number of ways to
represent different types of data, and it’s important to remember that it
is a skillset that should extend beyond your core analytics team.
Data interpretation is the process of reviewing data through well-defined
methods. They help assign meaning to the data and arrive at a relevant
conclusion. The analysis is the process of ordering, categorizing, and
summarizing data to answer research questions. It should be done
quickly and effectively. The results need to stand out and should be right
in your face. Data Plot types for Visualization is an important aspect of
this end. With growing data, this need is growing and hence data plots
become very important in today’s world. However, there are many types
of plots used in data visualization. It is often tricky to choose which type
is best for your business or data. Each of these plots has its strengths and
weaknesses that make it better than others in some situations.

These data plot types for visualization are sometimes called graphs or
charts

Benefits of good data visualization

Our eyes are drawn to colours and patterns. We can quickly identify red
from blue, and square from the circle. Our culture is visual, including
everything from art and advertisements to TV and movies.

Data visualization is another form of visual art that grabs our interest and
keeps our eyes on the message. When we see a chart, we quickly see
trends and outliers. If we can see something, we internalize it quickly.
It’s storytelling with a purpose. If you’ve ever stared at a massive
spreadsheet of data and couldn’t see a trend, you know how much more
effective a visualization can be. The uses of Data Visualization as follows.

Powerful way to explore data with presentable results.


Primary use is the pre-processing portion of the data mining

process.

Supports the data cleaning process by finding incorrect and


missing values.

For variable derivation and selection means to determine which


variable to include and discarded in the analysis.

Also play a role in combining categories as part of the data


reduction process.

Data Visualization Techniques

Box plots
Histograms

Heat maps

Charts

Tree maps

Word Cloud/Network diagram



Box Plots

The image above is a box plot. A boxplot is a standardized way of


displaying the distribution of data based on a five-number summary

(“minimum”, first quartile (Q1), median, third quartile (Q3), and


“maximum”). It can tell you about your outliers and what their
values are. It can also tell you if your data is symmetrical, how
tightly your data is grouped, and if and how your data is skewed.
A box plot is a graph that gives you a good indication of how the
values in the data are spread out. Although box plots may seem

primitive in comparison to a histogram or density plot, they have


the advantage of taking up less space, which is useful when
comparing distributions between many groups or datasets. For
some distributions/datasets, you will find that you need more
information than the measures of central tendency (median, mean,
and mode). You need to have information on the variability or
dispersion of the data.

List of Methods to Visualize Data

Column Chart: It is also called a vertical bar chart where each


category is represented by a rectangle. The height of the rectangle

is proportional to the values that are plotted.


Bar Graph: It has rectangular bars in which the lengths are
proportional to the values which are represented.

Stacked Bar Graph: It is a bar style graph that has various


components stacked together so that apart from the bar, the

components can also be compared to each other.


Stacked Column Chart: It is similar to a stacked bar; however, the
data is stacked horizontally.

Area Chart: It combines the line chart and bar chart to show how
the numeric values of one or more groups change over the progress

of a viable area.
Dual Axis Chart: It combines a column chart and a line chart and
then compares the two variables.

Line Graph: The data points are connected through a straight line;
therefore, creating a representation of the changing trend.

Mekko Chart: It can be called a two-dimensional stacked chart with


varying column widths.

Pie Chart: It is a chart where various components of a data set are


presented in the form of a pie which represents their proportion in

the entire data set.


Waterfall Chart: With the help of this chart, the increasing effect of
sequentially introduced positive or negative values can be

understood.
Bubble Chart: It is a multi-variable graph that is a hybrid of Scatter
Plot and a Proportional Area Chart.

Scatter Plot Chart: It is also called a scatter chart or scatter graph.


Dots are used to denote values for two different numeric variables.

Bullet Graph: It is a variation of a bar graph. A bullet graph is used


to swap dashboard gauges and meters.

Funnel Chart: The chart determines the flow of users with the help
of a business or sales process.

Heat Map: It is a technique of data visualization that shows the


level of instances as color in two dimensions.

Histograms

A histogram is a graphical display of data using bars of different heights.


In a histogram, each bar groups numbers into ranges. Taller bars show
that more data falls in that range. A histogram displays the shape and
spread of continuous sample data.
It is a plot that lets you discover, and show, the underlying frequency
distribution (shape) of a set of continuous data. This allows the
inspection of the data for its underlying distribution (e.g., normal
distribution), outliers, skewness, etc. It is an accurate representation of
the distribution of numerical data, it relates only one variable. Includes
bin or bucket- the range of values that divide the entire range of values
into a series of intervals and then count how many values fall into each
interval.

Histograms are based on area, not height of bars

In a histogram, the height of the bar does not necessarily indicate how
many occurrences of scores there were within each bin. It is the product
of height multiplied by the width of the bin that indicates the frequency
of occurrences within that bin. One of the reasons that the height of the
bars is often incorrectly assessed as indicating the frequency and not the
area of the bar is because a lot of histograms often have equally spaced
bars (bins), and under these circumstances, the height of the bin does
reflect the frequency.
Heat Maps

A heat map is data analysis software that uses colour the way a bar graph
uses height and width: as a data visualization tool.
If you’re looking at a web page and you want to know which areas get the
most attention, a heat map shows you in a visual way that’s easy to
assimilate and make decisions from. It is a graphical representation of
data where the individual values contained in a matrix are represented as
colours. Useful for two purposes: for visualizing correlation tables and for
visualizing missing values in the data. In both cases, the information is
conveyed in a two-dimensional table.
Note that heat maps are useful when examining a large number of
values, but they are not a replacement for more precise graphical
displays, such as bar charts, because colour differences cannot be
perceived accurately.

Charts

The simplest technique, a line plot is used to plot the relationship or


Line Chart

dependence of one variable on another. To plot the relationship between


the two variables, we can simply call the plot function.

Bar charts are used for comparing the quantities of different categories
Bar Charts

or groups. Values of a category are represented with the help of bars and
they can be configured with vertical or horizontal bars, with the length or
height of each bar representing the value.
A bar graph plots data with the help of bars, which represent value on
the y-axis and category on the x-axis. Bar graphs use bars with varying
heights to show the data which belongs to a specific category.
We can also stack bars on top of each other. Let's plot the data for apples
and oranges.
We can change the number and size of bins using numpy too.

We can create bins of unequal size too.


Similar to line charts, we can draw multiple histograms in a single chart.
We can reduce each histogram's opacity so that one histogram's bars
don't hide the others'. Let's draw separate histograms for each species of
flowers

Pie Chart

It is a circular statistical graph which decides slices to illustrate


numerical proportion. Here the arc length of each slide is proportional to
the quantity it represents. As a rule, they are used to compare the parts
of a whole and are most effective when there are limited components and
when text and percentages are included to describe the content.
However, they can be difficult to interpret because the human eye has a
hard time estimating areas and comparing visual angles.

Scatter Charts

Another common visualization technique is a scatter plot that is a two-


dimensional plot representing the joint variation of two data items. Each
marker (symbols such as dots, squares and plus signs) represents an
observation. The marker position indicates the value for each
observation. When you assign more than two measures, a scatter plot
matrix is produced that is a series scatter plot displaying every possible
pairing of the measures that are assigned to the visualization. Scatter
plots are used for examining the relationship, or correlations, between X
and Y variables.

Scatter plots are used when we have to plot two or more variables
present at different coordinates. The data is scattered all over the graph
and is not confined to a range. Two or more variables are plotted in a
Scatter Plot, with each variable being represented by a different color.
Let's use the ‘Iris’ dataset to plot a Scatter Plot.

First, let’s see how many different species of flowers we have.

Figure 26: Unique flower species

Let’s try plotting the data with the help of a line chart.
This is not very informative. We cannot figure out the relationship
between different data points.

This is much better. But we still cannot differentiate different data points
belonging to different categories. We can color the dots using the flower
species as a hue.
Figure 29: Scatter plot with multiple colors

Since Seaborn uses Matplotlib's plotting functions internally, we can use


functions like plt.figure and plt.title to modify the figure.

Figure 30: Changing dimensions of scatter plot

Heatmaps are used to see changes in behavior or gradual changes in


Heat Maps

data. It uses different colors to represent different values. Based on how


these colors range in hues, intensity, etc., tells us how the phenomenon
varies. Let's use heatmaps to visualize monthly passenger footfall at an
airport over 12 years from the flights dataset in Seaborn.
Figure 31: Flights dataset
The above dataset, flights_df shows us the monthly footfall in an airport
for each year, from 1949 to 1960. The values represent the number of
passengers (in thousands) that passed through the airport. Let’s use a
heatmap to visualize the above data.

Figure 32: Plotting heatmap

The brighter the color, the higher the footfall at the airport. By looking at
the graph, we can infer that :
1. The annual footfall for any given year is highest around July and
August.

2. The footfall grows annually. Any month in a year will have a higher
footfall when compared to the previous years.

Let's display the actual values in our heatmap and change the hue to
blue.

Figure 33: Plotting heatmap with values

Bubble Charts

It is a variation of scatter chart in which the data points are replaced


with bubbles, and an additional dimension of data is represented in the
size of the bubbles.

Timeline Charts

Timeline charts illustrate events, in chronological order — for example


the progress of a project, advertising campaign, acquisition process — in
whatever unit of time the data was recorded — for example week, month,
year, quarter. It shows the chronological sequence of past or future
events on a timescale.
Tree Maps

A treemap is a visualization that displays hierarchically organized data as


a set of nested rectangles, parent elements being tiled with their child
elements. The sizes and colours of rectangles are proportional to the
values of the data points they represent. A leaf node rectangle has an
area proportional to the specified dimension of the data. Depending on
the choice, the leaf node is coloured, sized or both according to chosen
attributes. They make efficient use of space, thus display thousands of
items on the screen simultaneously.

You might also like