0% found this document useful (0 votes)
0 views

BA_Unit 4 (P2)

This lesson on Descriptive Statistics using R covers essential skills for data analysis, including importing data, data visualization techniques, and understanding measures of central tendency and dispersion. It emphasizes the importance of visualizing data through various charts and analyzing relationships between variables using statistical measures. By the end of the lesson, students will have practical skills in data analysis and interpretation using R.

Uploaded by

allinone6813
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

BA_Unit 4 (P2)

This lesson on Descriptive Statistics using R covers essential skills for data analysis, including importing data, data visualization techniques, and understanding measures of central tendency and dispersion. It emphasizes the importance of visualizing data through various charts and analyzing relationships between variables using statistical measures. By the end of the lesson, students will have practical skills in data analysis and interpretation using R.

Uploaded by

allinone6813
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

L E S S O N

5
Descriptive Statistics
Using R
Ms. Asha Yadav
Assistant Professor
Department of Computer Science
School of Open Learning
University of Delhi

STRUCTURE
5.1 Learning Objectives
5.2 Introduction
5.3 Importing Data File
5.4 Data Visualisation Using Charts
5.5 Measure of Central Tendency
5.6 Measure of Dispersion
5.7 Relationship between Variables
5.8 Summary
5.9 Answers to In-Text Questions
5.10 Self-Assessment Questions
5.11 References
5.12 Suggested Readings

5.1 Learning Objectives


After reading this lesson student will be able to:
Write code to import data files in R.
Create and interpret various types of charts such as histograms, bar charts, box plots,
line graphs, and scatter plots for data visualization.
Describe data using measures of central tendency (mean, median, mode).
Describe data measures of dispersion (range, variance, standard deviation, IQR).
Analyse relationships between variables using statistical measures such as covariance,
correlation, and the coefficient of determination.
102 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 102 10-Jan-25 3:52:07 PM


DESCRIPTIVE STATISTICS USING R

Notes
5.2 Introduction
Data analysis is an important skill in today’s data-driven world, allowing
people and organizations to extract meaningful insights from raw data.
This lesson focuses on equipping you with the essential tools and tech-
niques in R, a powerful statistical computing and visualization language,
to handle data effectively. We begin by learning how to import data
files, a fundamental step in data analysis. Whether working with CSV
files, Excel sheets, or other formats, importing data correctly ensures a
seamless workflow for subsequent analysis.
Now we’re moving on to data visualization. It is an essential part of the
process of data exploration and communication. It will teach you how to
unveil patterns, distributions, and relationships in your data using visual
representations like histograms, bar charts, box plots, line graphs, and
scatter plots. Visualization doesn’t only help understand complex datasets
but also communicate results to others effectively. Descriptive statistics
are the basis of data analysis. You will look into measures of central
tendency-mean, median, mode, which summarize the central value of a
dataset, and measures of dispersion-range, variance, standard deviation,
interquartile range, which describe the variability or spread of data. These
measures give an overall idea of the characteristics of the data.
Lastly, we discuss the relationships between variables using concepts such
as covariance, correlation, and the coefficient of determination (R!). With
these tools, we can express and interpret the nature of the relationship
between variables. This is the foundation on which predictive modelling
and decision-making are based. At the end of this lesson, you would
have gained theoretical knowledge and practical skills in the analysis and
interpretation of data using R. Being a beginner or enhancing your skill
set on data, it gives a good foundation working with real-world datasets.

5.3 Importing Data File


You must be now familiar with the power of R for data analysis; to per-
form data analysis on any datasets effectively, we require to have data
from various sources. Therefore, importing data into R is an important
is one of the most essential skills. Like any other programming tool R

PAGE 103
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 103 10-Jan-25 3:52:07 PM


BUSINESS ANALYTICS

Notes also supports a variety of file formats for data import such as spread-
sheets, text files (.csv or .txt), from databases (MySQL, SQLite, and
PostgreSQL.), from software (SPSS, SAS, or STATA) and other formats
(JSON, XML, HTML etc.).
For importing data from csv we require the function read.csv(), the
syntax is read.csv(filepath, header, sep); where filepath specifies the
location of file, header parameter specifies if the first row contains
column names (TRUE/FALSE), and Sep is used to provide delimiter
(like “,” for csv).

Code Window 1
This will load my file to data and head() will display first six rows by
default.
To import other formats, we need to load desired package from
library, code to read excel, json and excel file is shown below in
code window 2.

Code Window 2
R can also interact with database using packages like DBI and
RMySQL as shown in code window 3.

104 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 104 10-Jan-25 3:52:07 PM


DESCRIPTIVE STATISTICS USING R

Notes

Code Window 3
Once you import data to R you can start doing data analysis using the
built-in functions or libraries.

5.4 Data Visualisation Using Charts


In this section we will learn about data visualization, R provides a rich
ecosystem of libraries (like ggplot2, plotly, lattice, cowplot) each offering
unique capabilities to create a variety of charts, plots, and interactive vi-
sualizations etc. In this section we will learn plotting various charts using
ggplot2 library which is widely used, for this purpose we will be using the
mpg dataset. This built-in dataset in R, provided by the ggplot2 package
containing information about fuel efficiency and various characteristics of
cars. Various columns of mpg dataset are shown below in Table 5.1.

Table 5.1: Columns of mpg Dataset

The first five rows of mpg dataset are shown below in code window 4.

PAGE 105
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 105 10-Jan-25 3:52:08 PM


BUSINESS ANALYTICS

Notes

Code Window 4
(i) Histograms: We can use histogram to visualize the distribution of
a single continuous variable by binning i.e. dividing them into
intervals or bins. It is useful to identify patterns such as skewness,
spread, or unusual gaps. The code is given in code window 5.

Code Window 5

106 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 106 10-Jan-25 3:52:08 PM


DESCRIPTIVE STATISTICS USING R

In this code we have created a histogram of highway mileage (hwy). Notes


The mileage is grouped into bins of width 2, the fill color of bars
is set to steel blue with black outline. The label of x and y axis is
done using labs().
(ii) Bar Chart: We use bar charts to represent categorial data, they are
ideal when we are comparing discrete groups as they can show
counts or proportions of each category. The code window 6 shows
bar charts.

Code Window 6
(iii) Box Plot: It summarizes the distribution of a continuous variable
by displaying the median, quartiles, and potential outliers. It is
useful when we need to compare across multiple groups. The code
is shown in code window 7.

PAGE 107
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 107 10-Jan-25 3:52:08 PM


BUSINESS ANALYTICS

Notes

Code Window 7
This code gives a box plot for highway mileage (hwy) for different
car classes (class). function geom_boxplot() generates a box plot
for each class with statistical summaries. The fill color is set to
light green color.
(iv) Line Graphs: They are recommended when we want to analyse
trends over a continuous variable or to observe relationships. The
code to generate line graph is shown in code window 8.

Code Window 8
108 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 108 10-Jan-25 3:52:09 PM


DESCRIPTIVE STATISTICS USING R

(v) Scatter Plots: When we need to visualize the relationship between Notes
two continuous variables we can use scatter plots, they are an ideal
choice for identifying trends, clusters, or correlations. The code
window 9 shows how to generate scatter plot.

Code Window 9

IN-TEXT QUESTIONS
1. Which function in R is commonly used to load a CSV file?
2. What type of chart is used to visualize the frequency distribution
of data?

5.5 Measure of Central Tendency


Any dataset can be analysed, summarised and its characteristics can be
understood based on one of the most common statistical tools that is
measures of central tendency. They summarize a set of data by identifying
a single value that represents the entire distribution of the dataset. The
three most common measures of central tendency are mean, median, and
mode. Each of these measures provides unique insights and the decision
which one to use is based on the nature of the data and the analysis that
you need to do.

PAGE 109
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 109 10-Jan-25 3:52:09 PM


BUSINESS ANALYTICS

Notes (i) Mean: The arithmetic average or mean is calculated by the formula
below:

It is a widely used measure for quantitative data although it is


sensitive to outliers, which can skew the result. Code window 10
shows how to compute mean on a dataset.
(ii) Median: Median is the middle value in a sorted dataset, it is calculated
as the central value if the dataset has an odd number of observations
and median is the average of the two central values if observations
are even in number. Compared to mean; the median is less affected
by outliers. The example below shows how to compute median. R
code is shown in code window 10.

(iii) Mode: The mode represents the value that appears most frequently in
a dataset. A dataset can be unimodal having one mode, multimodal
having more than one mode, or no mode at all if no value repeats.
The example below shows all three, the corresponding code to
compute mode in dataset is given in code window 10.

110 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 110 10-Jan-25 3:52:10 PM


DESCRIPTIVE STATISTICS USING R

Notes

Code Window 10

5.6 Measure of Dispersion


As you have seen in previous section though measure of central tendency
may provide information about the central values, this information can-
not be used alone to deduce characteristics of data. We need measure of
dispersion as well to understand how spread out the data is. They help
describe the variability or spread of data points around the central value.
Common measures of dispersion are discussed below, and their code is
shown in code window 11.
(i) Range: The simplest measure of dispersion is range which is the
difference between the maximum and minimum values in a dataset.
Although range is easy to calculate but it is sensitive to outliers.
Formula for range is:
Formula:
Range = Maximum Value – Minimum Value

PAGE 111
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 111 10-Jan-25 3:52:10 PM


BUSINESS ANALYTICS

Notes (ii) Variance: It is the measures of deviation from mean i.e. how far
each data point is from the mean, on average. A higher variance
indicates greater variability in the data. Variance is expressed in
squared units which makes it harder to interpret directly. The formula
of variance is given below:
Formula:

(iii) Standard Deviation: Since interpreting variance was difficult therefore


we use standard deviation, which is square root of the variance,
hence, providing a measure of dispersion in the same units as the
original data. A smaller standard deviation means the data points
are closer to the mean.
Formula:

(iv) Interquartile Range (IQR): It measures the spread of the middle


50% values of data. It indicates how spread out the middle half
of a dataset is. For better understanding, imagine you line up all
your data from smallest to largest (sorted). The IQR focuses on the
middle 50% of those numbers, ignoring the smallest and largest
value. The formula is shown below:
Formula:
IQR = Q3 – Q1
Hence measures of dispersion help us understand how consistent or varied
the data is, for example, two datasets with the same mean can have very
different standard deviations, thus indicating different levels of spread. We
can use these measures of dispersion to add depth to our understanding
of data, complementing the central tendency measures. They are essential
for making informed decisions based on data variability.

112 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 112 10-Jan-25 3:52:10 PM


DESCRIPTIVE STATISTICS USING R

Notes

Code Window 11

5.7 Relationship between Variables


By now we have already understood various measures of central tendency
and dispersion, understanding the relationship between variables is also a
key aspect of data analysis. It helps in identifying patterns, trends, and
associations amongst data that can lead to decision-making and predictions.
There can be three types of relationship amongst variables: positive rela-
tion depicts increase of two variables together for example studying time
and exam scores often show a positive relationship. Negative relationship
is observed when increase in one variable leads to decrease in other, like

PAGE 113
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 113 10-Jan-25 3:52:11 PM


BUSINESS ANALYTICS

Notes distance from a Wi-Fi router and internet speed. Nonlinearity is shown when
variables have a curved or complex relationship instead of a straight-line
pattern( we will study them in lesson 6). There is no relationship between
two variables when changes in one variable do not affect the other like
shoe size and IQ have no relationship. In this section, we’ll discuss three
key concepts for relationship measurement - Covariance, Correlation, and
the Coefficient of Determination (R!). Both covariance and correlation are
used to measure linear dependency between pair of random variables also
called bivariate data. You have already studied correlation and covariance
in lesson 2 we will explore these two again with r programming code.
Covariance is a statistical measure that indicates how two variables change
together. It shows whether an increase in one variable leads to increase
in another variable, or whether this will affect inversely. The formula for
covariance is given below:

Thus, covariance measures how the two variables vary together, so that
if the value of covariance is positive it means the relationship is direct
wherein both increase and decrease as each other while a negative covari-
ance means that the one increases as the other increases. If the covariance
value approximates zero, then apparently there is no relationship. However,
it is hard to interpret because covariance is measured in units that are the
product of the variables’ units, so it is hard to compare or standardize
across datasets. An example is shown in code window 12 below:

Code Window 12
114 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 114 10-Jan-25 3:52:11 PM


DESCRIPTIVE STATISTICS USING R

While covariance provides directionality yet interpreting the magnitude Notes


is challenging due to its dependence on units. This can be resolved by
normalizing covariance i.e. computing correlation, resulting in a dimen-

correlation coefficient (r) and its formula is shown below:

When r=1 it shows perfect positive linear relationship, while r=-1 indi-
cates negative linear relationship and r=0 means no linear relationship.
Correlation not only indicates the strength of the relationship but also
the direction. The R code for computing correlation is given in code
window 13:

Code Window 13
Correlation provides valuable insights into the strength and direction of
the relationship between two variables. A strong positive correlation (0.7

increases significantly, demonstrating a robust linear relationship. On the

do not reliably predict changes in other, which indicates there isn’t any
practically important linear relation between the variables.
Further, coefficient of determination (R!) is used for explaining variance, it
is denoted by R! and quantifies how well one variable predicts another. It
is especially useful in regression analysis to evaluate the goodness-of-fit.
It is computed by squaring the correlation coefficient as shown in formula:
R2 = r2

PAGE 115
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 115 10-Jan-25 3:52:11 PM


BUSINESS ANALYTICS

Notes This value represents the proportion of variance in one variable (Y) ex-
plained by the other (X). For instance, an R2 = 0.81 implies 81% of the
variation in Y can be explained by X.
Values of R2 can be used to deduce relationship like R!= 0 indicates no
predictive relationship among the given variables i.e. the independent
variable (X) does not explain any of the variation in the dependent vari-
able (Y), while R! = 1 signifies perfect prediction indicating that all the
variation in Y is entirely explained by X.
IN-TEXT QUESTIONS
3. Which measure of central tendency is the middle value of a
sorted dataset?
4. What is the statistical term for the difference between the
maximum and minimum values in a dataset?
5. What term describes the strength and direction of the linear
relationship between two variables?
6. Which metric indicates how well one variable explains another
in regression analysis?

5.8 Summary
This lesson provides a good foundation in using data analysis with R by
considering how you can import data, information visualization, describing
data, and correlating or checking for dependence among variables. This
will help you to read your data files and to successfully import various
kinds of sources, like CSVs or XLS, into your program. The next area
of key importance is visualization, where we discussed several types of
charts that allow us to visualize our data. Histograms help to find dis-
tributions. Bar charts are useful for categorical data comparisons. Box
plots summarize data distributions by summarizing medians, quartiles,
and outliers. Line graphs capture trends over time, and scatter plots
show the relationships between two variables. These tools not only help
in analyzing the data but also make it easy to communicate findings.
Further, descriptive statistics is explained which helps to describe char-
acteristics of the dataset. Measures of central tendency-mean, median,

116 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 116 10-Jan-25 3:52:12 PM


DESCRIPTIVE STATISTICS USING R

and mode-help summarize the center of the data. However, measures of Notes
dispersion-range, variance, standard deviation, and IQR-help explain the
variation or spread in the data. Together, these measures allow for a full
description of the dataset.
Lastly, we covered relationships between variables. You have learnt to
analyze how variables are connected through covariance and correlation,
which describe how two variables change together and the strength of the
relationship. The coefficient of determination, or R!, quantifies exactly
the degree to which one variable predicts another, giving a more nuanced
understanding of how they interact. Mastering these concepts and tools
will enable you to import, visually present, describe, and analyze data as
needed in preparation for more advanced data analysis and decision-making.

5.9 Answers to In-Text Questions


1. read.csv
2. Histogram
3. Median
4. Range
5. Correlation
6. Coefficient of Determination

5.10 Self-Assessment Questions


1. Explain the process of importing a CSV file in R.
2. Create a histogram and a scatter plot using a dataset of your choice.
3. What is the difference between variance and standard deviation?
Provide examples.
4. How do covariance and correlation differ? Explain with an example.

5.11 References
Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using
R. SAGE Publications.

PAGE 117
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 117 10-Jan-25 3:52:12 PM


BUSINESS ANALYTICS

Notes James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An
introduction to statistical learning: With applications in R. Springer.
Kabacoff, R. I. (2015). R in action: Data analysis and graphics
with R (2nd ed.). Manning Publications.

5.12 Suggested Readings


Matloff, N. (2011). The art of R programming: A tour of statistical
software design. No Starch Press.
R Core Team. (n.d.). The R project for statistical computing.
Retrieved from https://cran.r-project.org/
Wickham, H., & Grolemund, G. (2017). R for data science: Import,
tidy, transform, visualize, and model data. O’Reilly Media.

118 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 118 10-Jan-25 3:52:12 PM

You might also like