BA_Unit 4 (P2)
BA_Unit 4 (P2)
5
Descriptive Statistics
Using R
Ms. Asha Yadav
Assistant Professor
Department of Computer Science
School of Open Learning
University of Delhi
STRUCTURE
5.1 Learning Objectives
5.2 Introduction
5.3 Importing Data File
5.4 Data Visualisation Using Charts
5.5 Measure of Central Tendency
5.6 Measure of Dispersion
5.7 Relationship between Variables
5.8 Summary
5.9 Answers to In-Text Questions
5.10 Self-Assessment Questions
5.11 References
5.12 Suggested Readings
Notes
5.2 Introduction
Data analysis is an important skill in today’s data-driven world, allowing
people and organizations to extract meaningful insights from raw data.
This lesson focuses on equipping you with the essential tools and tech-
niques in R, a powerful statistical computing and visualization language,
to handle data effectively. We begin by learning how to import data
files, a fundamental step in data analysis. Whether working with CSV
files, Excel sheets, or other formats, importing data correctly ensures a
seamless workflow for subsequent analysis.
Now we’re moving on to data visualization. It is an essential part of the
process of data exploration and communication. It will teach you how to
unveil patterns, distributions, and relationships in your data using visual
representations like histograms, bar charts, box plots, line graphs, and
scatter plots. Visualization doesn’t only help understand complex datasets
but also communicate results to others effectively. Descriptive statistics
are the basis of data analysis. You will look into measures of central
tendency-mean, median, mode, which summarize the central value of a
dataset, and measures of dispersion-range, variance, standard deviation,
interquartile range, which describe the variability or spread of data. These
measures give an overall idea of the characteristics of the data.
Lastly, we discuss the relationships between variables using concepts such
as covariance, correlation, and the coefficient of determination (R!). With
these tools, we can express and interpret the nature of the relationship
between variables. This is the foundation on which predictive modelling
and decision-making are based. At the end of this lesson, you would
have gained theoretical knowledge and practical skills in the analysis and
interpretation of data using R. Being a beginner or enhancing your skill
set on data, it gives a good foundation working with real-world datasets.
PAGE 103
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes also supports a variety of file formats for data import such as spread-
sheets, text files (.csv or .txt), from databases (MySQL, SQLite, and
PostgreSQL.), from software (SPSS, SAS, or STATA) and other formats
(JSON, XML, HTML etc.).
For importing data from csv we require the function read.csv(), the
syntax is read.csv(filepath, header, sep); where filepath specifies the
location of file, header parameter specifies if the first row contains
column names (TRUE/FALSE), and Sep is used to provide delimiter
(like “,” for csv).
Code Window 1
This will load my file to data and head() will display first six rows by
default.
To import other formats, we need to load desired package from
library, code to read excel, json and excel file is shown below in
code window 2.
Code Window 2
R can also interact with database using packages like DBI and
RMySQL as shown in code window 3.
104 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 3
Once you import data to R you can start doing data analysis using the
built-in functions or libraries.
The first five rows of mpg dataset are shown below in code window 4.
PAGE 105
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 4
(i) Histograms: We can use histogram to visualize the distribution of
a single continuous variable by binning i.e. dividing them into
intervals or bins. It is useful to identify patterns such as skewness,
spread, or unusual gaps. The code is given in code window 5.
Code Window 5
106 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Code Window 6
(iii) Box Plot: It summarizes the distribution of a continuous variable
by displaying the median, quartiles, and potential outliers. It is
useful when we need to compare across multiple groups. The code
is shown in code window 7.
PAGE 107
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 7
This code gives a box plot for highway mileage (hwy) for different
car classes (class). function geom_boxplot() generates a box plot
for each class with statistical summaries. The fill color is set to
light green color.
(iv) Line Graphs: They are recommended when we want to analyse
trends over a continuous variable or to observe relationships. The
code to generate line graph is shown in code window 8.
Code Window 8
108 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
(v) Scatter Plots: When we need to visualize the relationship between Notes
two continuous variables we can use scatter plots, they are an ideal
choice for identifying trends, clusters, or correlations. The code
window 9 shows how to generate scatter plot.
Code Window 9
IN-TEXT QUESTIONS
1. Which function in R is commonly used to load a CSV file?
2. What type of chart is used to visualize the frequency distribution
of data?
PAGE 109
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes (i) Mean: The arithmetic average or mean is calculated by the formula
below:
(iii) Mode: The mode represents the value that appears most frequently in
a dataset. A dataset can be unimodal having one mode, multimodal
having more than one mode, or no mode at all if no value repeats.
The example below shows all three, the corresponding code to
compute mode in dataset is given in code window 10.
110 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 10
PAGE 111
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes (ii) Variance: It is the measures of deviation from mean i.e. how far
each data point is from the mean, on average. A higher variance
indicates greater variability in the data. Variance is expressed in
squared units which makes it harder to interpret directly. The formula
of variance is given below:
Formula:
112 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 11
PAGE 113
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes distance from a Wi-Fi router and internet speed. Nonlinearity is shown when
variables have a curved or complex relationship instead of a straight-line
pattern( we will study them in lesson 6). There is no relationship between
two variables when changes in one variable do not affect the other like
shoe size and IQ have no relationship. In this section, we’ll discuss three
key concepts for relationship measurement - Covariance, Correlation, and
the Coefficient of Determination (R!). Both covariance and correlation are
used to measure linear dependency between pair of random variables also
called bivariate data. You have already studied correlation and covariance
in lesson 2 we will explore these two again with r programming code.
Covariance is a statistical measure that indicates how two variables change
together. It shows whether an increase in one variable leads to increase
in another variable, or whether this will affect inversely. The formula for
covariance is given below:
Thus, covariance measures how the two variables vary together, so that
if the value of covariance is positive it means the relationship is direct
wherein both increase and decrease as each other while a negative covari-
ance means that the one increases as the other increases. If the covariance
value approximates zero, then apparently there is no relationship. However,
it is hard to interpret because covariance is measured in units that are the
product of the variables’ units, so it is hard to compare or standardize
across datasets. An example is shown in code window 12 below:
Code Window 12
114 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
When r=1 it shows perfect positive linear relationship, while r=-1 indi-
cates negative linear relationship and r=0 means no linear relationship.
Correlation not only indicates the strength of the relationship but also
the direction. The R code for computing correlation is given in code
window 13:
Code Window 13
Correlation provides valuable insights into the strength and direction of
the relationship between two variables. A strong positive correlation (0.7
do not reliably predict changes in other, which indicates there isn’t any
practically important linear relation between the variables.
Further, coefficient of determination (R!) is used for explaining variance, it
is denoted by R! and quantifies how well one variable predicts another. It
is especially useful in regression analysis to evaluate the goodness-of-fit.
It is computed by squaring the correlation coefficient as shown in formula:
R2 = r2
PAGE 115
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes This value represents the proportion of variance in one variable (Y) ex-
plained by the other (X). For instance, an R2 = 0.81 implies 81% of the
variation in Y can be explained by X.
Values of R2 can be used to deduce relationship like R!= 0 indicates no
predictive relationship among the given variables i.e. the independent
variable (X) does not explain any of the variation in the dependent vari-
able (Y), while R! = 1 signifies perfect prediction indicating that all the
variation in Y is entirely explained by X.
IN-TEXT QUESTIONS
3. Which measure of central tendency is the middle value of a
sorted dataset?
4. What is the statistical term for the difference between the
maximum and minimum values in a dataset?
5. What term describes the strength and direction of the linear
relationship between two variables?
6. Which metric indicates how well one variable explains another
in regression analysis?
5.8 Summary
This lesson provides a good foundation in using data analysis with R by
considering how you can import data, information visualization, describing
data, and correlating or checking for dependence among variables. This
will help you to read your data files and to successfully import various
kinds of sources, like CSVs or XLS, into your program. The next area
of key importance is visualization, where we discussed several types of
charts that allow us to visualize our data. Histograms help to find dis-
tributions. Bar charts are useful for categorical data comparisons. Box
plots summarize data distributions by summarizing medians, quartiles,
and outliers. Line graphs capture trends over time, and scatter plots
show the relationships between two variables. These tools not only help
in analyzing the data but also make it easy to communicate findings.
Further, descriptive statistics is explained which helps to describe char-
acteristics of the dataset. Measures of central tendency-mean, median,
116 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
and mode-help summarize the center of the data. However, measures of Notes
dispersion-range, variance, standard deviation, and IQR-help explain the
variation or spread in the data. Together, these measures allow for a full
description of the dataset.
Lastly, we covered relationships between variables. You have learnt to
analyze how variables are connected through covariance and correlation,
which describe how two variables change together and the strength of the
relationship. The coefficient of determination, or R!, quantifies exactly
the degree to which one variable predicts another, giving a more nuanced
understanding of how they interact. Mastering these concepts and tools
will enable you to import, visually present, describe, and analyze data as
needed in preparation for more advanced data analysis and decision-making.
5.11 References
Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using
R. SAGE Publications.
PAGE 117
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An
introduction to statistical learning: With applications in R. Springer.
Kabacoff, R. I. (2015). R in action: Data analysis and graphics
with R (2nd ed.). Manning Publications.
118 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi