0% found this document useful (0 votes)
3 views

Unit III - R Programming

Uploaded by

Harshitha B
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Unit III - R Programming

Uploaded by

Harshitha B
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Statistical Analysis and R Programming 2024-25

UNIT III: STATISTICS AND PROBABILITY

BASIC DATA VISUALIZATION


Data visualization is an efficient technique for gaining insight about data through a visual medium.
By using the data visualization technique, can work with large datasets to efficiently obtain key
insights about it. Graphics play an important role in carrying out the important features of the data.

R BAR CHARTS
A bar chart is a pictorial representation in which numerical values of variables are represented by
length or height of lines or rectangles of equal width. A bar chart is used for summarizing a set of
categorical data. In bar chart, the data is shown through rectangular bars having the length of the
bar proportional to the value of the variable.

Syntax: barplot (h, xlab, ylab,, main, names.arg, col)


Where h is a vector or matrix which contains numeric values used in the bar chart.
xlab is a label for the x-axis.
ylab is a label for the y-axis.
main is a title of the bar chart.
names.arg is a vector of names that appear under each bar.
col is used to give colors to the bars in the graph.

Example
H <- c(12,35,54,3,41)
M<- c("Feb","Mar","Apr","May","Jun")
png(file = "bar_properties.png")
barplot(H,names.arg=M, xlab="Month", ylab="Revenue", col="Green", main="Revenue
Barchart", border="red")
dev.off()

Shruthi S, Asst. Professor, GSSS SSFGC, Mysuru Page 1


Statistical Analysis and R Programming 2024-25

Output:

GROUP BAR CHART OR STACKED BAR CHART


Bar charts can create with groups of bars and stacks using matrices as input values in each bar.
One or more variables are represented as a matrix that is used to construct group bar charts and
stacked bar charts.

Example:
months <- c("Jan","Feb","Mar","Apr","May")
regions <- c("West","North","South")
Values <- matrix(c(21,32,33,14,95,46,67,78,39,11,22,23,94,15,16), nrow = 3, ncol = 5, byrow =
TRUE)
png(file = "stacked_chart.png")
barplot(Values, main = "Total Revenue", names.arg = months, xlab = "Month",
ylab = "Revenue", ccol =c("blue","pink","goldenrod "))
legend("topleft", regions, cex = 1.3, fill = c("blue","pink","goldenrod "))
dev.off()

Shruthi S, Asst. Professor, GSSS SSFGC, Mysuru Page 2


Statistical Analysis and R Programming 2024-25

Output:

R PIE CHARTS
A pie-chart is a representation of values in the form of slices of a circle with different colors. Slices
are labeled with a description, and the numbers corresponding to each slice are also shown in the
chart. The Pie charts are created with the help of pie () function, which takes positive numbers as
vector input.
Syntax: pie(X, Labels, Radius, main, col, Clockwise)
Where X is a vector that contains the numeric values used in the pie chart.
Labels are used to give the description to the slices.
Radius describes the radius of the pie chart.
main describes the title of the chart
col defines the color palette.
Clockwise is a logical value that indicates the clockwise or anticlockwise direction in which
slices are drawn.

Example:
x <- c(20, 65, 15, 50)
labels <- c("India", "America", "Shri Lanka", "Nepal")

Shruthi S, Asst. Professor, GSSS SSFGC, Mysuru Page 3


Statistical Analysis and R Programming 2024-25

png(file = "title_color.jpg")
pie(x,labels,main="Country Pie chart",col=rainbow(length(x)))
dev.off()

Output:

There are two additional properties of the pie chart, i.e., slice percentage and chart legend. Data
can be show in the form of percentage and legends to plots in R by using the legend () function.

Example:
x <- c(20, 65, 15, 50)
labels <- c("India", "America", "Shri Lanka", "Nepal")
pie_percent<- round(100*x/sum(x), 1)
png(file = "per_pie.jpg")
pie(x, labels = pie_percent, main = "Country Pie Chart",col = rainbow(length(x)))
legend("topright", c("India", "America", "Shri Lanka", "Nepal"), cex = 0.8, fill =
rainbow(length(x)))
dev.off()

Shruthi S, Asst. Professor, GSSS SSFGC, Mysuru Page 4


Statistical Analysis and R Programming 2024-25

Output:

R HISTOGRAM
A histogram is a type of bar chart which shows the frequency of the number of values which are
compared with a set of values ranges. The histogram is used for the distribution, whereas a bar
chart is used for comparing different entities. In the histogram, each bar represents the height of
the number of values present in the given range.
For creating a histogram, R provides hist() function, which takes a vector as an input.

Syntax: hist(V, main, xlab, ylab, xlim, ylim, breaks, col, border)
Where V is a vector that contains numeric values.
main indicates the title of the chart.
col is used to set the color of the bars.
border is used to set the border color of each bar.
xlab is used to describe the x-axis.
ylab is used to describe the y-axis.
xlim is used to specify the range of values on the x-axis.
ylim is used to specify the range of values on the y-axis.
breaks is used to mention the width of each bar.

Shruthi S, Asst. Professor, GSSS SSFGC, Mysuru Page 5


Statistical Analysis and R Programming 2024-25

Example:
v <- c(12,24,16,38,21,13,55,17,39,10,60)
png(file = "histogram_chart.png")
hist(v, xlab = "Weight", ylab="Frequency", col = "green", border = "red")
dev.off()

Output:

R SCATTERPLOTS
The scatter plots are used to compare variables. A comparison between variables is required when
we need to define how much one variable is affected by another variable. In a scatterplot, the data
is represented as a collection of points. Each point on the scatterplot defines the values of the two
variables. One variable is selected for the vertical axis and other for the horizontal axis.
Syntax: plot(x, y, main, xlab, ylab, xlim, ylim, axes)
Where x is the dataset whose values are the horizontal coordinates.
y is the dataset whose values are the vertical coordinates.
main is the title of the graph.
xlab is the label on the horizontal axis.

Shruthi S, Asst. Professor, GSSS SSFGC, Mysuru Page 6


Statistical Analysis and R Programming 2024-25

ylab is the label on the vertical axis.


xlim is the limits of the x values which is used for plotting.
ylim is the limits of the values of y, which is used for plotting.
axes indicates whether both axes should be drawn on the plot.

Example: In our example, we will use the dataset "mtcars", which is the predefined dataset
available in the R environment.

data <-mtcars[,c('wt','mpg')] #Fetching two columns from mtcars


png(file = "scatterplot.png")
plot(x = data$wt,y = data$mpg, xlab = "Weight", ylab = "Milage", xlim = c(2.5,5), ylim =
c(15,30), main = "Weight v/sMilage")
dev.off()

Output:

SCATTERPLOT USING GGPLOT2


In R, there is another way for creating scatterplot i.e. with the help of ggplot2 package. The ggplot2
package provides ggplot() and geom_point() function for creating a scatterplot. The ggplot ()
function takes a series of the input item. The first parameter is an input vector, and the second is
the aes() function it includes the x-axis and y-axis.

Shruthi S, Asst. Professor, GSSS SSFGC, Mysuru Page 7


Statistical Analysis and R Programming 2024-25

Example:
library(ggplot2) #Loading ggplot2 package
png(file = "scatterplot_ggplot.png")
ggplot(mtcars, aes(x = drat, y = mpg)) +geom_point()
dev.off()

Output

R BOXPLOT
Boxplots are a measure of how well data is distributed across a data set. This divides the data set
into three quartiles. This graph represents the minimum, maximum, average.
Boxplot is also useful in comparing the distribution of data in a data set by drawing a boxplot for
each of them.
R provides a boxplot() function to create a boxplot.
Syntax: boxplot(x, data, notch, varwidth, names, main)
Where x is a vector or a formula.
data is the data frame.
notch is a logical value set as true to draw a notch.

Shruthi S, Asst. Professor, GSSS SSFGC, Mysuru Page 8


Statistical Analysis and R Programming 2024-25

varwidth is also a logical value set as true to draw the width of the box same as the sample
size.
names is the group of labels that will be printed under each boxplot.
main is used to give a title to the graph.

Example
X=c(2, 4,6,7,8)
png(file = "boxplot.png")
boxplot(data = X, xlab = "Quantity of Cylinders", ylab = "Miles Per Gallon", main = "R Boxplot
Example")
dev.off()
Output:

Example: When we use notch=TRUE the following output will be displayed.


X=c(1,2,3,4,5,6,7,8)
png(file = "boxplot.png")
boxplot(data = X, xlab = "Quantity of Cylinders", ylab = "Miles Per Gallon", main = "R Boxplot
Example", notch = TRUE, col=”pink”)
dev.off()

Shruthi S, Asst. Professor, GSSS SSFGC, Mysuru Page 9


Statistical Analysis and R Programming 2024-25

Output:

PROBABIL ITY
A probability is a number that describes the “magnitude of chance” associated with making a
particular observation or statement. It’s always a number between 0 and 1.Calculation of a
probability depends on the definition of an event.
In statistics, an event typically refers to a specific outcome that can occur. To describe the chance
of event A actually occurring, a probability, denoted by Pr(A). At the extremes, Pr(A) = 0 suggests
A cannot occur, and Pr(A) = 1 suggests that A occurs with complete certainty.
Let’s say you roll a six-sided, fair die. Let A be the event “you roll a 5 or a 6.” You can assume
that each outcome on a standard die has a probability of occurring 1/6 in any given roll.

CONDITIONAL PROBABILITY
A conditional probability is the probability of one event occurring after taking into account the
occurrence of another event. The quantity Pr(A|B) represents “the probability that A occurs, given
that B has already occurred,” and vice versa if you write Pr(B|A).
If Pr(A|B) = Pr(A), then the two events are independent;
if Pr(A|B) =/ Pr(A), then the two events are dependent.

Shruthi S, Asst. Professor, GSSS SSFGC, Mysuru Page 10


Statistical Analysis and R Programming 2024-25

INTERSECTION
The intersection of two events is written as Pr(A ∩ B) and is read as “the probability that both A
and B occur simultaneously.” It is common to represent this as a Venn diagram, as shown here:

UNION
The union of two events is written as Pr(A ∪ B) and is read as “the probability that A or B occurs.”
Here is the representation of a union as a Venn diagram:

COMPLEMENT
The probability of the complement of an event is written as Pr(A¯) and is read as “the probability
that A does not occur.”

Shruthi S, Asst. Professor, GSSS SSFGC, Mysuru Page 11


Statistical Analysis and R Programming 2024-25

RANDOM VARIABLE
Random variable a variable whose possible values are numerical outcomes of a random
phenomenon. It is a function that assigns a real number to each outcome in the sample space of a
random experiment. Real values of random experiment is called Random variable.
Example
If two unbiased coins are tossed then find the random variable associated with that event.
Suppose two (unbiased) coins are tossed
X = number of heads. [X is a random variable or function]
Here, the sample space S = {HH, HT, TH, TT}

Two types of Random variables are:-


 Discrete Random Variable Definition
In probability theory, a discrete random variable is a type of random variable that can take on
a finite or countable number of distinct values. These values are often represented by integers
or whole numbers, other than this they can also be represented by other discrete values.
For example, the number of heads obtained after flipping a coin three times is a discrete
random variable. The possible values of this variable are 0, 1, 2, or 3.

 Continuous Random Variable


A continuous random variable is one which takes an infinite number of possible values.
Continuous random variables are usually measurements. Examples include height, weight, the
amount of sugar in an orange, the time required to run a mile etc.
A continuous random variable is not defined at specific values. Instead, it is defined over
an interval of values.

COMMON PROBABILITY DISTRIBUTION


The common distributions are broadly categorized as either discrete or continuous. Each
distribution has four core R functions tied to it, a d-function, providing specific mass or density
function values; a p-function, providing cumulative distribution probabilities; a q-function,
providing quantiles; and an r-function, providing random variate generation.

Shruthi S, Asst. Professor, GSSS SSFGC, Mysuru Page 12


Statistical Analysis and R Programming 2024-25

COMMON PROBABILITY MASS FUNCTIONS


Some common probability mass functions for discrete random variables are as follows.

1. BERNOULLI DISTRIBUTION
Bernoulli distribution is a special case of distribution where only a single trial is performed. It
is a discrete probability distribution for a Bernoulli trial. A trial that has only two outcomes i.e.
either success or failure.
For example, In R it can be represented as a coin toss where the probability of getting the head is
0.5 and getting a tail is 0.
The probabilities associated with all possible outcomes must sum to 1. Therefore, if the probability
of success is p the only other alternative outcome, failure, must occur with probability 1 – p.

In mathematical terms, for a discrete random variable X = x, the probability mass function f is
f(x) = p x (1 − p) 1 – x ; x = {0,1}
Where p is a parameter of the distribution.

In R programming there are 4 built-in functions for Bernoulli distribution. They are:
 dbern(): This function measures the density function of the Bernoulli distribution.
Syntax: dbern(x, prob, log = FALSE)
Where x is vector of quantiles
prob: probability of success on each trial
log: logical; if TRUE, probabilities p are given as log(p)

Example:
library(Rlab)
x <- c(0, 1, 3, 5, 7, 10)
y <- dbern(x, prob = 0.5)
plot(x, y, type = "o")

Shruthi S, Asst. Professor, GSSS SSFGC, Mysuru Page 13


Statistical Analysis and R Programming 2024-25

Output:

 pbern(): This function is used for cumulative distribution function (CDF) or cumulative
frequency function, describes the probability that a variate X takes on a value less than or
equal to a number x.
Syntax: pbern(q, prob, log.p = FALSE)
Where q is vector of quantiles
prob: probability of success on each trial
log.p: logical; if TRUE, probabilities p are given as log(p).

 qbern() : It gives the quantile function for the Bernoulli distribution.


Syntax: pbern(p, prob, log.p = FALSE)
Where q is the vector of probability
prob: probability of success on each trial
log.p: logical; if TRUE, probabilities p are given as log(p).

 rbern(): function is used to generate a vector of random numbers for Bernoulli distributed.
Syntax: rbern(n, prob)
Where n: number of random number
prob: probability of success on each trial

Shruthi S, Asst. Professor, GSSS SSFGC, Mysuru Page 14


Statistical Analysis and R Programming 2024-25

2. BINOMIAL DISTRIBUTION
The binomial distribution is a discrete distribution and has only two outcomes i.e. success or
failure. All its trials are independent, the probability of success remains the same and the previous
outcome does not affect the next outcome.
Binomial distribution helps to find the individual probabilities as well as cumulative probabilities
over a certain range.
In mathematical terms, for a discrete random variable X=x, the binomial mass function is

In R Programming there are 4 built-in functions to for Binomial distribution. They are:
 dbinom(): This function gives the cumulative probability of an event. It is a single value
representing the probability.
Syntax: dbinom(x, size, prob)
Where x is a vector of numbers.
size is the number of trials.
prob is the probability of success of each trial.

 pbinom(): It is used to find the cumulative probability of a data following binomial


distribution.
Syntax: pbinom(q, size, prob)
Where q is a vector of quantile
size is the number of trials.
prob is the probability of success of each trial.

 qbinom(): This function is used to find the quintile value. It is an inverse of pbinom().
Syntax: qbinom(p, size, prob)
Where p is a vector of probability
size is the number of trials.
prob is the probability of success of each trial.

Shruthi S, Asst. Professor, GSSS SSFGC, Mysuru Page 15


Statistical Analysis and R Programming 2024-25

 rbinom(): This function generates n random variables of a particular probability.


Syntax: rbinom(n, size, prob)
where n is the number of observations.
size is the number of trials.
prob is the probability of success of each trial.

3. POISSON DISTRIBUTION
Poisson distribution is a probability distribution that expresses the number of events occurring
in a fixed interval of time or space, given a constant average rate. This distribution is particularly
useful when dealing with rare events or incidents that happen independently.
In mathematical terms, for a discrete random variable and a realization X = x, the Poisson mass
function f is given as follows, where λp is a parameter of the distribution

There are four Poisson functions available in R:


 dpois(): This function is used for illustration of Poisson density in an R. Syntax:
Synatx: dpois(k ,λ, log)
where, k: number of successful events happened in an interval
λ: mean per interval
log: If TRUE then the function returns probability in form of log

 ppois():This function is used for the illustration of cumulative probability function .


Syntax: ppois(q, λ, lower.tail, log)
where, q: number of successful events happened in an interval
λ: mean per interval
lower.tail: If TRUE then left tail is considered otherwise if the FALSE right tail
is considered
log: If TRUE then the function returns probability in form of log

Shruthi S, Asst. Professor, GSSS SSFGC, Mysuru Page 16


Statistical Analysis and R Programming 2024-25

 qpois(): It is used for generating quantile of a given Poisson’s distribution.


Syntax: qpois(q, λ, lower.tail, log)
where, q: number of successful events happened in an interval
λ: mean per interval
lower.tail: If TRUE then left tail is considered otherwise if the FALSE right tail is
considered
log: If TRUE then the function returns probability in form of log

 rpois() :It is used for generating random numbers from a given Poisson’s distribution.
Syntax: rpois(q, λ)
Where, q: number of random numbers needed
λ: mean per interval

COMMON PROBABILITY DENSITY FUNCTIONS


When considering continuous random variables are deal with probability density functions. There
are a number of common continuous probability distributions frequently used over many different
types of problem. They are as follows

1. UNIFORM DISTRIBUTION
Uniform Distribution is the probability distribution that represents equal likelihood of all
outcomes within a specific range. i.e. the probability of each outcome occurring is the same
.A uniform distribution holds the same probability for the entire interval. Thus, its plot is a
rectangle, and therefore it is often referred to as rectangular distribution.
For a continuous random variable a ≤ X ≤ b, the uniform density function f is

Where a and b are interval parameters of the distribution defining the limits of the possible
values X can take. They represent the lower and upper limits, respectively.
The R function associates with in uniform distribution are as follows.

Shruthi S, Asst. Professor, GSSS SSFGC, Mysuru Page 17


Statistical Analysis and R Programming 2024-25

 dunif(): It calculates the uniform density function in R language in the specified interval
(a, b).
Syntax: dunif(x, min, max)
Where x: input sequence
min, is lower limit value in the range
max is uppper limit value in the range of values

Ex: dunif(x=c(-2,-0.33,0,0.5,1.05,1.2),min=-0.4,max=1.1)
Output:
0.0000000 0.6666667 0.6666667 0.6666667 0.6666667 0.0000000

 punif(): It is used to calculate the uniform cumulative distribution function.


Syntax: punif(q, min , max )
Where q: vector of quantile
min, is lower limit value in the range
max is uppper limit value in the range of values

 quinf(): This function calculates the quantile function of the uniform distribution. The
quantile function is the inverse of the punif()
Syntax: qunif(p, min , max )
Where p: vector of probability
min, is lower limit value in the range
max is uppper limit value in the range of values

 runif(): It is used to generate a sequence of random number the uniform distribution.


Syntax: runif(n, min , max)
Where n: vector of random number
min, is lower limit value in the range
max is uppper limit value in the range of values

2. NORMAL DISTRIBUTION
Normal Distribution is a probability function used in statistics that tells about how the data
values are distributed. It is generally observed that data distribution is normal when there

Shruthi S, Asst. Professor, GSSS SSFGC, Mysuru Page 18


Statistical Analysis and R Programming 2024-25

is a random collection of data from independent sources. The graph produced after plotting
the value of the variable on x-axis and count of the value on y-axis is bell-shaped curve
graph. It’s also referred to as the Gaussian distribution.
For a continuous random variable −∞ < X < ∞, the normal density function f is

The graph signifies that the peak point is the mean of the data set and half of the values of
data set lie on the left side of the mean and other half lies on the right part of the mean.

In R programming, there are 4 built-in functions to generate normal distribution:


 dnorm(): This function measures density function of distribution.
Syntax : dnorm(x, mean, sd)
Where x is a vector of numbers.
mean is the mean value of the sample data. Its default value is zero.
sd is the standard deviation. Its default value is 1.

 pnorm(): This function is the cumulative distribution function which measures the
probability that a random number X takes a value less than or equal to x.
Syntax: pnorm(q, mean, sd)
Where q is a vector of quantile.
mean is the mean value of the sample data. Its default value is zero.
sd is the standard deviation. Its default value is 1.

Shruthi S, Asst. Professor, GSSS SSFGC, Mysuru Page 19


Statistical Analysis and R Programming 2024-25

 qnorm() : This function is the inverse of pnorm() function. It takes the probability
value and gives output which corresponds to the probability value.
Syntax: qnorm(p, mean, sd)
Where p is a vector of probability.
mean is the mean value of the sample data. Its default value is zero.
sd is the standard deviation. Its default value is 1.

 rnorm(): This function in R programming is used to generate a vector of random


numbers which are normally distributed.
Syntax: qnorm(n, mean, sd)
Where n is a vector of random numbers.
mean is the mean value of the sample data. Its default value is zero.
sd is the standard deviation. Its default value is 1.

3. STUDENT’S T-DISTRIBUTION
The Student’s t-distribution is a continuous probability distribution generally used when
dealing with statistics estimated from a sample of data.
A statistical distribution published by William Gosset in 1908. The t-distribution, also known
as the Student’s t-distribution. It is a statistical function that creates a probability distribution.
The t-distribution is similar to the normal distribution, with its bell shape, but it has heavier
tails. It is used for estimating population parameters for small sample sizes or unknown
variances.
A t-distribution has only one parameter: The degrees of freedom (d.f.). The number of degrees
of freedom refers to the number of independent observations minus one.
df = n − 1
The R function associated with the student t distributions are as follows.
 dt() : This function is used to find the value of probability density function
Syntax: dt (x, df)
Where x is the vector
df is the degrees of freedom

Shruthi S, Asst. Professor, GSSS SSFGC, Mysuru Page 20


Statistical Analysis and R Programming 2024-25

 pt(): This function is used to get the cumulative distribution function of a t distribution
Syntax: pt(q, df)
Where q is the quantiles vector
df is the degrees of freedom

 qt(): This function is used to get the quantile function or inverse cumulative density
function of a t-distribution.
Syntax: qt(p, df)
Where p is the vector of probabilities
df is the degrees of freedom

 rt():This function generates random numbers from the t-distribution.


Syntax: rt(n, df)
Where n the number of random numbers
df the degrees of freedom.

Shruthi S, Asst. Professor, GSSS SSFGC, Mysuru Page 21

You might also like