0% found this document useful (0 votes)
45 views12 pages

Debarghya Das (Ba-1), 18021141033

This document provides an analysis of the Old Faithful geyser dataset using RStudio. It begins with an overview of the dataset and its variables. Descriptive statistics are then calculated, including the mean, standard deviation, range, and percentiles of the eruption and waiting time variables. The mean eruption time is 3.49 minutes and mean waiting time is 70.90 minutes. The minimum and maximum values are also reported to indicate the range for each variable. Percentiles at various probabilities are computed to gain insight into the distributions. This document demonstrates how to effectively summarize and explore the key attributes of a dataset using R.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views12 pages

Debarghya Das (Ba-1), 18021141033

This document provides an analysis of the Old Faithful geyser dataset using RStudio. It begins with an overview of the dataset and its variables. Descriptive statistics are then calculated, including the mean, standard deviation, range, and percentiles of the eruption and waiting time variables. The mean eruption time is 3.49 minutes and mean waiting time is 70.90 minutes. The minimum and maximum values are also reported to indicate the range for each variable. Percentiles at various probabilities are computed to gain insight into the distributions. This document demonstrates how to effectively summarize and explore the key attributes of a dataset using R.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

NAME:DEABRGHYA DAS DATA ANALYSIS ON RSTUDIO

PRN NO. :18021141033

Asus
Fri Dec 07 20:21:46 2018
FAITHFUL DATA SET:
Description
Waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in
Yellowstone National Park, Wyoming, USA.

Usage
faithful

Format

A data frame with 272 observations on 2 variables.

[,1] eruptions numeric Eruption time in mins

[,2] waiting numeric Waiting time to next eruption (in mins)

Details

A closer look at faithful$eruptions reveals that these are heavily rounded times originally in seconds,
where multiples of 5 are more frequent than expected under non-human measurement. For a better
version of the eruption times, see the example below.

There are many versions of this dataset around: Azzalini and Bowman (1990) use a more complete
version.

Source

W. Härdle.

plot(faithful, col = "darkblue", cex = 2)


A. SUMMARY and DESCRIPTIVE STATISTICS

Summary (or descriptive) statistics are the first figures used to represent nearly every dataset.
They also form the foundations for more complicated computations and analyses. Therefore,
they are essential to the analysis process. In this paper, we will explore the ways in which R can
be used to calculate summary statistics, including mean, standard deviation, range, and
percentile. Also included here is the summary function, which is one of the most useful tools in
the R set of commands.
First, let us inspect the FAITHFUL dataset.

# Load the packages that contain the dataset and the data viz package
library(ggplot2)
library(MASS)
# Display or print the first 100 observations of our dataset
print(head(faithful, n = 100))
## eruptions waiting
## 1 3.600 79
## 2 1.800 54
## 3 3.333 74
## 4 2.283 62
## 5 4.533 85
## 6 2.883 55
## 7 4.700 88
## 8 3.600 85
## 9 1.950 51
## 10 4.350 85
## 11 1.833 54
## 12 3.917 84
## 13 4.200 78
## 14 1.750 47
## 15 4.700 83
## 16 2.167 52
## 17 1.750 62
## 18 4.800 84
## 19 1.600 52
## 20 4.250 79
## 21 1.800 51
## 22 1.750 47
## 23 3.450 78
## 24 3.067 69
## 25 4.533 74
## 26 3.600 83
## 27 1.967 55
## 28 4.083 76
## 29 3.850 78
## 30 4.433 79
## 31 4.300 73
## 32 4.467 77
## 33 3.367 66
## 34 4.033 80
## 35 3.833 74
## 36 2.017 52
## 37 1.867 48
## 38 4.833 80
## 39 1.833 59
## 40 4.783 90
## 41 4.350 80
## 42 1.883 58
## 43 4.567 84
## 44 1.750 58
## 45 4.533 73
## 46 3.317 83
## 47 3.833 64
## 48 2.100 53
## 49 4.633 82
## 50 2.000 59
## 51 4.800 75
## 52 4.716 90
## 53 1.833 54
## 54 4.833 80
## 55 1.733 54
## 56 4.883 83
## 57 3.717 71
## 58 1.667 64
## 59 4.567 77
## 60 4.317 81
## 61 2.233 59
## 62 4.500 84
## 63 1.750 48
## 64 4.800 82
## 65 1.817 60
## 66 4.400 92
## 67 4.167 78
## 68 4.700 78
## 69 2.067 65
## 70 4.700 73
## 71 4.033 82
## 72 1.967 56
## 73 4.500 79
## 74 4.000 71
## 75 1.983 62
## 76 5.067 76
## 77 2.017 60
## 78 4.567 78
## 79 3.883 76
## 80 3.600 83
## 81 4.133 75
## 82 4.333 82
## 83 4.100 70
## 84 2.633 65
## 85 4.067 73
## 86 4.933 88
## 87 3.950 76
## 88 4.517 80
## 89 2.167 48
## 90 4.000 86
## 91 2.200 60
## 92 4.333 90
## 93 1.867 50
## 94 4.817 78
## 95 1.833 63
## 96 4.300 72
## 97 4.667 84
## 98 3.750 75
## 99 1.867 51
## 100 4.900 82
# To see the variable in the dataset; Use names(dataset) or ls(dataset)
ls(faithful)
## [1] "eruptions" "waiting"
# To see the number of columns and number of rows in the FAITHFUL dataset;
use ncol(dataset) and nrow(dataset)
ncol(faithful)
## [1] 2
nrow(faithful)
## [1] 272

From the result output, the Prestige dataset contains 2 variables and 272 rows. A more
advanced technique to see the structure of the dataset is to use the str(DATAVAR) function
# A more advanced and complete way to see the structure of our dataset
str(faithful)
## 'data.frame': 272 obs. of 2 variables:
## $ eruptions: num 3.6 1.8 3.33 2.28 4.53 ...
## $ waiting : num 79 54 74 62 85 55 88 85 51 85 ...

FAITHFUL is a data.frame, a datatype with more than one row and column. FAITHFUL includes
2 numeric variables . From here, we can now perform a summary and descriptive statistics of our
dataset.
MEAN of EACH VARIABLE
In R, a mean can be calculated on an isolated variable via the mean(VAR) command, VAR is the
name of the variable whose mean we want to compute. Alternative, the mean can be calculated
for all the variables in the dataset using mean(DATAVAR) function, where DATAVAR is the
name of the dataset containing the variables. For analysis purposes, we are going to exclude the
variables census and type from the descriptive statistics. To select a subset of a dataset, use
subset(DATAVAR, select = c(“VAR1”, “VAR2”, “VAR3”….“VARi”)) command. You can also type
?subset() in your R console followed by ENTER to learn how to subset() vectors, matrices and
data frames. The code below demonstrates how to select a subset of FAITHFUL dataset

# Subsetting eruptions and waiting from the dataset FAITHFUL


subset.data <- subset(faithful, select = c("eruptions","waiting"))
# Checking subset.data to make sure we have the needed subset
str(subset.data)
## 'data.frame': 272 obs. of 2 variables:
## $ eruptions: num 3.6 1.8 3.33 2.28 4.53 ...
## $ waiting : num 79 54 74 62 85 55 88 85 51 85 ...

From the above output, we see that we got the subset we need. Let’s find the mean of each
variable in the selected subset.

# Calculate the mean of a variable with mean(DATAVAR$VAR); mean of variable


eruptions
mean(subset.data$eruptions)
## [1] 3.487783
THE AVERAGE ERUPTION TIME IS 3.49MINS
# Calculate the mean of a variable with mean(DATAVAR$VAR); mean of variable
waiting
mean(subset.data$waiting)
## [1] 70.89706
THE AVERAGE WAITING TIME IS 70.90MIN

STANDARD DEVIATION OF EACH VARIABLE


Standard deviations are calculated in the same way as means within R. The standard deviation
of a single variable can be computed using the formula sd(VAR), where VAR is the name of the
variable whose standard deviation you want to retrieve. Standard deviation measures how
spread your data are. The codes below demonstrate the use of the standard deviation function.

# What is the standard deviation of eruptions?


sd(subset.data$eruptions)
## [1] 1.141371
# What is the standard deviation of waiting?
sd(subset.data$waiting)
## [1] 13.59497

RANGE of EACH VARIABLE : MINIMUM & MAXIMUM


Continuing in the same trajectory, minimum can be computed on a single variable using the
min(VAR) formula. In the same token, max(VAR) operates similarly. Minimum and Maximum give
the min and max of individual variables in the dataset. The codes below show how to calculate
minimums and maximums.
# Minimum and Maximum of eruptions time of survey
min(subset.data$eruptions);max(subset.data$eruptions)
## [1] 1.6
## [1] 5.1

From the output, the minimum eruption time is 1.6mins and the maximum eruption time is
5.1mins
# Minimum and Maximum of waiting time of survey
min(subset.data$waiting);max(subset.data$waiting)
## [1] 43
## [1] 96

From the output, the minimum eruption time is 43mins and the maximum eruption time is 96mins

RANGE
The range of a particular variable, that is, its minimum and maximum, can be retrieved using the
range(VAR) command. Like with min and max functions, using range(dataset) is not very useful
since it considers the entire dataset, rather than each individual variable. Consequently, it is
recommended that ranges be computed on individual variables. These computations are
demonstrated in the following codes:

# Calculate the range of a variable with range(VAR)


# Range of variable eruptions
range(subset.data$eruptions)
## [1] 1.6 5.1
# Range of variable waiting
range(subset.data$waiting)
## [1] 43 96

PERCENTILES : VALUES from PERCENTILES (QUANTILE)


You can get more insight into the distribution of a set of observations by examining quantiles. A
quantile is a value computed from a collection of numeric measurements that indicates an
observation’s rank when compared to all other present observations. Alternatively, quantile can
be expressed as a percentile, this is identical but on a percent scale of 0 to 100.
Obtaining quantiles and percentile in R is done using the quantile() function. The command is
written quantile(VAR, c(PROB1, PROB2, PROB3,….PROBi)) or quantile(VAR, prob = c(prob
value1, prob value 2, prob value 3…prob valuei)).

# Calculate the 25th, 50th, 75th, and 95th percentiles for eruptions in the
subset dataset
quantile(subset.data$eruptions, prob = c(0,.25, .50, .75, .95))
## 0% 25% 50% 75% 95%
## 1.60000 2.16275 4.00000 4.45425 4.81700
# Calculate the 25th, 50th, 75th, and 95th percentiles for waiting in the
subset dataset
quantile(subset.data$waiting, prob = c(0,.25, .50, .75, .95))
## 0% 25% 50% 75% 95%
## 43 58 76 82 89

PERCENTILES FROM VALUES (PERCENTILE RANK)


In the opposite situation, where a percentile rank corresponding to a given value is needed, one
has to devise a custom method. Here are the steps involved in computing a percentile rank.

1. Count the number of data points that are at or below the given value
2. Divide by the total number of data points
3. multiple by 100

The formula for calculating a percentile rank can be derived from this command: percentile rank
= length(VAR[VAR <= VALUE]) / length(VAR) * 100. length(VAR[VAR <= VALUE]) counts the
number of data points in a variable that are below the given value. The ‘<=’ can be replaced with
other operators, such as ‘<’, ‘>’, and =. The length(VAR) counts the number of data points in the
variable. The final step is to multiply the result by 100 to transform the decimal value into a
percentage. Let’s apply these steps in the following examples:

# In the sample, 3mins of eruptions time is at what percentile rank?


length(subset.data$eruptions[subset.data$eruptions <= 3]) /
length(subset.data$eruptions) * 100
## [1] 35.66176
# In the sample, 70mins of waiting time is at what percentile rank?
length(subset.data$waiting[subset.data$waiting <= 70]) /
length(subset.data$waiting) * 100
## [1] 39.33824

SUMMARY
A very useful function in R is summary(x), where x can be one of any number of objects,
including datasets, variables, and linear models. When used, the summary(x) provides summary
data related to the individual object that is included into it. Thus, the summary function has a
different output depending on what kind of object it takes as an argument. This method is
valuable because it often sums up what we previously did and provides exactly what is needed in
summary statistics. Let’s apply this command to the sample dataset.

# Summarize a variable using summary(VAR). Summary statistics of eruptions


print(summary(subset.data$eruptions))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.600 2.163 4.000 3.488 4.454 5.100
# Summarize a variable using summary(VAR). Summary statistics of waiting
print(summary(subset.data$waiting))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 43.0 58.0 76.0 70.9 82.0 96.0

We get the min, 1st quartile, Median, Mean, 3rd quartile, and the maximum years of education.
This step provides more summary statistics information. We will apply,the same command this
time,to the dataset.
# Summarize the subset.data sample using the command summary(subset.data)
print(summary(subset.data))
## eruptions waiting
## Min. :1.600 Min. :43.0
## 1st Qu.:2.163 1st Qu.:58.0
## Median :4.000 Median :76.0
## Mean :3.488 Mean :70.9
## 3rd Qu.:4.454 3rd Qu.:82.0
## Max. :5.100 Max. :96.0

The output of the preceding summary provides the descriptive statistics of all objects in the
sample data set. Under each variable, we have its summary statistics. Now that we know how to
do summary statistics, we can delve into the visual part of the analysis to see the relation
between the variables.

DATA VISUALIZATION

1. MATRIX of PLOTS

The single type of planar scatterplot is really useful only when comparing two numeric-
continuous variables. When there are more continuous variables of interest, it is possible to
display this information satisfactorily on a single plot. A simple and common solution is to
generate a two-variable scatterplot for each pair of variables and show them together in a
structured way; this is referred to as a Scatterplot Matrix. We have four continuous/numeric
variables in the subset dataset we have selected. Working with base R graphics, use the pairs
function.

# Drawing a scatterplot matrix of eruptions and waiting using the pairs


function
pairs(subset.data, pch = 16, col = "blue", main = "Matrix Scatterplot of
eruptions,waiting")
Our matrix scatterplot may be too big to fit on our screen. However, if you run the above code in
your console, you will get the matrix plot in Rstudio’s graphics area. The interpretation of the
above plots depends upon the labeling of the diagonal panels, running from the top left to the
bottom right. They will appear in the same order as the columns given as the first argument.
These “label panels” allow you to determine which individual plot in the matrix corresponds to
each pair of variables. For instance, the first column of the scatterplot matrix corresponds to an x-
axis variable of education; the third row of the matrix corresponds to a y-axis variable of women,
and each row and column displays a scale that is constant moving left/right or up/down,
respectively. The plot of prestige(y) on income(x) at position row 4 and column 2 displays the
same data as the scatterplot at position row 2 and column 4 but flipped on its axis. Likewise, the
plot of income(y) on education(x) at position row 2 and column 1 displays the same data as the
scatterplot at column 2 row 1 flippedd on its axis. The scatterplot matrices therefore allow for
easier comparison of pairwise relationships formed by observations made on multiple continuous
variables. Instead of viewing a panoramic relationship between all the variables in our subset
dataset, let’s use simple scatter plots to visualize the relationship between two variables.

2. SCATTERPLOT

A scatterplot is a useful way to visualize the relationship between two variables displayed as x-y
coordinate plots. Similar to correlations, scatterplots are often used to make initial diagnoses
before any statistical analyses are performed.
The simplest way to create a scatterplot is to directly graph two variables using the default
settings. In R, this can be achieved using the formula plot(VARX, VARY) function, where VARX
is the variable to plot along the x-axis and VARY is the variable to plot along the y-axis. I will also
add the ggplot2 version for the scatterplot. Let’s look at the relationship between eruptions and
waiting.

# Plot of the relationship between eruption and waiting


ggplot(subset.data) +
geom_point(aes(x = eruptions, y = waiting), col = 'blue', size = 3) +
ggtitle("Eruptions vs. waiting Scatterplot") +
theme(plot.title = element_text(hjust = 0.5))

The scatterplot also displays the presence of potential outliers. We can further expand our visual
analysis by calculating the slope and intercept of line of best fit.

# Calculate slope and intercept of line of best fit


coef(lm(waiting ~ eruptions, data = subset.data))
## (Intercept) eruptions
## 33.47440 10.72964

We will use geom_abline() function to add the intercept avlue and estimated coefficient of
eruptions to our plot.

# Adding a line of best fit; intercept and slope


ggplot(subset.data) +

geom_point(aes(x = eruptions, y = waiting), col = "blue", size = 3) +

geom_abline(aes(intercept = -2853, slope = 899), col = "darkred") +

ggtitle("Eruptions vs. Waiting Scatterplot With The Best Fit Line") +

theme(plot.title = element_text(hjust = 0.5))

You might also like