Debarghya Das (Ba-1), 18021141033
Debarghya Das (Ba-1), 18021141033
Asus
Fri Dec 07 20:21:46 2018
FAITHFUL DATA SET:
Description
Waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in
Yellowstone National Park, Wyoming, USA.
Usage
faithful
Format
Details
A closer look at faithful$eruptions reveals that these are heavily rounded times originally in seconds,
where multiples of 5 are more frequent than expected under non-human measurement. For a better
version of the eruption times, see the example below.
There are many versions of this dataset around: Azzalini and Bowman (1990) use a more complete
version.
Source
W. Härdle.
Summary (or descriptive) statistics are the first figures used to represent nearly every dataset.
They also form the foundations for more complicated computations and analyses. Therefore,
they are essential to the analysis process. In this paper, we will explore the ways in which R can
be used to calculate summary statistics, including mean, standard deviation, range, and
percentile. Also included here is the summary function, which is one of the most useful tools in
the R set of commands.
First, let us inspect the FAITHFUL dataset.
# Load the packages that contain the dataset and the data viz package
library(ggplot2)
library(MASS)
# Display or print the first 100 observations of our dataset
print(head(faithful, n = 100))
## eruptions waiting
## 1 3.600 79
## 2 1.800 54
## 3 3.333 74
## 4 2.283 62
## 5 4.533 85
## 6 2.883 55
## 7 4.700 88
## 8 3.600 85
## 9 1.950 51
## 10 4.350 85
## 11 1.833 54
## 12 3.917 84
## 13 4.200 78
## 14 1.750 47
## 15 4.700 83
## 16 2.167 52
## 17 1.750 62
## 18 4.800 84
## 19 1.600 52
## 20 4.250 79
## 21 1.800 51
## 22 1.750 47
## 23 3.450 78
## 24 3.067 69
## 25 4.533 74
## 26 3.600 83
## 27 1.967 55
## 28 4.083 76
## 29 3.850 78
## 30 4.433 79
## 31 4.300 73
## 32 4.467 77
## 33 3.367 66
## 34 4.033 80
## 35 3.833 74
## 36 2.017 52
## 37 1.867 48
## 38 4.833 80
## 39 1.833 59
## 40 4.783 90
## 41 4.350 80
## 42 1.883 58
## 43 4.567 84
## 44 1.750 58
## 45 4.533 73
## 46 3.317 83
## 47 3.833 64
## 48 2.100 53
## 49 4.633 82
## 50 2.000 59
## 51 4.800 75
## 52 4.716 90
## 53 1.833 54
## 54 4.833 80
## 55 1.733 54
## 56 4.883 83
## 57 3.717 71
## 58 1.667 64
## 59 4.567 77
## 60 4.317 81
## 61 2.233 59
## 62 4.500 84
## 63 1.750 48
## 64 4.800 82
## 65 1.817 60
## 66 4.400 92
## 67 4.167 78
## 68 4.700 78
## 69 2.067 65
## 70 4.700 73
## 71 4.033 82
## 72 1.967 56
## 73 4.500 79
## 74 4.000 71
## 75 1.983 62
## 76 5.067 76
## 77 2.017 60
## 78 4.567 78
## 79 3.883 76
## 80 3.600 83
## 81 4.133 75
## 82 4.333 82
## 83 4.100 70
## 84 2.633 65
## 85 4.067 73
## 86 4.933 88
## 87 3.950 76
## 88 4.517 80
## 89 2.167 48
## 90 4.000 86
## 91 2.200 60
## 92 4.333 90
## 93 1.867 50
## 94 4.817 78
## 95 1.833 63
## 96 4.300 72
## 97 4.667 84
## 98 3.750 75
## 99 1.867 51
## 100 4.900 82
# To see the variable in the dataset; Use names(dataset) or ls(dataset)
ls(faithful)
## [1] "eruptions" "waiting"
# To see the number of columns and number of rows in the FAITHFUL dataset;
use ncol(dataset) and nrow(dataset)
ncol(faithful)
## [1] 2
nrow(faithful)
## [1] 272
From the result output, the Prestige dataset contains 2 variables and 272 rows. A more
advanced technique to see the structure of the dataset is to use the str(DATAVAR) function
# A more advanced and complete way to see the structure of our dataset
str(faithful)
## 'data.frame': 272 obs. of 2 variables:
## $ eruptions: num 3.6 1.8 3.33 2.28 4.53 ...
## $ waiting : num 79 54 74 62 85 55 88 85 51 85 ...
FAITHFUL is a data.frame, a datatype with more than one row and column. FAITHFUL includes
2 numeric variables . From here, we can now perform a summary and descriptive statistics of our
dataset.
MEAN of EACH VARIABLE
In R, a mean can be calculated on an isolated variable via the mean(VAR) command, VAR is the
name of the variable whose mean we want to compute. Alternative, the mean can be calculated
for all the variables in the dataset using mean(DATAVAR) function, where DATAVAR is the
name of the dataset containing the variables. For analysis purposes, we are going to exclude the
variables census and type from the descriptive statistics. To select a subset of a dataset, use
subset(DATAVAR, select = c(“VAR1”, “VAR2”, “VAR3”….“VARi”)) command. You can also type
?subset() in your R console followed by ENTER to learn how to subset() vectors, matrices and
data frames. The code below demonstrates how to select a subset of FAITHFUL dataset
From the above output, we see that we got the subset we need. Let’s find the mean of each
variable in the selected subset.
From the output, the minimum eruption time is 1.6mins and the maximum eruption time is
5.1mins
# Minimum and Maximum of waiting time of survey
min(subset.data$waiting);max(subset.data$waiting)
## [1] 43
## [1] 96
From the output, the minimum eruption time is 43mins and the maximum eruption time is 96mins
RANGE
The range of a particular variable, that is, its minimum and maximum, can be retrieved using the
range(VAR) command. Like with min and max functions, using range(dataset) is not very useful
since it considers the entire dataset, rather than each individual variable. Consequently, it is
recommended that ranges be computed on individual variables. These computations are
demonstrated in the following codes:
# Calculate the 25th, 50th, 75th, and 95th percentiles for eruptions in the
subset dataset
quantile(subset.data$eruptions, prob = c(0,.25, .50, .75, .95))
## 0% 25% 50% 75% 95%
## 1.60000 2.16275 4.00000 4.45425 4.81700
# Calculate the 25th, 50th, 75th, and 95th percentiles for waiting in the
subset dataset
quantile(subset.data$waiting, prob = c(0,.25, .50, .75, .95))
## 0% 25% 50% 75% 95%
## 43 58 76 82 89
1. Count the number of data points that are at or below the given value
2. Divide by the total number of data points
3. multiple by 100
The formula for calculating a percentile rank can be derived from this command: percentile rank
= length(VAR[VAR <= VALUE]) / length(VAR) * 100. length(VAR[VAR <= VALUE]) counts the
number of data points in a variable that are below the given value. The ‘<=’ can be replaced with
other operators, such as ‘<’, ‘>’, and =. The length(VAR) counts the number of data points in the
variable. The final step is to multiply the result by 100 to transform the decimal value into a
percentage. Let’s apply these steps in the following examples:
SUMMARY
A very useful function in R is summary(x), where x can be one of any number of objects,
including datasets, variables, and linear models. When used, the summary(x) provides summary
data related to the individual object that is included into it. Thus, the summary function has a
different output depending on what kind of object it takes as an argument. This method is
valuable because it often sums up what we previously did and provides exactly what is needed in
summary statistics. Let’s apply this command to the sample dataset.
We get the min, 1st quartile, Median, Mean, 3rd quartile, and the maximum years of education.
This step provides more summary statistics information. We will apply,the same command this
time,to the dataset.
# Summarize the subset.data sample using the command summary(subset.data)
print(summary(subset.data))
## eruptions waiting
## Min. :1.600 Min. :43.0
## 1st Qu.:2.163 1st Qu.:58.0
## Median :4.000 Median :76.0
## Mean :3.488 Mean :70.9
## 3rd Qu.:4.454 3rd Qu.:82.0
## Max. :5.100 Max. :96.0
The output of the preceding summary provides the descriptive statistics of all objects in the
sample data set. Under each variable, we have its summary statistics. Now that we know how to
do summary statistics, we can delve into the visual part of the analysis to see the relation
between the variables.
DATA VISUALIZATION
1. MATRIX of PLOTS
The single type of planar scatterplot is really useful only when comparing two numeric-
continuous variables. When there are more continuous variables of interest, it is possible to
display this information satisfactorily on a single plot. A simple and common solution is to
generate a two-variable scatterplot for each pair of variables and show them together in a
structured way; this is referred to as a Scatterplot Matrix. We have four continuous/numeric
variables in the subset dataset we have selected. Working with base R graphics, use the pairs
function.
2. SCATTERPLOT
A scatterplot is a useful way to visualize the relationship between two variables displayed as x-y
coordinate plots. Similar to correlations, scatterplots are often used to make initial diagnoses
before any statistical analyses are performed.
The simplest way to create a scatterplot is to directly graph two variables using the default
settings. In R, this can be achieved using the formula plot(VARX, VARY) function, where VARX
is the variable to plot along the x-axis and VARY is the variable to plot along the y-axis. I will also
add the ggplot2 version for the scatterplot. Let’s look at the relationship between eruptions and
waiting.
The scatterplot also displays the presence of potential outliers. We can further expand our visual
analysis by calculating the slope and intercept of line of best fit.
We will use geom_abline() function to add the intercept avlue and estimated coefficient of
eruptions to our plot.