R WorkSamples
R WorkSamples
https://www.educative.io/courses/learn-r-from-scratch/JPkVlp8wWVJ
Variables
Rule Variable in R
1 The variable name must start with letter and can contain
number,letter,underscore(‘_’) and period('.').
2 Example: variableName1,new.variable,
Underscore('_') at the beginning of the variable name are not allowed.
Example: '_my_var' is not a valid variable name.
Period('.') at the beginning of the variable name are allowed but should not
be followed by a number. It is preferable in R to use '.' which helps to
separate the different words for the identifier.
Example: '.myvar' is a valid variable name. However, '.1myvar' is not a
valid variable name because the period followed by a number is not valid.
3 Reserved words or keywords are not allowed to be defined as a variable name.
4 Special characters such as '#', '&', etc., along with White space (tabs, space) are
not allowed in a variable name.
String
name <-"ojus"
> name
[1] "ojus"
Numbers
> A <- 10
>A
[1] 10
> B<- 10.2
>B
[1] 10.2
R Script File
Writing an R program in a file and running it
numb1 <- 10
numb2 <- 20
sum <- 0
sum<- numb1 + numb2
sum
Concatenate Elements
You can also concatenate, or join, two or more elements, by using the paste() function.
# numeric
x <- 10.5
class(x)
# integer
x <- 1000L
class(x)
# complex
x <- 9i + 3
class(x)
# character/string
x <- "R is exciting"
class(x)
# logical/boolean
x <- TRUE
class(x)
Type Conversion
You can convert from one type to another with the following functions:
as.numeric()
as.integer()
as.complex()
String Length
There are many usesful string functions in R.
For example, to find the number of characters in a string, use the nchar() function:
Check a String
Use the grepl() function to check if a character or a sequence of characters are present in
a string:
str <- "Hello World!"
For Loops
A for loop is used for iterating over a sequence:
for (x in 1:10) {
print(x)
}
R Functions
Arguments are specified after the function name, inside the parentheses. You can add as
many arguments as you want, just separate them with a comma.
return (5 * x)
print(my_function(3))
print(my_function(5))
print(my_function(9))
Output : 15,25,45
Default Arguments
my_function <- function(x = 10) {
return (5 * x)
print(my_function(3))
print(my_function())
print(my_function(9))
Output : 15,50,45
Vectors.
Lists.
Matrices.
Dataframes.
Factors.
Vectors vectors.r
A vector is simply a list of items that are of the same type.
To combine the list of items to a vector, use the c() function and separate the items by a
comma.
In the example below, we create a vector variable called fruits, that combine strings:
# Vector of strings
fruits <- c("banana", "apple", "orange")
# Print fruits
fruits
numbers
Lists
A list in R can contain many different data types inside it.
# List of strings
thislist <- list("apple", "banana", "cherry",1,2)
# Print the list
thislist
Matrices
A matrix is a two dimensional data set with columns and rows.
A matrix can be created with the matrix() function. Specify the nrow and ncol
parameters to get the amount of rows and columns:
# Create a matrix
thismatrix <- matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 2)
R Data Frames
Data Frames
Data Frames are data displayed in a format as a table.
Data Frames can have different types of data inside it. While the first column can be
character, the second and third can be numeric or logical. However, each column should
have the same type of data.
New_row_DF
Number of Rows and Columns
Use the dim() function to find the amount of rows and columns in a Data Frame:
dim(Data_Frame)
Factors
Factors are the data objects which are used to categorize the data and store it as levels. They are
useful for storing categorical data. They can store both strings and integers. They are useful to
categorize unique values in columns like “TRUE” or “FALSE”, or “MALE” or “FEMALE”, etc..
They are useful in data analysis for statistical modeling.
Output
R Plotting
Plot
The plot() function is used to draw points (markers) in a diagram.
At its simplest, you can use the plot() function to plot two numbers against each other:
Example
Draw one point in the diagram, at position (1) and position (3):
plot(1, 3)
Multiple Points
You can plot as many points as you like, just make sure you have the same number of
points in both axis:
Example
plot(c(1, 2, 3, 4, 5), c(3, 7, 8, 9, 12))
Sequences of Points
If you want to draw dots in a sequence, on both the x-axis and the y-axis, use the :
operator:
Example
plot(1:10)
Draw a Line
The plot() function also takes a type parameter with the value l to draw a line to
connect all the points in the diagram:
Example
plot(1:10, type="l")
Plot Labels
The plot() function also accept other parameters, such as main, xlab and ylab if you
want to customize the graph with a main title and different labels for the x and y-axis:
Example
plot(1:10, main="My Graph", xlab="The x-axis", ylab="The y axis")
Graph Appearance
There are many other parameters you can use to change the appearance of the points.
Colors
Use col="color" to add a color to the points:
Example
plot(1:10, col="red")
Size
Use cex=number to change the size of the points (1 is default, while 0.5 means 50%
smaller, and 2 means 100% larger):
Example
plot(1:10, cex=2)
Point Shape
Use pch with a value from 0 to 25 to change the point shape format:
Example
plot(1:10, pch=25, cex=2)
R Scatter Plot
Scatter Plots
A "scatter plot" is a type of plot used to display the relationship between two numerical
variables, and plots one dot for each observation.
It needs two vectors of same length, one for the x-axis (horizontal) and one for the y-
axis (vertical):
Example scatter.r
x <- c(5,7,8,7,2,2,9,4,11,12,9,6)
y <- c(99,86,87,88,111,103,87,94,78,77,85,86)
plot(x, y)
scatter1.r
x <- c(5,7,8,7,2,2,9,4,11,12,9,6)
y <- c(99,86,87,88,111,103,87,94,78,77,85,86)
plot(x, y, main="Observation of Cars", xlab="Car age", ylab="Car speed")
Box Plot
#draw in R terminal
A box plot (aka box and whisker plot) uses boxes and lines to depict the distributions of
one or more groups of numeric data. Box limits indicate the range of the central 50% of
the data, with a central line marking the median value.
The ends of the box mark the quartiles, and the vertical line through the box is located
at the median.
The whiskers of a boxplot extend to values known as adjacent values. These are the
values in the data that are furthest away from the median on either side of the box, but
are still within a distance of 1.5 times the interquartile range from the nearest end of the
box (that is, the nearer quartile). In many cases the whiskers actually extend right out
to the most extreme values in the data set. However, in other cases they do not.
Any values in the data set that are more extreme than the adjacent values are plotted as
separate points on the boxplot. This identifies them as potential outliers that may need
further investigation.
When a data distribution is symmetric, you can expect the median to be in the exact
center of the box: the distance between Q1 and Q2 should be the same as between Q2
and Q3. Outliers should be evenly present on either side of the box. If a distribution is
skewed, then the median will not be in the middle of the box, and instead off to the
side. You may also find an imbalance in the whisker lengths, where one side is short
with no outliers, and the other has a long tail with many more outliers.
survey <- c(apple=40, kiwi=15, grape=30, banana=50, pear=20, orange=35)
hist(survey)
hist(survey,xlab = "Fruit frequency", main = "Histogram for Fruit
Frequency", col = "yellow")
ttp://www.sthda.com/english/wiki/r-base-graphs
Line graphs
plot(x, y, type = "l", lty = 1)
lines(x, y, type = "l", lty = 1)
This means that, first you have to use the function p lot() to create an empty
graph and then use the function lines() to add lines.
x<-c(1:10)
y<- x*x
x1<-c(2:10)
y2 <- y*2
plot(x, y, type = "l", lty = 6)
lines(x, y2, type = "l", lty = 1)
Introduction to ggplot2
(http://www.sthda.com/english/wiki/qplot-quick-plot-with-ggplot2-r-software-and-data-visualization)
Install and load ggplot2 package
# Installation
install.packages('ggplot2')
# Loading
library(ggplot2)
The function qplot() [in ggplot2] is very similar to the basic plot()
function from the R base package.
x : x values
y : y values (optional)
data : data frame to use (optional).
geom : Character vector specifying geom to use. Defaults to “point” if x and y are specified,
and “histogram” if only x is specified.
xlim, ylim: x and y axis limits
Other arguments including main, xlab, ylab and log can be used also:
main: Plot title
xlab, ylab: x and y axis labels
log: which variables to log transform. Allowed values are “x”, “y” or “xy”
Scatter plots
# Use data from numeric vectors
x <- 1:10; y = x*x
# Basic plot
qplot(x,y)
# Add line
qplot(x, y, geom=c("point", "line"))
# Use data from a data frame
qplot(mpg, wt, data=mtcars)
Data is placed into hexbins, and the third dimension uses shading to
Most of the plots presented earlier try to detail the data as clearly as possible for
data scientists to identify structures and relationships.
Note that, depending on the format of your file, several variants of read.table() are
available to make your life easier, including read.csv(), read.csv2(), read.delim() and
read.delim2().
file: the path to the file containing the data to be imported into R.
header: logical value. If TRUE, read.table() assumes that your file has a header
row, so row 1 is the name of each column. If that’s not the case, you can add the
argument header = FALSE.
dec: the character used in the file for decimal points.
head(sales)
The head() and tail() function in R are often used to read the first and last n rows
of a dataset. (https://www.journaldev.com/43863/head-and-tail-function-r)
Examine
a Data Frame in R with 7 Basic Functions (DataFrameStructure.r)
1 dim(): shows the dimensions of the data frame by row and column
2 str(): shows the structure of the data frame
3 summary(): provides summary statistics on the columns of the data frame
4 colnames(): shows the name of each column in the data frame
5 head(): shows the first 6 rows of the data frame
6 tail(): shows the last 6 rows of the data frame
7 View(): shows a spreadsheet-like display of the entire data frame
"StartDate":[ "1/1/2012","9/23/2013","11/15/2014","5/11/2014","3/27/2015","5/21/2013",
"7/30/2013","6/17/2014"],
"Dept":[ "IT","Operations","IT","HR","Finance","IT","Operations","Finance"]
}
The JSON file is read by R using the function from JSON(). It is stored as a list in R.
library("rjson")
print(json_data_frame)
Syntax
The basic syntax for calculating mean in R is −
trim is used to drop some observations from both end of the sorted vector.
na.rm is used to remove the missing values from the input vector.
# Create a vector.
x <- c(1,2,3,4,5)
# Find Mean.
result.mean <- mean(x)
print(result.mean)
Applying NA Option
If there are missing values, then the mean function returns NA.
To drop the missing values from the calculation use na.rm = TRUE. which means
remove the NA values.
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5,NA)
# Find mean.
result.mean <- mean(x)
print(result.mean)
Median
The middle most value in a data series is called the median. The median() function is
used in R to calculate this value.
Syntax
The basic syntax for calculating median in R is −
na.rm is used to remove the missing values from the input vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
Mode
The mode is the value that has highest number of occurrences in a set of data. Unike
mean and median, mode can have both numeric and character data.
Descriptive Statistics
Summary() - summary data
cov(x,y) - returns covariance
What is covariance?
Covariance signifies the direction of the linear relationship between the two variables. By direction
we mean if the variables are directly proportional or inversely proportional to each other.
cor(x,y) - returns
correlation
What is correlation?
Correlation is a statistical measure that expresses the extent to which two variables are linearly
related (meaning they change together at a constant rate). It’s a common tool for describing simple
relationships without making a statement about cause and effect.
IQR(x) – returns
Interquartile range
mean(x) - returns mean
median(x) - returns median
range(x) - returns min max
sd(x) - returns std. dev.
var(x) - returns variance
dat <- iris # load the iris dataset and renamed it dat
head(dat) # first 6 observations
min(dat$Sepal.Length)
## [1] 4.3
max(dat$Sepal.Length)
## [1] 7.9
## 25%
## 5.1
## 75%
## 6.4
Interquartile range
The interquartile range (i.e., the difference between the first and third quartile) can be
computed with the IQR() function:
IQR(dat$Sepal.Length)
## [1] 1.3
## [1] 0.8280661
var(dat$Sepal.Length) # variance
## [1] 0.6856935
to compute the standard deviation (or variance) of multiple variables at the same time,
use lapply() with the appropriate statistics as second argument:
## $Sepal.Length
## [1] 0.8280661
##
## $Sepal.Width
## [1] 0.4358663
##
## $Petal.Length
## [1] 1.765298
##
## $Petal.Width
## [1] 0.7622377
Summary
You can compute the minimum, 1st
quartile and the maximum for all numeric variables of a dataset at once using
summary():
summary(dat)
if you need these descriptive statistics by group use the by() function:
## dat$Species: setosa
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100
## 1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200
## Median :5.000 Median :3.400 Median :1.500 Median :0.200
## Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246
## 3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300
## Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600
## Species
## setosa :50
## versicolor: 0
## virginica : 0
##
##
##
## ------------------------------------------------------------
## dat$Species: versicolor
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000 setosa : 0
## 1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200 versicolor:50
## Median :5.900 Median :2.800 Median :4.35 Median :1.300 virginica : 0
## Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326
## 3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500
## Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800
## ------------------------------------------------------------
## dat$Species: virginica
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400
## 1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800
## Median :6.500 Median :3.000 Median :5.550 Median :2.000
## Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026
## 3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300
## Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
## Species
## setosa : 0
## versicolor: 0
## virginica :50
##
##
##
Exploratory Data Analysis (EDA) is the process of analyzing and visualizing the
data to get a better understanding of the data and glean insight from it. There are
various steps involved when doing EDA but the following are the common steps that a
data analyst can take when performing EDA:
1 Import data
2 Clean the data
3 Process the data
4 Visualize the data
is.na(x)
mean(x)
[1] NA
mean(x, na.rm=TRUE)
[1] 2.5
DF
xy
1 1 10
2 2 20
3 3 NA
DF1
xy
1 1 10
2 2 20
Visualizing a Single Variable
data(mtcars)
dotchart(mtcars$mpg,labels=row.names(mtcars),cex=.7,
xlab=“Number of Cylinders”)
summary(income)
# density plot
plot(density(log10(income), adjust=0.5),
R can carry out a wide range of statistical analyses. Some of the simpler ones include:
Basic statistics
Really simple summary stats
R has a range of functions for carrying out summary statistics. The following table
shows a few of the functions that operate on single variables.
Functio
Statistic
n
Sum sum(x)
Mean mean(x)
median(x
Median
)
Largest max(x)
Smallest min(x)
Standard
sd(x)
Deviation
Variance var(x)
Number of
length(x)
elements
T-test (https://www.javatpoint.com/t-test-in-r)
In statistics, the T-test is one of the most common test which is used to
determine whether the mean of the two groups is equal to each other.
The assumption for the test is that both groups are sampled from a normal
distribution with equal fluctuation.
The null hypothesis is that the two means are the same, and the alternative is
that they are not identical.
Such samples are described as being parametric and the t-test is a parametric
test.
For evaluating the statistical significance of the t-test, we need to compute the
p-value. The p-value range starts from 0 to 1, and is interpreted as follow:
If the p-value is lower than 0.05, it means we are strongly confident to reject the
null hypothesis(H0). So that alternate hypothesis(H1) is accepted.
If the p-value is higher than 0.05, then it indicates that we don't have enough
evidence to reject the null hypothesis.
In R the t.test() command will carry out several versions of the t-test.
x – a numeric sample.
y – a second numeric sample (if this is missing the command carries out a 1-
sample test).
alternative – how to compare means, the default is “two.sided”. You can also
specify “less” or “greater”.
mu – the true value of the mean (or mean difference). The default is 0.
paired – the default is paired = FALSE. This assumes independent samples. The
alternative paired = TRUE is used for matched pair tests.
equal – the default is var.equal = FALSE. This treats the variance of the two
samples separately. If you set var.equal = TRUE you conduct a classic t-test
using pooled variance.
… – there are additional parameters that we aren’t concerned with here.
>X
[1] 12 15 17 11 15
>Y
[1] 8 9 7 9
> t.test(X, Y)
data: X and Y
t = 4.8098, df = 5.4106, p-value = 0.003927
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.745758 8.754242
sample estimates:
mean of x mean of y
14.00 8.25
If you specify a single variable you can carry out a 1-sample test by specifying the mean
to compare against:
> data1
[1] 3 5 7 5 3 2 6 8 5 6 9
> t.test(data1, mu = 5)
data: data1
t = 0.55902, df = 10, p-value = 0.5884
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
3.914249 6.813024
sample estimates:
mean of x
5.363636
If you have matched pair data you can specify paired = TRUE as a parameter.
U-test (http://www.sthda.com/english/wiki/unpaired-two-samples-wilcoxon-test-in-r)
The U-test is used for comparing the median values of two samples.
You use it when the data are not normally distributed, so it is described as a
non-parametric test.
The U-test is often called the Mann-Whitney U-test but is generally attributed
to Wilcoxon (Wilcoxon Rank Sum test),
If the p-value evaluated is less than significance level alpha, then null hypothesis
is rejected.
x – a numeric sample.
y – a second numeric sample (if this is missing the command carries out a 1-
sample test).
alternative – how to compare means, the default is “two.sided”. You can also
specify “less” or “greater”.
mu – the true value of the median (or median difference). The default is 0.
paired – the default is paired = FALSE. This assumes independent samples. The
alternative paired = TRUE is used for matched pair tests.
… – there are additional parameters that we aren’t concerned with here.
The p-value of the test is 0.02712, which is less than the significance level alpha = 0.05. We can conclude
that men’s median weight is significantly different from women’s median weight with a p-value =
0.02712.
Sometimes this kind of test is also called a repeated measures test (depending on
circumstance). You can run the test by adding paired = TRUE to the appropriate
command.
The p-value of the test is less than the significance level alpha = 0.05. We can
then reject null hypothesis
Here is an example where the data show the effectiveness of some treatement on
mice
res
We will take the survey data in the MASS library which represents the data from a
survey conducted on students.
Our aim is to test the hypothesis whether the students smoking habit is
independent of their exercise level at .05 significance level.
# R program to illustrate
# Chi-Square Test in R
library(MASS)
print(str(survey))
stu_data = data.frame(survey$Smoke,survey$Exer)
stu_data = table(survey$Smoke,survey$Exer)
print(stu_data)
print(chisq.test(stu_data))
'data.frame': 237 obs. of 12 variables:
$ NW.Hnd: num 18 20.5 13.3 18.9 20 17.7 17.7 17.3 19.5 18.5 ...
NULL
Heavy 7 1 3
Never 87 18 84
Occas 12 3 4
Regul 9 1 7
data: stu_data
For example,
we collected wild tulips and found that 81 were red, 50 were yellow and 27 were white.
If these colors were equally distributed, the expected proportion would be 1/3 for each
of the color.
Syntax
chisq.test(x, p)
x: a numeric vector
p: a vector of probabilities of the same length of x.
The p-value of the test is 8.80310^{-7}, which is less than the significance level alpha = 0.05.
We can conclude that the colors are significantly not commonly distributed with a p-value =
8.80310^{-7}.
Types of Dirty Data
(https://www.ringlead.com/blog/the-7-most-common-types-of-
dirty-data-and-how-to-clean-them/)
Duplicate Data
Outdated Data
Insecure Data
Incomplete Data
Incorrect/Inaccurate Data
Inconsistent Data
Too Much Data
1. Duplicate Data
Duplicates are among the worst offenders of data pollution. Duplicates can form in a number of
ways including data migration, via data exchanges through integrations, 3rd party connectors,
manual entry, and from batch imports. The most common duplicate objects are Contacts, and
Accounts.
Polluting your datastore with duplicate data can cause:
Inflated storage count
Inefficient workflows and data recovery
Skewed metrics and analytics
2. Outdated Data
Common causes of outdated data:
Individuals change roles or companies
Organizations rebrand or get acquired
Software and systems evolve past their previous iterations
Such is the nature of the modern digital ecosystem, change is too rapid for the average database to
keep up. Organizations need to be able to trust that their data is fresh and up-to-date before using it
for insights, decision-making, and analytics.
How to keep data fresh and up to date:
Purging your database of records created before a certain date can help expedite the process of
cleaning outdated records.
Data enrichment can also solve the pitfalls of outdated customer records by appending fields with
newer information.
3. Insecure Data
Data security & privacy laws are being put into place left and right, giving business extra financial
incentive to follow these newly placed laws to a tee.
With steep fines for non-compliance, insecure data is quickly becoming one of the most dangerous
types of dirty data.
Digital consent, opt-ins, and privacy notifications are becoming the new norm in an increasingly
consumer-centric business landscape.
Major data privacy laws include:
GDPR in the EU
California’s Consumer Privacy Act (CCPA)
Maine’s Act to Protect the Privacy of Online Consumer Information
Without proper database hygiene, remain within these stringent regulations becomes nearly
impossible. For example, let’s say an individual does not consent to your data sharing policy but
their customer profile is fragmented throughout your organization’s various databases. Adhering to
this opt-out will need manual intervention to reconcile all instances of this personal information
(and let’s be real, no one is doing that).
4. Incomplete Data
A record can be defined as incomplete if it lacks the key fields you need to process the incoming
information before sales and marketing take action.
For example, let’s say your organization is running a campaign to target non-profit institutions. If a
new or existing record is missing the ‘industry’ or ‘sector’ fields, it will not be included in smart
lists for the campaign and a valuable revenue opportunity may be missed.
5. Inaccurate/Incorrect Data
Incorrect data is data that is stored in the improper location e.g a text field containing a numerical
value
Inaccurate data, on the other hand, occurs when a field is filled but the information is not correct
e.g a fake email address
These each can cause a slew of issues, including poor targeting and segmentation, irrelevant non-
personalized messaging, mistimed or nonexistent email delivery, and a lack of competitive insights.
6. Inconsistent Data
Just like how duplicate records exist in various places within your database, multiple versions of the
same data elements can exist across different records.
Inconsistent (non-standardized) data is data that looks different but represents the same thing.
For example, let’s say you were targeting decision makers for an upcoming email blast and wanted
to segment all “Vice President” roles into one person. ‘V.P’ ‘v.p’ ‘VP’ & ‘Vice Pres’ all mean the
same thing, yet will only be included if you are certain all variations exist in your smart list or
campaign. Inconsistent data hurts analytics and make segmentation that much more difficult when
you have to account for all variables of the same title, industry, or other criteria.