RM practical(2)
RM practical(2)
1
TABLE OF CONTENT
1. INTRODUCTION TO R AND 3
TYPES OF DATA
2. R OPERATORS 4-8
3. VECTORS IN R 9-12
4. LIST IN R 13
5. MATRICES IN R 14-15
6. ARRAY IN R 15-16
2
Introduction To R
▶ The "R" name is derived from the first letter of the names of its two developers,
Ross Ihaka and Robert Gentleman, who were associated with the University of
Auckland at the time.
▶ The initial version of R was released in 1995 to allow academic statisticians and
others with sophisticated programming skills to perform complex data statistical
analysis and display the results in form of a multitude of visual graphics.
▶ Languages such as C++ require that an entire section of the code be written,
compiled, and run to see results, but in the case of r results can be seen after
one command at a time.
3
R Operators:
▶ Arithmetic operators: The R arithmetic operators allow us to do math operations,
like sums, divisions, or multiplications, among others.
Variable
Variables can be thought of as a labelled container used to store information.
Variables allow us to recall saved information to later use in calculations. Variables
can store many different things in R studio, from single values to tables of
information, images and graphs.
4
Defining and Assigning values to variables
Storing a value or “assigning” it to a variable is completed using either <-, = or ->
function. The name given to a variable should describe the information or data
being stored. This helps when revisiting old code or when sharing it with others.
>num1=2
>name="Aastha"
>Feepaid=TRUE
5
Assignment Operator: =, <-, ->
>num1=2
>num1
[1] 2
>num2<-4
>num2
[1] 4
>7->num3
>num3
[1] 7
Arithmetic Operators
Addition
>2+1
[1] 3
>2+6
[1] 8
Subtraction
>2-4
[1] -2
Multiplication
>2*9
[1] 18
Division
>2/7
[1] 0.28571428
EXPONENT/POWER
>2^5
[1] 32
6
CODE-1(ARITHEMETIC OPERATORS IN R PROGRAMMING)
7
Use of Logical Operator: & (and)
>num1=9
>num2=4
>num1>5 & num2<6
[1] TRUE
> num1>5 & num2<3
[1] FALSE
> num1>15 & num2<15
[1] FALSE
> num1>20 & num2<2
[1] FALSE
8
Vectors in R studio:
In R, a sequence of elements that share the same data type is known as a vector. If we
use only one item like 2 then it is variable but if a number of items are calculated
collectively it is called a vector.
>vec1=c(6,5,4,3,2,1)
>vec1
[1] 6,5,4,3,2,1
>class(vec1)
[1] "numeric"
>vec2=c("a","b","c","d","e")
>vec2
[1] "a" "b" "c" "d" "e"
>class(vec2)
[1] "character"
>vec3=c(F,T,F,T)
>vec3
[1] FALSE TRUE FALSE TRUE
>class(vec3)
[1] "logical"
>vec4=c(1,"a",4,"b",3)
>vec4
[1] "1" "a" "4" "b" "3"
>class(vec4)
[1] "character"
>vec5=c(1,T,2,F,3,F)
>vec5
[1] 1 1 2 0 3 0
>class(vec5)
9
[1] "numeric"
>vec6=c("d",T,"b",F,"c")
>vec6
[1] "d" "TRUE" "b" "FALSE" "c"
>class(vec6)
[1] "character"
>vec7=c(1,"a",T,2,"e",F)
>vec7
[1] "1" "a" "TRUE" "2" "e" "FALSE"
>class(vec7)
[1] "character"
Vector Arithmetic:
>vec1=c(1,2,3,4,5,6)
>vec2=c(1,1,1,1,1,1)
>vec1+vec2
[1] 2 3 4 5 6 7
>vec1-vec2
[1] 0 1 2 3 4 5
>vec1*vec2
[1] 1 2 3 4 5 6
[1] 1 2 3 4 5 6
10
CODE-3(VECTORS IN R PROGRAMMING)
Vector indexing:
Now try finding V1[4] or V2[3] etc. to find the value at 4 or 3 items. Also, try finding the
length and class of V1 and V2.
1.Write V1[4] which means extracting the 4th element from the V1 array.
2. Second to know the number of items in an array write Length(V1) – R will show
you the total number of items in an array.
3. To know the class of an array use the command Class(V1) it will show you the
class whether it is numeric, character, logical, etc.
11
>vec1=c(5,16,33,23,24,40,35,17)
>vec1[2]
[1] 16
>length(vec1)
[1] 8
12
Lists in R studio:
Lists are the R objects which contain elements of different types like − numbers, strings,
vectors, and other lists inside them. The list is created using list() function.
>l1=list(1,"a",TRUE)
>l1
[[1]]
[1] 1
[[2]]
[1] "a"
[[3]]
[1] TRUE
>class(l1[[1]])
[1] "numeric"
>class(l1[[2]])
[1] "character"
>class(l1[[3]])
[1] "logical"
CODE-5(LISTS IN R PROGRAMMING)
13
Matrices in R studio:
In R, a two-dimensional rectangular data set is known as a matrix. A matrix is created
with the help of the vector input to the matrix function. On R matrices, we can perform
addition, subtraction, multiplication, and division operation.
In the R matrix, elements are arranged in a fixed number of rows and columns. The
matrix elements are the real numbers.
>m1=matrix(c(6,5,4,3,2,1))
>m1
[,1]
[1,] 6
[2,] 5
[3,] 4
[4,] 3
[5,] 2
[6,] 1
>m1=matrix(c(1,2,3,4,5,6),nrow=2,ncol=3)
>m1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
>m1=matrix(c(1,2,3,4,5,6),nrow=2,ncol=3,byrow=T)
>m1
[,1][,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
>m1[1,2]
[1] 2
14
CODE-6(MATRICES IN R PROGRAMMING)
Array in R studio:
An array is a data structure that can hold multi-dimensional data. In R, the array objects
can hold two or more two-dimensional data. Arrays are also called vector structures. A
vector is an array of numbers with a single index while a matrix is an array of numbers
with two indices.
▶ Uni-dimensional arrays are called vectors with the length being their only
dimension.
▶ Two-dimensional arrays are called matrices, consisting of fixed numbers of rows
and columns.
▶ Arrays consist of all elements of the same data type.
▶ An array in R can be created with the use of array() function.
>vec1=c(1,2,3,4,5,6)
>vec2=c(7,8,9,10,11,12)
>a1=array(c(vec1,vec2),dim=c(2,3,2))
15
>a1
,,1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
,,2
[,1] [,2] [,3]
[1,] 7 9 11
[2,] 8 10 12
>a1[1,2,1]
[1] 3
>a1[1,1,2]
[1] 7
>a1[2,3,2]
[1] 12
CODE7-(ARRAY IN R PROGRAMMING)
16
Simple programming constructs such as if… else, for, while, break.
When we’re programming in R (or any other language, for that matter), we often
want to control when and how particular parts of our code are executed. We can do
that using control structures like if-else statements, for loops, and while loops.
▶The IF Conditional Statement: Let’s say we’re watching a sports match that decides
which team makes the playoffs. We could visualize the possible outcomes using
this tree chart:
17
IF STATEMENT:
▶ As we can see in the tree chart, there are only two possible outcomes. If Team A
wins, they go to the playoffs. If Team B wins, then they go.
>teamA=5
>teamB=3
>if(teamA>teamB)
{print("Team A Wins")}
[1] "Team A Wins"
R will write Team A wins Because it is true as 5 is more than 3
18
ELSE STATEMENT:
What if Team A had 1 goal and Team B had 3 goals? Our teamA>teamB
conditional would evaluate to FALSE. As a result, nothing would be printed if we
ran our code. Because the if statement evaluates to false, the code block inside
the if statement is not executed. In this >teamA=1
>teamB=3
>if(teamA>teamB)
{print("Team A Wins")}else{print("Team B wins")}
[1] "Team B wins"
19
FOR LOOP:
It is a type of control statement that enables one to easily construct a loop that
has to run statements or a set of statements multiple times. For loop is
commonly used to iterate over items of a sequence.
20
Print Days of Week
>week=c("sunday","monday","tuesday","wednesday","thursday","friday","satur
day")
>for (days in week) {print(days)}
[1] "sunday"
[1] "monday"
[1] "tuesday"
[1] "wednesday"
[1] "thursday"
[1] "friday"
[1] "saturday"
21
While LOOP:
It is a type of control statement which will run a statement or a set of statements
repeatedly unless the given condition becomes false. A while loop in R is a close
cousin of the for loop in R. However, a while loop will check a logical condition,
and keep running the loop as long as the condition is true.
While(condition){statement}
▶ If the condition in the while loop in R is always true, the while loop will be an
infinite loop, and our program will never stop running.
22
▶ Example: Let’s take a team that’s starting the season with zero wins. They’ll need
to win 10 matches to make the playoffs. We can write a while loop to tell us
whether the team wins:
>wins=0
>while(wins<10)
{print("will not win")
wins=wins+1}
It will run the command till the statement becomes false means till the number
reaches 10 in this case.
Break statement in R:
Break statement is used to terminate the loop
>i=1
>while(i<=10)
{print(i)
if(i==4)
break
i=i
+1
}
CODE-12(BREAK STATEMENT IN R PROGRAMMING)
23
Introduction to Data Frame
>fruits=data.frame(fruit_name=c("apple","banana","mango"),fruit_cost=c(100,200,300)
)
>fruits
fruit_namefruit_cost
1 apple 100
2 banana 200
3 mango 300
>fruits$fruit_cost
[1] 100 200 300
>fruits$fruit_name
[1] "apple" "banana" "mango"
CODE-13(DATAFRAME IN R PROGRAMMING)
24
Summary statistics
• R provides a wide range of functions for obtaining summary statistics. One method of
obtaining descriptive statistics is to use the summary (file name) function with a specified
summary statistic.
• For this we first need to install a package in r studio named: Fbasics
• It helps you to calculate the descriptive statistics of the whole data series, the values
calculated by this are:
o Mean
o Median
o Minimum
o Maximum
o 1st and 3rd quartile
>marks=c(8,10,12,15,20,7,6,5,8,3,2,12,8,9,7,15,8)
>summary(marks)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 7.000 8.000 9.118 12.000 20.000
25
CODE-15(>summary(iris$Sepal.Length) IN R PROGRAMMING)
Quick Plots
• For presenting the data in the form of simple plots we just need to write a command
plot(row name).
• This will help to draw a simple basic level plot of the data file selected, which looks like
this:
>marks=c(8,10,12,15,20,7,6,5,8,3,2,12,8,9,7,15,8)
>plot(marks)
26
CODE-16(PLOTS IN R PROGRAMMING)
1- Black
2- Red
3- green
4- Blue
5- Aqua
6- Pink
27
CODE-17(COLORED PLOTS IN R PROGRAMMING)
Histogram
• A histogram is a graph that shows the frequency of numerical data using
rectangles.
• The height of a rectangle (the vertical axis) represents the distribution frequency of
a variable (the amount, or how often that variable appears).
• The width of the rectangle (horizontal axis) represents the value of the variable (for
instance, minutes, years, or ages).
• The histogram displays the distribution frequency as a two-dimensional figure,
meaning the height and width of columns or rectangles have particular meanings
and can vary. A bar chart is a one-dimensional figure. The height of its bars
represents something specific.
• To draw a histogram using the command hist(data file)
• To draw the coloured histogram use the command hist(data file, col(“Red”))
28
• To add labels to the horizontal axis use the command (xlab=) in the above
command.
• To add a heading to the histogram using the commanding main() in the above
command.
>marks=c(8,10,12,15,20,7,6,5,8,3,2,12,8,9,7,15,8)
>hist(SepalLength)
CODE-18(HISTOGRAM IN R PROGRAMMING)
29
CODE-19(>hist(SepalLength) IN R PROGRAMMING)
PIE CHARTS
A pie chart (or a circle chart) is a circular statistical graphic, which is divided into slices to
illustrate numerical proportions.
In a pie chart, the arc length of each slice (and consequently its central angle and area) is
proportional to the quantity it represents.
Pie charts are created with the function pie(x, labels=) where x is a non-negative numeric
vector indicating the area of each slice and labels= notes a character vector of names for
the slices.
30
CODE-20(PIE CHART IN R PROGRAMMING)
Z test in R
Hypothesis testing, also known as significance testing, is a statistical test that is
used to conclude the population based on assumption. Here two hypotheses are
proposed. One is the null hypothesis, and the other is the alternate hypothesis. For
hypothesis testing different tests are used. The tests have been categorized in two
ways:
31
What is Z Test?
Z test is a popular parametric test used for hypothesis testing. Z test is a statistical
method used to determine if there is a significant difference between sample and
population means or between the means of two samples. It is used when there is a
large sample size and the population. It is to be noted that Z Test follows normal
distribution. The Z value acts as a threshold. Based on its value it is decided
whether to accept the hypothesis or reject the hypothesis. This test is applicable
where the sample size is greater than 30.
There are two types of Z tests based on samples:
One Sample Z-test
Two Sample Z-test
Here,
• Z denotes the Z value
• \bar{X} is the sample mean
• \mu denotes mean of the population
• \sigma denotes population standard deviation
• n denotes sample size.
32
Here,
• {\bar X_1} and {\bar X_2} are the sample means.
• s1 and s2 are standard deviations of the two samples.
• n1 and n2 are sample sizes of two samples
Application of Z-test
Z-test is applied when:
1. Population Standard Deviation is Known:
• We use z-test when, we know the standard deviation of the population and
are comparing a sample mean to a population mean or comparing means of
two independent samples.
• If you know the average height of a population and you want to test whether
a sample of individuals has a significantly different average height.
• As the sample size increases, the sampling distribution of the sample mean
becomes approximately normal, according to the Central Limit Theorem.
Therefore, the Z-test becomes more appropriate as the sample size increases.
33
Z test in R
R is a popular high level programming language used for statistical analysis. It is
open-source programming language as it has a huge community and users can
contribute to the development as well. It has vast number of packages which
allows the data miners to perform statistical analysis and data visualizations in an
interactive manner.
The syntax of z- test in R is:
z.test(x, y, alternative='two.sided', mu=0, sigma.x=NULL,
sigma.y=NULL,conf.level=.95)
Now we can conduct one sample test and two sample tests in R.
Here we provide the vector(s) and also provide the value of standard deviation and
population mean whose hypothesis is to be tested against. Then we use z.test to
calculate the z value. This method provides a complete summary of the output.
The one sample test is as follows:
34
The z.test function returns a test result object that includes the test statistic, pvalue, and other
relevant information.
The output of the z test is:
• Test Statistics (z): 0.53759
• P-value: 0.5909
• Alternative Hypothesis: The true mean is not equal to 24.
• 95% Confidence Interval: The confidence interval for the true mean is given
as (19.50205, 31.89795).
• Sample Estimate (mean of x): 25.7
The p-value is 0.5909 and the value is greater than the chosen significance level,
hence, we will fail to reject the null hypothesis. There is not enough evidence to
suggest that the true mean is different from 24 based on your sample data. The
95% confidence interval provides a range of plausible values for the true mean.
Based on the above output it is said that there is not much evidence to reject null
hypothesis. So, the null hypothesis is accepted, and the alternate hypothesis is
rejected.
Now we will perform two sample Z-Test
35
The output of the two-sample z-test comparing two independent samples:
From the above output we can see that the z-value is negative, and the p value is
very small. So based on the above calculations we can say that there is sufficient
evidence to accept null hypothesis. In this case we have to accept alternate
hypothesis.
Statistical tests are essential tools in data analysis, helping researchers make inferences about
populations based on sample data. Two common tests used to compare the means of different
groups are the two-sample t-test and the paired t-test. Both tests are based on the
R - Calculate Test MSE given a trained model from a training set and a test
set
Mean Squared Error (MSE) is a widely used metric for evaluating the performance
of regression models. It measures the average of the squares of the errors. the
average squared difference between the actual and predicted values. The Test
MSE, specifically, helps in assessing how well the model generalizes to new,
unseen data.
36
Upper Tail Test of Population Mean with Unknown Variance in R
A statistical hypothesis test is a method of statistical inference used to decide
whether the data at hand sufficiently support a particular hypothesis. The
conventional steps that are followed while formulating the hypothesis test, are
listed as follows State null hypothesis (Ho) and alternate hypothesis (Ha1)
37
38