0% found this document useful (0 votes)
114 views

RSCH8079 - Session 09 - Data Science With R

This document provides an outline for a data science session in R that includes introductions to R, descriptive statistics, correlation and regression, t-tests and ANOVA, and chi-square tests. The outline lists the topics that will be covered in each section, such as importing and exploring data, descriptive measures like mean and standard deviation, correlation, linear regression, t-tests, ANOVA, and chi-square tests.

Uploaded by

Dinne Ratj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
114 views

RSCH8079 - Session 09 - Data Science With R

This document provides an outline for a data science session in R that includes introductions to R, descriptive statistics, correlation and regression, t-tests and ANOVA, and chi-square tests. The outline lists the topics that will be covered in each section, such as importing and exploring data, descriptive measures like mean and standard deviation, correlation, linear regression, t-tests, ANOVA, and chi-square tests.

Uploaded by

Dinne Ratj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 69

Course: RSCH8079 – IT Research Methodology

Data Science with R


Session 09

D3502 - Bens Pardamean, B.Sc., M.Sc., PhD


Outline

• Introduction to R
• Descriptive Statistics
• Correlation and Regression
• t-Test and ANOVA
• Chi-Square Test
Introduction to R
Why R ?

- Open source, dan cross platform - Mendukung prinsip reproducibility


- Menghasilkan visualisasi yang berkualitas tinggi -Di dukung komunitas yang besar ( >2 juta user)

Sumber: https://www.r-bloggers.com/new-surveys-show-continued-popularity-of-r/
R Components
• R Base • R IDE - RStudio

• R Package - CRAN
R Installation
- Windows and Mac OS X

Download executable file (.exe for windows and .pkg for Mac):
http://www.r-project.org/

- Linux
Ubuntu atau Debian : r-base
Red Hat atau Fedora: R.i386
Suse : R-base

Example:
$ sudo apt-get install r-base

- Instalasi Rstudio:
Follow this link: https://www.rstudio.com/
Go to Products > RStudio > Download RStudio Desktop
Get Start
To start R, we need to specify our working directory. All files related to
the analysis should be placed in this directory.

- In R Base / R GUI:
File > Change dir... > Choose a directory

- In Rstudio:
File > New Project > New Directory > Choose project type> Specify name
and path file
Basic Operator and Data Type
Arithmetic Operator: Operator to define a
•Add ( + ) variable:
>5+6 > age = 20 > 20 -> age
•Subtract ( - ) -“ <- “ or
[1] 11 > age > age
•Multiple ( * )
-“ -> “ or [1] 20 [1] 20
•Divide( / )
•Square ( ^ )
-“ = “
> age <- 20
> age
[1] 20

Data Type:
-Numeric x = 10.25 -Character x = “ten”

-Integer x = 10 -Factor x = “agree”


y=
-Complex x = 10 + 3i “disagree”
z = “neutral”
-Logical x = TRUE
Package Installation
R provide a comprehensive library for data analysis in CRAN.
> install.packages('stringr')
Installing package into ‘C:/Users/Arif/Documents/R/win-library/3.3’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.3/stringr_1.1.0.zip'
Content type 'application/zip' length 119734 bytes (116 KB)
downloaded 116 KB

package ‘stringr’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in


C:\Users\Arif\AppData\Local\Temp\RtmpCKsMwX\downloaded_packages

R Help System > help(cat) > ?cat


Data Import
- CSV
data.csv <- read.table("namafile.csv",header=TRUE)
- EXCEL
library(xlsx)
data.xlsx <- read.xlsx("namafile.xlsx",sheetName = "Sheet1")
- SPSS
library(memisc)
data.spss <- as.data.set(spss.system.file ('namafile.sav'))
- TXT
data.txt = read.table("namafile.txt")
Data Exploration
Case Study – Health data analytic
Please follow this link to download survey.csv file:
http://bdsrc.binus.ac.id/RM/survey.csv
#import the data
> survey = read.csv("survey.csv")

#print the structure of the data


> str(survey)
'data.frame': 237 obs. of 6 variables:
$ sex : Factor w/ 6 levels "F","female","Female",..: 3 6 6 6 6 3 6 3 6 6 ...
$ height : int 68 70 NA 63 65 68 72 62 69 66 ...
$ weight : int 158 256 204 187 168 172 160 116 262 189 ...
$ handedness: Factor w/ 2 levels "Left","Right": 2 1 2 2 2 2 2 2 2 2 ...
$ exercise : Factor w/ 3 levels "Freq","None",..: 3 2 2 2 3 3 1 1 3 3 ...
$ smoke : Factor w/ 4 levels "Heavy","Never",..: 2 4 3 2 2 2 2 2 2 2 ...
Data Exploration

Data entry error

- Fix incorrect data


#print all possible values for one variable
> unique(survey$sex)
[1] Female Male <NA> F M male female
Levels: F female Female M male
#If we want only “Female” and “Male” for this variable, then we need to change all
other values

#find all rows that contain unexpected values (for instance “F”)
> which(data$sex == ‘F’)
[1] 210 211 212

#change it with the correct value


> data$sex[which(data$sex == ‘F’)] = ‘Female’
Data Exploration

Data entry error Data entry error

- Missing value (NA) -Missing values (NA) – Data Imputation


#check if there is NA Replace NA with an appropriate value
> sum(is.na(survey$height))
[1] 28 #replace NA on “height” with 160
> data$height[is.na(data$height)] = 160
#find rows contain NA
> which(is.na(data$height))
[1] 3 12 15 25 26 29 31 35 58 68 70 81 #replace NA on “height” with height
83 84 90 92 96 108 121 average for each sex
[20] 133 157 173 179 203 213 217 225 226 > female.height =
mean(data$height[which(data$sex==
#exclude NA in mean calculation ‘Female’)], na.rm=T)
> mean(data$height, na.rm = T) > data$height[which(data$sex ==
[1] 67.89474 ‘Female’& is.na(data$height))] =
female.height
Descriptive Statistics
Studi Kasus
Data: Major League Baseball (MLB)
http://bdsrc.binus.ac.id/RM/teams.csv

> data = read.csv(‘teams.csv’)


> attach(data)

> str(data)
'data.frame': 30 obs. of 9 variables:
$ team : Factor w/ 30 levels "Arizona Diamondbacks",..: 1 2 3 4 5 6 7 8 9 10 ...
$ code : Factor w/ 30 levels "ARI","ATL","BAL",..: 1 2 3 4 5 6 7 8 9 10 ...
$ league : Factor w/ 2 levels "AL","NL": 2 2 1 1 2 1 2 1 2 1 ...
$ division: Factor w/ 3 levels "Central","East",..: 3 2 2 2 1 1 1 1 3 1 ...
$ games : int 162 162 162 162 162 162 162 162 162 162 ...
$ wins : int 81 94 93 69 61 85 97 68 64 88 ...
$ losses : int 81 68 69 93 101 77 65 94 98 74 ...
$ pct : num 0.5 0.58 0.574 0.426 0.377 0.525 0.599 0.42 0.395 0.543 ...
$ payroll : int 67069833 86208000 76704000 110386000 80422700 118208000 80309500
78911300 75485000 131394000 ...
Scatterplots
Show a relation between two variables

> plot (payroll,wins)

Labeling in scatterplot

> plot (payroll,wins)


> id = identify(payroll, wins,labels = code, n = 5)

> plot (payroll,wins)


> with(data, text(payroll, wins, labels = code, pos = 1,
cex=0.5))
Scatterplots
Data grouping (categorical) Data grouping (numerik)

> s1 = which(league == ‘NL’) > s3 = which(pct > 0.5)


> s2 = which(league == ‘AL’) > s4 = which(pct <= 0.5)
> plot(payroll[s1],wins[s1],xlim = range(payroll), > plot(payroll[s3], wins[s3], pch = 3, xlim =
ylim=range(wins), xlab=‘payroll’, ylab=‘wins’) range(payroll), ylim = range(wins), xlab = ‘payroll’,
> points(payroll[s2],wins[s2],pch=2) ylab = ‘wins’)
> points(payroll[s4], wins[s4], pch = 4)
Scatterplots
Line to separate two groups Legend

> s1 = which(league == ‘NL’) > plot(payroll[s3], wins[s3], xlim = range(payroll), ylim


> s2 = which(league == ‘AL’) = range(wins), xlab='payroll', ylab = 'wins')
> plot(payroll[s1],wins[s1],xlim = range(payroll), > points(payroll[s4], wins[s4], pch = 2)
ylim=range(wins), xlab=‘payroll’, ylab=‘wins’) > lines(range(payroll), c(81,81), lty = 3)
> points(payroll[s2],wins[s2],pch=2) > legend('bottomright', c('pct > 0.5', 'pct <= 0.5'),
pch=c(1,2), title='Legend')
Data Aggregation
Comparing sum of “payroll” between two leagues

> sum(payroll[which(league == 'NL')])


[1] 1512099665
> sum(payroll[which(league == 'AL')])
[1] 1424254675

> by(payroll,league,sum)
league: AL
[1] 1424254675
------------------------------------------------------------------
league: NL
[1] 1512099665
Bar Plot

> barplot(by(payroll,league,sum))

> par(xpd=T, mar=par()$mar + c(0,0,0,4))


> barplot(by(payroll,list(division,league),
sum),col=2:4)
> legend(2.5,8e8,c(‘Central’,’East’,’West’), fill=2:4)
Pie Diagram

> labels = c(‘AL Central’, ‘AL East’, ‘AL West’,‘NL


Central’, ‘NL East’, ‘NL West’)
> pie(by(as.numeric(payroll), league, sum)) > pie(as.numeric(by(payroll,list(division,
league),sum)),labels)
Descriptive Statistic
Case study: metropolitan.csv
http://bdsrc.binus.ac.id/RM/metropolitan.csv
> data = read.csv(‘metropolitan.csv’) > head(data)
> attach (data)
> dim(data) > tail(data)
[1] 2759 11
> nrow(data) > summary(data)
[1] 2759
> ncol(data)
[1] 11

> str(data)
'data.frame': 2759 obs. of 11 variables:
$ NAME : Factor w/ 2757 levels "Abbeville, LA",..: 4 347 1263 2444 17 2033 2408 26 124 715 ...
$ LSAD : Factor w/ 4 levels "County or equivalent",..: 3 1 1 1 3 1 1 3 1 1 ...
$ CENSUS2010POP : int 165252 13544 20202 131506 703200 161419 541781 157308 3451 94565 ...
$ NPOPCHG_2010 : int 417 -12 27 402 -332 -38 -294 277 -60 156 ...
$ NATURALINC2010 : int 228 -14 10 232 310 65 245 220 4 147 ...
$ BIRTHS2010 : int 609 36 41 532 1945 385 1560 542 6 363 ...
$ DEATHS2010 : int 381 50 31 300 1635 320 1315 322 2 216 ...
$ NETMIG2010 : int 190 2 17 171 -631 -101 -530 57 -61 11 ...
$ INTERNATIONALMIG2010: int 77 1 2 74 127 26 101 36 0 32 ...
$ DOMESTICMIG2010 : int 113 1 15 97 -758 -127 -631 21 -61 -21 ...
$ RESIDUAL2010 : int -1 0 0 -1 -11 -2 -9 0 -3 -2 ...
Descriptive Statistic
> sort(data$CENSUS2010POP)

> output = sort(data$CENSUS2010POP,decreasing=T,


indeks.return=T)

> data[output$ix[1:10],1:2]

> data[order(-data$CENSUS2010POP)[1:10],1:2]

Data Grouping

> by(data$CENSUS2010POP,data$LSAD,mean)
data$LSAD: County or equivalent
[1] 161779.3
--------------------------------------------------
data$LSAD: Metropolitan Division
[1] 2803270
--------------------------------------------------
data$LSAD: Metropolitan Statistical Area
[1] 705786.2
--------------------------------------------------
data$LSAD: Micropolitan Statistical Area
[1] 53721.44
Statistik Deskriptif
Data Distribution – Box Plot & Histogram

> boxplot(data$BIRTHS2010 ~ data$LSAD) > hist(data.micro$BIRTHS2010)


Statistik Deskriptif
Skewness
> library(momen)
> skewness(data.micro[,3:11])
CENSUS2010POP NPOPCHG_2010 NATURALINC2010
1.7384473 2.6371220 1.0143676
BIRTHS2010 DEATHS2010 NETMIG2010
1.6833753 1.5502585 2.6078737
INTERNATIONALMIG2010 DOMESTICMIG2010 RESIDUAL2010
4.4857400 2.3719011
0.9202234

Kurtosis
> kurtosis(data.micro[,3:11])
CENSUS2010POP NPOPCHG_2010 NATURALINC2010
6.757994 17.459700 9.590567
BIRTHS2010 DEATHS2010 NETMIG2010
6.504231 6.819837 21.681844
INTERNATIONALMIG2010 DOMESTICMIG2010 RESIDUAL2010
34.850185 22.369340 17.521871
Correlation and Regression
Parametric
Correlation
Case Study: “parenthood.Rdata”
http://bdsrc.binus.ac.id/RM/parenthood.Rdat
a
> load( "parenthood.Rdata" )
> attach(parenthood)

> str(parenthood)
'data.frame': 100 obs. of 4 variables:
$ dan.sleep : num 7.59 7.91 5.14 7.71 6.68 5.99 8.19 7.19 7.4 6.58 ...
$ baby.sleep: num 10.18 11.66 7.92 9.61 9.75 ...
$ dan.grump : num 56 60 82 55 67 72 53 60 60 71 ...
$ day : int 1 2 3 4 5 6 7 8 9 10 ...
Parametric
Histogram
Correlation

> hist (dan.grump)


> hist (dan.sleep)
> hist (baby.sleep)
Parametric
qqnorm()
Correlation
> qqnorm (dan.grump);qqline (dan.grump, col=’red’)
> qqnorm (dan.sleep);qqline (dan.sleep, col=’red’)
> qqnorm (baby.sleep);qqline (baby.sleep, col=’red’)
Parametric Correlation
Scatterplot

> plot(dan.grump, dan.sleep)


> plot(dan.grump, baby.sleep)
> plot(dan.sleep, baby.sleep)
Parametric Correlation
Parametric
Correlation
Pearson’s Coefficient Correlation
Fungsi cor()

> cor(dan.sleep, dan.grump) [1]


-0.903384
> cor(baby.sleep, dan.grump)
[1] -0.5659637
> cor(baby.sleep, dan.sleep) [1]
0.6279493
Parametric
Correlation
Pearson’s Coefficient Correlation
Fungsi cor()

> cor(dan.sleep, dan.grump) [1]


-0.903384
> cor(baby.sleep, dan.grump)
[1] -0.5659637
> cor(baby.sleep, dan.sleep) [1]
0.6279493
Non-Parametric
Correlation
Case Study: “effort.Rdata”
http://bdsrc.binus.ac.id/RM/effort.Rdata

> load( "effort.Rdata" )


> attach(effort)
> effort
hours grade
1 2 13
2 76 91
3 40 79
4 6 14
5 16 21
6 28 74
7 27 47
8 59 85

> hist(hours)
> hist(grade)
Non-Parametric
Correlation
Spearman’s Rank Correlation

> hours.rank = rank(hours) > cor (hours, grade, method = “spearman”)


> hours.rank [1] 1
[1] 1 10 6 2 3 5 4 8 7 9
> grade.rank = rank(grade)
> grade.rank
[1] 1 10 6 2 3 5 4 8 7 9

> cor(hours.rank, grade.rank)


[1] 1
Significance of
Correlation
cor.test() function

> cor.test(dan.sleep, dan.grump)

Pearson's product-moment correlation data: dan.sleep and dan.grump


t = -20.854, df = 98, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9340614 -0.8594714
sample estimates: cor
-0.903384
Parametric Regression – Linear
Regression
Case study: “auto.csv” (chapter 6)
http://bdsrc.binus.ac.id/RM/auto.csv

> auto = read.csv('auto.csv')


> attach(auto)
> str(auto)
> plot(horsepower, price)
Parametric Regression – Linear
Regression
lm() function
y = f (x; w) = w0 + w1x + ε

> model = lm(price ˜ horsepower)


> model Call:
lm(formula = price ~ horsepower)

Coefficients:
(Intercept) horsepower
-4630.7 173.1

price = -4630.7022 + 173.1292 * horsepower


Parametric Regression – Linear
Regression
Regression model visualization

> plot(horsepower,price)
> abline(model)
Parametric Regression – Linear
Regression
Model evaluation

> summary(model) Call:


lm(formula = price ˜ horsepower)
Residuals:

Min 1Q Median 3Q Max


-10296.1 -2243.5 -450.1 1794.7 18174.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4630.70 990.58 -4.675 5.55e-06 ***
horsepower 173.13 8.99 19.259 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 4728 on 191 degrees of freedom
Multiple R-squared: 0.6601, Adjusted R-squared: 0.6583 F-
statistic: 370.9 on 1 and 191 DF, p-value: < 2.2e-16
Parametric Regression – Linear
Prediction –predict() function
Regression

> new.data = data.frame(horsepower = c(100,125,150,175,200))


> predict(model, new.data)
1 2 3 4 5
12682.21 17010.44 21338.67 25666.90 29995.1
3
> predict(model, new.data, interval = 'confidence', level = 0.95)
fit lwr upr
1 12682.21 12008.03 13356.40
2 17010.44 16238.24 17782.65
3 21338.67 20275.14 22402.20
4 25666.90 24232.01 27101.79
5 29995.13 28156.72 31833.53
Parametric Regression –
Multivariate Linear Regression

> lm(price ~ length + engine.size + horsepower + city.mpg)

Call:
lm(formula = price ~ length + engine.size + horsepower + city.mpg)

Coefficients:
(Intercept) length engine.size horsepower city.mpg
-28480.00 114.58 115.32 52.74 61.51

price = 114.58 x length + 115.32 x engine.size + 52.74 x horsepower + 61.51 x city.mpg – 28480.00
Regresi Parametrik – Log Linear
Regression
> lm(city.mpg ~ log(horsepower))

Call:
lm(formula = city.mpg ~ log(horsepower))

Coefficients:
(Intercept) log(horsepower)
101.44 -16.62

city.mpg = 101.44 – 16.62 x horsepower


Regresi Parametrik – Log Linier
Regression
Linear regression VS log Linear regression

> summary(lm(city.mpg ˜ horsepower)) Call: > summary(lm(city.mpg ˜ log(horsepower))) Call:


lm(formula = city.mpg ˜ horsepower) Residuals: lm(formula = city.mpg ˜ log(horsepower))
Min 1Q Median 3Q Max Residuals:
-7.5162 -1.9232 -0.1454 0.8365 17.2934 Min 1Q Median 3Q Max
Coefficients: -6.7491 -1.7312 -0.1621 1.2798 15.0499
Estimate Std. Error t value Pr(>|t|) Coefficients:
(Intercept) 39.842721 0.741080 53.76 <2e-16 *** Estimate Std. Error t value Pr(>|t|)
horsepower -0.140279 0.006725 -20.86 <2e-16 *** (Intercept) 101.4362 2.8703 35.34 <2e-16 ***
--- log(horsepower) -16.6204 0.6251 -26.59 <2e-16 ***
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 ---
Residual standard error: 3.538 on 191 degrees of freedom Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Multiple R-squared: 0.6949, Adjusted R-squared: 0.6933 F- Residual standard error: 2.954 on 191 degrees of freedom
statistic: 435.1 on 1 and 191 DF, p-value: < 2.2e-16 Multiple R-squared: 0.7873, Adjusted R-squared: 0.7862 F-
statistic: 707 on 1 and 191 DF, p-value: < 2.2e-16
Non-Parametric Regression –
Regression Tree
• > library (rpart)
• > fit = rpart(price ~ length +engine.size + horsepower + city.mpg)
• > fit
• n= 193

• node), split, n, deviance, yval


• * denotes terminal node

• 1) root 193 12563190000 13285.030


• 2) engine.size< 182 176 3805169000 11241.450
• 4) horsepower< 94.5 94 382629400 7997.319
• 8) length< 172.5 72 108629400 7275.847 *
• 9) length>=172.5 22 113868600 10358.500 *
• 5) horsepower>=94.5 82 1299182000 14960.330

10) length< 176.4 33 444818200 12290.670


20)city.mpg>=22 21 94343020 10199.330 *
21)city.mpg< 22 12 97895460 15950.500 *
11) length>=176.4 49 460773500 16758.270 *
3) engine.size>=182 413464300 34442.060 *
Non-Parametric Regression –
Regression Tree
> plot(fit, uniform = T)
> text(fit, digits = 6, cex=0.6)
t-test and Anova
One sample t-test
-> Mean comparison
datasets::sleep
> attach (sleep) Given average of sleep duration increase is
> sleep 0
extra group ID
1 0.7 1 1
2 -1.6 1 2 > mean(extra)
3 -0.2 1 3 [1] 1.54
4 -1.2 1 4
5 -0.1 1 5 H0 : There is no significant difference of sleep duration mean
6 3.4 1 6 between observed sample and the whole population.
7 3.7 1 7
8 0.8 1 8 H1 : There is a significant difference of sleep duration mean
9 0.0 1 9 between observed sample and the whole population.
10 2.0 1 10
11 1.9 2 1
......
One sample t-test
-> Mean comparison
datasets::sleep

Reject H0 dan accept H1 if p-value < 0.05 > t.test(extra, mu=0)

Accept H0 if p-value > 0.05 One Sample t-test

Conclussion: There is a significant data: extra


difference of sleep duration mean between t = 3.413, df = 19, p-value = 0.002918
observed sample and the whole population. alternative hypothesis: true mean is not equal
to 0
95 percent confidence interval:
0.5955845 2.4844155
sample estimates:
mean of x
1.54
Dependent samples t-test
-> Mean comparison between two sample groups

> mean(extra[group==1]) > t.test (extra~group, sleep, var.equal = T, paired=T)


[1] 0.75
> mean(extra[group==2]) Paired t-test
[1] 2.33
data: extra by group
t = -4.0621, df = 9, p-value = 0.002833
H0 : There is no significant difference of sleep alternative hypothesis: true difference in means is
duration mean between two groups. not equal to 0
95 percent confidence interval:
H1 : There is a significant difference of sleep -2.4598858 -0.7001142
duration mean between two groups. sample estimates:
mean of the differences
-1.58
Independent samples t-test
-> Mean comparison between two independent sample
groups
case study: “harpo.Rdata” ->
http://bdsrc.binus.ac.id/RM/harpo.Rdata

> load("D:/BDSRC/DataScienceWithR/modul2/harpo.Rdata")
> str(harpo)
'data.frame': 33 obs. of 2 variables:
$ grade: num 65 72 66 74 73 71 66 76 69 79 ...
$ tutor: Factor w/ 2 levels "Anastasia","Bernadette": 1 2 2 1 1 2 2 2 2 2 ...

H0 : There is no significant difference of sleep duration mean


between two groups (grouped by the tutor).

H1 : There is a significant difference of sleep duration mean


between two groups (grouped by the tutor).
Independent samples t-test
-> Mean comparison between two independent sample
groups
case study: “harpo.Rdata” ->
http://bdsrc.binus.ac.id/RM/harpo.Rdata
> tapply(grade,tutor,mean) > t.test(grade~tutor, harpo, var.equal = T)
Anastasia Bernadette
74.53333 69.05556 Two Sample t-test
> tapply(grade,tutor,sd)
Anastasia Bernadette data: grade by tutor
8.998942 5.774918 t = 2.1154, df = 31, p-value = 0.04253
> tapply(grade,tutor,length) alternative hypothesis: true difference in means is not
Anastasia Bernadette equal to 0
15 18 95 percent confidence interval:
0.1965873 10.7589683
sample estimates:
mean in group Anastasia mean in group Bernadette
Conclussion: There is a significant 74.53333 69.05556
difference of sleep duration mean between
two groups (grouped by the tutor).
Anova
-> Mean comparison among more than two independent sample
groups
datasets::InsectSprays
>attach(InsectSprays)
>data(InsectSprays)
>str(InsectSprays)
'data.frame': 72 obs. of 2 variables:
$ count: num 10 7 20 14 14 12 10 23 17 20 ...
$ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ..

> tapply(count,spray,mean)
A B C D E F
14.500000 15.333333 2.083333 4.916667 3.500000 16.666667

> tapply(count,spray,var)
A B C D E F
22.272727 18.242424 3.901515 6.265152 3.000000 38.606061
> tapply(count,spray,length)
A B C D E F
12 12 12 12 12 12
Anova
-> Mean comparison among more than two independent sample
groups
datasets::InsectSprays

>boxplot(count ~ spray, InsectSprays)


Anova
-> Mean comparison among more than two independent sample
groups
datasets::InsectSprays

H0: There is no significant difference among the gorups


H1: There is a significant difference among the gorups

> oneway.test(count~spray)
One-way analysis of means (not assuming equal variances)
data: count and spray
F = 36.0654, num df = 5.000, denom df = 30.043, p-value = 7.999e-12

> qf(.95,5,30.043)
[1] 2.533065

Decision: Reject H0 because F (36.0654) is more than F tabel (2.533065) and also p-
value = 7.999e-12, very small.
Anova
Bartlet test (parametric) or Levene (Non-parametric)

H0: The data has a homogen population variant


H1: There are two population variant difference at minimum

> bartlett.test(count~spray, InsectSprays)


Bartlett test of homogeneity of variances
data: count by spray
Bartlett's K-squared = 25.96, df = 5, p-value = 9.085e-05

Decision: Rehect H0 because p-value = 9.085e-05, less than 0.05


Anova
Tukey Honest Significant Differences

diff lwr upr p adj


> aov.out = aov(count ~ spray, data=InsectSprays)
B-A 0.83333 -3.8661 5.53274 0.995181
> TukeyHSD(aov.out)
C-A -12.417 -17.116 -7.7173 0
Tukey multiple comparisons of means
D-A
95% family-wise confidence level -9.5833 -14.283 -4.8839 0.0000014
E-A -11 -15.699 -6.3006 0
F-A
Fit: aov(formula = count ~ spray, data = InsectSprays) 2.16667 -2.5327 6.86608 0.7542147
C-B -13.25 -17.949 -8.5506 0
D-B
$spray -10.417 -15.116 -5.7173 0.0000002
E-B -11.833 -16.533 -7.1339 0
Based on this result. there are significant differences F-B 1.33333 -3.3661 6.03274 0.9603075
between Variable C-A, D-A, E-A, C-B, D-B, E-B, F-C, D-C 2.83333 -1.8661 7.53274 0.4920707
F-D dan F-E, p-value for these relations are less than E-C 1.41667 -3.2827 6.11608 0.9488669
0.05. F-C 14.5833 9.88393 19.2827 0
E-D -1.4167 -6.1161 3.28274 0.9488669
F-D 11.75 7.05059 16.4494 0
F-E 13.1667 8.46726 17.8661 0
Chi-square Test
chi-square – Goodness of fit (randomness)
Data -> randomness.Rdata
http://bdsrc.binus.ac.id/RM/randomness.Rdata
> load("randomness.Rdata")
> attach(cards)
> str(cards)
'data.frame': 200 obs. of 3 variables:
$ id : Factor w/ 200 levels "subj1","subj10",..: 1 112 124 135 146 157 168
179 190 2 ...
$ choice_1: Factor w/ 4 levels "clubs","diamonds",..: 4 2 3 4 3 1 3 2 4 2 ...
$ choice_2: Factor w/ 4 levels "clubs","diamonds",..: 1 1 1 1 4 3 2 1 1 4 ...

> observed = table(choice_1)


> observed
choice_1
clubs diamonds hearts spades
35 51 64 50
chi-square

H0: All card have the same probability to be picked


Clubs: 25% | Diamonds: 25% | Hearts: 25% | Spades: 25%

H1: There is a significant difference of proboballity for each card to be picked.

> prob = c(clubs = .25, diamonds = .25, hearts = .25, spades = .25)
> prob
clubs diamonds hearts spades
0.25 0.25 0.25 0.25

> N = 200 # sample size


> expected = N * prob # expected frequencies
> expected
clubs diamonds hearts spades
50 50 50 50
chi-square

> observed - expected


choice_1
clubs diamonds hearts
spades
-15 1 14 0
> (observed - expected)^2
choice_1
clubs diamonds hearts spades
225 1 196 0

> (observed - expected)^2/expected


choice_1
clubs diamonds hearts spades
4.50 0.02 3.92 0.00

> sum((observed - expected)^2/expected)


[1] 8.44
chi-square

Critical Value:

> qchisq( p = .95, df = 3 )


[1] 7.814728

p-value:

> pchisq( q = 8.44, df = 3, lower.tail = FALSE ) Reject H0 if p-value < 0.05 (α = 95%)
[1] 0.03774185

Decision:

Reject H0 and accept H1

There is a significant difference of proboballity for each card to be picked (not random)
chi-square
Other approach using chisq.test() function

Cara lain:

> chisq.test(observed)

Chi-squared test for given


probabilities

data: observed
X-squared = 8.44, df = 3, p-value = 0.03774

Decision:

Reject H0 and accept H1

There is a significant difference of proboballity for each card to be picked (not random)
chi-square test of independence
D Data -> chapek9.Rdata
http://bdsrc.binus.ac.id/RM/chapek9.Rdata
> load( "chapek9.Rdata" )
> attach(chapek9)
> str(chapek9)
'data.frame': 180 obs. of 2 variables:
$ species: Factor w/ 2 levels "robot","human": 1 2 2 2 1 2 2 1 2 1 ...
$ choice : Factor w/ 3 levels "puppy","flower",..: 2 3 3 3 3 2 3 3 1 2 ...

> summary(chapek9) > tbl = table(species, choice)


species choice > tbl
robot:87 choice
puppy : 28 species puppy flower data
human:93 robot 13 30 44
flower: 43 human 15 13 65
data
:109
chi-square test of independence

H0: There is no correlation between species and decision making

H1: There is correlation between species and decision making

> chisq.test(species, choice)

Pearson's Chi-squared test

data: choice and species


X-squared = 10.722, df = 2, p-value = 0.004697

Critical value:

> qchisq(0.95,2)
[1] 5.991465
Fisher test for small N

> mhs <- matrix(c(1, 2, 1, 3),nrow = 2, dimnames = list (c(“Passed", “not passed"), c("D
epressed", “Not Depressed")))
> mhs
Depressed Not depressed
Passed 1 1
Not passed 2 3

> chisq.test(mhs)

Pearson's Chi-squared test with Yates' continuity correction

data: mhs
X-squared = 1.438e-32, df = 1, p-value = 1

Warning message:
In chisq.test(mhs) : Chi-squared approximation may be incorrect
Fisher test untuk N kecil
> fisher.test(mhs)

Fisher's Exact Test for Count Data

data: mhs
p-value = 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.01279034 156.23100767
sample estimates:
odds ratio
1.414185

Odds Ratio
Based on this result, the probability of students to suffer from depression b
ecause of the their score is 1 orang.
MOOC:
-https://www.r-bloggers.com/how-to-learn-r-2/
-https://www.datacamp.com/
-http://tryr.codeschool.com/
-https://www.coursera.org/learn/r-programming
-https://www.rstudio.com/online-learning/
References
Adler, J. (2012). R in a Nutshell. Sebastopol, California: O'Reilly Media.

Matloff, N. (2011). The Art of R Programming. San Francisco: No Starch Press, Inc.

Pardamean, B., Baurley, J.W., Muljo, H.M., Perbangsa, A.S., & Suparyanto, T. (2014).
Data Management and Analysis System for Genome-Wide Association Study.
Bioinformatics Research Group, Bina Nusantara University.

Pathak, M.A. (2014). Beginning Data Science with R. California, USA: Springer.

Teetor, P. (2011). R Cookbook. Sebastopol, California: O’Reilly Media, Inc.

Venables, W.N., & Smith, D.M. (2008). An Introduction to R. Network Theory.

Verzani, J. (2014). Using R for Introductory Statistics. Chapman and Hall/CRC.

You might also like