0% found this document useful (0 votes)

114 views

RSCH8079 - Session 09 - Data Science With R

This document provides an outline for a data science session in R that includes introductions to R, descriptive statistics, correlation and regression, t-tests and ANOVA, and chi-square tests. The outline lists the topics that will be covered in each section, such as importing and exploring data, descriptive measures like mean and standard deviation, correlation, linear regression, t-tests, ANOVA, and chi-square tests.

Uploaded by

Dinne Ratj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

114 views

RSCH8079 - Session 09 - Data Science With R

Uploaded by

Dinne Ratj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 69

Course: RSCH8079 – IT Research Methodology

Data Science with R

Session 09

D3502 - Bens Pardamean, B.Sc., M.Sc., PhD

Outline

• Introduction to R
• Descriptive Statistics
• Correlation and Regression
• t-Test and ANOVA
• Chi-Square Test
Introduction to R
Why R ?

- Open source, dan cross platform - Mendukung prinsip reproducibility

- Menghasilkan visualisasi yang berkualitas tinggi -Di dukung komunitas yang besar ( >2 juta user)

Sumber: https://www.r-bloggers.com/new-surveys-show-continued-popularity-of-r/
R Components
• R Base • R IDE - RStudio

• R Package - CRAN
R Installation
- Windows and Mac OS X

Download executable file (.exe for windows and .pkg for Mac):
http://www.r-project.org/

- Linux
Ubuntu atau Debian : r-base
Red Hat atau Fedora: R.i386
Suse : R-base

Example:
$ sudo apt-get install r-base

- Instalasi Rstudio:
Follow this link: https://www.rstudio.com/
Go to Products > RStudio > Download RStudio Desktop
Get Start
To start R, we need to specify our working directory. All files related to
the analysis should be placed in this directory.

- In R Base / R GUI:
File > Change dir... > Choose a directory

- In Rstudio:
File > New Project > New Directory > Choose project type> Specify name
and path file
Basic Operator and Data Type
Arithmetic Operator: Operator to define a
•Add ( + ) variable:
>5+6 > age = 20 > 20 -> age
•Subtract ( - ) -“ <- “ or
[1] 11 > age > age
•Multiple ( * )
-“ -> “ or [1] 20 [1] 20
•Divide( / )
•Square ( ^ )
-“ = “
> age <- 20
> age
[1] 20

Data Type:
-Numeric x = 10.25 -Character x = “ten”

-Integer x = 10 -Factor x = “agree”

y=
-Complex x = 10 + 3i “disagree”
z = “neutral”
-Logical x = TRUE
Package Installation
R provide a comprehensive library for data analysis in CRAN.
> install.packages('stringr')
Installing package into ‘C:/Users/Arif/Documents/R/win-library/3.3’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.3/stringr_1.1.0.zip'
Content type 'application/zip' length 119734 bytes (116 KB)
downloaded 116 KB

package ‘stringr’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in

C:\Users\Arif\AppData\Local\Temp\RtmpCKsMwX\downloaded_packages

R Help System > help(cat) > ?cat

Data Import
- CSV
data.csv <- read.table("namafile.csv",header=TRUE)
- EXCEL
library(xlsx)
data.xlsx <- read.xlsx("namafile.xlsx",sheetName = "Sheet1")
- SPSS
library(memisc)
data.spss <- as.data.set(spss.system.file ('namafile.sav'))
- TXT
data.txt = read.table("namafile.txt")
Data Exploration
Case Study – Health data analytic
Please follow this link to download survey.csv file:
http://bdsrc.binus.ac.id/RM/survey.csv
#import the data
> survey = read.csv("survey.csv")

#print the structure of the data

> str(survey)
'data.frame': 237 obs. of 6 variables:
$ sex : Factor w/ 6 levels "F","female","Female",..: 3 6 6 6 6 3 6 3 6 6 ...
$ height : int 68 70 NA 63 65 68 72 62 69 66 ...
$ weight : int 158 256 204 187 168 172 160 116 262 189 ...
$ handedness: Factor w/ 2 levels "Left","Right": 2 1 2 2 2 2 2 2 2 2 ...
$ exercise : Factor w/ 3 levels "Freq","None",..: 3 2 2 2 3 3 1 1 3 3 ...
$ smoke : Factor w/ 4 levels "Heavy","Never",..: 2 4 3 2 2 2 2 2 2 2 ...
Data Exploration

Data entry error

- Fix incorrect data

#print all possible values for one variable
> unique(survey$sex)
[1] Female Male <NA> F M male female
Levels: F female Female M male
#If we want only “Female” and “Male” for this variable, then we need to change all
other values

#find all rows that contain unexpected values (for instance “F”)
> which(data$sex == ‘F’)
[1] 210 211 212

#change it with the correct value

> data$sex[which(data$sex == ‘F’)] = ‘Female’
Data Exploration

Data entry error Data entry error

- Missing value (NA) -Missing values (NA) – Data Imputation

#check if there is NA Replace NA with an appropriate value
> sum(is.na(survey$height))
[1] 28 #replace NA on “height” with 160
> data$height[is.na(data$height)] = 160
#find rows contain NA
> which(is.na(data$height))
[1] 3 12 15 25 26 29 31 35 58 68 70 81 #replace NA on “height” with height
83 84 90 92 96 108 121 average for each sex
[20] 133 157 173 179 203 213 217 225 226 > female.height =
mean(data$height[which(data$sex==
#exclude NA in mean calculation ‘Female’)], na.rm=T)
> mean(data$height, na.rm = T) > data$height[which(data$sex ==
[1] 67.89474 ‘Female’& is.na(data$height))] =
female.height
Descriptive Statistics
Studi Kasus
Data: Major League Baseball (MLB)
http://bdsrc.binus.ac.id/RM/teams.csv

> data = read.csv(‘teams.csv’)

> attach(data)

> str(data)
'data.frame': 30 obs. of 9 variables:
$ team : Factor w/ 30 levels "Arizona Diamondbacks",..: 1 2 3 4 5 6 7 8 9 10 ...
$ code : Factor w/ 30 levels "ARI","ATL","BAL",..: 1 2 3 4 5 6 7 8 9 10 ...
$ league : Factor w/ 2 levels "AL","NL": 2 2 1 1 2 1 2 1 2 1 ...
$ division: Factor w/ 3 levels "Central","East",..: 3 2 2 2 1 1 1 1 3 1 ...
$ games : int 162 162 162 162 162 162 162 162 162 162 ...
$ wins : int 81 94 93 69 61 85 97 68 64 88 ...
$ losses : int 81 68 69 93 101 77 65 94 98 74 ...
$ pct : num 0.5 0.58 0.574 0.426 0.377 0.525 0.599 0.42 0.395 0.543 ...
$ payroll : int 67069833 86208000 76704000 110386000 80422700 118208000 80309500
78911300 75485000 131394000 ...
Scatterplots
Show a relation between two variables

> plot (payroll,wins)

Labeling in scatterplot

> plot (payroll,wins)

> id = identify(payroll, wins,labels = code, n = 5)

> plot (payroll,wins)

> with(data, text(payroll, wins, labels = code, pos = 1,
cex=0.5))
Scatterplots
Data grouping (categorical) Data grouping (numerik)

> s1 = which(league == ‘NL’) > s3 = which(pct > 0.5)

> s2 = which(league == ‘AL’) > s4 = which(pct <= 0.5)
> plot(payroll[s1],wins[s1],xlim = range(payroll), > plot(payroll[s3], wins[s3], pch = 3, xlim =
ylim=range(wins), xlab=‘payroll’, ylab=‘wins’) range(payroll), ylim = range(wins), xlab = ‘payroll’,
> points(payroll[s2],wins[s2],pch=2) ylab = ‘wins’)
> points(payroll[s4], wins[s4], pch = 4)
Scatterplots
Line to separate two groups Legend

> s1 = which(league == ‘NL’) > plot(payroll[s3], wins[s3], xlim = range(payroll), ylim

> s2 = which(league == ‘AL’) = range(wins), xlab='payroll', ylab = 'wins')
> plot(payroll[s1],wins[s1],xlim = range(payroll), > points(payroll[s4], wins[s4], pch = 2)
ylim=range(wins), xlab=‘payroll’, ylab=‘wins’) > lines(range(payroll), c(81,81), lty = 3)
> points(payroll[s2],wins[s2],pch=2) > legend('bottomright', c('pct > 0.5', 'pct <= 0.5'),
pch=c(1,2), title='Legend')
Data Aggregation
Comparing sum of “payroll” between two leagues

> sum(payroll[which(league == 'NL')])

[1] 1512099665
> sum(payroll[which(league == 'AL')])
[1] 1424254675

> by(payroll,league,sum)
league: AL
[1] 1424254675
------------------------------------------------------------------
league: NL
[1] 1512099665
Bar Plot

> barplot(by(payroll,league,sum))

> par(xpd=T, mar=par()$mar + c(0,0,0,4))

> barplot(by(payroll,list(division,league),
sum),col=2:4)
> legend(2.5,8e8,c(‘Central’,’East’,’West’), fill=2:4)
Pie Diagram

> labels = c(‘AL Central’, ‘AL East’, ‘AL West’,‘NL

Central’, ‘NL East’, ‘NL West’)
> pie(by(as.numeric(payroll), league, sum)) > pie(as.numeric(by(payroll,list(division,
league),sum)),labels)
Descriptive Statistic
Case study: metropolitan.csv
http://bdsrc.binus.ac.id/RM/metropolitan.csv
> data = read.csv(‘metropolitan.csv’) > head(data)
> attach (data)
> dim(data) > tail(data)
[1] 2759 11
> nrow(data) > summary(data)
[1] 2759
> ncol(data)
[1] 11

> str(data)
'data.frame': 2759 obs. of 11 variables:
$ NAME : Factor w/ 2757 levels "Abbeville, LA",..: 4 347 1263 2444 17 2033 2408 26 124 715 ...
$ LSAD : Factor w/ 4 levels "County or equivalent",..: 3 1 1 1 3 1 1 3 1 1 ...
$ CENSUS2010POP : int 165252 13544 20202 131506 703200 161419 541781 157308 3451 94565 ...
$ NPOPCHG_2010 : int 417 -12 27 402 -332 -38 -294 277 -60 156 ...
$ NATURALINC2010 : int 228 -14 10 232 310 65 245 220 4 147 ...
$ BIRTHS2010 : int 609 36 41 532 1945 385 1560 542 6 363 ...
$ DEATHS2010 : int 381 50 31 300 1635 320 1315 322 2 216 ...
$ NETMIG2010 : int 190 2 17 171 -631 -101 -530 57 -61 11 ...
$ INTERNATIONALMIG2010: int 77 1 2 74 127 26 101 36 0 32 ...
$ DOMESTICMIG2010 : int 113 1 15 97 -758 -127 -631 21 -61 -21 ...
$ RESIDUAL2010 : int -1 0 0 -1 -11 -2 -9 0 -3 -2 ...
Descriptive Statistic
> sort(data$CENSUS2010POP)

> output = sort(data$CENSUS2010POP,decreasing=T,

indeks.return=T)

> data[output$ix[1:10],1:2]

> data[order(-data$CENSUS2010POP)[1:10],1:2]

Data Grouping

> by(data$CENSUS2010POP,data$LSAD,mean)
data$LSAD: County or equivalent
[1] 161779.3
--------------------------------------------------
data$LSAD: Metropolitan Division
[1] 2803270
--------------------------------------------------
data$LSAD: Metropolitan Statistical Area
[1] 705786.2
--------------------------------------------------
data$LSAD: Micropolitan Statistical Area
[1] 53721.44
Statistik Deskriptif
Data Distribution – Box Plot & Histogram

> boxplot(data$BIRTHS2010 ~ data$LSAD) > hist(data.micro$BIRTHS2010)

Statistik Deskriptif
Skewness
> library(momen)
> skewness(data.micro[,3:11])
CENSUS2010POP NPOPCHG_2010 NATURALINC2010
1.7384473 2.6371220 1.0143676
BIRTHS2010 DEATHS2010 NETMIG2010
1.6833753 1.5502585 2.6078737
INTERNATIONALMIG2010 DOMESTICMIG2010 RESIDUAL2010
4.4857400 2.3719011
0.9202234

Kurtosis
> kurtosis(data.micro[,3:11])
CENSUS2010POP NPOPCHG_2010 NATURALINC2010
6.757994 17.459700 9.590567
BIRTHS2010 DEATHS2010 NETMIG2010
6.504231 6.819837 21.681844
INTERNATIONALMIG2010 DOMESTICMIG2010 RESIDUAL2010
34.850185 22.369340 17.521871
Correlation and Regression
Parametric
Correlation
Case Study: “parenthood.Rdata”
http://bdsrc.binus.ac.id/RM/parenthood.Rdat
a
> load( "parenthood.Rdata" )
> attach(parenthood)

> str(parenthood)
'data.frame': 100 obs. of 4 variables:
$ dan.sleep : num 7.59 7.91 5.14 7.71 6.68 5.99 8.19 7.19 7.4 6.58 ...
$ baby.sleep: num 10.18 11.66 7.92 9.61 9.75 ...
$ dan.grump : num 56 60 82 55 67 72 53 60 60 71 ...
$ day : int 1 2 3 4 5 6 7 8 9 10 ...
Parametric
Histogram
Correlation

> hist (dan.grump)

> hist (dan.sleep)
> hist (baby.sleep)
Parametric
qqnorm()
Correlation
> qqnorm (dan.grump);qqline (dan.grump, col=’red’)
> qqnorm (dan.sleep);qqline (dan.sleep, col=’red’)
> qqnorm (baby.sleep);qqline (baby.sleep, col=’red’)
Parametric Correlation
Scatterplot

> plot(dan.grump, dan.sleep)

> plot(dan.grump, baby.sleep)
> plot(dan.sleep, baby.sleep)
Parametric Correlation
Parametric
Correlation
Pearson’s Coefficient Correlation
Fungsi cor()

> cor(dan.sleep, dan.grump) [1]

-0.903384
> cor(baby.sleep, dan.grump)
[1] -0.5659637
> cor(baby.sleep, dan.sleep) [1]
0.6279493
Parametric
Correlation
Pearson’s Coefficient Correlation
Fungsi cor()

> cor(dan.sleep, dan.grump) [1]

-0.903384
> cor(baby.sleep, dan.grump)
[1] -0.5659637
> cor(baby.sleep, dan.sleep) [1]
0.6279493
Non-Parametric
Correlation
Case Study: “effort.Rdata”
http://bdsrc.binus.ac.id/RM/effort.Rdata

> load( "effort.Rdata" )

> attach(effort)
> effort
hours grade
1 2 13
2 76 91
3 40 79
4 6 14
5 16 21
6 28 74
7 27 47
8 59 85

> hist(hours)
> hist(grade)
Non-Parametric
Correlation
Spearman’s Rank Correlation

> hours.rank = rank(hours) > cor (hours, grade, method = “spearman”)

> hours.rank [1] 1
[1] 1 10 6 2 3 5 4 8 7 9
> grade.rank = rank(grade)
> grade.rank
[1] 1 10 6 2 3 5 4 8 7 9

> cor(hours.rank, grade.rank)

[1] 1
Significance of
Correlation
cor.test() function

> cor.test(dan.sleep, dan.grump)

Pearson's product-moment correlation data: dan.sleep and dan.grump

t = -20.854, df = 98, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9340614 -0.8594714
sample estimates: cor
-0.903384
Parametric Regression – Linear
Regression
Case study: “auto.csv” (chapter 6)
http://bdsrc.binus.ac.id/RM/auto.csv

> auto = read.csv('auto.csv')

> attach(auto)
> str(auto)
> plot(horsepower, price)
Parametric Regression – Linear
Regression
lm() function
y = f (x; w) = w0 + w1x + ε

> model = lm(price ˜ horsepower)

> model Call:
lm(formula = price ~ horsepower)

Coefficients:
(Intercept) horsepower
-4630.7 173.1

price = -4630.7022 + 173.1292 * horsepower

Parametric Regression – Linear
Regression
Regression model visualization

> plot(horsepower,price)
> abline(model)
Parametric Regression – Linear
Regression
Model evaluation

> summary(model) Call:

lm(formula = price ˜ horsepower)
Residuals:

Min 1Q Median 3Q Max

-10296.1 -2243.5 -450.1 1794.7 18174.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4630.70 990.58 -4.675 5.55e-06 ***
horsepower 173.13 8.99 19.259 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 4728 on 191 degrees of freedom
Multiple R-squared: 0.6601, Adjusted R-squared: 0.6583 F-
statistic: 370.9 on 1 and 191 DF, p-value: < 2.2e-16
Parametric Regression – Linear
Prediction –predict() function
Regression

> new.data = data.frame(horsepower = c(100,125,150,175,200))

> predict(model, new.data)
1 2 3 4 5
12682.21 17010.44 21338.67 25666.90 29995.1
3
> predict(model, new.data, interval = 'confidence', level = 0.95)
fit lwr upr
1 12682.21 12008.03 13356.40
2 17010.44 16238.24 17782.65
3 21338.67 20275.14 22402.20
4 25666.90 24232.01 27101.79
5 29995.13 28156.72 31833.53
Parametric Regression –
Multivariate Linear Regression

> lm(price ~ length + engine.size + horsepower + city.mpg)

Call:
lm(formula = price ~ length + engine.size + horsepower + city.mpg)

Coefficients:
(Intercept) length engine.size horsepower city.mpg
-28480.00 114.58 115.32 52.74 61.51

price = 114.58 x length + 115.32 x engine.size + 52.74 x horsepower + 61.51 x city.mpg – 28480.00
Regresi Parametrik – Log Linear
Regression
> lm(city.mpg ~ log(horsepower))

Call:
lm(formula = city.mpg ~ log(horsepower))

Coefficients:
(Intercept) log(horsepower)
101.44 -16.62

city.mpg = 101.44 – 16.62 x horsepower

Regresi Parametrik – Log Linier
Regression
Linear regression VS log Linear regression

> summary(lm(city.mpg ˜ horsepower)) Call: > summary(lm(city.mpg ˜ log(horsepower))) Call:

lm(formula = city.mpg ˜ horsepower) Residuals: lm(formula = city.mpg ˜ log(horsepower))
Min 1Q Median 3Q Max Residuals:
-7.5162 -1.9232 -0.1454 0.8365 17.2934 Min 1Q Median 3Q Max
Coefficients: -6.7491 -1.7312 -0.1621 1.2798 15.0499
Estimate Std. Error t value Pr(>|t|) Coefficients:
(Intercept) 39.842721 0.741080 53.76 <2e-16 *** Estimate Std. Error t value Pr(>|t|)
horsepower -0.140279 0.006725 -20.86 <2e-16 *** (Intercept) 101.4362 2.8703 35.34 <2e-16 ***
--- log(horsepower) -16.6204 0.6251 -26.59 <2e-16 ***
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 ---
Residual standard error: 3.538 on 191 degrees of freedom Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Multiple R-squared: 0.6949, Adjusted R-squared: 0.6933 F- Residual standard error: 2.954 on 191 degrees of freedom
statistic: 435.1 on 1 and 191 DF, p-value: < 2.2e-16 Multiple R-squared: 0.7873, Adjusted R-squared: 0.7862 F-
statistic: 707 on 1 and 191 DF, p-value: < 2.2e-16
Non-Parametric Regression –
Regression Tree
• > library (rpart)
• > fit = rpart(price ~ length +engine.size + horsepower + city.mpg)
• > fit
• n= 193

• node), split, n, deviance, yval

• * denotes terminal node

• 1) root 193 12563190000 13285.030

• 2) engine.size< 182 176 3805169000 11241.450
• 4) horsepower< 94.5 94 382629400 7997.319
• 8) length< 172.5 72 108629400 7275.847 *
• 9) length>=172.5 22 113868600 10358.500 *
• 5) horsepower>=94.5 82 1299182000 14960.330

10) length< 176.4 33 444818200 12290.670

20)city.mpg>=22 21 94343020 10199.330 *
21)city.mpg< 22 12 97895460 15950.500 *
11) length>=176.4 49 460773500 16758.270 *
3) engine.size>=182 413464300 34442.060 *
Non-Parametric Regression –
Regression Tree
> plot(fit, uniform = T)
> text(fit, digits = 6, cex=0.6)
t-test and Anova
One sample t-test
-> Mean comparison
datasets::sleep
> attach (sleep) Given average of sleep duration increase is
> sleep 0
extra group ID
1 0.7 1 1
2 -1.6 1 2 > mean(extra)
3 -0.2 1 3 [1] 1.54
4 -1.2 1 4
5 -0.1 1 5 H0 : There is no significant difference of sleep duration mean
6 3.4 1 6 between observed sample and the whole population.
7 3.7 1 7
8 0.8 1 8 H1 : There is a significant difference of sleep duration mean
9 0.0 1 9 between observed sample and the whole population.
10 2.0 1 10
11 1.9 2 1
......
One sample t-test
-> Mean comparison
datasets::sleep

Reject H0 dan accept H1 if p-value < 0.05 > t.test(extra, mu=0)

Accept H0 if p-value > 0.05 One Sample t-test

Conclussion: There is a significant data: extra

difference of sleep duration mean between t = 3.413, df = 19, p-value = 0.002918
observed sample and the whole population. alternative hypothesis: true mean is not equal
to 0
95 percent confidence interval:
0.5955845 2.4844155
sample estimates:
mean of x
1.54
Dependent samples t-test
-> Mean comparison between two sample groups

> mean(extra[group==1]) > t.test (extra~group, sleep, var.equal = T, paired=T)

[1] 0.75
> mean(extra[group==2]) Paired t-test
[1] 2.33
data: extra by group
t = -4.0621, df = 9, p-value = 0.002833
H0 : There is no significant difference of sleep alternative hypothesis: true difference in means is
duration mean between two groups. not equal to 0
95 percent confidence interval:
H1 : There is a significant difference of sleep -2.4598858 -0.7001142
duration mean between two groups. sample estimates:
mean of the differences
-1.58
Independent samples t-test
-> Mean comparison between two independent sample
groups
case study: “harpo.Rdata” ->
http://bdsrc.binus.ac.id/RM/harpo.Rdata

> load("D:/BDSRC/DataScienceWithR/modul2/harpo.Rdata")
> str(harpo)
'data.frame': 33 obs. of 2 variables:
$ grade: num 65 72 66 74 73 71 66 76 69 79 ...
$ tutor: Factor w/ 2 levels "Anastasia","Bernadette": 1 2 2 1 1 2 2 2 2 2 ...

H0 : There is no significant difference of sleep duration mean

between two groups (grouped by the tutor).

H1 : There is a significant difference of sleep duration mean

between two groups (grouped by the tutor).
Independent samples t-test
-> Mean comparison between two independent sample
groups
case study: “harpo.Rdata” ->
http://bdsrc.binus.ac.id/RM/harpo.Rdata
> tapply(grade,tutor,mean) > t.test(grade~tutor, harpo, var.equal = T)
Anastasia Bernadette
74.53333 69.05556 Two Sample t-test
> tapply(grade,tutor,sd)
Anastasia Bernadette data: grade by tutor
8.998942 5.774918 t = 2.1154, df = 31, p-value = 0.04253
> tapply(grade,tutor,length) alternative hypothesis: true difference in means is not
Anastasia Bernadette equal to 0
15 18 95 percent confidence interval:
0.1965873 10.7589683
sample estimates:
mean in group Anastasia mean in group Bernadette
Conclussion: There is a significant 74.53333 69.05556
difference of sleep duration mean between
two groups (grouped by the tutor).
Anova
-> Mean comparison among more than two independent sample
groups
datasets::InsectSprays
>attach(InsectSprays)
>data(InsectSprays)
>str(InsectSprays)
'data.frame': 72 obs. of 2 variables:
$ count: num 10 7 20 14 14 12 10 23 17 20 ...
$ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ..

> tapply(count,spray,mean)
A B C D E F
14.500000 15.333333 2.083333 4.916667 3.500000 16.666667

> tapply(count,spray,var)
A B C D E F
22.272727 18.242424 3.901515 6.265152 3.000000 38.606061
> tapply(count,spray,length)
A B C D E F
12 12 12 12 12 12
Anova
-> Mean comparison among more than two independent sample
groups
datasets::InsectSprays

>boxplot(count ~ spray, InsectSprays)

Anova
-> Mean comparison among more than two independent sample
groups
datasets::InsectSprays

H0: There is no significant difference among the gorups

H1: There is a significant difference among the gorups

> oneway.test(count~spray)
One-way analysis of means (not assuming equal variances)
data: count and spray
F = 36.0654, num df = 5.000, denom df = 30.043, p-value = 7.999e-12

> qf(.95,5,30.043)
[1] 2.533065

Decision: Reject H0 because F (36.0654) is more than F tabel (2.533065) and also p-
value = 7.999e-12, very small.
Anova
Bartlet test (parametric) or Levene (Non-parametric)

H0: The data has a homogen population variant

H1: There are two population variant difference at minimum

> bartlett.test(count~spray, InsectSprays)

Bartlett test of homogeneity of variances
data: count by spray
Bartlett's K-squared = 25.96, df = 5, p-value = 9.085e-05

Decision: Rehect H0 because p-value = 9.085e-05, less than 0.05

Anova
Tukey Honest Significant Differences

diff lwr upr p adj

> aov.out = aov(count ~ spray, data=InsectSprays)
B-A 0.83333 -3.8661 5.53274 0.995181
> TukeyHSD(aov.out)
C-A -12.417 -17.116 -7.7173 0
Tukey multiple comparisons of means
D-A
95% family-wise confidence level -9.5833 -14.283 -4.8839 0.0000014
E-A -11 -15.699 -6.3006 0
F-A
Fit: aov(formula = count ~ spray, data = InsectSprays) 2.16667 -2.5327 6.86608 0.7542147
C-B -13.25 -17.949 -8.5506 0
D-B
$spray -10.417 -15.116 -5.7173 0.0000002
E-B -11.833 -16.533 -7.1339 0
Based on this result. there are significant differences F-B 1.33333 -3.3661 6.03274 0.9603075
between Variable C-A, D-A, E-A, C-B, D-B, E-B, F-C, D-C 2.83333 -1.8661 7.53274 0.4920707
F-D dan F-E, p-value for these relations are less than E-C 1.41667 -3.2827 6.11608 0.9488669
0.05. F-C 14.5833 9.88393 19.2827 0
E-D -1.4167 -6.1161 3.28274 0.9488669
F-D 11.75 7.05059 16.4494 0
F-E 13.1667 8.46726 17.8661 0
Chi-square Test
chi-square – Goodness of fit (randomness)
Data -> randomness.Rdata
http://bdsrc.binus.ac.id/RM/randomness.Rdata
> load("randomness.Rdata")
> attach(cards)
> str(cards)
'data.frame': 200 obs. of 3 variables:
$ id : Factor w/ 200 levels "subj1","subj10",..: 1 112 124 135 146 157 168
179 190 2 ...
$ choice_1: Factor w/ 4 levels "clubs","diamonds",..: 4 2 3 4 3 1 3 2 4 2 ...
$ choice_2: Factor w/ 4 levels "clubs","diamonds",..: 1 1 1 1 4 3 2 1 1 4 ...

> observed = table(choice_1)

> observed
choice_1
clubs diamonds hearts spades
35 51 64 50
chi-square

H0: All card have the same probability to be picked

Clubs: 25% | Diamonds: 25% | Hearts: 25% | Spades: 25%

H1: There is a significant difference of proboballity for each card to be picked.

> prob = c(clubs = .25, diamonds = .25, hearts = .25, spades = .25)
> prob
clubs diamonds hearts spades
0.25 0.25 0.25 0.25

> N = 200 # sample size

> expected = N * prob # expected frequencies
> expected
clubs diamonds hearts spades
50 50 50 50
chi-square

> observed - expected

choice_1
clubs diamonds hearts
spades
-15 1 14 0
> (observed - expected)^2
choice_1
clubs diamonds hearts spades
225 1 196 0

> (observed - expected)^2/expected

choice_1
clubs diamonds hearts spades
4.50 0.02 3.92 0.00

> sum((observed - expected)^2/expected)

[1] 8.44
chi-square

Critical Value:

> qchisq( p = .95, df = 3 )

[1] 7.814728

p-value:

> pchisq( q = 8.44, df = 3, lower.tail = FALSE ) Reject H0 if p-value < 0.05 (α = 95%)
[1] 0.03774185

Decision:

Reject H0 and accept H1

There is a significant difference of proboballity for each card to be picked (not random)
chi-square
Other approach using chisq.test() function

Cara lain:

> chisq.test(observed)

Chi-squared test for given

probabilities

data: observed
X-squared = 8.44, df = 3, p-value = 0.03774

Decision:

Reject H0 and accept H1

There is a significant difference of proboballity for each card to be picked (not random)
chi-square test of independence
D Data -> chapek9.Rdata
http://bdsrc.binus.ac.id/RM/chapek9.Rdata
> load( "chapek9.Rdata" )
> attach(chapek9)
> str(chapek9)
'data.frame': 180 obs. of 2 variables:
$ species: Factor w/ 2 levels "robot","human": 1 2 2 2 1 2 2 1 2 1 ...
$ choice : Factor w/ 3 levels "puppy","flower",..: 2 3 3 3 3 2 3 3 1 2 ...

> summary(chapek9) > tbl = table(species, choice)

species choice > tbl
robot:87 choice
puppy : 28 species puppy flower data
human:93 robot 13 30 44
flower: 43 human 15 13 65
data
:109
chi-square test of independence

H0: There is no correlation between species and decision making

H1: There is correlation between species and decision making

> chisq.test(species, choice)

Pearson's Chi-squared test

data: choice and species

X-squared = 10.722, df = 2, p-value = 0.004697

Critical value:

> qchisq(0.95,2)
[1] 5.991465
Fisher test for small N

> mhs <- matrix(c(1, 2, 1, 3),nrow = 2, dimnames = list (c(“Passed", “not passed"), c("D
epressed", “Not Depressed")))
> mhs
Depressed Not depressed
Passed 1 1
Not passed 2 3

> chisq.test(mhs)

Pearson's Chi-squared test with Yates' continuity correction

data: mhs
X-squared = 1.438e-32, df = 1, p-value = 1

Warning message:
In chisq.test(mhs) : Chi-squared approximation may be incorrect
Fisher test untuk N kecil
> fisher.test(mhs)

Fisher's Exact Test for Count Data

data: mhs
p-value = 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.01279034 156.23100767
sample estimates:
odds ratio
1.414185

Odds Ratio
Based on this result, the probability of students to suffer from depression b
ecause of the their score is 1 orang.
MOOC:
-https://www.r-bloggers.com/how-to-learn-r-2/
-https://www.datacamp.com/
-http://tryr.codeschool.com/
-https://www.coursera.org/learn/r-programming
-https://www.rstudio.com/online-learning/
References
Adler, J. (2012). R in a Nutshell. Sebastopol, California: O'Reilly Media.

Matloff, N. (2011). The Art of R Programming. San Francisco: No Starch Press, Inc.

Pardamean, B., Baurley, J.W., Muljo, H.M., Perbangsa, A.S., & Suparyanto, T. (2014).
Data Management and Analysis System for Genome-Wide Association Study.
Bioinformatics Research Group, Bina Nusantara University.

Pathak, M.A. (2014). Beginning Data Science with R. California, USA: Springer.

Teetor, P. (2011). R Cookbook. Sebastopol, California: O’Reilly Media, Inc.

Venables, W.N., & Smith, D.M. (2008). An Introduction to R. Network Theory.

Verzani, J. (2014). Using R for Introductory Statistics. Chapman and Hall/CRC.

Lab 5
0% (1)
Lab 5
5 pages
Stata Dcreate Module
No ratings yet
Stata Dcreate Module
2 pages
Answ Exam Ibp 07
No ratings yet
Answ Exam Ibp 07
16 pages
Business Mathematics and Statistical Inference PDF
No ratings yet
Business Mathematics and Statistical Inference PDF
1 page
Correlation and Regression
100% (4)
Correlation and Regression
49 pages
Cohen 1992 A Power Primer PDF
No ratings yet
Cohen 1992 A Power Primer PDF
8 pages
Lab file AD pdf
No ratings yet
Lab file AD pdf
25 pages
R Intro 2011
No ratings yet
R Intro 2011
115 pages
R code
No ratings yet
R code
9 pages
All Values in The First Column
No ratings yet
All Values in The First Column
7 pages
Workshop Activity: X Seq y Length
No ratings yet
Workshop Activity: X Seq y Length
3 pages
Lecture 1
No ratings yet
Lecture 1
167 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
Final Cost Practical
No ratings yet
Final Cost Practical
29 pages
Lab0 R Tutorial EHS
No ratings yet
Lab0 R Tutorial EHS
9 pages
R For Data Exploration
No ratings yet
R For Data Exploration
52 pages
R-Programming-Cheat-Sheet
No ratings yet
R-Programming-Cheat-Sheet
7 pages
r file code
No ratings yet
r file code
16 pages
DS Lab
No ratings yet
DS Lab
31 pages
Lab Exercise 1
No ratings yet
Lab Exercise 1
16 pages
DSR LAB MANUAL - 10 programs
No ratings yet
DSR LAB MANUAL - 10 programs
34 pages
STAT 214-T241-Lab 2
No ratings yet
STAT 214-T241-Lab 2
23 pages
CH 3
No ratings yet
CH 3
33 pages
STA2050 Assignment 2
No ratings yet
STA2050 Assignment 2
10 pages
R Console
No ratings yet
R Console
6 pages
(Practical) Programming With R
No ratings yet
(Practical) Programming With R
5 pages
UL2
No ratings yet
UL2
2 pages
Advanced R Data Analysis Training PDF
No ratings yet
Advanced R Data Analysis Training PDF
72 pages
DEV_Lab_Manual
No ratings yet
DEV_Lab_Manual
27 pages
COST - JournalPracticals (1-7)
No ratings yet
COST - JournalPracticals (1-7)
22 pages
IntroR 2
No ratings yet
IntroR 2
18 pages
Comp Lab 2 GunExample 2425
No ratings yet
Comp Lab 2 GunExample 2425
15 pages
R_record-1
No ratings yet
R_record-1
57 pages
STAT-1000---Worksheet-2 (1)
No ratings yet
STAT-1000---Worksheet-2 (1)
14 pages
R Studio Lab Summary Sheet
No ratings yet
R Studio Lab Summary Sheet
3 pages
STAT-1000---Worksheet-2
No ratings yet
STAT-1000---Worksheet-2
14 pages
R Tutorial #1: Applied Econometrics (Econ3005)
No ratings yet
R Tutorial #1: Applied Econometrics (Econ3005)
21 pages
Tian Statistics Lesson 4 Frequency Distribution Definition and Properties of Probability
No ratings yet
Tian Statistics Lesson 4 Frequency Distribution Definition and Properties of Probability
54 pages
R Practicals
No ratings yet
R Practicals
32 pages
Muthayammal College of Arts and Science Rasipuram: Assignment No - 1
No ratings yet
Muthayammal College of Arts and Science Rasipuram: Assignment No - 1
10 pages
SML Practical 1to11
No ratings yet
SML Practical 1to11
23 pages
Module 1: Unit - 1.1: Introduction To Analytics or R Programming
No ratings yet
Module 1: Unit - 1.1: Introduction To Analytics or R Programming
26 pages
advance R prog.-1
No ratings yet
advance R prog.-1
24 pages
R Programmimg Practical Journal All-1
No ratings yet
R Programmimg Practical Journal All-1
25 pages
R Programing Bhagu
No ratings yet
R Programing Bhagu
40 pages
Week3 Slides
No ratings yet
Week3 Slides
36 pages
Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio
No ratings yet
Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio
58 pages
R Commands: Firsty, We Need To Install Package
No ratings yet
R Commands: Firsty, We Need To Install Package
4 pages
R-Programming Lab Mannual
No ratings yet
R-Programming Lab Mannual
33 pages
Cost Practical
No ratings yet
Cost Practical
13 pages
MBA Sem 1 Unit 3 Fundamentals of R (1)
No ratings yet
MBA Sem 1 Unit 3 Fundamentals of R (1)
41 pages
Lecture 10 R
No ratings yet
Lecture 10 R
117 pages
MDPN460 Lecture05
No ratings yet
MDPN460 Lecture05
32 pages
shahun term workR1
No ratings yet
shahun term workR1
34 pages
r-cheatsheet-ABC
No ratings yet
r-cheatsheet-ABC
3 pages
R-Lab p-4,2,1
No ratings yet
R-Lab p-4,2,1
12 pages
Unit2
No ratings yet
Unit2
76 pages
A1RIB_T4
No ratings yet
A1RIB_T4
5 pages
Datavisual@2
No ratings yet
Datavisual@2
5 pages
ProbList2-24-Sln
No ratings yet
ProbList2-24-Sln
20 pages
R Commands
No ratings yet
R Commands
18 pages
r-cheatsheet-ABC (1)
No ratings yet
r-cheatsheet-ABC (1)
3 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
No Ph.D. Game Design With Three.js
From Everand
No Ph.D. Game Design With Three.js
Nikiforos Kontopoulos
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
PPT8-S8 - Configuration and Asset Management
No ratings yet
PPT8-S8 - Configuration and Asset Management
20 pages
PPT5-S5 - Problem & Change Management
No ratings yet
PPT5-S5 - Problem & Change Management
29 pages
PPT4-S4 - Performance & Tuning
No ratings yet
PPT4-S4 - Performance & Tuning
44 pages
PPT3-S3 - Availability
No ratings yet
PPT3-S3 - Availability
35 pages
Basic and Advanced Bayesian Structural Equation Modeling With Applications in the Medical and Behavioral Sciences 1st Edition Sik-Yum Lee - The ebook version is available in PDF and DOCX for easy access
No ratings yet
Basic and Advanced Bayesian Structural Equation Modeling With Applications in the Medical and Behavioral Sciences 1st Edition Sik-Yum Lee - The ebook version is available in PDF and DOCX for easy access
49 pages
Z Score Tables
No ratings yet
Z Score Tables
5 pages
CFA Level 2 Formula Sheet
No ratings yet
CFA Level 2 Formula Sheet
44 pages
L2 Probability Review
No ratings yet
L2 Probability Review
42 pages
Bivariate Poisson Regression
0% (1)
Bivariate Poisson Regression
39 pages
(Ebook) Testing for Normality by Henry C. Thode ISBN 9780824796136, 0824796136 All Chapters Instant Download
100% (1)
(Ebook) Testing for Normality by Henry C. Thode ISBN 9780824796136, 0824796136 All Chapters Instant Download
81 pages
Statistics 100A Homework 4 Solutions: Ryan Rosario
No ratings yet
Statistics 100A Homework 4 Solutions: Ryan Rosario
8 pages
Sheet
No ratings yet
Sheet
2 pages
Nonparametric Methods
No ratings yet
Nonparametric Methods
11 pages
Hypothesis Test Paper
No ratings yet
Hypothesis Test Paper
3 pages
Design Thinking and QFD - Two Faces of The Same Coin - (PDFDrive)
No ratings yet
Design Thinking and QFD - Two Faces of The Same Coin - (PDFDrive)
910 pages
Chi Squared
No ratings yet
Chi Squared
23 pages
Probability Distributions: Discrete and Continuous Univariate Probability Distributions. Let S Be A Sample Space With A Prob
No ratings yet
Probability Distributions: Discrete and Continuous Univariate Probability Distributions. Let S Be A Sample Space With A Prob
7 pages
Definition: Order Statistics of A Sample
No ratings yet
Definition: Order Statistics of A Sample
11 pages
MAT102 - Statistics For Business - UEH-ISB - T3 2022 - Unit Guide - DR Chon Le
No ratings yet
MAT102 - Statistics For Business - UEH-ISB - T3 2022 - Unit Guide - DR Chon Le
12 pages
Bivariate Distribution
No ratings yet
Bivariate Distribution
5 pages
Guru Gobind Singh Indraprastha University: "Research Methodology Lab" Subject Code-BBA-208
No ratings yet
Guru Gobind Singh Indraprastha University: "Research Methodology Lab" Subject Code-BBA-208
23 pages
Homework 1 - Solutions
No ratings yet
Homework 1 - Solutions
6 pages
6.1 There Is An Urn Containing 9 Balls, Which Can Be Either Green or Red. The Number of Red Balls in The
No ratings yet
6.1 There Is An Urn Containing 9 Balls, Which Can Be Either Green or Red. The Number of Red Balls in The
6 pages
Nama: Rosalinda NIM: B11.2018.04883: Reliability Statistics
No ratings yet
Nama: Rosalinda NIM: B11.2018.04883: Reliability Statistics
7 pages
Probabilistic Similarity Measures
No ratings yet
Probabilistic Similarity Measures
20 pages
PC6 Final (Ingles)
No ratings yet
PC6 Final (Ingles)
6 pages
GERIE Biostatic
No ratings yet
GERIE Biostatic
275 pages
Ecc321 chapter 3
No ratings yet
Ecc321 chapter 3
8 pages
11.1 Expectation Versus High Probability
No ratings yet
11.1 Expectation Versus High Probability
1 page