0% found this document useful (0 votes)
39 views23 pages

Analysis Using Statistical: Introduction & Data Exploration

1) The document provides an outline for a statistical analysis course covering topics like importing different types of datasets, exploring the data using basic functions, plotting the data, and exercises to measure understanding. 2) The course will teach how to import datasets in csv, text, Stata, and Excel format and explore the data structure using vectors, matrices, and built-in datasets like mtcars and iris. 3) Various data exploration techniques will be covered like checking for missing values, duplicates, sorting, plotting scatter plots, histograms, box plots, and time series to gain insights from the data.

Uploaded by

Izzue Kashfi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views23 pages

Analysis Using Statistical: Introduction & Data Exploration

1) The document provides an outline for a statistical analysis course covering topics like importing different types of datasets, exploring the data using basic functions, plotting the data, and exercises to measure understanding. 2) The course will teach how to import datasets in csv, text, Stata, and Excel format and explore the data structure using vectors, matrices, and built-in datasets like mtcars and iris. 3) Various data exploration techniques will be covered like checking for missing values, duplicates, sorting, plotting scatter plots, histograms, box plots, and time series to gain insights from the data.

Uploaded by

Izzue Kashfi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Statistical

PART I: Analysis
Introduction & using
Data Exploration
5th December 2020
Nur Saadah Binti Abd Majid 10am-1.00pm
10am-11.30am Via Microsoft Teams
Course Outline
Importing Learn how to import data with different type such
Dataset
as csv, text, stata and excel.

Understanding
the dataset Use the basic built in function to get
knowledge on the data

Data Explore the data using the plots that


Exploration
basically required for EDA process.

Exporting Save and export the


Dataset
updated data.

Exercise Measure the


understanding.
Setting Working Directory

setwd("<working path>")

OR
File Change dir… <Find your folder>
Importing the dataset
Dataset data<-
in read.table("mydata.txt
text ",sep=";")
file

data<- Dataset
in
read.csv("mydata.csv",
header=T, sep=",") csv
file

Dataset
library(foreign)
in
data <-
Stata read.dta("mydata.dta")
file
Built-in dataset

Example of datasets

data(mtcars)
help(mtcars)
data(iris)
help(iris)

List of all built-in dataset

data()
Generating normally
distributed data
#1. density function
dnorm(0.5,mean = 0, sd = 1, log =
FALSE)

#2. distribution function


pnorm(0.5, mean = 0, sd = 1, lower.tail
= TRUE, log.p = FALSE)

#3. quantile function


qnorm(p, mean = 0, sd = 1, lower.tail =
TRUE, log.p = FALSE)

#4. Generate a vector of normally


distributed random numbers
set.seed(1005)
dat<-rnorm(100, mean = 0, sd = 1)
dat<-round(dat,2)
Understanding the
structure of the data

USING VECTORS USING DATASETS


head(data)
a<-rep(2,15) tail(data)
b<-rep(1:3,5) nrow(data)
ncol(data)
c<-rep(c(3,5,10:15),2)
dim(data)
d<-c(1:15) colnames(data)
try1<-cbind(a,b) rownames(data)
str(data)
try2<-rbind(a,b)
summary(data)
try3<-c(a,c)
Checking the missing
values/duplicate Check the
duplicated
rows

CHECKING THE
DUPLICATES
class(c)
class(data)
class(try1)
#to make try1 become data.frame
try1<-as.data.frame(try1)
class(try1)

#Find the missing value


library(dplyr)
duplicated(try1)
distinct(try1,try1$b)
Gives the
unique(try1) distinct
value for a
variable
#Identifying rows that duplicate
dup.rownum<-which(duplicated(try1)=="TRUE")
try1[dup.rownum,] Removes the
duplicated
rows
automatically
Checking the missing
values/duplicate
CHECKING THE
MISSING VALUES

Create data
with Missing
#Find the MISSING VALUES Values
try1[c(3,6),1]<-NA
try1[c(10,15),2]<-NA
try1
complete.cases(try1)
mis.rownum<-which(complete.cases(try1)=="FALSE“)
try1[mis.rownum,]

Check the
rows without
the missing
values
Sorting and Ordering the
data Sorting the
vector
automatically

Give the
number of
order for
each rows

sort(b)
sort(b,decreasing=T)
order(b)
order(b,decreasing=T)
try1<-cbind(a,b)
Extract the
try1<-as.data.frame(try1) data by order

try1[order(b),]
If else
Dataset in a
vector type

avr<-mean(data[,2])
if (data[5,2]>avr) {
print("Larger than mean")
} The value
else { assigned is
larger than
print("Smaller than mean") the mean of
} the vector No

"Smaller
Yes than mean"

"Larger than
mean"
Looping
i-th observation
of dataset in a
vector type

The ith

"Smaller
No value is Yes "Larger
larger
than the
than
than mean" mean"
mean of
the vector
for(i in 1:nrow(data)){
if (data[i,2]>avr) {
print("Larger than mean")
} else { The i value is
print("Smaller than mean") increasing by 1
} and create new
} value of i

The i value
is larger
No
than the
max value

Yes

Terminate
Data Exploration &
Plottings

SCATTER PLOT

help(mtcars)
summary(mtcars)
dim(mtcars)

plot(mtcars[,1],mtcars[,3],ylim
=c(min(mtcars[,3]),max(mtcars[, The Displacement against Miles/US gallon
3])),xlim=c(min(mtcars[,1]),max
(mtcars[,1])),ylab="Displacemen
t (cu.in.)",xlab="Miles/US
400

gallon", main="The Displacement


against Miles/US gallon")
Displacement (cu.in.)

300

abline(v=mean(mtcars[,1]),
lty=2, col="Red")
200

abline(h=mean(mtcars[,3]),
lty=5, col="Blue")
100

points(mtcars[1:5,1],mtcars[1:5
,3],col="Purple") 10 15 20 25 30

Miles/US gallon

Save as Save as
metafile bitmap
Data Exploration &
Plottings

LINE PLOT
plot(sunspots[1:200],main="Monthly averaged sunspots
from 1749–1983",xaxt="n",xlabs="Month order",
ylab="Sunspots number")

lines(sunspots[1:200],lty=1,col="Purple")

lines(sunspots[1:200]+20,lty=1,col="Blue")

axis(1,1:200,1:200,las=2)

legend("topright", legend=c("Sunspot", "Sunspot+20"),


col=c("purple", "blue"),lty=c(1,1),cex=0.8)

TIME SERIES PLOT


ts.plot(sunspots[1:200],main="Monthly
averaged sunspots from 1749–1983",
ylab="Sunspots number",col="blue")
Data Exploration &
Plottings

HISTOGRAM

hist(mtcars[,1],xlab
= "Weight",col =
"pink",border =
"red", main="Miles/US
gallon", breaks =
10)
Data Exploration &
Plottings

BOX PLOT

boxplot(mtcars[,1],main="Miles/U
S gallon", horizontal=T)
boxplot(mtcars[,3],main=
"Displacement (cu.in.)")

par(mfrow=c(2,1))
boxplot(mtcars[,1],main=
"Miles/US gallon", horizontal=T)

boxplot(mtcars[,3],main=
"Displacement (cu.in.)",
horizontal=T)

par(mfrow=c(1,2))
boxplot(mtcars[,1],main=
"Miles/US gallon" )

boxplot(mtcars[,3],main=
"Displacement (cu.in.)")
Data Exploration &
Plottings

QQPLOT

qqnorm(mtcars[,1],
pch = 1, frame =
FALSE)

qqline(mtcars[,1],
col = "blue", lwd =
2)
Data Exploration &
Plottings

CORRPLOT

library(corrplot)
corrplot(cor(
mtcars[,1:5]),method
="circle")

corrplot(cor(
mtcars[,1:5]),method
="numbers") #errors
make you learn

corrplot(cor(
mtcars[,1:5]),method
="square")
Save figure as Pdf

pdf(file = "My Plot.pdf",


width = 4, height = 4)
boxplot(mtcars[,1],
main="Miles/US gallon",
horizontal=T)

boxplot(mtcars[,3],
main="Displacement
(cu.in.)")
dev.off()

#or using loop

pdf(file = "My Plot


loop.pdf", width = 4,
height = 4)
for(i in 1:5){
boxplot(mtcars[,i],
main=paste("Column",i),
horizontal=T)
}
dev.off()
Exporting the dataset
Dataset write.table(mydata,
in "mydata.txt", sep=";")
text
file

Dataset
in
write.csv(mydata,
"mydata.csv") csv
file

Dataset
library(foreign)
in
write.dta(mydata,
Stata "mydata.dta")
file
EXERCISE
1. Please import the data named “yourdata” using read.csv
command.

2. Give the mean of the hot variable in the yourdata.

3. What is the dimension of the yourdata dataset?

4. Which variables in the class of integer?

5. For the whole dataset, is there any observations that


uplicated?

6. Please create a command for variable wind that will give


sentence of “wind has duplicated values for more than half
of the dataset” if the variable contains more than 50
duplicated values. Otherwise, give “wind is okay”
EXERCISE
7. Extract the
output as the left
figure in your R
software.
Thank You!

You might also like