0% found this document useful (0 votes)

39 views23 pages

Analysis Using Statistical: Introduction & Data Exploration

1) The document provides an outline for a statistical analysis course covering topics like importing different types of datasets, exploring the data using basic functions, plotting the data, and exercises to measure understanding. 2) The course will teach how to import datasets in csv, text, Stata, and Excel format and explore the data structure using vectors, matrices, and built-in datasets like mtcars and iris. 3) Various data exploration techniques will be covered like checking for missing values, duplicates, sorting, plotting scatter plots, histograms, box plots, and time series to gain insights from the data.

Uploaded by

Izzue Kashfi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views23 pages

Analysis Using Statistical: Introduction & Data Exploration

Uploaded by

Izzue Kashfi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Statistical

PART I: Analysis
Introduction & using
Data Exploration
5th December 2020
Nur Saadah Binti Abd Majid 10am-1.00pm
10am-11.30am Via Microsoft Teams
Course Outline
Importing Learn how to import data with different type such
Dataset
as csv, text, stata and excel.

Understanding
the dataset Use the basic built in function to get
knowledge on the data

Data Explore the data using the plots that

Exploration
basically required for EDA process.

Exporting Save and export the

Dataset
updated data.

Exercise Measure the

understanding.
Setting Working Directory

setwd("<working path>")

OR
File Change dir… <Find your folder>
Importing the dataset
Dataset data<-
in read.table("mydata.txt
text ",sep=";")
file

data<- Dataset
in
read.csv("mydata.csv",
header=T, sep=",") csv
file

Dataset
library(foreign)
in
data <-
Stata read.dta("mydata.dta")
file
Built-in dataset

Example of datasets

data(mtcars)
help(mtcars)
data(iris)
help(iris)

List of all built-in dataset

data()
Generating normally
distributed data
#1. density function
dnorm(0.5,mean = 0, sd = 1, log =
FALSE)

#2. distribution function

pnorm(0.5, mean = 0, sd = 1, lower.tail
= TRUE, log.p = FALSE)

#3. quantile function

qnorm(p, mean = 0, sd = 1, lower.tail =
TRUE, log.p = FALSE)

#4. Generate a vector of normally

distributed random numbers
set.seed(1005)
dat<-rnorm(100, mean = 0, sd = 1)
dat<-round(dat,2)
Understanding the
structure of the data

USING VECTORS USING DATASETS

head(data)
a<-rep(2,15) tail(data)
b<-rep(1:3,5) nrow(data)
ncol(data)
c<-rep(c(3,5,10:15),2)
dim(data)
d<-c(1:15) colnames(data)
try1<-cbind(a,b) rownames(data)
str(data)
try2<-rbind(a,b)
summary(data)
try3<-c(a,c)
Checking the missing
values/duplicate Check the
duplicated
rows

CHECKING THE
DUPLICATES
class(c)
class(data)
class(try1)
#to make try1 become data.frame
try1<-as.data.frame(try1)
class(try1)

#Find the missing value

library(dplyr)
duplicated(try1)
distinct(try1,try1$b)
Gives the
unique(try1) distinct
value for a
variable
#Identifying rows that duplicate
dup.rownum<-which(duplicated(try1)=="TRUE")
try1[dup.rownum,] Removes the
duplicated
rows
automatically
Checking the missing
values/duplicate
CHECKING THE
MISSING VALUES

Create data
with Missing
#Find the MISSING VALUES Values
try1[c(3,6),1]<-NA
try1[c(10,15),2]<-NA
try1
complete.cases(try1)
mis.rownum<-which(complete.cases(try1)=="FALSE“)
try1[mis.rownum,]

Check the
rows without
the missing
values
Sorting and Ordering the
data Sorting the
vector
automatically

Give the
number of
order for
each rows

sort(b)
sort(b,decreasing=T)
order(b)
order(b,decreasing=T)
try1<-cbind(a,b)
Extract the
try1<-as.data.frame(try1) data by order

try1[order(b),]
If else
Dataset in a
vector type

avr<-mean(data[,2])
if (data[5,2]>avr) {
print("Larger than mean")
} The value
else { assigned is
larger than
print("Smaller than mean") the mean of
} the vector No

"Smaller
Yes than mean"

"Larger than
mean"
Looping
i-th observation
of dataset in a
vector type

The ith

"Smaller
No value is Yes "Larger
larger
than the
than
than mean" mean"
mean of
the vector
for(i in 1:nrow(data)){
if (data[i,2]>avr) {
print("Larger than mean")
} else { The i value is
print("Smaller than mean") increasing by 1
} and create new
} value of i

The i value
is larger
No
than the
max value

Yes

Terminate
Data Exploration &
Plottings

SCATTER PLOT

help(mtcars)
summary(mtcars)
dim(mtcars)

plot(mtcars[,1],mtcars[,3],ylim
=c(min(mtcars[,3]),max(mtcars[, The Displacement against Miles/US gallon
3])),xlim=c(min(mtcars[,1]),max
(mtcars[,1])),ylab="Displacemen
t (cu.in.)",xlab="Miles/US
400

gallon", main="The Displacement

against Miles/US gallon")
Displacement (cu.in.)

300

abline(v=mean(mtcars[,1]),
lty=2, col="Red")
200

abline(h=mean(mtcars[,3]),
lty=5, col="Blue")
100

points(mtcars[1:5,1],mtcars[1:5
,3],col="Purple") 10 15 20 25 30

Miles/US gallon

Save as Save as
metafile bitmap
Data Exploration &
Plottings

LINE PLOT
plot(sunspots[1:200],main="Monthly averaged sunspots
from 1749–1983",xaxt="n",xlabs="Month order",
ylab="Sunspots number")

lines(sunspots[1:200],lty=1,col="Purple")

lines(sunspots[1:200]+20,lty=1,col="Blue")

axis(1,1:200,1:200,las=2)

legend("topright", legend=c("Sunspot", "Sunspot+20"),

col=c("purple", "blue"),lty=c(1,1),cex=0.8)

TIME SERIES PLOT

ts.plot(sunspots[1:200],main="Monthly
averaged sunspots from 1749–1983",
ylab="Sunspots number",col="blue")
Data Exploration &
Plottings

HISTOGRAM

hist(mtcars[,1],xlab
= "Weight",col =
"pink",border =
"red", main="Miles/US
gallon", breaks =
10)
Data Exploration &
Plottings

BOX PLOT

boxplot(mtcars[,1],main="Miles/U
S gallon", horizontal=T)
boxplot(mtcars[,3],main=
"Displacement (cu.in.)")

par(mfrow=c(2,1))
boxplot(mtcars[,1],main=
"Miles/US gallon", horizontal=T)

boxplot(mtcars[,3],main=
"Displacement (cu.in.)",
horizontal=T)

par(mfrow=c(1,2))
boxplot(mtcars[,1],main=
"Miles/US gallon" )

boxplot(mtcars[,3],main=
"Displacement (cu.in.)")
Data Exploration &
Plottings

QQPLOT

qqnorm(mtcars[,1],
pch = 1, frame =
FALSE)

qqline(mtcars[,1],
col = "blue", lwd =
2)
Data Exploration &
Plottings

CORRPLOT

library(corrplot)
corrplot(cor(
mtcars[,1:5]),method
="circle")

corrplot(cor(
mtcars[,1:5]),method
="numbers") #errors
make you learn

corrplot(cor(
mtcars[,1:5]),method
="square")
Save figure as Pdf

pdf(file = "My Plot.pdf",

width = 4, height = 4)
boxplot(mtcars[,1],
main="Miles/US gallon",
horizontal=T)

boxplot(mtcars[,3],
main="Displacement
(cu.in.)")
dev.off()

#or using loop

pdf(file = "My Plot

loop.pdf", width = 4,
height = 4)
for(i in 1:5){
boxplot(mtcars[,i],
main=paste("Column",i),
horizontal=T)
}
dev.off()
Exporting the dataset
Dataset write.table(mydata,
in "mydata.txt", sep=";")
text
file

Dataset
in
write.csv(mydata,
"mydata.csv") csv
file

Dataset
library(foreign)
in
write.dta(mydata,
Stata "mydata.dta")
file
EXERCISE
1. Please import the data named “yourdata” using read.csv
command.

2. Give the mean of the hot variable in the yourdata.

3. What is the dimension of the yourdata dataset?

4. Which variables in the class of integer?

5. For the whole dataset, is there any observations that

uplicated?

6. Please create a command for variable wind that will give

sentence of “wind has duplicated values for more than half
of the dataset” if the variable contains more than 50
duplicated values. Otherwise, give “wind is okay”
EXERCISE
7. Extract the
output as the left
figure in your R
software.
Thank You!

Lecture 10 R
No ratings yet
Lecture 10 R
117 pages
Advanced R Data Analysis Training PDF
No ratings yet
Advanced R Data Analysis Training PDF
72 pages
Unit2
No ratings yet
Unit2
76 pages
Module2 BDA
No ratings yet
Module2 BDA
44 pages
Da Lab File
No ratings yet
Da Lab File
33 pages
shahun term workR1
No ratings yet
shahun term workR1
34 pages
Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio
No ratings yet
Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio
58 pages
r program
No ratings yet
r program
22 pages
Lab file AD pdf
No ratings yet
Lab file AD pdf
25 pages
Chapter 2. Pre-Processing Data
No ratings yet
Chapter 2. Pre-Processing Data
37 pages
as
No ratings yet
as
22 pages
Intro To Data Science Lecture 4
No ratings yet
Intro To Data Science Lecture 4
13 pages
Lab Manual _DSR
No ratings yet
Lab Manual _DSR
32 pages
AMDA Practical - A048
No ratings yet
AMDA Practical - A048
35 pages
ProbList2-24-Sln
No ratings yet
ProbList2-24-Sln
20 pages
vertopal.com_R_practical
No ratings yet
vertopal.com_R_practical
9 pages
R-Programming Lab Mannual
No ratings yet
R-Programming Lab Mannual
33 pages
Psychology Sem-5 E-Txt Book - Compressed
No ratings yet
Psychology Sem-5 E-Txt Book - Compressed
155 pages
Unit Ii Eda Using R
No ratings yet
Unit Ii Eda Using R
11 pages
R Practicals
No ratings yet
R Practicals
32 pages
Data_analysis_with_R _24
No ratings yet
Data_analysis_with_R _24
47 pages
DEV_Lab_Manual
No ratings yet
DEV_Lab_Manual
27 pages
MDPN460 Lecture05
No ratings yet
MDPN460 Lecture05
32 pages
Exercises For R
No ratings yet
Exercises For R
40 pages
R For Data Exploration
No ratings yet
R For Data Exploration
52 pages
R code
No ratings yet
R code
9 pages
Aditya Garg DMDW
No ratings yet
Aditya Garg DMDW
40 pages
DSR LAB MANUAL - 10 programs
No ratings yet
DSR LAB MANUAL - 10 programs
34 pages
Module 5-6
No ratings yet
Module 5-6
12 pages
FE418_RLectureNotes1
No ratings yet
FE418_RLectureNotes1
15 pages
Module 2 ExploratoryDataAnalysis
No ratings yet
Module 2 ExploratoryDataAnalysis
22 pages
R Syntax Examples 1
No ratings yet
R Syntax Examples 1
6 pages
Week13 Slides Review
No ratings yet
Week13 Slides Review
23 pages
da lab file 2
No ratings yet
da lab file 2
13 pages
r file code
No ratings yet
r file code
16 pages
18 3 24 Upto Week 6 A B Latest 1
No ratings yet
18 3 24 Upto Week 6 A B Latest 1
25 pages
Final Cost Practical
No ratings yet
Final Cost Practical
29 pages
Notes
No ratings yet
Notes
6 pages
DEV RECORD AIDS
No ratings yet
DEV RECORD AIDS
24 pages
Da Lab It
No ratings yet
Da Lab It
20 pages
R Programming
No ratings yet
R Programming
11 pages
Lecture_5_(Managing_and_Understanding_Data)
No ratings yet
Lecture_5_(Managing_and_Understanding_Data)
9 pages
R-Programming-Cheat-Sheet
No ratings yet
R-Programming-Cheat-Sheet
7 pages
X - 15 x-1 2. Print ('Hello Word!') ## (1) "Hello Word!" 3. X - 4 y - 5 Z - X+y Print (Z) 4. X - 4 y - 5 Cat ('The Sum of X and y Is', X+y)
No ratings yet
X - 15 x-1 2. Print ('Hello Word!') ## (1) "Hello Word!" 3. X - 4 y - 5 Z - X+y Print (Z) 4. X - 4 y - 5 Cat ('The Sum of X and y Is', X+y)
15 pages
RSTUDIO
No ratings yet
RSTUDIO
44 pages
CAN Protocol
No ratings yet
CAN Protocol
48 pages
library management system.
100% (1)
library management system.
21 pages
R Console
No ratings yet
R Console
6 pages
FDP Indoglobal Group of Colleges: 27 April To 1 May R Programming Language Assignment Submission
No ratings yet
FDP Indoglobal Group of Colleges: 27 April To 1 May R Programming Language Assignment Submission
12 pages
R Functions
No ratings yet
R Functions
8 pages
Stored Procedures in SAP Business One
No ratings yet
Stored Procedures in SAP Business One
6 pages
Coursera Notes
No ratings yet
Coursera Notes
4 pages
Workshop Activity: X Seq y Length
No ratings yet
Workshop Activity: X Seq y Length
3 pages
A Short List of The Most Useful R Commands
No ratings yet
A Short List of The Most Useful R Commands
11 pages
Chapter 8
100% (1)
Chapter 8
36 pages
A Short List of The Most Useful R Commands
No ratings yet
A Short List of The Most Useful R Commands
8 pages
Praktikum Modul 3
No ratings yet
Praktikum Modul 3
5 pages
Relational databases - Student with RegClass 3 Tables
No ratings yet
Relational databases - Student with RegClass 3 Tables
9 pages
Basics: TH TH TH TH TH TH TH
No ratings yet
Basics: TH TH TH TH TH TH TH
3 pages
JAVA SQL Connection
100% (1)
JAVA SQL Connection
11 pages
Vcap Man
No ratings yet
Vcap Man
100 pages
UL2
No ratings yet
UL2
2 pages
BAN5
No ratings yet
BAN5
2 pages
Mulesoft - Examanswers.mcd Level 1.v2021!04!27.by - Ethan.65q
No ratings yet
Mulesoft - Examanswers.mcd Level 1.v2021!04!27.by - Ethan.65q
48 pages
R Workshop PART 2
No ratings yet
R Workshop PART 2
36 pages
Hana Ha
No ratings yet
Hana Ha
140 pages
Fault Code List For Central Data Memory (ZDS) Control Unit PDF
100% (1)
Fault Code List For Central Data Memory (ZDS) Control Unit PDF
2 pages
Lenovo C510M Compal LA-3861P - LA-3691P Rev1.0 Schematic
No ratings yet
Lenovo C510M Compal LA-3861P - LA-3691P Rev1.0 Schematic
47 pages
Clariion Cx700 Flare Recovery v0.02
No ratings yet
Clariion Cx700 Flare Recovery v0.02
31 pages
Program Schedule
No ratings yet
Program Schedule
1 page
Oracle DBA Interview PERFORMANCE TUNING Question-1
No ratings yet
Oracle DBA Interview PERFORMANCE TUNING Question-1
7 pages
Midterm Exam (Form A) (150 Points)
100% (1)
Midterm Exam (Form A) (150 Points)
7 pages
Future Technology Devices International LTD FT231X: Usb To Full Handshake Uart Ic
No ratings yet
Future Technology Devices International LTD FT231X: Usb To Full Handshake Uart Ic
44 pages
Information Lifecycle Management in An SAP Environment: February 2008
No ratings yet
Information Lifecycle Management in An SAP Environment: February 2008
30 pages
How To - Add Active or Backup Gateway For Load Balancing and Gateway Failover
No ratings yet
How To - Add Active or Backup Gateway For Load Balancing and Gateway Failover
5 pages
Solohackerlink 6
No ratings yet
Solohackerlink 6
41 pages
Mrxsmb-Ring0-Advisory (REVERSING MRXSMB - SYS)
No ratings yet
Mrxsmb-Ring0-Advisory (REVERSING MRXSMB - SYS)
10 pages
Singapore Kokurikulum
No ratings yet
Singapore Kokurikulum
16 pages
Read Low Threshold: - Zxl208 - Readlowthreshold: ZXL - 208 Basic Function Symbol
No ratings yet
Read Low Threshold: - Zxl208 - Readlowthreshold: ZXL - 208 Basic Function Symbol
3 pages
Statistical Analysis Using: 5th December 2020 10 Am-1 PM Online Via Microsoft Teams
No ratings yet
Statistical Analysis Using: 5th December 2020 10 Am-1 PM Online Via Microsoft Teams
1 page
Exception Pdflibexception With Message Handle Parameter Image Has Bad
No ratings yet
Exception Pdflibexception With Message Handle Parameter Image Has Bad
2 pages
Business Impact Analysis
No ratings yet
Business Impact Analysis
18 pages
Product Matrix: Fortigate Network Security Platform - Top Selling Models Matrix
No ratings yet
Product Matrix: Fortigate Network Security Platform - Top Selling Models Matrix
6 pages
QPSK
No ratings yet
QPSK
9 pages
Assign1 Ans
No ratings yet
Assign1 Ans
3 pages
IBM x3650 M3 Controllers For Internal Storage
No ratings yet
IBM x3650 M3 Controllers For Internal Storage
3 pages
Storing Your Data Into A Database With Php/Mysql
No ratings yet
Storing Your Data Into A Database With Php/Mysql
5 pages
Azure Developers Sheet-Dark
No ratings yet
Azure Developers Sheet-Dark
1 page
CDOT 8.2.1 NAS Command Layout
No ratings yet
CDOT 8.2.1 NAS Command Layout
1 page
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
From Everand
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
Matthew Rosch
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet

Analysis Using Statistical: Introduction & Data Exploration

Uploaded by

Analysis Using Statistical: Introduction & Data Exploration

Uploaded by

Statistical

Data Explore the data using the plots that

Exporting Save and export the

Exercise Measure the

List of all built-in dataset

#2. distribution function

#3. quantile function

#4. Generate a vector of normally

USING VECTORS USING DATASETS

#Find the missing value

gallon", main="The Displacement

legend("topright", legend=c("Sunspot", "Sunspot+20"),

TIME SERIES PLOT

pdf(file = "My Plot.pdf",

#or using loop

pdf(file = "My Plot

2. Give the mean of the hot variable in the yourdata.

3. What is the dimension of the yourdata dataset?

4. Which variables in the class of integer?

5. For the whole dataset, is there any observations that

6. Please create a command for variable wind that will give

You might also like