SlideShare a Scribd company logo
Introduction to Data Analysis and Graphics2
Introduction to Data Analysis and Graphics2
Hellen Gakuruh
2017-03-07
Session Two
Vector and Assignment, Data Objects and Data Importation
Outline
By the end of this session we will have knowledge on:
• Vectors and Assignment
• Data types
• Data structure and
• Importing data into R
Vector and Assignment
• Simplest data structure in R is a vector. From a data point of view, a
vector is collection of elements. These elements can be numeric values,
alphabetical characters, logical, dates and time values.
• Vectors are created with function “c” which means “concatenate”. e.g. a
numerical vector c(1, 5, 6, 8)
• Thee vectors can be named by using an assignment operator “<-” or
function “assign()”. e.g. to assign vector c(1, 5, 6, 8) to name “num”;
num <- c(1, 5, 6, 8) or assign(“num”, c(1, 5, 6, 8)). We often use “<-” for
assignment, “assign” function is mostly used in developing functions
• A vector can be of any length begining from 1 to about 2.1474836 × 109
1
Data types
R recognises seven data types, these are:
• Logical
• Integer
• Real/Double
• String/Character
• Factor
cont. . .
• Complex
• Raw
• R manuals specifys six types; logical, integer, double, character, complex
and raw. However, factor is a data type that does not fall into either of
the six listed data types.
• In this sub-section we introduce these data types
Data types: Logical
• These are vectors with only TRUE and FALSE values like c(TRUE, TRUE,
FALSE, TRUE, FALSE)
• Can be considered as binary vectors in analysis
• Other than categorical variables with these values, these vectors are often
created by binary operators like “<”, “>”, “<=”, >=, ==, =!, “|”, “||”,
“&”, and “&&”
• During analysis, these vectors can be coerced to numeric values in which
case TRUE becomes 1 and FALSE becomes 0
• These vectors include value “NA” which in R means “Not Available”, a
placeholder for missing values.
• Any operation done with a vector containing NA is bound to result to NA
since NA is unknown
Data types: Integer
• These are basically positive and negative numbers without fractions {. . . ,
-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, . . . }
• In R, integers are denoted with letter L e.g. c(-3L, 0L, 2L, 5L, 6L). Can
confirm it’s an integer vector with function is.integer(c(-3L, 0L, 2L,
5L, 5L))
• Example of a variable which can be considered to naturally have integers
is “number of people” (you can’t have a fraction of a person)
• Mathematically denoted by ( mathbb{Z} )
2
Real/Double
• A real number is any number along an infinitely number line
• They include fractions
• Denoted mathematically with ( mathbb{R} )
• Any numeric vector that does not have values followed by letter “L” are
considered as double e.g. c(-3, 0, 2, 5, 6). Can confirm a vector is a real
or double vector with funtion “is.double” e.g is.double(c(-3, 0, 2, 5,
6))
String/Character
• Composed of alphabetical letters and word/text
• Denoted by single or double quotation marks
• R has a special vector with alphabetical letter; this is letters
• Example c("a", "b", "c"), letters, c('cats', 'and' , 'dogs')
• Can check whether a vector is a character vector with function
is.character e.g. is.character(letters)
Data type: Factors
n
• In R a factor vector is a categorical variable with discrete classification
(grouping)
• Example
cat <- factor(c(rep("Y", 28), rep("N", 10)))
is.factor(cat)
[1] TRUE
levels(cat)
[1] "N" "Y"
Data type: Complex
n
• These are vectors with real and imaginary values. Imaginary numbers are
denoted by letter “i”
• Mathematically used to make it possible to take square-root of negative
values
3
# Example, complex vector
3+2i
[1] 3+2i
# Confirm it's complex
is.complex(3+2i)
[1] TRUE
Data type: Raw
• These are vectors containing computer bytes or information on data storage
units
• More of computer language (0’s and 1’s) than human readable language
• Integers and doubles are jointly refered to as numeric
• The most commonly used data types are logical, numeric and characters.
Complex and raw data types are rarely used
int <- c(-3L, -2L, -1L, 0L, 1L, 2L, 3L)
is.integer(int)
[1] TRUE
is.numeric(int)
[1] TRUE
doub <- c(-3, -2, -1, 0, 1, 2, 3)
is.double(doub)
[1] TRUE
is.numeric(doub)
[1] TRUE
Data structures
• There two broad types of data structures in R
– Atomic vectors
– Generic (list) vectors
• These structures have three properties
– Type
– Length and
– Attributes
4
• Function "type" is used to establish a vector’s type, function "length"
is used to determine length and function "attributes" is used to get
additional information about a vector
• Atomic vectors and lists differ in their type as atomic vectors can only
contain one data type while lists can contain any number of data types.
Atomic Vectors
• Contains only one data type, they include 1 dimensional atomic vectors, 2
dimensional atomic vectors called “matrices” and multi-dimensional atomic
vectors called “arrays”.
• Dimensionality can be considered as number of indices required to address
any element in a vector e.g. vector “cat” requires one index to address any
value, for example index “4” means fourth value which is Y
• Single variables are all atomic vectors of one dimension
• To check if a vector is either atomic or list, use is.atomic() or is.list().
Note there is a is.vector() but this checks if vector is named
Atomic vectors: Matrices
• Two dimensional atomic vectors, they contain data of the same type
• Any atomic vector can be converted to a matrix by adding a dim attribute
cat <- c(rep("Y", 28), rep("N", 10))
typeof(cat)
[1] "character"
dim(cat)
NULL
is.matrix(cat)
[1] FALSE
dim(cat) <- c(19, 2)
typeof(cat)
[1] "character"
dim(cat)
[1] 19 2
is.matrix(cat)
5
[1] TRUE
• Other than using "dim()" to convert a one dim to a multi-dimension
atomic vector, matrices can be created with "matrix()", or by coercing
another data object with "as.matrix()"
typeof(airmiles)
[1] "double"
airmiles2 <- matrix(airmiles, nrow = 8, ncol = 3)
is.matrix(airmiles2)
[1] TRUE
airmiles3 <- as.matrix(airmiles, nrow = 8, ncol = 3)
is.matrix(airmiles3)
[1] TRUE
rm(airmiles2, airmiles3)
Special 1 & 2 dimension atomic vectors
Time series objects
• These are vectors used to store observations collected at given time points
(interval) over a period time, e.g. observations collected every three three
months for five year.
• Distiguishing feature in this data is time, interval is usually constant like
three months (regular), but in other cases it might not be so (irregular)
• In R, time series data are numeric vectors with attribute class equal “ts”
meaning time series
• Time series vectors can either be 1 dim atomic vector like “AirPassengers”
data set in R or a 2d matrix like "EuStockMarkets"
typeof(AirPassengers)
[1] "double"
attr(AirPassengers, "class")
[1] "ts"
typeof(EuStockMarkets)
[1] "double"
attr(EuStockMarkets, "class")
[1] "mts" "ts" "matrix"
6
Atomic vectors: Arrays
• Arrays are multi-dimensional atomic vectors.
• Matrices are two dimensional array.
• They are rarely used, but it’s good to know they exist
• Created like matrices; "dim()" e.g. dim(a) <- c(6, 2, 2), or array()
or as.array()
Data structures: Generic vectors
• Lists are data structure which can contain more than on type of data type.
• There are two types of lists; two dimensional lists called "data frames"
and "lists"
Data frames
n
• Most recognizable data structure
• A core data strucure in R
• Present data in row and columns like matrices, but in this case columns
can have different data types
# Example
head(faithful)
eruptions waiting
1 3.600 79
2 1.800 54
3 3.333 74
4 2.283 62
5 4.533 85
6 2.883 55
Generic vectors: Lists
• These are unique data structure
• Can contain any number and type of object, not just data. Can contain
sub-lists hence also called recursive
• Created with function “list()”. Can also coerce other structures to a list
with function “as.list()”
• We will create this structure in our next session
7
Importing and Exporting Data in R
• Data importation also referred to as “reading in” data
• Reading data depends on type and location of file
• Sub-session interest, reading in local R, text, excel, database and other
statistical program files
• Also discuss web scrapping
Reading in .RData
• Data created in R can be store in RData file
• This could be any data structure or a collection of data saved from an
active working directory (workspace)
• Function “save.image()” used to store workspace, function “load” is used
to read in any “.RData” (or even .Rhistory)
# See current objects
ls()
[1] "cat" "doub" "int"
# Store in an external .RData file
save.image()
# Remove all object from workspace/global environment
rm(list = ls())
ls()
character(0)
# Read in .RData
load(".RData")
# Check we have them back
ls()
[1] "cat" "doub" "int"
R’s core importing function “read.table()”
• read.table is R’s core importing function
• Almost all other functions including contributed packages depend on this
function
• Reads a file and creates a data frame from it
• It has a number of wrapper functions (functions which provide a con-
vinience interface to another function like give pre-defined/default values,
this make function calls more efficient)
8
• Wrapper functions include read.csv(), read.csv2(), read.delim,
read.delim2
• CSV are comma separated files
• Delim are text files, word delim means delimited which implys how data
are separate like with tabs
• Both csv and delim are relatively easy to read into R as long as separa-
tor/delimitors are known
• In case separator or delimitor is not known and file cannot be opened, then
best to read in a few lines with read.lines function Live demo (reading
in CSV file)
Reading in Excel files
• Base R does not have a function to read in Excel based files
• But many contributed packages have functions to read them in
• Core reference in importing this type of files is one of R-projects manuals
R Data Import/Export specifically chapter 9.
• Recommendation made is to try and convert Excel file in to “.csv” (comma-
separated) or “delim” (tab-separated) file. Live demo (reading excel file)
Reading in Databases data
• A bit of caution, database data tend to be large, R is not to good when it
comes to large data, hence read in part of data or look for ways to increase
memory allocated to R processes like using cloud.
• Most Relational Database Management Systems (RDMS) have data similar
to R’s dataframe where columns are called “fields” and rows are called
“records”.
• Extracting part of relational database requires use of database quering
sematics core of which is a SELECT statement.
• In general, SELECT query uses:
– FROM to select the table
– WHERE to specify a condition for inclusion and
– ORDER BY to sort results (this is important as RDMS do not order
it’s rows like R’s dataframes)
• There are a number of contributed packaged on CRAN for reading RDMS
data, these include RMySQL, DBI, ROracle, RPostgreSQL and RSQLite.
Live demo (reading in RDMS and web data)
9
From other statistical softwares
• Other statistical softwares often used to read in data are SPSS, SAS, Stata
and EpiInfo
• Like excel and database data, to read in these files a package must be used
• Recommended package is package "foreign" other packages include,
"readstata3" and haven.
Live demo (reading SPSS and Stata data files)
10

More Related Content

What's hot (20)

PDF
STL in C++
Surya Prakash Sahu
 
PPTX
Standard template library
Sukriti Singh
 
PPTX
Standard Template Library
GauravPatil318
 
PDF
Lecture-05-DSA
Haitham El-Ghareeb
 
PPTX
Data structures in c#
SivaSankar Gorantla
 
PPT
Basic data-structures-v.1.1
BG Java EE Course
 
PPT
Unit 1 introduction to data structure
kalyanineve
 
PPTX
Abstract Algebra and Category Theory
Naveenkumar Muguda
 
PPT
Data structures & algorithms lecture 3
Poojith Chowdhary
 
PDF
2nd puc computer science chapter 3 data structures 1
Aahwini Esware gowda
 
PDF
Data structure ppt
Prof. Dr. K. Adisesha
 
PDF
Data Structures (BE)
PRABHAHARAN429
 
PPTX
How to choose best containers in STL (C++)
Sangharsh agarwal
 
PPTX
Introduction To R Language
Gaurang Dobariya
 
PPTX
Data Structures (CS8391)
Elavarasi K
 
PPTX
Data structure &amp; algorithms introduction
Sugandh Wafai
 
PPTX
Set data structure
Tech_MX
 
PPTX
Arrays in Data Structure and Algorithm
KristinaBorooah
 
PDF
Introduction to R programming
Alberto Labarga
 
PPT
Data structures using C
Prof. Dr. K. Adisesha
 
STL in C++
Surya Prakash Sahu
 
Standard template library
Sukriti Singh
 
Standard Template Library
GauravPatil318
 
Lecture-05-DSA
Haitham El-Ghareeb
 
Data structures in c#
SivaSankar Gorantla
 
Basic data-structures-v.1.1
BG Java EE Course
 
Unit 1 introduction to data structure
kalyanineve
 
Abstract Algebra and Category Theory
Naveenkumar Muguda
 
Data structures & algorithms lecture 3
Poojith Chowdhary
 
2nd puc computer science chapter 3 data structures 1
Aahwini Esware gowda
 
Data structure ppt
Prof. Dr. K. Adisesha
 
Data Structures (BE)
PRABHAHARAN429
 
How to choose best containers in STL (C++)
Sangharsh agarwal
 
Introduction To R Language
Gaurang Dobariya
 
Data Structures (CS8391)
Elavarasi K
 
Data structure &amp; algorithms introduction
Sugandh Wafai
 
Set data structure
Tech_MX
 
Arrays in Data Structure and Algorithm
KristinaBorooah
 
Introduction to R programming
Alberto Labarga
 
Data structures using C
Prof. Dr. K. Adisesha
 

Similar to R training2 (20)

PPTX
Introduction to R - Basics of R programming, Data structures.pptx
DrTherasaChandraseka
 
PPTX
Introduction to R programming Language.pptx
kemetex
 
PPTX
R data types
Teachers Mitraa
 
PPTX
Language R
Girish Khanzode
 
PPTX
Data Types of R.pptx
Ramakrishna Reddy Bijjam
 
PPT
R programming by ganesh kavhar
Savitribai Phule Pune University
 
PPTX
Introduction to R _IMPORTANT FOR DATA ANALYTICS
HaritikaChhatwal1
 
PPT
Session 4
Shailendra Mathur
 
PPT
R-programming-training-in-mumbai
Unmesh Baile
 
PPT
R-programming with example representation.ppt
geethar79
 
PPT
R Programming for Statistical Applications
drputtanr
 
PPTX
Ggplot2 v3
Josh Doyle
 
PDF
Learning notes of r for python programmer (Temp1)
Chia-Chi Chang
 
PPTX
Introduction to R.pptx
karthikks82
 
PPTX
Array
PralhadKhanal1
 
PPTX
python-numwpyandpandas-170922144956.pptx
smartashammari
 
PPTX
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Andrew Ferlitsch
 
PPTX
Introduction To Programming In R for data analyst
ssuser26ff68
 
PPT
standard template library(STL) in C++
•sreejith •sree
 
PDF
Matplotlib adalah pustaka plotting 2D Python yang menghasilkan gambar berkual...
HendraPurnama31
 
Introduction to R - Basics of R programming, Data structures.pptx
DrTherasaChandraseka
 
Introduction to R programming Language.pptx
kemetex
 
R data types
Teachers Mitraa
 
Language R
Girish Khanzode
 
Data Types of R.pptx
Ramakrishna Reddy Bijjam
 
R programming by ganesh kavhar
Savitribai Phule Pune University
 
Introduction to R _IMPORTANT FOR DATA ANALYTICS
HaritikaChhatwal1
 
R-programming-training-in-mumbai
Unmesh Baile
 
R-programming with example representation.ppt
geethar79
 
R Programming for Statistical Applications
drputtanr
 
Ggplot2 v3
Josh Doyle
 
Learning notes of r for python programmer (Temp1)
Chia-Chi Chang
 
Introduction to R.pptx
karthikks82
 
python-numwpyandpandas-170922144956.pptx
smartashammari
 
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Andrew Ferlitsch
 
Introduction To Programming In R for data analyst
ssuser26ff68
 
standard template library(STL) in C++
•sreejith •sree
 
Matplotlib adalah pustaka plotting 2D Python yang menghasilkan gambar berkual...
HendraPurnama31
 
Ad

More from Hellen Gakuruh (20)

PDF
R training6
Hellen Gakuruh
 
PDF
R training5
Hellen Gakuruh
 
PDF
R training4
Hellen Gakuruh
 
PDF
R training
Hellen Gakuruh
 
PDF
Prelude to level_three
Hellen Gakuruh
 
PDF
Prelude to level_two
Hellen Gakuruh
 
PDF
SessionThree_IntroductionToVersionControlSystems
Hellen Gakuruh
 
PPTX
Day 2
Hellen Gakuruh
 
PPTX
Day 1
Hellen Gakuruh
 
PDF
Introduction_to_Regular_Expressions_in_R
Hellen Gakuruh
 
PDF
SessionTen_CaseStudies
Hellen Gakuruh
 
PDF
webScrapingFunctions
Hellen Gakuruh
 
PDF
SessionNine_HowandWheretoGetHelp
Hellen Gakuruh
 
PDF
SessionEight_PlottingInBaseR
Hellen Gakuruh
 
PDF
SessionSeven_WorkingWithDatesandTime
Hellen Gakuruh
 
PDF
SessionSix_TransformingManipulatingDataObjects
Hellen Gakuruh
 
PDF
Files
Hellen Gakuruh
 
PDF
SessionFive_ImportingandExportingData
Hellen Gakuruh
 
PDF
SessionFour_DataTypesandObjects
Hellen Gakuruh
 
PDF
SessionTwo_MakingFunctionCalls
Hellen Gakuruh
 
R training6
Hellen Gakuruh
 
R training5
Hellen Gakuruh
 
R training4
Hellen Gakuruh
 
R training
Hellen Gakuruh
 
Prelude to level_three
Hellen Gakuruh
 
Prelude to level_two
Hellen Gakuruh
 
SessionThree_IntroductionToVersionControlSystems
Hellen Gakuruh
 
Introduction_to_Regular_Expressions_in_R
Hellen Gakuruh
 
SessionTen_CaseStudies
Hellen Gakuruh
 
webScrapingFunctions
Hellen Gakuruh
 
SessionNine_HowandWheretoGetHelp
Hellen Gakuruh
 
SessionEight_PlottingInBaseR
Hellen Gakuruh
 
SessionSeven_WorkingWithDatesandTime
Hellen Gakuruh
 
SessionSix_TransformingManipulatingDataObjects
Hellen Gakuruh
 
SessionFive_ImportingandExportingData
Hellen Gakuruh
 
SessionFour_DataTypesandObjects
Hellen Gakuruh
 
SessionTwo_MakingFunctionCalls
Hellen Gakuruh
 
Ad

Recently uploaded (20)

PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PPTX
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PDF
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PPTX
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PDF
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PPTX
Nursing Shift Supervisor 24/7 in a week .pptx
amjadtanveer
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
PPT
introdution to python with a very little difficulty
HUZAIFABINABDULLAH
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
Nursing Shift Supervisor 24/7 in a week .pptx
amjadtanveer
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
introdution to python with a very little difficulty
HUZAIFABINABDULLAH
 

R training2

  • 1. Introduction to Data Analysis and Graphics2 Introduction to Data Analysis and Graphics2 Hellen Gakuruh 2017-03-07 Session Two Vector and Assignment, Data Objects and Data Importation Outline By the end of this session we will have knowledge on: • Vectors and Assignment • Data types • Data structure and • Importing data into R Vector and Assignment • Simplest data structure in R is a vector. From a data point of view, a vector is collection of elements. These elements can be numeric values, alphabetical characters, logical, dates and time values. • Vectors are created with function “c” which means “concatenate”. e.g. a numerical vector c(1, 5, 6, 8) • Thee vectors can be named by using an assignment operator “<-” or function “assign()”. e.g. to assign vector c(1, 5, 6, 8) to name “num”; num <- c(1, 5, 6, 8) or assign(“num”, c(1, 5, 6, 8)). We often use “<-” for assignment, “assign” function is mostly used in developing functions • A vector can be of any length begining from 1 to about 2.1474836 × 109 1
  • 2. Data types R recognises seven data types, these are: • Logical • Integer • Real/Double • String/Character • Factor cont. . . • Complex • Raw • R manuals specifys six types; logical, integer, double, character, complex and raw. However, factor is a data type that does not fall into either of the six listed data types. • In this sub-section we introduce these data types Data types: Logical • These are vectors with only TRUE and FALSE values like c(TRUE, TRUE, FALSE, TRUE, FALSE) • Can be considered as binary vectors in analysis • Other than categorical variables with these values, these vectors are often created by binary operators like “<”, “>”, “<=”, >=, ==, =!, “|”, “||”, “&”, and “&&” • During analysis, these vectors can be coerced to numeric values in which case TRUE becomes 1 and FALSE becomes 0 • These vectors include value “NA” which in R means “Not Available”, a placeholder for missing values. • Any operation done with a vector containing NA is bound to result to NA since NA is unknown Data types: Integer • These are basically positive and negative numbers without fractions {. . . , -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, . . . } • In R, integers are denoted with letter L e.g. c(-3L, 0L, 2L, 5L, 6L). Can confirm it’s an integer vector with function is.integer(c(-3L, 0L, 2L, 5L, 5L)) • Example of a variable which can be considered to naturally have integers is “number of people” (you can’t have a fraction of a person) • Mathematically denoted by ( mathbb{Z} ) 2
  • 3. Real/Double • A real number is any number along an infinitely number line • They include fractions • Denoted mathematically with ( mathbb{R} ) • Any numeric vector that does not have values followed by letter “L” are considered as double e.g. c(-3, 0, 2, 5, 6). Can confirm a vector is a real or double vector with funtion “is.double” e.g is.double(c(-3, 0, 2, 5, 6)) String/Character • Composed of alphabetical letters and word/text • Denoted by single or double quotation marks • R has a special vector with alphabetical letter; this is letters • Example c("a", "b", "c"), letters, c('cats', 'and' , 'dogs') • Can check whether a vector is a character vector with function is.character e.g. is.character(letters) Data type: Factors n • In R a factor vector is a categorical variable with discrete classification (grouping) • Example cat <- factor(c(rep("Y", 28), rep("N", 10))) is.factor(cat) [1] TRUE levels(cat) [1] "N" "Y" Data type: Complex n • These are vectors with real and imaginary values. Imaginary numbers are denoted by letter “i” • Mathematically used to make it possible to take square-root of negative values 3
  • 4. # Example, complex vector 3+2i [1] 3+2i # Confirm it's complex is.complex(3+2i) [1] TRUE Data type: Raw • These are vectors containing computer bytes or information on data storage units • More of computer language (0’s and 1’s) than human readable language • Integers and doubles are jointly refered to as numeric • The most commonly used data types are logical, numeric and characters. Complex and raw data types are rarely used int <- c(-3L, -2L, -1L, 0L, 1L, 2L, 3L) is.integer(int) [1] TRUE is.numeric(int) [1] TRUE doub <- c(-3, -2, -1, 0, 1, 2, 3) is.double(doub) [1] TRUE is.numeric(doub) [1] TRUE Data structures • There two broad types of data structures in R – Atomic vectors – Generic (list) vectors • These structures have three properties – Type – Length and – Attributes 4
  • 5. • Function "type" is used to establish a vector’s type, function "length" is used to determine length and function "attributes" is used to get additional information about a vector • Atomic vectors and lists differ in their type as atomic vectors can only contain one data type while lists can contain any number of data types. Atomic Vectors • Contains only one data type, they include 1 dimensional atomic vectors, 2 dimensional atomic vectors called “matrices” and multi-dimensional atomic vectors called “arrays”. • Dimensionality can be considered as number of indices required to address any element in a vector e.g. vector “cat” requires one index to address any value, for example index “4” means fourth value which is Y • Single variables are all atomic vectors of one dimension • To check if a vector is either atomic or list, use is.atomic() or is.list(). Note there is a is.vector() but this checks if vector is named Atomic vectors: Matrices • Two dimensional atomic vectors, they contain data of the same type • Any atomic vector can be converted to a matrix by adding a dim attribute cat <- c(rep("Y", 28), rep("N", 10)) typeof(cat) [1] "character" dim(cat) NULL is.matrix(cat) [1] FALSE dim(cat) <- c(19, 2) typeof(cat) [1] "character" dim(cat) [1] 19 2 is.matrix(cat) 5
  • 6. [1] TRUE • Other than using "dim()" to convert a one dim to a multi-dimension atomic vector, matrices can be created with "matrix()", or by coercing another data object with "as.matrix()" typeof(airmiles) [1] "double" airmiles2 <- matrix(airmiles, nrow = 8, ncol = 3) is.matrix(airmiles2) [1] TRUE airmiles3 <- as.matrix(airmiles, nrow = 8, ncol = 3) is.matrix(airmiles3) [1] TRUE rm(airmiles2, airmiles3) Special 1 & 2 dimension atomic vectors Time series objects • These are vectors used to store observations collected at given time points (interval) over a period time, e.g. observations collected every three three months for five year. • Distiguishing feature in this data is time, interval is usually constant like three months (regular), but in other cases it might not be so (irregular) • In R, time series data are numeric vectors with attribute class equal “ts” meaning time series • Time series vectors can either be 1 dim atomic vector like “AirPassengers” data set in R or a 2d matrix like "EuStockMarkets" typeof(AirPassengers) [1] "double" attr(AirPassengers, "class") [1] "ts" typeof(EuStockMarkets) [1] "double" attr(EuStockMarkets, "class") [1] "mts" "ts" "matrix" 6
  • 7. Atomic vectors: Arrays • Arrays are multi-dimensional atomic vectors. • Matrices are two dimensional array. • They are rarely used, but it’s good to know they exist • Created like matrices; "dim()" e.g. dim(a) <- c(6, 2, 2), or array() or as.array() Data structures: Generic vectors • Lists are data structure which can contain more than on type of data type. • There are two types of lists; two dimensional lists called "data frames" and "lists" Data frames n • Most recognizable data structure • A core data strucure in R • Present data in row and columns like matrices, but in this case columns can have different data types # Example head(faithful) eruptions waiting 1 3.600 79 2 1.800 54 3 3.333 74 4 2.283 62 5 4.533 85 6 2.883 55 Generic vectors: Lists • These are unique data structure • Can contain any number and type of object, not just data. Can contain sub-lists hence also called recursive • Created with function “list()”. Can also coerce other structures to a list with function “as.list()” • We will create this structure in our next session 7
  • 8. Importing and Exporting Data in R • Data importation also referred to as “reading in” data • Reading data depends on type and location of file • Sub-session interest, reading in local R, text, excel, database and other statistical program files • Also discuss web scrapping Reading in .RData • Data created in R can be store in RData file • This could be any data structure or a collection of data saved from an active working directory (workspace) • Function “save.image()” used to store workspace, function “load” is used to read in any “.RData” (or even .Rhistory) # See current objects ls() [1] "cat" "doub" "int" # Store in an external .RData file save.image() # Remove all object from workspace/global environment rm(list = ls()) ls() character(0) # Read in .RData load(".RData") # Check we have them back ls() [1] "cat" "doub" "int" R’s core importing function “read.table()” • read.table is R’s core importing function • Almost all other functions including contributed packages depend on this function • Reads a file and creates a data frame from it • It has a number of wrapper functions (functions which provide a con- vinience interface to another function like give pre-defined/default values, this make function calls more efficient) 8
  • 9. • Wrapper functions include read.csv(), read.csv2(), read.delim, read.delim2 • CSV are comma separated files • Delim are text files, word delim means delimited which implys how data are separate like with tabs • Both csv and delim are relatively easy to read into R as long as separa- tor/delimitors are known • In case separator or delimitor is not known and file cannot be opened, then best to read in a few lines with read.lines function Live demo (reading in CSV file) Reading in Excel files • Base R does not have a function to read in Excel based files • But many contributed packages have functions to read them in • Core reference in importing this type of files is one of R-projects manuals R Data Import/Export specifically chapter 9. • Recommendation made is to try and convert Excel file in to “.csv” (comma- separated) or “delim” (tab-separated) file. Live demo (reading excel file) Reading in Databases data • A bit of caution, database data tend to be large, R is not to good when it comes to large data, hence read in part of data or look for ways to increase memory allocated to R processes like using cloud. • Most Relational Database Management Systems (RDMS) have data similar to R’s dataframe where columns are called “fields” and rows are called “records”. • Extracting part of relational database requires use of database quering sematics core of which is a SELECT statement. • In general, SELECT query uses: – FROM to select the table – WHERE to specify a condition for inclusion and – ORDER BY to sort results (this is important as RDMS do not order it’s rows like R’s dataframes) • There are a number of contributed packaged on CRAN for reading RDMS data, these include RMySQL, DBI, ROracle, RPostgreSQL and RSQLite. Live demo (reading in RDMS and web data) 9
  • 10. From other statistical softwares • Other statistical softwares often used to read in data are SPSS, SAS, Stata and EpiInfo • Like excel and database data, to read in these files a package must be used • Recommended package is package "foreign" other packages include, "readstata3" and haven. Live demo (reading SPSS and Stata data files) 10