Data Wrangling (Data Preprocessing) : Practical Assessment 1
Data Wrangling (Data Preprocessing) : Practical Assessment 1
Practical assessment 1
Install library
install.packages("tidyverse")
1
## 1 Bridgerule, Devon, England Southampton
## 2 New York, New York, US Cherbourg
## 3 Jyväskylä, Finland Southampton
## 4 Scituate, Massachusetts, US Southampton
## 5 Birmingham, West Midlands, England Southampton
## 6 Cork, Ireland Queenstown
## Destination Lifeboat Body Class
## 1 Qu'Appelle Valley, Saskatchewan, Canada 3
## 2 New York, New York, US 4 1
## 3 New York City 14? 3
## 4 Scituate, Massachusetts, US D 1
## 5 New York City 3
## 6 New York City 3
The titanic set describe the survival status of individual passengers on the Titanic. The titanic data frame
does not contain information from the crew, but it does contain actual ages of half of the passengers. The
principal source for data about Titanic passengers is the Encyclopedia Titanica. The datasets used here were
begun by a variety of researchers. One of the original sources is Eaton & Haas (1994) Titanic: Triumph and
Tragedy, Patrick Stephens Ltd, which includes a passenger list created by many researchers and edited by
Michael A. Findlay
VARIABLE DESCRIPTIONS
Columns Describle
PassengerId Passenger Identification
survival Survival (0 = No; 1 = Yes)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare (British pound)
cabin Cabin
embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S =
Southampton)
boat Lifeboat
body Body Identification Number
home.dest Home/Destination
check dimension of data set titanic, using method dim in library : tidyverse
dim(ds_titanic)
## [1] 1309 21
It’s mean the data set titanic have total 21 columns and 1309 rows.
Show columns name of data set
colnames(ds_titanic)
2
Change column SibSp to : SpousesAboard and Parch to : ChildrenAboard, using index
colnames(ds_titanic)[7] = "SpousesAboard"
colnames(ds_titanic)[8] = "ChildrenAboard"
colnames(ds_titanic)
3
## 3 8 7.9250 1 2 865 6 26 3 2
## 4 1 53.1000 2 2 127 5 35 6 2
## 5 4 8.0500 1 2 627 1 35 1 2
## NA NA NA NA NA NA NA NA NA NA
## 9 3 11.1333 1 2 902 7 26 8 2
## 13 6 8.0500 1 2 1196 8 19 7 2
## 14 2 31.2750 1 2 632 2 39 4 2
## NA.1 NA NA NA NA NA NA NA NA NA
## Destination Lifeboat Body Class
## 1 3 1 1 3
## 2 2 4 1 1
## 3 1 2 1 3
## 4 4 5 1 1
## 5 1 1 1 3
## NA NA NA NA NA
## 9 5 3 1 3
## 13 1 1 1 3
## 14 6 1 1 3
## NA.1 NA NA NA NA
Extract data base on condition age >= 20 and age <= 40, and using method head with argument equal 10
to extract first 10 rows only. Want to observe the survival rate with people in adulthood.
creat new data frame from scratch with 2 variables and 10 observations, use column PassengerId and Age
new_ds = head(ds_titanic[,c("PassengerId","Age")],10)
new_ds
## PassengerId Age
## 1 1 22
## 2 2 38
## 3 3 26
## 4 4 35
## 5 5 35
## 6 6 NA
## 7 7 54
## 8 8 2
## 9 9 27
## 10 10 14
new vector numeric
vec = c(10,5,3,6,8,9,4,1,2,7)
4
## 9 9 27 2
## 10 10 14 7