0% found this document useful (0 votes)
168 views

Data Wrangling (Data Preprocessing) : Practical Assessment 1

The document discusses data wrangling and preprocessing of the Titanic dataset. It: 1) Loads and reads in the Titanic dataset from an online source. 2) Performs some initial exploration of the data including viewing the first few rows and checking dimensions. 3) Renames and changes some column names and checks data types of columns. 4) Filters the dataset to only include passengers between the ages of 20-40 and takes the first 10 rows of the filtered data to create a matrix.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
168 views

Data Wrangling (Data Preprocessing) : Practical Assessment 1

The document discusses data wrangling and preprocessing of the Titanic dataset. It: 1) Loads and reads in the Titanic dataset from an online source. 2) Performs some initial exploration of the data including viewing the first few rows and checking dimensions. 3) Renames and changes some column names and checks data types of columns. 4) Filters the dataset to only include passengers between the ages of 20-40 and takes the first 10 rows of the filtered data to create a matrix.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Data Wrangling (Data Preprocessing)

Practical assessment 1

Student name submitting the assessment report come here

Install library
install.packages("tidyverse")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'


## (as 'lib' is unspecified)
Load data from github : https://github.com/datasciencedojo/datasets/blob/master/titanic.csv After extract
and change name to titanic and copy to project, using read.csv method to read data set titanic
ds_titanic = read.csv(file = 'titanic.csv',header = TRUE,sep = ',')

head of data set titanic ( return first 6 rows only)


head(ds_titanic)

## PassengerId Survived Pclass


## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked WikiId
## 1 A/5 21171 7.2500 S 691
## 2 PC 17599 71.2833 C85 C 90
## 3 STON/O2. 3101282 7.9250 S 865
## 4 113803 53.1000 C123 S 127
## 5 373450 8.0500 S 627
## 6 330877 8.4583 Q 785
## Name_wiki Age_wiki
## 1 Braund, Mr. Owen Harris 22
## 2 Cumings, Mrs. Florence Briggs (née Thayer) 35
## 3 Heikkinen, Miss Laina 26
## 4 Futrelle, Mrs. Lily May (née Peel) 35
## 5 Allen, Mr. William Henry 35
## 6 Doherty, Mr. William John (aka "James Moran") 22
## Hometown Boarded

1
## 1 Bridgerule, Devon, England Southampton
## 2 New York, New York, US Cherbourg
## 3 Jyväskylä, Finland Southampton
## 4 Scituate, Massachusetts, US Southampton
## 5 Birmingham, West Midlands, England Southampton
## 6 Cork, Ireland Queenstown
## Destination Lifeboat Body Class
## 1 Qu'Appelle Valley, Saskatchewan, Canada 3
## 2 New York, New York, US 4 1
## 3 New York City 14? 3
## 4 Scituate, Massachusetts, US D 1
## 5 New York City 3
## 6 New York City 3
The titanic set describe the survival status of individual passengers on the Titanic. The titanic data frame
does not contain information from the crew, but it does contain actual ages of half of the passengers. The
principal source for data about Titanic passengers is the Encyclopedia Titanica. The datasets used here were
begun by a variety of researchers. One of the original sources is Eaton & Haas (1994) Titanic: Triumph and
Tragedy, Patrick Stephens Ltd, which includes a passenger list created by many researchers and edited by
Michael A. Findlay
VARIABLE DESCRIPTIONS

Columns Describle
PassengerId Passenger Identification
survival Survival (0 = No; 1 = Yes)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare (British pound)
cabin Cabin
embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S =
Southampton)
boat Lifeboat
body Body Identification Number
home.dest Home/Destination

check dimension of data set titanic, using method dim in library : tidyverse
dim(ds_titanic)

## [1] 1309 21
It’s mean the data set titanic have total 21 columns and 1309 rows.
Show columns name of data set
colnames(ds_titanic)

## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"


## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked" "WikiId" "Name_wiki" "Age_wiki"
## [16] "Hometown" "Boarded" "Destination" "Lifeboat" "Body"
## [21] "Class"

2
Change column SibSp to : SpousesAboard and Parch to : ChildrenAboard, using index
colnames(ds_titanic)[7] = "SpousesAboard"
colnames(ds_titanic)[8] = "ChildrenAboard"
colnames(ds_titanic)

## [1] "PassengerId" "Survived" "Pclass" "Name"


## [5] "Sex" "Age" "SpousesAboard" "ChildrenAboard"
## [9] "Ticket" "Fare" "Cabin" "Embarked"
## [13] "WikiId" "Name_wiki" "Age_wiki" "Hometown"
## [17] "Boarded" "Destination" "Lifeboat" "Body"
## [21] "Class"
Check data type of columns , using method str
str(ds_titanic)

## 'data.frame': 1309 obs. of 21 variables:


## $ PassengerId : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : num 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SpousesAboard : int 1 1 0 1 0 0 0 3 0 1 ...
## $ ChildrenAboard: int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
## $ WikiId : num 691 90 865 127 627 ...
## $ Name_wiki : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. Florence Briggs (née Thayer)" "Heikk
## $ Age_wiki : num 22 35 26 35 35 22 54 2 26 14 ...
## $ Hometown : chr "Bridgerule, Devon, England" "New York, New York, US" "Jyväskylä, Finland" "S
## $ Boarded : chr "Southampton" "Cherbourg" "Southampton" "Southampton" ...
## $ Destination : chr "Qu'Appelle Valley, Saskatchewan, Canada" "New York, New York, US" "New York
## $ Lifeboat : chr "" "4" "14?" "D" ...
## $ Body : chr "" "" "" "" ...
## $ Class : int 3 1 3 1 3 3 1 3 3 2 ...
matrix_ds_titanic = data.matrix(head(ds_titanic[ds_titanic$Age >= 20 & ds_titanic$Age <= 40,],10))
matrix_ds_titanic

## PassengerId Survived Pclass Name Sex Age SpousesAboard ChildrenAboard


## 1 1 0 3 3 2 22 1 0
## 2 2 1 1 4 1 38 1 0
## 3 3 1 3 6 1 26 0 0
## 4 4 1 1 5 1 35 1 0
## 5 5 0 3 1 2 35 0 0
## NA NA NA NA NA NA NA NA NA
## 9 9 1 3 7 1 27 0 2
## 13 13 0 3 8 2 20 0 0
## 14 14 0 3 2 2 39 1 5
## NA.1 NA NA NA NA NA NA NA NA
## Ticket Fare Cabin Embarked WikiId Name_wiki Age_wiki Hometown Boarded
## 1 5 7.2500 1 2 691 3 22 2 2
## 2 7 71.2833 3 1 90 4 35 5 1

3
## 3 8 7.9250 1 2 865 6 26 3 2
## 4 1 53.1000 2 2 127 5 35 6 2
## 5 4 8.0500 1 2 627 1 35 1 2
## NA NA NA NA NA NA NA NA NA NA
## 9 3 11.1333 1 2 902 7 26 8 2
## 13 6 8.0500 1 2 1196 8 19 7 2
## 14 2 31.2750 1 2 632 2 39 4 2
## NA.1 NA NA NA NA NA NA NA NA NA
## Destination Lifeboat Body Class
## 1 3 1 1 3
## 2 2 4 1 1
## 3 1 2 1 3
## 4 4 5 1 1
## 5 1 1 1 3
## NA NA NA NA NA
## 9 5 3 1 3
## 13 1 1 1 3
## 14 6 1 1 3
## NA.1 NA NA NA NA
Extract data base on condition age >= 20 and age <= 40, and using method head with argument equal 10
to extract first 10 rows only. Want to observe the survival rate with people in adulthood.
creat new data frame from scratch with 2 variables and 10 observations, use column PassengerId and Age
new_ds = head(ds_titanic[,c("PassengerId","Age")],10)
new_ds

## PassengerId Age
## 1 1 22
## 2 2 38
## 3 3 26
## 4 4 35
## 5 5 35
## 6 6 NA
## 7 7 54
## 8 8 2
## 9 9 27
## 10 10 14
new vector numeric
vec = c(10,5,3,6,8,9,4,1,2,7)

add vector to data set using cbind()


new_ds_titanic <- cbind(new_ds,vec)
new_ds_titanic

## PassengerId Age vec


## 1 1 22 10
## 2 2 38 5
## 3 3 26 3
## 4 4 35 6
## 5 5 35 8
## 6 6 NA 9
## 7 7 54 4
## 8 8 2 1

4
## 9 9 27 2
## 10 10 14 7

You might also like