Big Data - Lab 3
Big Data - Lab 3
Third Section
Agenda
• Introduction to R studio
• Working with Scripts
• Import and Export Data with R
• Data Cleaning
Introduction to
R studio
R studio
• RStudio is a free and open-source integrated development
environment (IDE) for R.
Example:
> setwd("C:/Users/Joach/Desktop/my_folder")
> getwd()
> "C:/Users/Joach/Desktop/my_folder" #return the currently used working directory
Import & Export Data from CSV file
> var1 <- 1:5
> var2 <- (1:5)/10
> var3 <- c("R", "and", "Data Mining", "Examples", "Case Studies")
> df1 <- data.frame(var1, var2, var3)
# Write to a csv file
write.csv(R object, “FilePath\File name.csv")
write.csv(df1, “dummmyData.csv")
# Read from a csv file
DataFrame <- read.csv(“FilePath\File name.csv")
df2 <- read.csv("dummmyData.csv")
Import & Export Data from Excel file
install.packages("xlsx")
library(xlsx)
Inconsistent
Values
Incorrect
Values
Incomplete
Data
Inconsistent
Data
What we will cover
• Understanding what is your data
• Understanding the structure of your data
• Tidying data
• Preparing data for analysis and visualization
What is your data ?
• Every dataset has a STORY , so a data scientist must know what is
the meaning of the data
• For example in R :
• Titanic Survival of passengers on the Titanic
• cars Speed and Stopping Distances of Cars
• ToothGrowth The Effect of Vitamin C on Tooth Growth
And so on…….
Understanding the structure of your data
# read data from csv file and insert NA in empty cells
>bank_data<-read.csv("D:\\bank_load.csv", na.strings="" ,header=TRUE)
#view it’s dimensions
>dim(bank_data)
[1] 199 9
# To provide a useful summary of dataset structure
#look at columns names
>str(bank_data)
>names(bank_data)
[1] "id" "age" "gender"
[4] "region" "income"
[6] "married" "children " "car "
[10]"Loan_Approval"
# If you want to omit rows that contain Null values
> data <- na.omit(bank_data)
# Add a default value = 1000 in case income is NULL
> bank_data$income[is.na(bank_data$income)] <- 1000
# Add a default value = "INNER_CITY" in case region is NULL
> bank_data$region[is.na(bank_data$region)] <- "INNER_CITY"
# To remove negative from Age and income columns
> bank_data$age[bank_data$age<0] <- bank_data$age[bank_data$age<0]*(-1)
> bank_data$income[bank_data$income<0] <- bank_data$income[bank_data$income<0]*(-1)
# To remove inconsistancy in married column
> bank_data$married[bank_data$married=="N"] <- "NO"
> bank_data$married[bank_data$married=="Y"] <- "YES"
# To convert data from factor to string
>bank_data$gender<-as.character(bank_data$gender)
>str(bank_data)
At this stage data is ready for
analysis and visualization
Thank You