0% found this document useful (0 votes)
30 views

Big Data - Lab 3

The document discusses R studio, an integrated development environment for R. It covers working with scripts in R studio, and importing and exporting data from CSV and Excel files. It also discusses data cleaning techniques for dealing with inconsistent, incomplete, incorrect and duplicate data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Big Data - Lab 3

The document discusses R studio, an integrated development environment for R. It covers working with scripts in R studio, and importing and exporting data from CSV and Excel files. It also discusses data cleaning techniques for dealing with inconsistent, incomplete, incorrect and duplicate data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Big Data

Third Section
Agenda
• Introduction to R studio
• Working with Scripts
• Import and Export Data with R
• Data Cleaning
Introduction to
R studio
R studio
• RStudio is a free and open-source integrated development
environment (IDE) for R.

• Download Link: Rstudio


5
Working With Scripts
Scripts
• Script is a series of commands that you can execute at one time so you can save
a lot of time.
• It is just a plain text file with R commands in it.
• It is a good way to keep track of what you're doing.
• If you have a long analysis, and you want to be able to recreate it later, a good
idea is to type it into a script.
Example: Calculating Avg, Max & Min Values
for a Given Object

1. Create a script file (File->New Script)


2. Write Your code in the Script window
Example: Calculating Avg, Max & Min Values
for a Given Object

3. Choose the commands you want to run


4. Press run script
5. Output will be printed in the console
6. Save your work in the desired location (ex: Numbers.R)
7. You can run the script later using source command:
source('Path')
> source('~/Desktop/Numbers.R')
Import & Export Data
Prepare Working Directory
> getwd() # Returns the filepath of the current working directory

> setwd(“filepath”) #Specifies a new working directory

Example:
> setwd("C:/Users/Joach/Desktop/my_folder")
> getwd()
> "C:/Users/Joach/Desktop/my_folder" #return the currently used working directory
Import & Export Data from CSV file
> var1 <- 1:5
> var2 <- (1:5)/10
> var3 <- c("R", "and", "Data Mining", "Examples", "Case Studies")
> df1 <- data.frame(var1, var2, var3)
# Write to a csv file
write.csv(R object, “FilePath\File name.csv")
write.csv(df1, “dummmyData.csv")
# Read from a csv file
DataFrame <- read.csv(“FilePath\File name.csv")
df2 <- read.csv("dummmyData.csv")
Import & Export Data from Excel file
install.packages("xlsx")
library(xlsx)

# Write to a excel file


write.xlsx(R object, “FilePath\File name.xlsx”, sheetName = "sheet+number")
write.xlsx(df2, "dummmyData.xlsx", sheetName = "sheet1")
# Read from a excel file
Dataframe<- read.xlsx(“FilePath\File name.xlsx”, sheetName = "sheet+number")
df3 <- read.xlsx("dummmyData.xlsx", sheetName = "sheet1")
Data Cleaning
Data Cleaning
• It is the process of removing data from a dataset that is incorrect , incomplete,
inconsistent , duplicated or improperly formatted
• It deals with detecting and removing errors and inconsistencies from data in
order to improve the quality of data
• Sometimes called data scrubbing or data cleansing
• It is the MOST time consuming part in big data cycle
Problems with Data
• Inconsistency: when two data items in the dataset contradict each other
e.g. Age = 20 , year_of_birth=1960
• Incomplete: data contains missing values
• Incorrect: data contains wrong or non reasonable values
e.g. Age = 0 OR Income = -3400
• Duplicated: data are redundant
e.g. some rows appear twice

HOW TO SOLVE THESE PROBLEMS ???


Data Set Example

Inconsistent
Values
Incorrect
Values

Incomplete
Data

Inconsistent
Data
What we will cover
• Understanding what is your data
• Understanding the structure of your data
• Tidying data
• Preparing data for analysis and visualization
What is your data ?
• Every dataset has a STORY , so a data scientist must know what is
the meaning of the data
• For example in R :
• Titanic Survival of passengers on the Titanic
• cars Speed and Stopping Distances of Cars
• ToothGrowth The Effect of Vitamin C on Tooth Growth
And so on…….
Understanding the structure of your data
# read data from csv file and insert NA in empty cells
>bank_data<-read.csv("D:\\bank_load.csv", na.strings="" ,header=TRUE)
#view it’s dimensions
>dim(bank_data)
[1] 199 9
# To provide a useful summary of dataset structure
#look at columns names
>str(bank_data)
>names(bank_data)
[1] "id" "age" "gender"
[4] "region" "income"
[6] "married" "children " "car "
[10]"Loan_Approval"
# If you want to omit rows that contain Null values
> data <- na.omit(bank_data)
# Add a default value = 1000 in case income is NULL
> bank_data$income[is.na(bank_data$income)] <- 1000
# Add a default value = "INNER_CITY" in case region is NULL
> bank_data$region[is.na(bank_data$region)] <- "INNER_CITY"
# To remove negative from Age and income columns
> bank_data$age[bank_data$age<0] <- bank_data$age[bank_data$age<0]*(-1)
> bank_data$income[bank_data$income<0] <- bank_data$income[bank_data$income<0]*(-1)
# To remove inconsistancy in married column
> bank_data$married[bank_data$married=="N"] <- "NO"
> bank_data$married[bank_data$married=="Y"] <- "YES"
# To convert data from factor to string
>bank_data$gender<-as.character(bank_data$gender)
>str(bank_data)
At this stage data is ready for
analysis and visualization
Thank You

You might also like