SlideShare a Scribd company logo
2
Most read
3
Most read
7
Most read
Working with directory
• Before writing a program in R important to find
directory to load all list of file in the system
• This can be done by using getwd() without pass
any arguments
• If you want to change directory then setwd(path).
• It help to you to reset the current working
directory to another location
• List.files() helps to you to give information about
your files
• Dir() is equalent to list.files()
Data Exploration in R
• Data Exploration is a statistical approach or
technique for analyzing data sets in order to
summarize their important and main
characteristics generally by using some visual
aids. The EDA approach can be used to gather
knowledge about the following aspects of
data:
• Main characteristics or features of the data.
• The variables and their relationships.
• Finding out the important variables that can
be used in our problem.
• EDA is an iterative approach that includes:
• Generating questions about our data
• Searching for the answers by using visualization,
transformation, and modeling of our data.
• Using the lessons that we learn in order to refine our set of
questions or to generate a new set of questions.
• Exploratory Data Analysis in R
• In R Language, we are going to perform EDA under two
broad classifications:
• Descriptive Statistics, which includes mean, median, mode,
inter-quartile range, and so on.
• Graphical Methods, which includes histogram, density
estimation, box plots, and so on.
• Summary()
• It includes functions like min,Max,median,mean…
• Str()
• Displays the internal structure of dataset
• View()
• Displays the given dataset in separate spread sheet
• Head()
• Displays first 6 rows of data
• Tail()
• Displays last 6 rows of data
• Ncol()
• It returns the number of columns in the data set
• Nrows()
• It returns the number of rows in the data set
• Edit()
• It is used to dynamic editing or data manipulation of
dataset
• Fix()
• It is used to saves the changes in the dataset itself
• Data()
• List out the available data sets
• Image()
• Save.image() writes the external representation of R
objects to the specific file
• dim(iris)// Dimentions
• names(iris)// The attributes
• str(iris) // Structure is revealed
• attributes(iris)//The names, class etc
• iris[1:5] // the first 5
• Head(iris)//first six
. tail(iris)// Last Six entries
• idx<-sample(1:nrow(iris),5) 5 random values from the dataset
• Iris[1:10,”Sepal.Length”]//10 values
• Iris(idx)
• Summary(iris)
• Quantile(iris$Sepal.Length)//% disrtibution
• Quantile(iris$Sepal.Length,c(0.1,0.3,0.65))
• Var(iris$Sepal.Length
• Plot(iris)
Commands for Data Exploration
1) Loading Example Data
2) Example 1: Print First Six Rows of Data Frame Using head() Function
3) Example 2: Return Column Names of Data Frame Using names()
Function
4) Example 3: Get Number of Rows & Columns of Data Frame Using
dim() Function
5) Example 4: Explore Structure of Data Frame Columns Using str()
Function
6) Example 5: Calculate Descriptive Statistics Using summary() Function
7) Example 6: Count NA Values by Column Using colSums() & is.na()
Functions
8) Example 7: Draw Pairs Plot of Data Frame Columns Using ggpairs()
Function of GGally Package
9) Example 8: Draw Boxplots of Multiple Columns Using ggplot2 Package
10) Example 9: Draw facet_wrap Histograms of Multiple Columns Using
ggplot2 Package
Loading Example Data
• we’ll need to load some example data. In this
tutorial, we’ll use the mtcars data set, which
contains information about motor trend car
road tests.
• We can import the mtcars data set to the
current R session using the data() function as
shown below:
• data(mtcars) # Import example data frame
Count NA Values by Column Using
colSums() & is.na() Functions
• The following R programming syntax
demonstrates how to count the number of NA
values in each column of a data frame.
• To do this, we can apply
the colSums and is.na functions:
• colSums(is.na(mtcars)) # Count missing values
Draw Pairs Plot of Data Frame Columns Using ggpairs()
Function of GGally Package
• Until now, we have performed an analytical exploratory data analysis
based on numbers and certain RStudio console outputs.
•
However, when it comes to data exploration, it is also important to
have a visual look at your data.
• The following R code demonstrates how to create a pairs plot using the
.
• For this, we need the functions of the ggplot2 and GGally packages.
• By installing and loading GGally, the ggplot2 package is also imported.
So it’s enough to install and load GGally:
• install.packages("GGally") # Install GGally package library("GGally") #
Load GGally package
• Next, we can apply the ggpairs function of the GGally package to our
data frame:
• ggpairs(mtcars) # Draw pairs plot
Draw Boxplots of Multiple Columns
Using ggplot2 Package
• Boxplots are another popular way to visualize the columns of data
sets.
• To draw such a graph, we first have to manipulate our data using
the tidyr package. In order to use the functions of the tidyr package,
we first need to install and load tidyr to RStudio:
• install.packages("tidyr") # Install & load tidyr library("tidyr")
• Next, we can apply the pivot_longer function to reshape some of the
columns of our data from wide to long format:
• mtcars_long <- pivot_longer(mtcars, # Reshape data frame c("mpg",
"disp", "hp", "qsec"))
• Finally, we can apply the ggplot and geom_boxplot functions to our
data to visualize each of the selected columns in a side-by-side boxplot
graphic:
• gplot(mtcars_long, # Draw boxplots
• aes(x = value, fill = name)) + geom_boxplot()
Draw facet_wrap Histograms of
Multiple Columns Using ggplot2
Package
• Typically, we would also have a look at our
numerical columns in a histogram plot.
• In the following R syntax, I’m creating a histogram
for each of our columns. Furthermore, I’m using
the facet_wrap function to separate each column
in its own plotting panel:
• ggplot(mtcars_long, # Draw histograms aes(x =
value)) + geom_histogram() + facet_wrap(name ~
., scales = "free")
Importing Data in R Script
• Importing Data in R
• First, let’s consider a data-set which we can use
for the demonstration. For this demonstration,
we will use two examples of a single dataset, one
in .csv form and another .txt
• Reading a Comma-Separated Value(CSV) File
• Method 1: Using read.csv() Function Read CSV
Files into R
• The function has two parameters:
Data Exploration in R.pptx
• file.choose(): It opens a menu to choose a csv file from
the desktop.
• header: It is to indicate whether the first row of the
dataset is a variable name or not. Apply T/True if the
variable name is present else put F/False.
• # import and store the dataset in data1
• data1 <- read.csv(file.choose(), header=T)
•
• # display the data
• data1
• Using read.table() Function
• This function specifies how the dataset is
separated, in this case we take sep=”, “ as an
argument.
• Example:
• R
• # import and store the dataset in data2
• data2 <- read.table(file.choose(), header=T,
sep=", ")
•
• # display data
• data2
• Understanding datasets
• A dataset is usually a rectangular array of data with
rows representing observations and columns
representing variables.IT provides an example of a
hypothetical patient dataset.
• A patient dataset
• PatientID AdmDate Age Diabetes Status
• 1 10/15/2009 25 type1 poor
• 2. 15/12/2007 32 type2 improved

More Related Content

PPTX
Decision Tree.pptx
PPTX
PDF
Introduction to data analysis using R
PPTX
Data Management in R
PPTX
Descriptive Statistics in R.pptx
PDF
Data Exploration and Visualization with R
PDF
Import Data using R
PPT
R studio
Decision Tree.pptx
Introduction to data analysis using R
Data Management in R
Descriptive Statistics in R.pptx
Data Exploration and Visualization with R
Import Data using R
R studio

What's hot (20)

PDF
Introduction to R Graphics with ggplot2
PPTX
Getting Started with R
PDF
R Programming: Introduction To R Packages
PDF
pandas - Python Data Analysis
PPT
PDF
Data Visualization With R
PDF
Data analytics using R programming
PPTX
Tweets Classification using Naive Bayes and SVM
PPTX
Data analysis with R
PPTX
Introduction to R programming
PDF
R Programming: Transform/Reshape Data In R
PPT
Dynamic programming
DOC
Unit 3 daa
PPTX
2. R-basics, Vectors, Arrays, Matrices, Factors
PDF
R data-import, data-export
 
PPTX
Divide and conquer - Quick sort
PDF
Transaction in DBMS
PDF
Data Visualisation for Data Science
PPTX
Java script arrays
PPTX
Logistic regression
Introduction to R Graphics with ggplot2
Getting Started with R
R Programming: Introduction To R Packages
pandas - Python Data Analysis
Data Visualization With R
Data analytics using R programming
Tweets Classification using Naive Bayes and SVM
Data analysis with R
Introduction to R programming
R Programming: Transform/Reshape Data In R
Dynamic programming
Unit 3 daa
2. R-basics, Vectors, Arrays, Matrices, Factors
R data-import, data-export
 
Divide and conquer - Quick sort
Transaction in DBMS
Data Visualisation for Data Science
Java script arrays
Logistic regression
Ad

Similar to Data Exploration in R.pptx (20)

PPTX
Unit I - introduction to r language 2.pptx
PPTX
Aggregate.pptx
PDF
R programming & Machine Learning
PDF
Data analystics with R module 3 cseds vtu
PPTX
Lecture 9.pptx
PPTX
python for data anal gh i o fytysis creation.pptx
PDF
Python-for-Data-Analysis.pdf
PPTX
Python for data analysis
PPT
SASasasASSSasSSSSSasasaSASsasASASasasASs
PPTX
Data Cleaning in R language basic concepts.pptx
PDF
Python for Data Analysis.pdf
PPTX
Python-for-Data-Analysis.pptx
PPTX
Python-for-Data-Analysis.pptx
PPTX
Lecture 1 Pandas Basics.pptx machine learning
PPTX
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
PDF
R training3
PPTX
PPT on Data Science Using Python
PPT
Introduction to r language programming.ppt
PPTX
Introduction to R _IMPORTANT FOR DATA ANALYTICS
PPTX
Pandas yayyyyyyyyyyyyyyyyyin Python.pptx
Unit I - introduction to r language 2.pptx
Aggregate.pptx
R programming & Machine Learning
Data analystics with R module 3 cseds vtu
Lecture 9.pptx
python for data anal gh i o fytysis creation.pptx
Python-for-Data-Analysis.pdf
Python for data analysis
SASasasASSSasSSSSSasasaSASsasASASasasASs
Data Cleaning in R language basic concepts.pptx
Python for Data Analysis.pdf
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
Lecture 1 Pandas Basics.pptx machine learning
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
R training3
PPT on Data Science Using Python
Introduction to r language programming.ppt
Introduction to R _IMPORTANT FOR DATA ANALYTICS
Pandas yayyyyyyyyyyyyyyyyyin Python.pptx
Ad

More from Ramakrishna Reddy Bijjam (20)

PPTX
DataStructures in Pyhton Pandas and numpy.pptx
PPTX
Pyhton with Mysql to perform CRUD operations.pptx
PPTX
Regular expressions,function and glob module.pptx
PPTX
Natural Language processing using nltk.pptx
PPTX
Parsing HTML read and write operations and OS Module.pptx
PPTX
JSON, XML and Data Science introduction.pptx
PPTX
What is FIle and explanation of text files.pptx
PPTX
BINARY files CSV files JSON files with example.pptx
DOCX
VBS control structures for if do whilw.docx
DOCX
Builtinfunctions in vbscript and its types.docx
DOCX
VBScript Functions procedures and arrays.docx
DOCX
VBScript datatypes and control structures.docx
PPTX
Numbers and global functions conversions .pptx
DOCX
Structured Graphics in dhtml and active controls.docx
DOCX
Filters and its types as wave shadow.docx
PPTX
JavaScript Arrays and its types .pptx
PPTX
JS Control Statements and Functions.pptx
PPTX
Code conversions binary to Gray vice versa.pptx
PDF
FIXED and FLOATING-POINT-REPRESENTATION.pdf
PPTX
Handling Missing Data for Data Analysis.pptx
DataStructures in Pyhton Pandas and numpy.pptx
Pyhton with Mysql to perform CRUD operations.pptx
Regular expressions,function and glob module.pptx
Natural Language processing using nltk.pptx
Parsing HTML read and write operations and OS Module.pptx
JSON, XML and Data Science introduction.pptx
What is FIle and explanation of text files.pptx
BINARY files CSV files JSON files with example.pptx
VBS control structures for if do whilw.docx
Builtinfunctions in vbscript and its types.docx
VBScript Functions procedures and arrays.docx
VBScript datatypes and control structures.docx
Numbers and global functions conversions .pptx
Structured Graphics in dhtml and active controls.docx
Filters and its types as wave shadow.docx
JavaScript Arrays and its types .pptx
JS Control Statements and Functions.pptx
Code conversions binary to Gray vice versa.pptx
FIXED and FLOATING-POINT-REPRESENTATION.pdf
Handling Missing Data for Data Analysis.pptx

Recently uploaded (20)

PPTX
NOI Hackathon - Summer Edition - GreenThumber.pptx
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
PSYCHOLOGY IN EDUCATION.pdf ( nice pdf ...)
PPTX
UNDER FIVE CLINICS OR WELL BABY CLINICS.pptx
PDF
Mga Unang Hakbang Tungo Sa Tao by Joe Vibar Nero.pdf
PPTX
Introduction and Scope of Bichemistry.pptx
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
Cell Structure & Organelles in detailed.
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PDF
Open folder Downloads.pdf yes yes ges yes
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
Open Quiz Monsoon Mind Game Prelims.pptx
PPTX
Nursing Management of Patients with Disorders of Ear, Nose, and Throat (ENT) ...
PDF
From loneliness to social connection charting
PPTX
Pharma ospi slides which help in ospi learning
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPTX
Onica Farming 24rsclub profitable farm business
PDF
Piense y hagase Rico - Napoleon Hill Ccesa007.pdf
NOI Hackathon - Summer Edition - GreenThumber.pptx
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PSYCHOLOGY IN EDUCATION.pdf ( nice pdf ...)
UNDER FIVE CLINICS OR WELL BABY CLINICS.pptx
Mga Unang Hakbang Tungo Sa Tao by Joe Vibar Nero.pdf
Introduction and Scope of Bichemistry.pptx
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Cell Structure & Organelles in detailed.
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
Open folder Downloads.pdf yes yes ges yes
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Open Quiz Monsoon Mind Game Prelims.pptx
Nursing Management of Patients with Disorders of Ear, Nose, and Throat (ENT) ...
From loneliness to social connection charting
Pharma ospi slides which help in ospi learning
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Onica Farming 24rsclub profitable farm business
Piense y hagase Rico - Napoleon Hill Ccesa007.pdf

Data Exploration in R.pptx

  • 1. Working with directory • Before writing a program in R important to find directory to load all list of file in the system • This can be done by using getwd() without pass any arguments • If you want to change directory then setwd(path). • It help to you to reset the current working directory to another location • List.files() helps to you to give information about your files • Dir() is equalent to list.files()
  • 2. Data Exploration in R • Data Exploration is a statistical approach or technique for analyzing data sets in order to summarize their important and main characteristics generally by using some visual aids. The EDA approach can be used to gather knowledge about the following aspects of data: • Main characteristics or features of the data. • The variables and their relationships. • Finding out the important variables that can be used in our problem.
  • 3. • EDA is an iterative approach that includes: • Generating questions about our data • Searching for the answers by using visualization, transformation, and modeling of our data. • Using the lessons that we learn in order to refine our set of questions or to generate a new set of questions. • Exploratory Data Analysis in R • In R Language, we are going to perform EDA under two broad classifications: • Descriptive Statistics, which includes mean, median, mode, inter-quartile range, and so on. • Graphical Methods, which includes histogram, density estimation, box plots, and so on.
  • 4. • Summary() • It includes functions like min,Max,median,mean… • Str() • Displays the internal structure of dataset • View() • Displays the given dataset in separate spread sheet • Head() • Displays first 6 rows of data • Tail() • Displays last 6 rows of data • Ncol() • It returns the number of columns in the data set
  • 5. • Nrows() • It returns the number of rows in the data set • Edit() • It is used to dynamic editing or data manipulation of dataset • Fix() • It is used to saves the changes in the dataset itself • Data() • List out the available data sets • Image() • Save.image() writes the external representation of R objects to the specific file
  • 6. • dim(iris)// Dimentions • names(iris)// The attributes • str(iris) // Structure is revealed • attributes(iris)//The names, class etc • iris[1:5] // the first 5 • Head(iris)//first six . tail(iris)// Last Six entries • idx<-sample(1:nrow(iris),5) 5 random values from the dataset • Iris[1:10,”Sepal.Length”]//10 values • Iris(idx) • Summary(iris) • Quantile(iris$Sepal.Length)//% disrtibution • Quantile(iris$Sepal.Length,c(0.1,0.3,0.65)) • Var(iris$Sepal.Length • Plot(iris)
  • 7. Commands for Data Exploration 1) Loading Example Data 2) Example 1: Print First Six Rows of Data Frame Using head() Function 3) Example 2: Return Column Names of Data Frame Using names() Function 4) Example 3: Get Number of Rows & Columns of Data Frame Using dim() Function 5) Example 4: Explore Structure of Data Frame Columns Using str() Function 6) Example 5: Calculate Descriptive Statistics Using summary() Function 7) Example 6: Count NA Values by Column Using colSums() & is.na() Functions 8) Example 7: Draw Pairs Plot of Data Frame Columns Using ggpairs() Function of GGally Package 9) Example 8: Draw Boxplots of Multiple Columns Using ggplot2 Package 10) Example 9: Draw facet_wrap Histograms of Multiple Columns Using ggplot2 Package
  • 8. Loading Example Data • we’ll need to load some example data. In this tutorial, we’ll use the mtcars data set, which contains information about motor trend car road tests. • We can import the mtcars data set to the current R session using the data() function as shown below: • data(mtcars) # Import example data frame
  • 9. Count NA Values by Column Using colSums() & is.na() Functions • The following R programming syntax demonstrates how to count the number of NA values in each column of a data frame. • To do this, we can apply the colSums and is.na functions: • colSums(is.na(mtcars)) # Count missing values
  • 10. Draw Pairs Plot of Data Frame Columns Using ggpairs() Function of GGally Package • Until now, we have performed an analytical exploratory data analysis based on numbers and certain RStudio console outputs. • However, when it comes to data exploration, it is also important to have a visual look at your data. • The following R code demonstrates how to create a pairs plot using the . • For this, we need the functions of the ggplot2 and GGally packages. • By installing and loading GGally, the ggplot2 package is also imported. So it’s enough to install and load GGally: • install.packages("GGally") # Install GGally package library("GGally") # Load GGally package • Next, we can apply the ggpairs function of the GGally package to our data frame: • ggpairs(mtcars) # Draw pairs plot
  • 11. Draw Boxplots of Multiple Columns Using ggplot2 Package • Boxplots are another popular way to visualize the columns of data sets. • To draw such a graph, we first have to manipulate our data using the tidyr package. In order to use the functions of the tidyr package, we first need to install and load tidyr to RStudio: • install.packages("tidyr") # Install & load tidyr library("tidyr") • Next, we can apply the pivot_longer function to reshape some of the columns of our data from wide to long format: • mtcars_long <- pivot_longer(mtcars, # Reshape data frame c("mpg", "disp", "hp", "qsec")) • Finally, we can apply the ggplot and geom_boxplot functions to our data to visualize each of the selected columns in a side-by-side boxplot graphic: • gplot(mtcars_long, # Draw boxplots • aes(x = value, fill = name)) + geom_boxplot()
  • 12. Draw facet_wrap Histograms of Multiple Columns Using ggplot2 Package • Typically, we would also have a look at our numerical columns in a histogram plot. • In the following R syntax, I’m creating a histogram for each of our columns. Furthermore, I’m using the facet_wrap function to separate each column in its own plotting panel: • ggplot(mtcars_long, # Draw histograms aes(x = value)) + geom_histogram() + facet_wrap(name ~ ., scales = "free")
  • 13. Importing Data in R Script • Importing Data in R • First, let’s consider a data-set which we can use for the demonstration. For this demonstration, we will use two examples of a single dataset, one in .csv form and another .txt • Reading a Comma-Separated Value(CSV) File • Method 1: Using read.csv() Function Read CSV Files into R • The function has two parameters:
  • 15. • file.choose(): It opens a menu to choose a csv file from the desktop. • header: It is to indicate whether the first row of the dataset is a variable name or not. Apply T/True if the variable name is present else put F/False. • # import and store the dataset in data1 • data1 <- read.csv(file.choose(), header=T) • • # display the data • data1
  • 16. • Using read.table() Function • This function specifies how the dataset is separated, in this case we take sep=”, “ as an argument. • Example: • R • # import and store the dataset in data2 • data2 <- read.table(file.choose(), header=T, sep=", ") • • # display data • data2
  • 17. • Understanding datasets • A dataset is usually a rectangular array of data with rows representing observations and columns representing variables.IT provides an example of a hypothetical patient dataset. • A patient dataset • PatientID AdmDate Age Diabetes Status • 1 10/15/2009 25 type1 poor • 2. 15/12/2007 32 type2 improved