UNIT II -DA USING R
UNIT II -DA USING R
In R, "aggregating" and "grouping" refer to processes that summarize and transform data
based on certain criteria. The aggregate function and the dplyr package are commonly used
for these tasks.
aggregate Function
The aggregate function in R is used to compute summary statistics of data, such as sums,
means, and more, for subsets of the data grouped by one or more variables.
Syntax
The iris dataset is a built-in dataset in R that contains measurements of different species of
iris flowers.
Output
Species x
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
The dplyr package provides more readable and powerful functions for grouping and
aggregation.
Syntax
library(dplyr)
data %>%
group_by(grouping_variable) %>%
summarise(
summary_variable1 = FUN1(target_variable),
summary_variable2 = FUN2(target_variable)
)
NOTE:
The %>% operator is the pipe operator from the magrittr package (often used in dplyr). It
passes the iris dataset to the next function.
Output
# A tibble: 3 × 3
Species Mean_Sepal_Length Mean_Sepal_Width
<fct> <dbl> <dbl>
1 setosa 5.01 3.43
2 versicolor 5.94 2.77
3 virginica 6.59 2.97
Explanation
These examples demonstrate how to perform aggregation and grouping in R using both base
R functions and the dplyr package. The dplyr package is often preferred for its readability
and ease of use.
In the context of the aggregate function in R, FUN stands for "function." It specifies the
function to be applied to the grouped subsets of data to calculate the summary statistic.
FUN:
When you use the aggregate function, you're typically interested in summarizing the data in
some way. The FUN argument allows you to specify what kind of summary you want, such
as the mean, sum, maximum, minimum, etc. Without specifying FUN, the aggregate
function wouldn't know how to summarize the data.
4
Syntax of aggregate
aggregate(x, by, FUN)
Here’s an example using the iris dataset, where we calculate the mean of Sepal.Length for
each species:
Output
Species x
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
Explanation
You can use various functions with FUN to get different types of summaries. Here are some
examples:
Sum:
Maximum:
print(agg_max)
Minimum:
Standard Deviation:
By changing the function specified in FUN, you can easily compute different summary
statistics for your grouped data.
Step-by-Step Analysis
The iris dataset is a built-in dataset in R. It contains 150 observations of iris flowers, with
measurements for sepal length, sepal width, petal length, and petal width, along with the
species of the iris flower.
Output
mathematica
Copy code
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Before performing any aggregation, it’s important to understand the data by exploring its
basic properties.
Output
Sepal.Length Sepal.Width Petal.Length Petal.Width
Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa
:50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica
:50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
This output gives a quick overview of the dataset, including the minimum, first quartile,
median, mean, third quartile, and maximum values for each numerical variable, as well as the
counts for each species.
We can use the aggregate function to calculate summary statistics for different groups
within the data. Here, we'll calculate the mean sepal length and sepal width for each species.
Output
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
7
Explanation
aggregate(. ~ Species, data = iris, FUN = mean, na.rm = TRUE): This line
uses the aggregate function to calculate the mean of all numerical variables
(Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width) grouped by
Species. The na.rm = TRUE argument ensures that any missing values are ignored in
the calculation.
The dplyr package provides more readable and flexible functions for data manipulation and
aggregation. We will achieve the same aggregation using dplyr.
Output
# A tibble: 3 × 5
Species Mean_Sepal_Length Mean_Sepal_Width Mean_Petal_Length
Mean_Petal_Width
<fct> <dbl> <dbl> <dbl>
<dbl>
1 setosa 5.01 3.43 1.46
0.246
2 versicolor 5.94 2.77 4.26
1.33
3 virginica 6.59 2.97 5.55
2.03
Explanation
Conclusion
8
We have performed a simple analysis of the iris dataset in R, including loading the data,
conducting basic EDA, and summarizing the data using both the aggregate function and the
dplyr package. This analysis helps us understand the average measurements of different iris
species, providing insights into the dataset's structure.
Let's walk through a simple data analysis example using the iris dataset in R. The iris
dataset contains measurements of sepal length, sepal width, petal length, and petal width for
three species of iris flowers.
1. Data Collection
2. Data Cleaning
If there are missing values, you could handle them like this:
3. Data Exploration
# Summary statistics
summary(iris)
4. Data Transformation
Group the data by species and calculate the mean of each measurement.
5. Data Modeling
For this simple example, we won't apply a complex model, but we could proceed with
various statistical or machine learning models to analyze relationships and patterns further.
6. Data Interpretation
7. Reporting
Conclusion
Analysis involves breaking down data into meaningful patterns and insights through
systematic steps. By following these steps, you can gain a deeper understanding of your data
and make informed decisions based on your findings.
NOTE:
The attributes such as minimum, first quartile, median, mean, third quartile, and maximum
values, along with counts for each category, are fundamental descriptive statistics. They
provide essential insights into the distribution and central tendency of numerical data. Here
are the practical uses of each attribute:
1. Minimum Value
o Definition: The smallest value in the dataset.
10
o Use: Identifies the lower bound of the data range. It's useful for understanding
the lowest extreme and for detecting outliers or unusual values.
2. First Quartile (Q1)
o Definition: The value below which 25% of the data falls.
o Use: Helps to understand the lower 25% of the data distribution. It's useful in
identifying the spread of the lower portion of the dataset and is also a
component of the interquartile range (IQR), which measures statistical
dispersion.
3. Median (Q2)
o Definition: The middle value of the dataset when sorted in ascending order.
o Use: Represents the central tendency of the data, less affected by outliers and
skewed data compared to the mean. It's used to understand the typical value in
the dataset.
4. Mean
o Definition: The average of all data points.
o Use: Represents the central tendency but can be influenced by outliers. It's
useful for calculating the expected value and in various statistical analyses.
5. Third Quartile (Q3)
o Definition: The value below which 75% of the data falls.
o Use: Helps to understand the upper 25% of the data distribution. Like Q1, it's
a component of the interquartile range (IQR).
6. Maximum Value
o Definition: The largest value in the dataset.
o Use: Identifies the upper bound of the data range. It's useful for understanding
the highest extreme and for detecting outliers or unusual values.
7. Counts for Each Category
o Definition: The number of occurrences of each category (e.g., species in the
iris dataset).
o Use: Useful for understanding the distribution of categorical data, comparing
the frequency of different categories, and ensuring that each category is
adequately represented in the analysis.
Practical Examples
Let's consider practical scenarios where these descriptive statistics are useful:
1. Business Analytics
Sales Analysis: Understanding the minimum, maximum, and quartiles of daily sales
can help a business manage inventory, identify sales trends, and detect anomalies.
Customer Feedback: Median customer satisfaction scores provide a robust measure
of central tendency, helping businesses understand typical customer sentiment without
being skewed by outliers.
2. Healthcare
Patient Data: Analyzing the mean, median, and quartiles of patient wait times can
help in resource allocation and improving service efficiency.
Blood Pressure Levels: Understanding the distribution of blood pressure readings
(minimum, Q1, median, Q3, maximum) can aid in identifying at-risk patients and
tailoring medical interventions.
11
3. Education
Student Scores: Teachers can use the quartiles and median scores to understand the
distribution of student performance and identify students who may need additional
support.
Class Participation: Counts of class participation by different groups (e.g., gender,
grade level) help in assessing engagement and inclusivity.
4. Real Estate
Property Prices: Real estate agents can use the descriptive statistics of property
prices in a neighborhood to advise clients on buying and selling decisions.
Rental Rates: Understanding the distribution of rental rates helps in setting
competitive prices and identifying market trends.
Output
# A tibble: 3 × 8
Species Min_Sepal_Length Q1_Sepal_Length Median_Sepal_Length
Mean_Sepal_Length Q3_Sepal_Length Max_Sepal_Length Count
<fct> <dbl> <dbl> <dbl>
<dbl> <dbl> <dbl> <int>
1 setosa 4.3 4.8 5.0
5.01 5.2 5.8 50
2 versicolor 4.9 5.6 5.9
5.94 6.3 7.0 50
3 virginica 4.9 6.2 6.5
6.59 6.9 7.9 50
Methods for reading Data
Reading data refers to the process of importing or loading data from various sources
into a programming environment or software for analysis. In R, there are multiple methods
12
for reading data depending on the source and format of the data. Below are the methods,
along with their meanings and definitions:
readr::read_csv()
library(readr)
data <- read_csv("path/to/your/file.csv")
library(readxl)
data <- read_excel("path/to/your/file.xlsx", sheet = "Sheet1")
readr::read_table()
library(readr)
data <- read_table("path/to/your/file.txt")
jsonlite::fromJSON()
library(jsonlite)
data <- fromJSON("path/to/your/file.json")
library(DBI)
con <- dbConnect(RSQLite::SQLite(), "path/to/your/database.sqlite")
data <- dbGetQuery(con, "SELECT * FROM your_table")
dbDisconnect(con)
library(httr)
response <- GET("https://api.example.com/data")
data <- content(response, "parsed")
library(haven)
data_sas <- read_sas("path/to/your/file.sas7bdat")
14
Summary
Each of these methods allows you to import data into R from different file formats and
sources. Understanding these methods and their specific functions helps you efficiently load
and manage data for analysis.
R can connect to several types of databases, such as SQLite, MySQL, PostgreSQL, SQL
Server, and Oracle, using the DBI package along with database-specific packages like
RSQLite, RMySQL, RPostgres, odbc, etc.
Once the data is loaded into R, you can perform various data manipulation, analysis, and
visualization tasks using packages such as dplyr, ggplot2, and others.
R can be integrated with various BI systems for enhanced reporting and dash boarding
capabilities. This integration allows for complex analytics within BI platforms.
Power BI supports R scripts for data transformation and visualization. Here's an example of
using R in Power BI:
# Create a plot
ggplot(dataset, aes(x = category_column, y = numeric_column)) +
geom_point() +
labs(title = "Scatter Plot", x = "Category", y = "Value")
Tableau supports R integration through the Rserve package. Here’s how to set it up:
install.packages("Rserve")
library(Rserve)
Rserve()
2. Connect Tableau to R:
o In Tableau, go to "Help" > "Settings and Performance" > "Manage External Service
Connection".
o Choose "Rserve" and enter the server details.
3. Use R Scripts in Tableau:
o Create calculated fields using R scripts.
o Example: SCRIPT_REAL("mean(.arg1)", SUM([numeric_column]))
Example Workflow
Combining these steps into a workflow, you can connect to a database, perform data
manipulation and visualization in R, and then integrate the results into a BI system for
reporting.
17
1. Connect to Database:
library(DBI)
library(RMySQL)
2. Data Manipulation:
library(dplyr)
3. Data Visualization:
library(ggplot2)
Conclusion
Using R with databases and business intelligence systems enhances the ability to
perform advanced analytics and create insightful visualizations.
This integration allows for efficient data management, complex statistical analysis,
and seamless reporting, making it a powerful combination for data-driven decision-making.
18