0% found this document useful (0 votes)
9 views

UNIT II -DA USING R

Uploaded by

Nilavan Nilavan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

UNIT II -DA USING R

Uploaded by

Nilavan Nilavan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

1

Semester Course Code Title of the Course Hours Credits


V 21UCC53CC10 DATA ANALYSIS USING R

Unit – II (15 Hours)


Aggregating and group processing of variable - Simple analysis using R- Methods for reading
Data - Using R with Databases and Business Intelligence systems.

In this study, we'll explore:

1. Aggregating and Group Processing: Techniques for summarizing data by groups


using functions like aggregate() and dplyr.
2. Simple Analysis: Basic data analysis methods in R, including descriptive statistics
and visualization with ggplot2.
3. Reading Data: Various methods to import data from different sources, including
CSV, Excel, and databases.
4. Using R with Databases: Connecting to and querying databases with R using
packages like DBI.
5. Business Intelligence Integration: Incorporating R with BI tools such as Power BI
and Tableau for advanced reporting and visualization.

In R, "aggregating" and "grouping" refer to processes that summarize and transform data
based on certain criteria. The aggregate function and the dplyr package are commonly used
for these tasks.

Feature aggregate dplyr


Type Base R function Part of the tidyverse package
data %>% group_by(var) %>%
Syntax aggregate(x, by, FUN, ...)
summarise(...)
More readable, uses pipe %>%
Readability Less readable, more verbose
operator
Comprehensive data
Functionality Primarily for aggregation
manipulation
Can be slower for large Generally faster and more
Performance datasets efficient
Standalone, less integrated Integrates well with other
Integration with other packages tidyverse packages
Grouping specified in by Grouping specified in
Grouping argument group_by()
Uses summarise() with various
Summarization Uses FUN to summarize
functions
Pipeline Support No native support Native support with %>%
Complex Limited to simple Supports complex operations
Operations aggregation and transformations

Default Output Data frame Data frame


Requires more setup and
Ease of Use Intuitive and user-friendly
configuration
Feature aggregate dplyr
2

aggregate Function

The aggregate function in R is used to compute summary statistics of data, such as sums,
means, and more, for subsets of the data grouped by one or more variables.

Syntax

aggregate(x, by, FUN)

 x: A data frame or a numeric matrix.


 by: A list of grouping elements.
 FUN: The function to apply to the grouped data.

Example with iris Dataset

The iris dataset is a built-in dataset in R that contains measurements of different species of
iris flowers.

# Load iris dataset


data(iris)

# Aggregate the data: calculate mean of Sepal.Length for each Species


agg_result <- aggregate(iris$Sepal.Length, by = list(Species =
iris$Species), FUN = mean)

# Print the result


print(agg_result)

Output
Species x
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588

Using dplyr for Grouping and Aggregation

The dplyr package provides more readable and powerful functions for grouping and
aggregation.

Syntax

library(dplyr)

data %>%
group_by(grouping_variable) %>%
summarise(
summary_variable1 = FUN1(target_variable),
summary_variable2 = FUN2(target_variable)
)

Example with iris Dataset


3

# Load dplyr package


library(dplyr)

# Group by Species and calculate mean of Sepal.Length


agg_df <- iris %>%
group_by(Species) %>%
summarise(
Mean_Sepal_Length = mean(Sepal.Length),
Mean_Sepal_Width = mean(Sepal.Width)
)

# Print the result


print(agg_df)

NOTE:

The %>% operator is the pipe operator from the magrittr package (often used in dplyr). It
passes the iris dataset to the next function.

Output
# A tibble: 3 × 3
Species Mean_Sepal_Length Mean_Sepal_Width
<fct> <dbl> <dbl>
1 setosa 5.01 3.43
2 versicolor 5.94 2.77
3 virginica 6.59 2.97

Explanation

1. Loading Data: The iris dataset is loaded.


2. Using aggregate:
o aggregate(iris$Sepal.Length, by = list(Species = iris$Species),
FUN = mean): This computes the mean of Sepal.Length for each Species in
the iris dataset.
3. Using dplyr:
o group_by(Species): Groups the data by Species.
o summarise(Mean_Sepal_Length = mean(Sepal.Length),
Mean_Sepal_Width = mean(Sepal.Width)): Calculates the mean of
Sepal.Length and Sepal.Width for each group of Species.

These examples demonstrate how to perform aggregation and grouping in R using both base
R functions and the dplyr package. The dplyr package is often preferred for its readability
and ease of use.

In the context of the aggregate function in R, FUN stands for "function." It specifies the
function to be applied to the grouped subsets of data to calculate the summary statistic.

FUN:

When you use the aggregate function, you're typically interested in summarizing the data in
some way. The FUN argument allows you to specify what kind of summary you want, such
as the mean, sum, maximum, minimum, etc. Without specifying FUN, the aggregate
function wouldn't know how to summarize the data.
4

Syntax of aggregate
aggregate(x, by, FUN)

 x: The data frame or numeric matrix to be summarized.


 by: A list of grouping elements.
 FUN: The function to apply to each subset of the data.

Example with iris Dataset

Here’s an example using the iris dataset, where we calculate the mean of Sepal.Length for
each species:

# Load iris dataset


data(iris)

# Aggregate the data: calculate mean of Sepal.Length for each Species


agg_result <- aggregate(iris$Sepal.Length, by = list(Species =
iris$Species), FUN = mean)

# Print the result


print(agg_result)

Output
Species x
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588

Explanation

1. Data: iris$Sepal.Length is the numeric data we want to summarize.


2. Grouping: by = list(Species = iris$Species) groups the data by the Species
column.
3. Function: FUN = mean specifies that we want to calculate the mean of Sepal.Length
for each species.

Different Functions with FUN

You can use various functions with FUN to get different types of summaries. Here are some
examples:

 Sum:

agg_sum <- aggregate(iris$Sepal.Length, by = list(Species =


iris$Species), FUN = sum)
print(agg_sum)

 Maximum:

agg_max <- aggregate(iris$Sepal.Length, by = list(Species =


iris$Species), FUN = max)
5

print(agg_max)

 Minimum:

agg_min <- aggregate(iris$Sepal.Length, by = list(Species =


iris$Species), FUN = min)
print(agg_min)

 Standard Deviation:

agg_sd <- aggregate(iris$Sepal.Length, by = list(Species =


iris$Species), FUN = sd)
print(agg_sd)

By changing the function specified in FUN, you can easily compute different summary
statistics for your grouped data.

SIMPLE ANALYSIS USING R

“Analysis" refers to the detailed examination and evaluation of the elements or


structure of something, typically as a basis for discussion or interpretation. In the context of
data analysis, it involves systematically applying statistical and logical techniques to
describe, illustrate, condense, and evaluate data. The goal of analysis is to extract useful
information, derive conclusions, and support decision-making.

Key Steps in Data Analysis

1. Data Collection: Gathering the necessary data from various sources.


2. Data Cleaning: Removing or correcting any errors, inconsistencies, or missing values
in the data.
3. Data Exploration: Examining the data to understand its structure, patterns, and
relationships.
4. Data Transformation: Manipulating the data into a suitable format for analysis,
including aggregation and normalization.
5. Data Modeling: Applying statistical models or algorithms to analyze the data.
6. Data Interpretation: Drawing conclusions and insights from the analysis results.
7. Reporting: Communicating the findings through reports, visualizations, or
presentations.

Step-by-Step Analysis

1. Load the Data

The iris dataset is a built-in dataset in R. It contains 150 observations of iris flowers, with
measurements for sepal length, sepal width, petal length, and petal width, along with the
species of the iris flower.

# Load the iris dataset


data(iris)
6

# Display the first few rows of the dataset


head(iris)

Output
mathematica
Copy code
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

2. Exploratory Data Analysis (EDA)

Before performing any aggregation, it’s important to understand the data by exploring its
basic properties.

# Summary statistics for the dataset


summary(iris)

Output
Sepal.Length Sepal.Width Petal.Length Petal.Width
Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa
:50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica
:50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

This output gives a quick overview of the dataset, including the minimum, first quartile,
median, mean, third quartile, and maximum values for each numerical variable, as well as the
counts for each species.

3. Aggregation Using aggregate

We can use the aggregate function to calculate summary statistics for different groups
within the data. Here, we'll calculate the mean sepal length and sepal width for each species.

# Aggregate the data to calculate mean Sepal.Length and Sepal.Width for


each Species
agg_result <- aggregate(. ~ Species, data = iris, FUN = mean, na.rm = TRUE)

# Print the aggregated result


print(agg_result)

Output
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
7

1 setosa 5.006 3.428 1.462 0.246


2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026

Explanation

 aggregate(. ~ Species, data = iris, FUN = mean, na.rm = TRUE): This line
uses the aggregate function to calculate the mean of all numerical variables
(Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width) grouped by
Species. The na.rm = TRUE argument ensures that any missing values are ignored in
the calculation.

4. Aggregation Using dplyr

The dplyr package provides more readable and flexible functions for data manipulation and
aggregation. We will achieve the same aggregation using dplyr.

# Load dplyr package


library(dplyr)

# Group by Species and calculate mean of Sepal.Length and Sepal.Width


agg_df <- iris %>%
group_by(Species) %>%
summarise(
Mean_Sepal_Length = mean(Sepal.Length, na.rm = TRUE),
Mean_Sepal_Width = mean(Sepal.Width, na.rm = TRUE),
Mean_Petal_Length = mean(Petal.Length, na.rm = TRUE),
Mean_Petal_Width = mean(Petal.Width, na.rm = TRUE)
)

# Print the aggregated result


print(agg_df)

Output
# A tibble: 3 × 5
Species Mean_Sepal_Length Mean_Sepal_Width Mean_Petal_Length
Mean_Petal_Width
<fct> <dbl> <dbl> <dbl>
<dbl>
1 setosa 5.01 3.43 1.46
0.246
2 versicolor 5.94 2.77 4.26
1.33
3 virginica 6.59 2.97 5.55
2.03

Explanation

 group_by(Species): This function groups the data by the Species column.


 summarise(...): This function summarizes each group by calculating the mean of
Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width, ignoring any
missing values.

Conclusion
8

We have performed a simple analysis of the iris dataset in R, including loading the data,
conducting basic EDA, and summarizing the data using both the aggregate function and the
dplyr package. This analysis helps us understand the average measurements of different iris
species, providing insights into the dataset's structure.

Example of Data Analysis: The iris Dataset

Let's walk through a simple data analysis example using the iris dataset in R. The iris
dataset contains measurements of sepal length, sepal width, petal length, and petal width for
three species of iris flowers.

1. Data Collection

In this case, the iris dataset is preloaded in R.

# Load iris dataset


data(iris)

2. Data Cleaning

Check for missing values and handle them if necessary.

# Check for missing values


sum(is.na(iris))

If there are missing values, you could handle them like this:

# Remove rows with missing values


iris_clean <- na.omit(iris)

3. Data Exploration

Explore the dataset to understand its structure and summary statistics.

# Display the structure of the dataset


str(iris)

# Summary statistics
summary(iris)

4. Data Transformation

Group the data by species and calculate the mean of each measurement.

# Load dplyr package


library(dplyr)

# Group by Species and calculate mean of Sepal.Length and Sepal.Width


agg_df <- iris %>%
group_by(Species) %>%
summarise(
Mean_Sepal_Length = mean(Sepal.Length, na.rm = TRUE),
9

Mean_Sepal_Width = mean(Sepal.Width, na.rm = TRUE),


Mean_Petal_Length = mean(Petal.Length, na.rm = TRUE),
Mean_Petal_Width = mean(Petal.Width, na.rm = TRUE)
)

# Print the aggregated result


print(agg_df)

5. Data Modeling

For this simple example, we won't apply a complex model, but we could proceed with
various statistical or machine learning models to analyze relationships and patterns further.

6. Data Interpretation

Interpret the results from the aggregation step.

# Interpretation of the results


# The means of the measurements for each species give us an idea of the
typical size and shape of the iris flowers in each species.

7. Reporting

Create visualizations to communicate the findings.

# Load ggplot2 package


library(ggplot2)

# Create a boxplot of Sepal.Length by Species


ggplot(iris, aes(x = Species, y = Sepal.Length)) +
geom_boxplot() +
labs(title = "Sepal Length by Species", x = "Species", y = "Sepal
Length")

# Print the plot

Conclusion

Analysis involves breaking down data into meaningful patterns and insights through
systematic steps. By following these steps, you can gain a deeper understanding of your data
and make informed decisions based on your findings.

NOTE:

The attributes such as minimum, first quartile, median, mean, third quartile, and maximum
values, along with counts for each category, are fundamental descriptive statistics. They
provide essential insights into the distribution and central tendency of numerical data. Here
are the practical uses of each attribute:

Descriptive Statistics and Their Practical Uses

1. Minimum Value
o Definition: The smallest value in the dataset.
10

o Use: Identifies the lower bound of the data range. It's useful for understanding
the lowest extreme and for detecting outliers or unusual values.
2. First Quartile (Q1)
o Definition: The value below which 25% of the data falls.
o Use: Helps to understand the lower 25% of the data distribution. It's useful in
identifying the spread of the lower portion of the dataset and is also a
component of the interquartile range (IQR), which measures statistical
dispersion.
3. Median (Q2)
o Definition: The middle value of the dataset when sorted in ascending order.
o Use: Represents the central tendency of the data, less affected by outliers and
skewed data compared to the mean. It's used to understand the typical value in
the dataset.
4. Mean
o Definition: The average of all data points.
o Use: Represents the central tendency but can be influenced by outliers. It's
useful for calculating the expected value and in various statistical analyses.
5. Third Quartile (Q3)
o Definition: The value below which 75% of the data falls.
o Use: Helps to understand the upper 25% of the data distribution. Like Q1, it's
a component of the interquartile range (IQR).
6. Maximum Value
o Definition: The largest value in the dataset.
o Use: Identifies the upper bound of the data range. It's useful for understanding
the highest extreme and for detecting outliers or unusual values.
7. Counts for Each Category
o Definition: The number of occurrences of each category (e.g., species in the
iris dataset).
o Use: Useful for understanding the distribution of categorical data, comparing
the frequency of different categories, and ensuring that each category is
adequately represented in the analysis.

Practical Examples

Let's consider practical scenarios where these descriptive statistics are useful:

1. Business Analytics

 Sales Analysis: Understanding the minimum, maximum, and quartiles of daily sales
can help a business manage inventory, identify sales trends, and detect anomalies.
 Customer Feedback: Median customer satisfaction scores provide a robust measure
of central tendency, helping businesses understand typical customer sentiment without
being skewed by outliers.

2. Healthcare

 Patient Data: Analyzing the mean, median, and quartiles of patient wait times can
help in resource allocation and improving service efficiency.
 Blood Pressure Levels: Understanding the distribution of blood pressure readings
(minimum, Q1, median, Q3, maximum) can aid in identifying at-risk patients and
tailoring medical interventions.
11

3. Education

 Student Scores: Teachers can use the quartiles and median scores to understand the
distribution of student performance and identify students who may need additional
support.
 Class Participation: Counts of class participation by different groups (e.g., gender,
grade level) help in assessing engagement and inclusivity.

4. Real Estate

 Property Prices: Real estate agents can use the descriptive statistics of property
prices in a neighborhood to advise clients on buying and selling decisions.
 Rental Rates: Understanding the distribution of rental rates helps in setting
competitive prices and identifying market trends.

Example Using the iris Dataset in R

Let’s compute these statistics for the iris dataset in R.

# Load necessary library


library(dplyr)

# Summary statistics for the iris dataset


summary_stats <- iris %>%
group_by(Species) %>%
summarise(
Min_Sepal_Length = min(Sepal.Length),
Q1_Sepal_Length = quantile(Sepal.Length, 0.25),
Median_Sepal_Length = median(Sepal.Length),
Mean_Sepal_Length = mean(Sepal.Length),
Q3_Sepal_Length = quantile(Sepal.Length, 0.75),
Max_Sepal_Length = max(Sepal.Length),
Count = n()
)

# Print the summary statistics


print(summary_stats)

Output
# A tibble: 3 × 8
Species Min_Sepal_Length Q1_Sepal_Length Median_Sepal_Length
Mean_Sepal_Length Q3_Sepal_Length Max_Sepal_Length Count
<fct> <dbl> <dbl> <dbl>
<dbl> <dbl> <dbl> <int>
1 setosa 4.3 4.8 5.0
5.01 5.2 5.8 50
2 versicolor 4.9 5.6 5.9
5.94 6.3 7.0 50
3 virginica 4.9 6.2 6.5
6.59 6.9 7.9 50
Methods for reading Data
Reading data refers to the process of importing or loading data from various sources
into a programming environment or software for analysis. In R, there are multiple methods
12

for reading data depending on the source and format of the data. Below are the methods,
along with their meanings and definitions:

1. Reading Data from CSV Files


read.csv()

 Meaning: Reads comma-separated values (CSV) files.


 Definition: A function in base R to import data from CSV files.

data <- read.csv("path/to/your/file.csv")

readr::read_csv()

 Meaning: Reads CSV files more efficiently than read.csv().


 Definition: A function from the readr package designed for fast and easy reading of
CSV files.

library(readr)
data <- read_csv("path/to/your/file.csv")

2. Reading Data from Excel Files


readxl::read_excel()

 Meaning: Reads data from Excel (.xls or .xlsx) files.


 Definition: A function from the readxl package to import data from Excel
spreadsheets.

library(readxl)
data <- read_excel("path/to/your/file.xlsx", sheet = "Sheet1")

3. Reading Data from Text Files


read.table()

 Meaning: Reads data from text files with specified delimiters.


 Definition: A function in base R for reading tabular data from text files.

data <- read.table("path/to/your/file.txt", header = TRUE, sep = "\t")

readr::read_table()

 Meaning: Reads delimited text files.


 Definition: A function from the readr package for reading delimited text files
efficiently.

library(readr)
data <- read_table("path/to/your/file.txt")

4. Reading Data from JSON Files


13

jsonlite::fromJSON()

 Meaning: Reads data from JSON (JavaScript Object Notation) files.


 Definition: A function from the jsonlite package to parse and import JSON data.

library(jsonlite)
data <- fromJSON("path/to/your/file.json")

5. Reading Data from Databases

DBI::dbConnect() and DBI::dbGetQuery()

 Meaning: Connects to databases and retrieves data.


 Definition: Functions from the DBI package used to connect to a database and run
SQL queries to import data.

library(DBI)
con <- dbConnect(RSQLite::SQLite(), "path/to/your/database.sqlite")
data <- dbGetQuery(con, "SELECT * FROM your_table")
dbDisconnect(con)

6. Reading Data from Web APIs


httr::GET()

 Meaning: Makes HTTP requests to web APIs to retrieve data.


 Definition: A function from the httr package to send GET requests to web APIs and
fetch data.

library(httr)
response <- GET("https://api.example.com/data")
data <- content(response, "parsed")

7. Reading Data from RDS Files


readRDS()

 Meaning: Reads R serialized data files.


 Definition: A base R function to read data from RDS files, which store R objects in a
binary format.

data <- readRDS("path/to/your/file.rds")

8. Reading Data from SAS, SPSS, and Stata Files

haven::read_sas(), haven::read_sav(), haven::read_dta()

 Meaning: Reads data from SAS, SPSS, and Stata files.


 Definition: Functions from the haven package to import data from proprietary
statistical software file formats.

library(haven)
data_sas <- read_sas("path/to/your/file.sas7bdat")
14

data_spss <- read_sav("path/to/your/file.sav")


data_stata <- read_dta("path/to/your/file.dta")

Summary

Each of these methods allows you to import data into R from different file formats and
sources. Understanding these methods and their specific functions helps you efficiently load
and manage data for analysis.

Using R with Databases and Business Intelligence systems


Using R with databases and business intelligence (BI) systems enables powerful data
manipulation, analysis, and visualization capabilities. R can connect to various databases,
perform complex queries, and integrate with BI systems for reporting and dashboarding.
Here's a comprehensive guide on how to leverage R in these contexts.

Connecting to Databases with R

1. Setting Up Database Connections

R can connect to several types of databases, such as SQLite, MySQL, PostgreSQL, SQL
Server, and Oracle, using the DBI package along with database-specific packages like
RSQLite, RMySQL, RPostgres, odbc, etc.

Example: Connecting to an SQLite Database


# Load necessary packages
library(DBI)
library(RSQLite)

# Establish a connection to the SQLite database


con <- dbConnect(RSQLite::SQLite(), "path/to/your/database.sqlite")

# List tables in the database


tables <- dbListTables(con)
print(tables)

# Query data from a table


data <- dbGetQuery(con, "SELECT * FROM your_table")

# Disconnect from the database


dbDisconnect(con)
Example: Connecting to a MySQL Database
# Load necessary packages
library(DBI)
library(RMySQL)

# Establish a connection to the MySQL database


con <- dbConnect(RMySQL::MySQL(),
dbname = "your_database_name",
host = "your_host",
port = 3306,
user = "your_username",
password = "your_password")
15

# List tables in the database


tables <- dbListTables(con)
print(tables)

# Query data from a table


data <- dbGetQuery(con, "SELECT * FROM your_table")

# Disconnect from the database


dbDisconnect(con)
Example: Connecting to a PostgreSQL Database
# Load necessary packages
library(DBI)
library(RPostgres)

# Establish a connection to the PostgreSQL database


con <- dbConnect(RPostgres::Postgres(),
dbname = "your_database_name",
host = "your_host",
port = 5432,
user = "your_username",
password = "your_password")

# List tables in the database


tables <- dbListTables(con)
print(tables)

# Query data from a table


data <- dbGetQuery(con, "SELECT * FROM your_table")

# Disconnect from the database


dbDisconnect(con)

Performing Data Analysis and Visualization

Once the data is loaded into R, you can perform various data manipulation, analysis, and
visualization tasks using packages such as dplyr, ggplot2, and others.

Data Manipulation with dplyr

# Load necessary package


library(dplyr)

# Example data manipulation


data_summary <- data %>%
group_by(category_column) %>%
summarise(
mean_value = mean(numeric_column, na.rm = TRUE),
max_value = max(numeric_column, na.rm = TRUE)
)

# View the summary


print(data_summary)

Data Visualization with ggplot2

# Load necessary package


library(ggplot2)
16

# Example data visualization


ggplot(data, aes(x = category_column, y = numeric_column, fill =
category_column)) +
geom_bar(stat = "identity") +
labs(title = "Example Bar Plot", x = "Category", y = "Value")

Integrating R with Business Intelligence Systems

R can be integrated with various BI systems for enhanced reporting and dash boarding
capabilities. This integration allows for complex analytics within BI platforms.

1. Using R with Microsoft Power BI

Power BI supports R scripts for data transformation and visualization. Here's an example of
using R in Power BI:

1. Load Data: Import data into Power BI.


2. Add R Script: In the Power Query Editor, select "Transform" > "Run R Script".
3. Write R Script: Enter your R script to manipulate or visualize the data.

Example R Script in Power BI:

# Load necessary package


library(ggplot2)

# Create a plot
ggplot(dataset, aes(x = category_column, y = numeric_column)) +
geom_point() +
labs(title = "Scatter Plot", x = "Category", y = "Value")

2. Using R with Tableau

Tableau supports R integration through the Rserve package. Here’s how to set it up:

1. Install Rserve: Run the following commands in R:

install.packages("Rserve")
library(Rserve)
Rserve()

2. Connect Tableau to R:
o In Tableau, go to "Help" > "Settings and Performance" > "Manage External Service
Connection".
o Choose "Rserve" and enter the server details.
3. Use R Scripts in Tableau:
o Create calculated fields using R scripts.
o Example: SCRIPT_REAL("mean(.arg1)", SUM([numeric_column]))

Example Workflow

Combining these steps into a workflow, you can connect to a database, perform data
manipulation and visualization in R, and then integrate the results into a BI system for
reporting.
17

1. Connect to Database:

library(DBI)
library(RMySQL)

con <- dbConnect(RMySQL::MySQL(),


dbname = "your_database_name",
host = "your_host",
port = 3306,
user = "your_username",
password = "your_password")
data <- dbGetQuery(con, "SELECT * FROM your_table")
dbDisconnect(con)

2. Data Manipulation:

library(dplyr)

data_summary <- data %>%


group_by(category_column) %>%
summarise(
mean_value = mean(numeric_column, na.rm = TRUE),
max_value = max(numeric_column, na.rm = TRUE)
)

3. Data Visualization:

library(ggplot2)

ggplot(data_summary, aes(x = category_column, y = mean_value)) +


geom_bar(stat = "identity") +
labs(title = "Mean Value by Category", x = "Category", y = "Mean
Value")

4. Integrate with BI System:


o Export the visualization as an image or data file.
o Import the image or data into Power BI or Tableau.
o Use R scripts within the BI tool for additional analysis or visualization.

Conclusion

Using R with databases and business intelligence systems enhances the ability to
perform advanced analytics and create insightful visualizations.

This integration allows for efficient data management, complex statistical analysis,
and seamless reporting, making it a powerful combination for data-driven decision-making.
18

You might also like