0% found this document useful (0 votes)
3 views

Assign--1

The document provides a comprehensive guide on data manipulation using the DPLYR package in R, specifically focusing on the Carseats dataset. It covers various operations such as selecting, filtering, arranging, mutating, and summarizing data, along with examples and explanations of each function's purpose. Additionally, it introduces the creation of new variables for categorizing sales performance based on defined thresholds.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Assign--1

The document provides a comprehensive guide on data manipulation using the DPLYR package in R, specifically focusing on the Carseats dataset. It covers various operations such as selecting, filtering, arranging, mutating, and summarizing data, along with examples and explanations of each function's purpose. Additionally, it introduces the creation of new variables for categorizing sales performance based on defined thresholds.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Data Manipulation using DPLYR package

2024-08-22

Installing and loading the packages


#install.packages("ISLR", dependencies = TRUE)
#install.packages("dplyr", dependencies = TRUE)

library(ISLR)
library(dplyr)

##
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':


##
## filter, lag

## The following objects are masked from 'package:base':


##
## intersect, setdiff, setequal, union

Loading the Datasets


data("Carseats")

Exploring the Carseats Dataset


head(Carseats, n=2)

## Sales CompPrice Income Advertising Population Price ShelveLoc Age


Education
## 1 9.50 138 73 11 276 120 Bad 42
17
## 2 11.22 111 48 16 260 83 Good 65
10
## Urban US
## 1 Yes Yes
## 2 Yes Yes

The carseats datasets has 8 attributes such as Sales, ComPrice, Income, Advertising,
Population, Price, ShelveLoc, Age, Education, Urban, and US records.
Select
1. Select the columns Sales, CompPrice, and Income from the Carseats dataset.
Selected_Car <- Carseats %>%
select(Sales, CompPrice, Income) %>%
slice_head(n=10)

Selected_Car

## Sales CompPrice Income


## 1 9.50 138 73
## 2 11.22 111 48
## 3 10.06 113 35
## 4 7.40 117 100
## 5 4.15 141 64
## 6 10.81 124 113
## 7 6.63 115 105
## 8 11.85 136 81
## 9 6.54 132 110
## 10 4.69 132 113

In this code, I used the select() function to select the columns titled Sale, Comprice,
and Income. Selecting the sales records is essential for monitoring performance and
tracking revenues. The Comprice column is used for cost management and profitability
analysis. The Income record is crucial for profit calculation, financial planning, and
compliance and reporting.

Filter
2. Filter the Carseats dataset to include only observations where Sales is greater than
8000.
Filtered_Car <- Carseats %>%
filter(Sales > 8.000) %>%
slice_head (n=10)

Filtered_Car

## Sales CompPrice Income Advertising Population Price ShelveLoc


Age Education
## 1 9.50 138 73 11 276 120 Bad
42 17
## 2 11.22 111 48 16 260 83 Good
65 10
## 3 10.06 113 35 10 269 80 Medium
59 12
## 6 10.81 124 113 13 501 72 Bad
78 16
## 8 11.85 136 81 15 425 120 Good
67 10
## 11 9.01 121 78 9 150 100 Bad
26 10
## 12 11.96 117 94 4 503 94 Good
50 13
## 14 10.96 115 28 11 29 86 Good
53 18
## 15 11.17 107 117 11 148 118 Good
52 18
## 16 8.71 149 95 5 400 144 Medium
76 18
## Urban US
## 1 Yes Yes
## 2 Yes Yes
## 3 Yes Yes
## 6 No Yes
## 8 Yes Yes
## 11 No Yes
## 12 Yes Yes
## 14 Yes Yes
## 15 Yes Yes
## 16 No No

niversity 6.8 25 26385 92

After filtering, the top 10 rows of the resulting dataset are selected using the slice_head function.
The filtered data showcases various features of car seat sales, including details such as competition price
(CompPrice), average income of the population (Income), and advertising expenditure (Advertising). The
results display a mix of attributes across different observations. For example, the Sales values range from
8.71 to 11.96. Other characteristics include competition prices between 107 and 149, advertising
expenditures from 4 to 16, and the age of store shelves varying between 26 and 78 years. The ShelveLoc
variable categorizes the display location quality as "Bad," "Good," or "Medium," with "Good" being the
most common. Additionally, the data indicates that most stores are located in urban areas (Urban) and are
within the US (US), except for one instance where a store is neither urban nor US-based. This filtered
dataset provides a snapshot of stores with relatively high sales performance, revealing the diversity in
their market and demographic characteristics.
Arrange
3. Order the Carseats dataset by Sales in descending order.
Arranged_Car <- Carseats %>%
arrange(desc(Sales)) %>%
slice_head(n=10)

Arranged_Car

## Sales CompPrice Income Advertising Population Price ShelveLoc


Age Education
## 377 16.27 141 60 19 319 92 Good
44 11
## 317 15.63 122 36 5 369 72 Good
35 10
## 26 14.90 139 32 0 176 82 Good
54 11
## 368 14.37 95 106 0 256 53 Good
52 17
## 19 13.91 110 110 0 408 68 Good
46 17
## 31 13.55 125 94 0 447 89 Good
30 12
## 353 13.44 133 103 14 288 122 Good
61 17
## 69 13.39 149 69 20 366 134 Good
60 13
## 358 13.36 103 73 3 276 72 Medium
34 15
## 194 13.28 139 70 7 71 96 Good
61 10
## Urban US
## 377 Yes Yes
## 317 Yes Yes
## 26 No No
## 368 Yes No
## 19 No Yes
## 31 Yes No
## 353 Yes Yes
## 69 Yes Yes
## 358 Yes Yes
## 194 Yes Yes

First, the data is sorted in descending order based on the "Sales" column using the
arrange(desc(Sales)) function. Then, the slice_head(n=10) function is applied to extract the top 10 entries
from this sorted data. The resulting dataset highlights the 10 Carseats records with the highest sales, along
with associated details such as competitor prices, income levels, advertising budgets, and various other
attributes. Notably, most of these top-selling stores have a "Good" shelf location rating and are situated in
urban areas in the United States. This suggests that these factors might contribute to higher sales
performance in these particular locations.

Mutate
4. Create a new variable in the Carseats dataset called Profit calculated as Sales minus
Price.
Mutated_Car <- Carseats %>%
mutate(Profit = Sales - Price) %>%
slice_head (n=10)

Mutated_Car

## Sales CompPrice Income Advertising Population Price ShelveLoc


Age Education
## 1 9.50 138 73 11 276 120 Bad
42 17
## 2 11.22 111 48 16 260 83 Good
65 10
## 3 10.06 113 35 10 269 80 Medium
59 12
## 4 7.40 117 100 4 466 97 Medium
55 14
## 5 4.15 141 64 3 340 128 Bad
38 13
## 6 10.81 124 113 13 501 72 Bad
78 16
## 7 6.63 115 105 0 45 108 Medium
71 15
## 8 11.85 136 81 15 425 120 Good
67 10
## 9 6.54 132 110 0 108 124 Medium
76 10
## 10 4.69 132 113 0 131 124 Medium
76 17
## Urban US Profit
## 1 Yes Yes -110.50
## 2 Yes Yes -71.78
## 3 Yes Yes -69.94
## 4 Yes Yes -89.60
## 5 Yes No -123.85
## 6 No Yes -61.19
## 7 Yes No -101.37
## 8 Yes Yes -108.15
## 9 No No -117.46
## 10 No Yes -119.31
The code provided manipulates the Carseats dataset by creating a new variable called Profit,
which is calculated as the difference between Sales and Price. After this transformation, the code extracts
the first 10 rows of the modified dataset using the slice_head function. The resulting data shows that for
each of the first 10 entries, the Profit values are all negative, indicating that the price of the car seats
exceeds the sales revenue for these observations. This suggests that these particular products are being
sold at a loss. The dataset also includes various other variables, such as CompPrice, Income, Advertising,
Population, ShelveLoc, and demographic information like Age, Education, Urban, and US, which could
be analyzed further to understand the factors influencing these negative profits.

Group_by and Summarize


5. Calculate the average Sales for each ShelveLoc in the Carseats dataset.
Summary_Car <- Carseats %>%
count(ShelveLoc)
Summary_Car

## ShelveLoc n
## 1 Bad 96
## 2 Good 85
## 3 Medium 219

The code provided calculates a summary of the Carseats dataset, focusing specifically on the
ShelveLoc variable, which represents the quality of shelf location for car seats. By using the count
function from the dplyr package, the code counts the number of occurrences for each category within the
ShelveLoc variable. The resulting summary data shows that out of all the observations, 96 instances have
a "Bad" shelf location, 85 instances have a "Good" shelf location, and 219 instances have a "Medium"
shelf location. This distribution indicates that the "Medium" shelf location is the most common among the
car seat products in the dataset.

Additional Challenges
6. Create a new variable in the Carseats dataset indicating whether sales are high,
medium, or low based on certain thresholds.
Carseats_with_new_column <- Carseats %>%
mutate(SalesCategory = case_when(
Sales > 8 ~ "High",
Sales > 4 ~ "Medium",
TRUE ~ "Low")) %>%
slice_head(n=10)

Carseats_with_new_column
## Sales CompPrice Income Advertising Population Price ShelveLoc
Age Education
## 1 9.50 138 73 11 276 120 Bad
42 17
## 2 11.22 111 48 16 260 83 Good
65 10
## 3 10.06 113 35 10 269 80 Medium
59 12
## 4 7.40 117 100 4 466 97 Medium
55 14
## 5 4.15 141 64 3 340 128 Bad
38 13
## 6 10.81 124 113 13 501 72 Bad
78 16
## 7 6.63 115 105 0 45 108 Medium
71 15
## 8 11.85 136 81 15 425 120 Good
67 10
## 9 6.54 132 110 0 108 124 Medium
76 10
## 10 4.69 132 113 0 131 124 Medium
76 17
## Urban US SalesCategory
## 1 Yes Yes High
## 2 Yes Yes High
## 3 Yes Yes High
## 4 Yes Yes Medium
## 5 Yes No Medium
## 6 No Yes High
## 7 Yes No Medium
## 8 Yes Yes High
## 9 No No Medium
## 10 No Yes Medium

The code provided creates a new column called SalesCategory in the Carseats dataset by using
the mutate function from the dplyr package. The SalesCategory is determined based on the Sales values:
if Sales is greater than 8, the category is labeled as "High"; if Sales is between 4 and 8, it is labeled as
"Medium"; and if Sales is 4 or below, it is labeled as "Low." After adding this new column, the
slice_head function is used to select the first 10 rows of the modified dataset. The output displays these
rows with the new SalesCategory column included. For example, in the first row, the Sales value is 9.50,
which results in a "High" classification in the SalesCategory. In contrast, the fifth row, with a Sales value
of 4.15, falls into the "Medium" category. This process enables quick categorization of sales performance
within the dataset, aiding in easier data analysis and interpretation.

You might also like