0% found this document useful (0 votes)
13 views

HW 4

Uploaded by

guoj310
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

HW 4

Uploaded by

guoj310
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

10/25/24, 4:57 PM HW 4

HW 4
Enter your name and EID here: Jiaqi Guo (jg76446)

For all questions, include the R commands/functions that you used to find your answer (show R chunk). Answers
without supporting code will not receive credit. Write full sentences to describe your findings.

Part 1
world_bank_pop tidyverse

Question 1: (2 pts)
world_bank_pop

# pivot years 2000 to 2017 into a 'year' variable, and population values into 'indicator
_value'
world_bank_pop_tidy <- world_bank_pop %>%
pivot_longer(
cols = `2000`:`2017`, # Columns for each year from 2000 to 2017
names_to = "year", # New column to hold year values
values_to = "indicator_value" # New column to hold values for each year
) %>%
mutate(year = as.numeric(year)) # Ensure 'year' is a numeric variable

pivot world_bank_pop
year
indicator_value year
world_bank_pop pivot indicator
myworld

# further tidying to make each indicator category its own column


myworld <- world_bank_pop_tidy %>%
pivot_wider(
names_from = indicator,
values_from = indicator_value
)
myworld

file:///Users/jiaqiguo/Downloads/R studio/SDS322E/HW4.html 1/12


10/25/24, 4:57 PM HW 4

## # A tibble: 4,788 × 6
## country year SP.URB.TOTL SP.URB.GROW SP.POP.TOTL SP.POP.GROW
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ABW 2000 41625 1.66 89101 2.54
## 2 ABW 2001 42025 0.956 90691 1.77
## 3 ABW 2002 42194 0.401 91781 1.19
## 4 ABW 2003 42277 0.197 92701 0.997
## 5 ABW 2004 42317 0.0946 93540 0.901
## 6 ABW 2005 42399 0.194 94483 1.00
## 7 ABW 2006 42555 0.367 95606 1.18
## 8 ABW 2007 42729 0.408 96787 1.23
## 9 ABW 2008 42906 0.413 97996 1.24
## 10 ABW 2009 43079 0.402 99212 1.23
## # ℹ 4,778 more rows

Question 2: (2 pts)
ggplot Note: the
country code WLD represents the entire world.

# Filter the data to include only the world population growth data (country code "WLD")
world_growth <- myworld %>%
filter(country == "WLD")

# Plot urban population growth over the years


ggplot(world_growth, aes(x = year, y = SP.URB.GROW)) +
geom_line(color = "blue") +
labs(
title = "World Urban Population Growth Over the Years",
x = "Year",
y = "Urban Population Growth (%)"
)

file:///Users/jiaqiguo/Downloads/R studio/SDS322E/HW4.html 2/12


10/25/24, 4:57 PM HW 4

myworld

# Filter data for the year 2017 and find the country with the highest population growth
highest_growth_2017 <- myworld %>%
filter(year == 2017) %>%
filter(SP.POP.GROW == max(SP.POP.GROW, na.rm = TRUE))

highest_growth_2017

## # A tibble: 1 × 6
## country year SP.URB.TOTL SP.URB.GROW SP.POP.TOTL SP.POP.GROW
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 QAT 2017 2686753 4.46 2711755 4.39

file:///Users/jiaqiguo/Downloads/R studio/SDS322E/HW4.html 3/12


10/25/24, 4:57 PM HW 4

Question 3: (2 pts)

countrycode

# Install the package (only needed once)


install.packages("countrycode")

codelist

# Call the countrycode package


library(countrycode)

# Take a look at the dataset


head(codelist)

continent wb
country.name.en wb
continent mycodes

# Create `mycodes` by selecting necessary columns and removing rows with missing data
mycodes <- codelist %>%
select(continent, wb, country.name.en) %>%
filter(!is.na(wb) & !is.na(continent))
mycodes

## # A tibble: 216 × 3
## continent wb country.name.en
## <chr> <chr> <chr>
## 1 Asia AFG Afghanistan
## 2 Europe ALB Albania
## 3 Africa DZA Algeria
## 4 Oceania ASM American Samoa
## 5 Europe AND Andorra
## 6 Africa AGO Angola
## 7 Americas ATG Antigua & Barbuda
## 8 Americas ARG Argentina
## 9 Asia ARM Armenia
## 10 Americas ABW Aruba
## # ℹ 206 more rows

mycodes

# Count distinct country codes


num_country_codes <- mycodes %>%
summarise(distinct_codes = n_distinct(wb))

num_country_codes

file:///Users/jiaqiguo/Downloads/R studio/SDS322E/HW4.html 4/12


10/25/24, 4:57 PM HW 4

## # A tibble: 1 × 1
## distinct_codes
## <int>
## 1 216

Question 4: (2 pts)
myworld mycodes

# your code goes below (replace this comment with something meaningful)
num_country_codes <- myworld %>%
summarise(distinct_codes = n_distinct(country))
num_country_codes

## # A tibble: 1 × 1
## distinct_codes
## <int>
## 1 266

inner_join() myworld
mycountries

# combining dataset using innerjoin


mycountries <- inner_join(myworld, mycodes, by = c("country" = "wb"))
mycountries

## # A tibble: 3,870 × 8
## country year SP.URB.TOTL SP.URB.GROW SP.POP.TOTL SP.POP.GROW continent
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 ABW 2000 41625 1.66 89101 2.54 Americas
## 2 ABW 2001 42025 0.956 90691 1.77 Americas
## 3 ABW 2002 42194 0.401 91781 1.19 Americas
## 4 ABW 2003 42277 0.197 92701 0.997 Americas
## 5 ABW 2004 42317 0.0946 93540 0.901 Americas
## 6 ABW 2005 42399 0.194 94483 1.00 Americas
## 7 ABW 2006 42555 0.367 95606 1.18 Americas
## 8 ABW 2007 42729 0.408 96787 1.23 Americas
## 9 ABW 2008 42906 0.413 97996 1.24 Americas
## 10 ABW 2009 43079 0.402 99212 1.23 Americas
## # ℹ 3,860 more rows
## # ℹ 1 more variable: country.name.en <chr>

mycountries

file:///Users/jiaqiguo/Downloads/R studio/SDS322E/HW4.html 5/12


10/25/24, 4:57 PM HW 4

# Find the country code with the highest population growth in 2017
highest_growth_country <- mycountries %>%
filter(year == 2017) %>%
slice_max(order_by = SP.POP.GROW)

highest_growth_country

## # A tibble: 1 × 8
## country year SP.URB.TOTL SP.URB.GROW SP.POP.TOTL SP.POP.GROW continent
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 QAT 2017 2686753 4.46 2711755 4.39 Asia
## # ℹ 1 more variable: country.name.en <chr>

Question 5: (2 pts)
continent mycountries

# Group by continent and calculate the average urban population growth


average_growth_by_continent <- mycountries %>%
group_by(continent) %>%
summarize(average_growth = mean(SP.URB.GROW, na.rm = TRUE)) %>%
ungroup()

# Identify the continent with the highest and lowest average growth
highest_growth_continent <- average_growth_by_continent %>%
filter(average_growth == max(average_growth, na.rm = TRUE))

lowest_growth_continent <- average_growth_by_continent %>%


filter(average_growth == min(average_growth, na.rm = TRUE))

# Output the results


highest_growth_continent

## # A tibble: 1 × 2
## continent average_growth
## <chr> <dbl>
## 1 Africa 3.59

lowest_growth_continent

## # A tibble: 1 × 2
## continent average_growth
## <chr> <dbl>
## 1 Europe 0.499

file:///Users/jiaqiguo/Downloads/R studio/SDS322E/HW4.html 6/12


10/25/24, 4:57 PM HW 4

myafrica2017

# creating new dataset for focusing on Africa Countries for year of 2017
myafrica2017 <- mycountries %>%
filter(year == 2017, continent == "Africa")
myafrica2017

## # A tibble: 54 × 8
## country year SP.URB.TOTL SP.URB.GROW SP.POP.TOTL SP.POP.GROW continent
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 AGO 2017 19586972 4.62 30208628 3.55 Africa
## 2 BDI 2017 1417430 4.82 11155593 2.29 Africa
## 3 BEN 2017 5423582 4.11 11596779 2.95 Africa
## 4 BFA 2017 5701421 5.01 19835858 2.87 Africa
## 5 BWA 2017 1650064 3.20 2401840 2.08 Africa
## 6 CAF 2017 2047664 2.76 4996741 1.87 Africa
## 7 CIV 2017 12505013 3.47 24848016 2.59 Africa
## 8 CMR 2017 13605785 3.91 24393181 2.83 Africa
## 9 COD 2017 36983500 4.76 84283273 3.44 Africa
## 10 COG 2017 3530528 3.08 5312340 2.39 Africa
## # ℹ 44 more rows
## # ℹ 1 more variable: country.name.en <chr>

Question 6: (2 pts)

map_data()
maps

# Install package (only needed once)


install.packages("maps")

mapWorld

# Geographic coordinates about countries in the world


mapWorld <- map_data("world")

# Take a quick look


head(mapWorld)

file:///Users/jiaqiguo/Downloads/R studio/SDS322E/HW4.html 7/12


10/25/24, 4:57 PM HW 4

## long lat group order region subregion


## 1 -69.89912 12.45200 1 1 Aruba <NA>
## 2 -69.89571 12.42300 1 2 Aruba <NA>
## 3 -69.94219 12.43853 1 3 Aruba <NA>
## 4 -70.00415 12.50049 1 4 Aruba <NA>
## 5 -70.06612 12.54697 1 5 Aruba <NA>
## 6 -70.05088 12.59707 1 6 Aruba <NA>

mapWorld myafrica2017 Note: the


variables do not have the same name for each dataset but they contain the same information.
mymap

# checking column names of both dataset


mapWorld

## long lat group order region subregion


## 1 -69.89912 12.45200 1 1 Aruba <NA>
## 2 -69.89571 12.42300 1 2 Aruba <NA>
## 3 -69.94219 12.43853 1 3 Aruba <NA>
## 4 -70.00415 12.50049 1 4 Aruba <NA>
## 5 -70.06612 12.54697 1 5 Aruba <NA>
## 6 -70.05088 12.59707 1 6 Aruba <NA>
## 7 -70.03511 12.61411 1 7 Aruba <NA>
## 8 -69.97314 12.56763 1 8 Aruba <NA>
## 9 -69.91181 12.48047 1 9 Aruba <NA>
## 10 -69.89912 12.45200 1 10 Aruba <NA>
## 12 74.89131 37.23164 2 12 Afghanistan <NA>
## 13 74.84023 37.22505 2 13 Afghanistan <NA>
## 14 74.76738 37.24917 2 14 Afghanistan <NA>
## 15 74.73896 37.28564 2 15 Afghanistan <NA>
## 16 74.72666 37.29072 2 16 Afghanistan <NA>
## 17 74.66895 37.26670 2 17 Afghanistan <NA>
## [ reached 'max' / getOption("max.print") -- omitted 99322 rows ]

myafrica2017

file:///Users/jiaqiguo/Downloads/R studio/SDS322E/HW4.html 8/12


10/25/24, 4:57 PM HW 4

## # A tibble: 54 × 8
## country year SP.URB.TOTL SP.URB.GROW SP.POP.TOTL SP.POP.GROW continent
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 AGO 2017 19586972 4.62 30208628 3.55 Africa
## 2 BDI 2017 1417430 4.82 11155593 2.29 Africa
## 3 BEN 2017 5423582 4.11 11596779 2.95 Africa
## 4 BFA 2017 5701421 5.01 19835858 2.87 Africa
## 5 BWA 2017 1650064 3.20 2401840 2.08 Africa
## 6 CAF 2017 2047664 2.76 4996741 1.87 Africa
## 7 CIV 2017 12505013 3.47 24848016 2.59 Africa
## 8 CMR 2017 13605785 3.91 24393181 2.83 Africa
## 9 COD 2017 36983500 4.76 84283273 3.44 Africa
## 10 COG 2017 3530528 3.08 5312340 2.39 Africa
## # ℹ 44 more rows
## # ℹ 1 more variable: country.name.en <chr>

# joining dataset 'mapWorld' and 'myafrica2017'


mymap <- inner_join(mapWorld, myafrica2017, by = c("region" = "country.name.en"))
mymap

## long lat group order region subregion country year SP.URB.TOTL


## 1 23.96650 -10.87178 3 423 Angola <NA> AGO 2017 19586972
## 2 23.98828 -11.00283 3 424 Angola <NA> AGO 2017 19586972
## 3 24.01006 -11.18477 3 425 Angola <NA> AGO 2017 19586972
## 4 24.02559 -11.31563 3 426 Angola <NA> AGO 2017 19586972
## 5 24.04141 -11.37412 3 427 Angola <NA> AGO 2017 19586972
## 6 24.04668 -11.40537 3 428 Angola <NA> AGO 2017 19586972
## 7 24.02930 -11.43916 3 429 Angola <NA> AGO 2017 19586972
## SP.URB.GROW SP.POP.TOTL SP.POP.GROW continent
## 1 4.620862 30208628 3.550987 Africa
## 2 4.620862 30208628 3.550987 Africa
## 3 4.620862 30208628 3.550987 Africa
## 4 4.620862 30208628 3.550987 Africa
## 5 4.620862 30208628 3.550987 Africa
## 6 4.620862 30208628 3.550987 Africa
## 7 4.620862 30208628 3.550987 Africa
## [ reached 'max' / getOption("max.print") -- omitted 11472 rows ]

Question 7: (2 pts)
ggmap

# Install package (only needed once)


install.packages("ggmap")

#
Note: it would be a good idea to run the code piece by piece to see what each layer adds to
the plot. eval=FALSE

file:///Users/jiaqiguo/Downloads/R studio/SDS322E/HW4.html 9/12


10/25/24, 4:57 PM HW 4

# Upload the ggmap package


library(ggmap)

# Build a map!
mymap |>
#
ggplot(aes(x = long, y = lat, group = group, fill = SP.URB.GROW)) +
#
geom_polygon(colour = "black") +
#
scale_fill_gradient(low = "red", high = "blue") +
#
labs(fill = "Urban Growth",
title = "Urban Growth in Africa in 2017",
x ="Longitude",
y ="Latitude")

Question 8: (1 pt)

myafrica2017 mapWorld
mapWorld myafrica2017

myafrica2017 |>
anti_join(mapWorld, by =c("country.name.en" = "region")) |>
select(country.name.en)

## # A tibble: 5 × 1
## country.name.en
## <chr>
## 1 Côte d’Ivoire
## 2 Congo - Kinshasa
## 3 Congo - Brazzaville
## 4 São Tomé & Príncipe
## 5 Eswatini

Note: This question


can be challenging! You will have to do some research about each of these countries: this is pretty typical for a
data scientist though! We need to get more knowledge about the context to make sense of the data.

file:///Users/jiaqiguo/Downloads/R studio/SDS322E/HW4.html 10/12


10/25/24, 4:57 PM HW 4

str_detect() mapWorld
myafrica2017

# distinct country names in mapWorld that could match


mapWorld |>
distinct(region) |>
filter(str_detect(region, "Ivor") |
str_detect(region, "Congo") |
str_detect(region, "Sao") |
str_detect(region, "Swaziland")
)

## region
## 1 Ivory Coast
## 2 Democratic Republic of the Congo
## 3 Republic of Congo
## 4 Sao Tome and Principe
## 5 Swaziland

myafrica2017
Hint: use recode() inside mutate() as described in our WS10 or in this article
https://www.statology.org/recode-dplyr/ (https://www.statology.org/recode-dplyr/).
mapWorld myafrica2017

# your code goes below (replace this comment with something meaningful)

Part 2

Question 9: (2 pts)

file:///Users/jiaqiguo/Downloads/R studio/SDS322E/HW4.html 11/12


10/25/24, 4:57 PM HW 4

Question 10: (2 pts)

Formatting: (1 pt)
Open in Browser

file:///Users/jiaqiguo/Downloads/R studio/SDS322E/HW4.html 12/12

You might also like