tidyverse学习笔记——Factor Operators篇

本文链接：https://blog.csdn.net/dc20221127/article/details/144308827

Factor Operators

fct

Use forcats::fct() , not factor in base R.

x <- c("Dec", "Apr", "Jan", "Mar")
month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun",
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
factor <- fct(x, levels = month_levels)`

With forcats::fct():

After converting to a factor, sort() sorts the data by categories, rather than the raw data values. The factor type preserves the category information, which facilitates ordered operations and comparisons.

sort(x)
#> [1] Jan Mar Apr Dec
#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dam"

When converting to a factor, all values must appear in the specified levels. If there are missing categories, R throws an error, helping ensure data integrity.

x <- c("Dec", "Apr", "Jam", "Mar")
month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun",
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
y <- fct(x, levels = month_levels)
#> Error in `fct()`:
#> ! All values of `x` must appear in `levels` or `na`
#> ℹ Missing level: "Jam"

levels

levels() helps access to the set of valid levels directly.

levels(x)

x should be a factor.

ordered

ordered() creates ordered factors. Ordered factors imply a strict ordering between levels, but don’t specify anything about the magnitude of the differences between the levels.

An ordered factor can be recognized in its printed output, as it uses < symbols between the factor levels:

ordered(c("a", "b", "c")) 
#> [1] a b c 
#> Levels: a < b < c`

In both base R and the tidyverse, ordered factors behave similarly to regular factors, with two notable differences:

When an ordered factor is mapped to color or fill in ggplot2, it defaults to scale_color_viridis() or scale_fill_viridis(), color scales that imply a ranking.
When used as a predictor in a linear model, an ordered factor applies “polynomial contrasts.” While these can be useful, they are rarely interpreted unless specialized statistical knowledge is applied. For further details, refer to vignette("contrasts", package = "faux") by Lisa DeBruine.

Modifying factor order

The functions mentioned below do not actually modify the order of factor levels in memory. Instead, they affect the order used in subsequent operations that involve the factor. The underlying factor levels remain the same, but the way they are handled or displayed is adjusted for specific tasks.

fct_inorder

fct_inorder() reorders levels by order in which
they appear in the data.

fct_inorder(f)

fct_reorder

fct_reorder() reorders the levels of a factor based on a numeric vector, typically by a function applied to the vector (e.g., median, mean, etc.). The function sorts the levels in ascending order by default, according to the values in the numeric vector.

fct_reorder(.f = , .x = , .fun = median, desc = FALSE)

.f represents the factor whose levels need to be modified.
.x represents the numeric vector used to reorder the levels. It must have the same length as the factor.
.fun represents the function applied to .x for each level of .f .

fct_reorder2

fct_reorder2() reorders the levels of a factor based on 2 numeric vectors.

fct_reorder2(.f = , .x = , .y = , .fun = , .desc = TRUE )

.f represents the factor whose levels are to be reordered.
.x and .y are two numeric vectors used to determine the reordering. Specifically, the function first sorts the data based on .x. Then, it applies the specified function .fun to the .y values corresponding to the sorted .x, and uses the result to reorder the levels of the factor.
.fun :
last2: After sorting .x in ascending order, it takes the corresponding .y value associated with the last value of .x (i.e., the largest value in .x )
first2: After sorting .x in ascending order, it takes the corresponding .y value associated with the first value of .x (i.e., the smallest value in .x).
.desc is boolean value indicating whether to reorder the levels in descending order based on the selected .y.

fct_relevel

fct_relevel() reorders the levels of a factor by moving specified levels to a specified place.

fct_relevel(.f = , ..., after = )

.f represents the factor whose levels are to be reordered.
... specifies the levels to be moved to the front.
after indicates the position after which the specified levels should be placed (default is 0, meaning the levels will be moved to the front).

fct_infreq

fct_infreq() reorders factor levels based on their frequencies, in descending order.

fct_infreq(f, w = NULL)

f represents the factor whose levels are to be reordered.
w represents the optional weights used to calculate the frequencies.

fct_rev

fct_rev() reverses the order of factor levels.

fct_rev(f)

Modifying factor levels

The functions described below do not modify the underlying levels of a factor in memory. Instead, they adjust how the levels are treated or displayed in subsequent operations involving the factor. While the factor’s original levels remain unchanged, these functions alter their behavior or appearance for specific tasks.

fct_recode

fct_recode() changes the value of each level.

gss_cat |> count(partyid)
#> # A tibble: 10 × 2
#>   partyid                n
#>   <fct>              <int>
#> 1 No answer            154
#> 2 Don't know             1
#> 3 Other party          393
#> 4 Strong republican   2314
#> 5 Not str republican  3032
#> 6 Ind,near rep        1791
#> # ℹ 4 more rows
gss_cat |>
  mutate(
    partyid = fct_recode(partyid,
      "Republican, strong"    = "Strong republican",
      "Republican, weak"      = "Not str republican",
      "Independent, near rep" = "Ind,near rep",
      "Independent, near dem" = "Ind,near dem",
      "Democrat, weak"        = "Not str democrat",
      "Democrat, strong"      = "Strong democrat"
    )
  ) |>
  count(partyid)
#> # A tibble: 10 × 2
#>   partyid                   n
#>   <fct>                 <int>
#> 1 No answer               154
#> 2 Don't know                1
#> 3 Other party             393
#> 4 Republican, strong     2314
#> 5 Republican, weak       3032
#> 6 Independent, near rep  1791
#> # ℹ 4 more rows

fct_collapse

fct_collapse() collapses levels of a factor.

gss_cat |>
  mutate(
    partyid = fct_collapse(partyid,
      "other" = c("No answer", "Don't know", "Other party"),
      "rep" = c("Strong republican", "Not str republican"),
      "ind" = c("Ind,near rep", "Independent", "Ind,near dem"),
      "dem" = c("Not str democrat", "Strong democrat")
    )
  ) |>
  count(partyid)
#> # A tibble: 4 × 2
#>   partyid     n
#>   <fct>   <int>
#> 1 other     548
#> 2 rep      5346
#> 3 ind      8409
#> 4 dem      7180

fct_lump_

All fct_lump_() functions share the following parameters:

f: The factor whose levels are to be lumped.
w: An optional numeric vector of weights for frequency of each value (not level) in f. If NULL, all values are weighted equally.
other_level: A string specifying the name of the new level that combines lumped levels. The default value is "Other".

fct_lump_lowfreq

fct_lump_lowfreq() lumps together the least frequent levels, ensuring that the resulting “other” level always corresponds to the least frequent level among the factor levels.

fct_lump_lowfreq(f, w = NULL, other_level = "Other")

fct_lump_n

fct_lump_n() lumps all levels except for the n most frequent (or least frequent if n < 0).

fct_lump_n(
  f,
  n,
  w = NULL,
  other_level = "Other",
  ties.method = c("min", "average", "first", "last", "random", "max")
)

n: The number of the most frequent levels to retain.
ties.method: Determines how ties in frequency are handled when the number of levels with equal frequency exceeds n. Options include:
"min": Assigns the smallest rank to tied values.
"average": Assigns the average rank to tied values.
"first": Ranks tied values based on their first occurrence.
"last": Ranks tied values based on their last occurrence.
"random": Breaks ties randomly.
"max": Assigns the largest rank to tied values.

fct_lump_min

fct_lump_min() lumps levels that appear fewer than min times.

fct_lump_min(f, min, w = NULL, other_level = "Other")

min: A numeric value specifying the minimum number of times a level must appear to remain as a separate level. Levels with fewer occurrences than min are lumped together into the “other” level.

fct_lump_prop

fct_lump_prop() lumps levels that appear in fewer than or equal to a specified proportion (prop) of the total weight.

fct_lump_prop(f, prop, w = NULL, other_level = "Other")

prop: A numeric value between 0 and 1 that specifies the minimum proportion of the total weight (or total occurrences if w is NULL) a level must have to remain as a separate level. Levels with proportions less than or equal to prop are lumped together.

References

R4DS