Factor Operators
fct
Use forcats::fct()
, not factor
in base R.
x <- c("Dec", "Apr", "Jan", "Mar")
month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
factor <- fct(x, levels = month_levels)`
With forcats::fct()
:
- After converting to a
factor
,sort()
sorts the data by categories, rather than the raw data values. Thefactor
type preserves the category information, which facilitates ordered operations and comparisons.
sort(x)
#> [1] Jan Mar Apr Dec
#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dam"
- When converting to a
factor
, all values must appear in the specifiedlevels
. If there are missing categories, R throws an error, helping ensure data integrity.
x <- c("Dec", "Apr", "Jam", "Mar")
month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
y <- fct(x, levels = month_levels)
#> Error in `fct()`:
#> ! All values of `x` must appear in `levels` or `na`
#> ℹ Missing level: "Jam"
levels
levels()
helps access to the set of valid levels directly.
levels(x)
x
should be a factor.
ordered
ordered()
creates ordered factors. Ordered factors imply a strict ordering between levels, but don’t specify anything about the magnitude of the differences between the levels.
An ordered factor can be recognized in its printed output, as it uses <
symbols between the factor levels:
ordered(c("a", "b", "c"))
#> [1] a b c
#> Levels: a < b < c`
In both base R and the tidyverse, ordered factors behave similarly to regular factors, with two notable differences:
- When an ordered factor is mapped to color or fill in ggplot2, it defaults to
scale_color_viridis()
orscale_fill_viridis()
, color scales that imply a ranking. - When used as a predictor in a linear model, an ordered factor applies “polynomial contrasts.” While these can be useful, they are rarely interpreted unless specialized statistical knowledge is applied. For further details, refer to
vignette("contrasts", package = "faux")
by Lisa DeBruine.
Modifying factor order
The functions mentioned below do not actually modify the order of factor levels in memory. Instead, they affect the order used in subsequent operations that involve the factor. The underlying factor levels remain the same, but the way they are handled or displayed is adjusted for specific tasks.
fct_inorder
fct_inorder()
reorders levels by order in which
they appear in the data.
fct_inorder(f)
fct_reorder
fct_reorder()
reorders the levels of a factor based on a numeric vector, typically by a function applied to the vector (e.g., median, mean, etc.). The function sorts the levels in ascending order by default, according to the values in the numeric vector.
fct_reorder(.f = , .x = , .fun = median, desc = FALSE)
-
.f
represents the factor whose levels need to be modified. -
.x
represents the numeric vector used to reorder the levels. It must have the same length as the factor. -
.fun
represents the function applied to.x
for each level of.f
.
fct_reorder2
fct_reorder2()
reorders the levels of a factor based on 2 numeric vectors.
fct_reorder2(.f = , .x = , .y = , .fun = , .desc = TRUE )
-
.f
represents the factor whose levels are to be reordered. -
.x
and.y
are two numeric vectors used to determine the reordering. Specifically, the function first sorts the data based on.x
. Then, it applies the specified function.fun
to the.y
values corresponding to the sorted.x
, and uses the result to reorder the levels of the factor. -
.fun
: -
last2
: After sorting.x
in ascending order, it takes the corresponding.y
value associated with the last value of.x
(i.e., the largest value in.x
) -
first2
: After sorting.x
in ascending order, it takes the corresponding.y
value associated with the first value of.x
(i.e., the smallest value in.x
). -
.desc
is boolean value indicating whether to reorder the levels in descending order based on the selected.y
.
fct_relevel
fct_relevel()
reorders the levels of a factor by moving specified levels to a specified place.
fct_relevel(.f = , ..., after = )
.f
represents the factor whose levels are to be reordered....
specifies the levels to be moved to the front.after
indicates the position after which the specified levels should be placed (default is0
, meaning the levels will be moved to the front).
fct_infreq
fct_infreq()
reorders factor levels based on their frequencies, in descending order.
fct_infreq(f, w = NULL)
-
f
represents the factor whose levels are to be reordered. -
w
represents the optional weights used to calculate the frequencies.
fct_rev
fct_rev()
reverses the order of factor levels.
fct_rev(f)
Modifying factor levels
The functions described below do not modify the underlying levels of a factor in memory. Instead, they adjust how the levels are treated or displayed in subsequent operations involving the factor. While the factor’s original levels remain unchanged, these functions alter their behavior or appearance for specific tasks.
fct_recode
fct_recode()
changes the value of each level.
gss_cat |> count(partyid)
#> # A tibble: 10 × 2
#> partyid n
#> <fct> <int>
#> 1 No answer 154
#> 2 Don't know 1
#> 3 Other party 393
#> 4 Strong republican 2314
#> 5 Not str republican 3032
#> 6 Ind,near rep 1791
#> # ℹ 4 more rows
gss_cat |>
mutate(
partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat"
)
) |>
count(partyid)
#> # A tibble: 10 × 2
#> partyid n
#> <fct> <int>
#> 1 No answer 154
#> 2 Don't know 1
#> 3 Other party 393
#> 4 Republican, strong 2314
#> 5 Republican, weak 3032
#> 6 Independent, near rep 1791
#> # ℹ 4 more rows
fct_collapse
fct_collapse()
collapses levels of a factor.
gss_cat |>
mutate(
partyid = fct_collapse(partyid,
"other" = c("No answer", "Don't know", "Other party"),
"rep" = c("Strong republican", "Not str republican"),
"ind" = c("Ind,near rep", "Independent", "Ind,near dem"),
"dem" = c("Not str democrat", "Strong democrat")
)
) |>
count(partyid)
#> # A tibble: 4 × 2
#> partyid n
#> <fct> <int>
#> 1 other 548
#> 2 rep 5346
#> 3 ind 8409
#> 4 dem 7180
fct_lump_
All fct_lump_()
functions share the following parameters:
f
: The factor whose levels are to be lumped.w
: An optional numeric vector of weights for frequency of each value (not level) inf
. IfNULL
, all values are weighted equally.other_level
: A string specifying the name of the new level that combines lumped levels. The default value is"Other"
.
fct_lump_lowfreq
fct_lump_lowfreq()
lumps together the least frequent levels, ensuring that the resulting “other” level always corresponds to the least frequent level among the factor levels.
fct_lump_lowfreq(f, w = NULL, other_level = "Other")
fct_lump_n
fct_lump_n()
lumps all levels except for the n
most frequent (or least frequent if n < 0
).
fct_lump_n(
f,
n,
w = NULL,
other_level = "Other",
ties.method = c("min", "average", "first", "last", "random", "max")
)
-
n
: The number of the most frequent levels to retain. -
ties.method
: Determines how ties in frequency are handled when the number of levels with equal frequency exceedsn
. Options include: -
"min"
: Assigns the smallest rank to tied values. -
"average"
: Assigns the average rank to tied values. -
"first"
: Ranks tied values based on their first occurrence. -
"last"
: Ranks tied values based on their last occurrence. -
"random"
: Breaks ties randomly. -
"max"
: Assigns the largest rank to tied values.
fct_lump_min
fct_lump_min()
lumps levels that appear fewer than min
times.
fct_lump_min(f, min, w = NULL, other_level = "Other")
min
: A numeric value specifying the minimum number of times a level must appear to remain as a separate level. Levels with fewer occurrences thanmin
are lumped together into the “other” level.
fct_lump_prop
fct_lump_prop()
lumps levels that appear in fewer than or equal to a specified proportion (prop
) of the total weight.
fct_lump_prop(f, prop, w = NULL, other_level = "Other")
prop
: A numeric value between 0 and 1 that specifies the minimum proportion of the total weight (or total occurrences ifw
isNULL
) a level must have to remain as a separate level. Levels with proportions less than or equal toprop
are lumped together.